Introduction to big data
INTRODUCTION TO BIG DATA
Data and Information
Data is nothing but facts and statistics stored or free-flowing over a network, generally, it’s raw
and unprocessed. When data are processed, organized, structured, or presented in a given context
so as to make them useful, they are called Information.
It is not enough to have data (such as statistics on the economy).
Data themselves are fairly useless, but when these data are interpreted and processed to determine
their true meaning, they become useful and can be called Information.
For example: When you visit any website, they might store your IP address, that is data, in return
they might add a cookie in your browser, marking you that you visited the website, that is data,
your name, its data, your age, it’s data.
What is Data?
The quantities, characters, or symbols on which operations are performed by a computer, which
may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical,
or mechanical recording media.
3 Actions on Data
– Capture
– Transform
– Store
BigData
Every day we create 2.5 quintillion bytes of data—in fact, 90 percent of the data in the world today
has been created in the last two years alone.
• This data comes from a wide variety of sources: sensors used to gather climate information, posts
to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS
signals, to name a few.
The New York Stock Exchange is an example of Big Data that generates about one terabyte of new trade
data per day.
Twitter generates 7TB of data Daily
IBM claims 90% of today’s stored data was generated in just the last two years.
Walmart handles more than 1 million customer transactions every hour.
Facebook handles 40 billion photos from its user base.
Decoding the human genome originally took 10 years to process; now it can be achieved in one week.
According to Gartner, the definition of Big Data –
“Big data” is high-volume, velocity, and variety information assets that demand costeffective, innovative forms of information processing for enhanced insight and decision
making.”
Big Data refers to complex and large data sets that have to be processed and analyzed to uncover
valuable information that can benefit businesses and organizations. However, there are certain
basic tenets of Big Data that will make it even simpler to answer what is Big Data:
·It refers to a massive amount of data that keeps on growing exponentially with time.
·Big data is a term applied to data sets whose size or type is beyond the ability of traditional
relational databases to capture, manage and process the data with low latency.
·It includes data mining, data storage, data analysis, data sharing, and data visualization.
·The term is an all-comprehensive one including data, data frameworks, along with the tools and
techniques used to process and analyze the data.
SOURCES OF BIG DATA
This data comes from a wide variety of sources: sensors used to gather climate information, posts
to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS
signals, to name a few.
Artificial intelligence (AI), Mobile, Social Media, and the Internet of Things (IoT) are driving data
complexity through new forms and sources of data.
For example, big data comes from Sensors, Devices, Video/Audio, Networks, Log files,
Transactional applications, Web, and Social media — much of it is generated in real-time and at a
very large scale.
The History of Big Data
The 21st century is characterized by rapid advancement in the field of information technology.
IT has become an integral part of daily life as well as various other industries like: health,
education, entertainment, science and technology, genetics, or business operations and these
industries generate a lot of data, this can be called Big Data.
Big Data consists of large datasets that cannot be managed efficiently by the common database
management systems.
These datasets range from terabytes to exabytes.
Mobile phones, credit cards, Radio Frequency Identification (RFID) devices, and social
networking platforms create huge amounts of data that may reside unutilized at unknown servers
for many years.
And with the evolution of Big Data, this data can be accessed and analyzed on a regular basis to
generate useful information.
“Big Data” is a relative term depending on who is discussing it. For Example, Big Data to Amazon
or Google is very different from Big Data to a medium-sized insurance organization.
Types of Big Data/Types of digital data
DIGITAL DATA
Digital data is information stored on a computer system as a series of 0’s and 1’s in a binary
language. Digital data jumps from one value to the next in a step-by-step sequence.
Example: Whenever we send an email, read a social media post, or take pictures with our digital
camera, we are working with digital data.
Digital data can be classified into three forms:
a) Structured
Structured is one of the types of big data and by structured data, we mean data that can be
processed, stored, and retrieved in a fixed format. It refers to highly organized information that
can be readily and seamlessly stored and accessed from a database by simple search engine
algorithms. Relationships exist between entities of data, such as classes and their objects.
Structured data is usually stored in well-defined columns and databases.
– Structured Schema
– Tables with rows and columns of data
– Example: DBMS, RDBMS
For instance, the employee table in a company database will be structured as the employee
details, their job positions, their salaries, etc., will be present in an organized manner.
b) Unstructured
Unstructured data refers to the data that lacks any specific form or structure nor obeys to follow
formal structural rules of data models. It does not even have a consistent format and it is found to
be varying all the time. This makes it very difficult and time-consuming to process and analyze
unstructured data. About 80—90% data of an organization is in this format.
Example: Memos, chat rooms, PowerPoint presentations, images, videos, letters, research, white
papers, the body of an email, etc.
c) Semi-structured
It is considered as another form of structured data. It inherits few properties of structured data, but
major parts of this kind of data fail to have a definitive structure, and also it does not obey the
formal structure of data models such as RDBMS. To be precise, it refers to the data that although
has not been classified under a particular repository (database), yet contains vital information or
tags that segregate individual elements within the data. However, it is not in a form that can be
used easily by a computer program.
Example : Emails, XML, markup languages like HTML, etc.
Metadata for this data is available but is not sufficient.
– Schema is not defined properly
– JSON, XML, CSV, RSS
– Ex: Transactional history file, Logfile
d) Quasi-structured Data: The data format contains textual data with inconsistent data
formats that are formatted with effort and time with some tools. Example: Web server logs,
i.e., the log file is created and maintained by some server that contains a list of activities.
Introduction to Big Data platform
A big data platform is a type of IT solution that combines the features and capabilities of several
big data applications and utilities within a single solution, this is then used further for managing
as well as analyzing Big Data.
It focuses on providing its users with efficient analytics tools for massive datasets.
The users of such platforms can custom-build applications according to their use case like to
calculate customer loyalty (E-Commerce user case), and so on.
Goal: The main goal of a Big Data Platform is to achieve: Scalability, Availability, Performance,
and Security.
Basics of Bigdata Platform
· Big Data platform is an integrated IT solution that combines several Big Data tools and utilities
into one packaged solution for managing and analyzing Big Data.
· It is an enterprise-class IT platform that enables organizations in developing, deploying,
operating, and manages a big data infrastructure /environment.
· There are several Open sources and commercial Big Data platforms in the market with varied
features which can be used in the Big Data environment.
· Big data platform generally consists of big data storage, servers, database, big data management,
business intelligence, and other big data management utilities
· It also supports custom development, querying, and integration with other systems.
· The primary benefit behind a big data platform is to reduce the complexity of multiple vendors/
solutions into one cohesive solution.
· Big data platforms are also delivered through the cloud where the provider provides all-inclusive
big data solutions and services.
Features of Big Data Platform
Here are the most important features of any good Big Data Analytics Platform:
a) Big Data platform should be able to accommodate new platforms and tools based on the
business requirement. Because business needs can change due to new technologies or due to
changes in the business processes.
b) It should support linear scale-out
c) It should have the capability for rapid deployment
d) It should support a variety of data formats
e) Platform should provide data analysis and reporting tools
f) It should provide real-time data analysis software
g) It should have tools for searching the data through large data set
Best Big Data Platforms
Based on S, A, P, S which means Scalability, Availability, Performance, and Security, platforms
are listed below:
a) Hadoop Delta Lake Migration Platform
b) Data Catalog Platform
c) Data Ingestion Platform
d) IoT Analytics Platform
e) Data Integration and Management Platform
f) ETL Data Transformation Platform
• Hadoop – Delta Lake Migration Platform: It is an open-source software platform
managed by Apache Software Foundation. It is used to manage and store large data
sets at a low cost and with great efficiency.
•
Data Catalog Platform: Provides a single self-service environment to the users,
helping them find, understand, and trust the data source. Helps the users to discover the
new data sources if there are any. Discovering and understanding data sources are the
initial steps for registering the sources.
• Data Ingestion Platform: This layer is the first step for the data coming from variable
sources to start its journey. This means the data here is prioritized and categorized,
making data flow smoothly in further layers in this process flow.
• IoT Analytics Platform: It provides a wide range of tools to work upon big data; this
functionality of it comes in handy while using it over the IoT case.
•
Data Integration and Management Platform: ElixirData provides a highly
customizable solution for Enterprises. ElixirData provides Flexibility, Security, and
Stability for an Enterprise application and Big Data Infrastructure to deploy onpremises and Public Cloud with cognitive insights using Machine Learning and
Artificial Intelligence.
•
ETL Data Transformation Platform: This Platform can be used to build pipelines
and even schedule the running of the same for data transformation.
Drivers for Big Data
Big Data has quickly risen to become one of the most desired topics in the industry.
The main business drivers for such rising demand for Big Data Analytics are :
1. The digitization of society
2. The drop in technology costs
3. Connectivity through cloud computing
4. Increased knowledge about data science
5. Social media applications
6. The rise of Internet-of-Things(IoT)
Example: A number of companies that have Big Data at the core of their strategy like Apple,
Amazon, Facebook, and Netflix have become very successful at the beginning of the 21st century.
Big Data Architecture:
Big data architecture refers to the logical and physical structure that dictates how high volumes of
data are ingested, processed, stored, managed, and accessed. Big data architecture is designed to
handle the ingestion, processing, and analysis of data that is too large or complex for traditional
database systems.
Layers in BIG DATA Architecture
Data sources: All big data solutions start with one or more data sources.
For example,
o Application data stores, such as relational databases.
o Static files produced by applications, such as web server log files.
o Real-time data sources, such as IoT devices.
Big Data Ingestion Layer
This layer of Big Data Architecture is the first step for the data coming from variable sources to
start its journey. Data ingestion means the data is prioritized and categorized, making data flow
smoothly in further layers in the Data ingestion process flow.
Tools used by this layer is:
Apache Flume – straightforward and flexible architecture based on streaming data flows,
Apache Nifi – supports robust and scalable directed graphs of data routing, transformation,
and system mediation logic.,
Elastic Logstash – open-source Data ingestion tool, server-side data processing pipeline that
ingests data from many sources, simultaneously transforms it, and then sends it to your
“stash, ” i.e., Elasticsearch
Data Collector Layer
In this Layer, more focus is on the transportation of data from the ingestion layer to the rest of the
data pipeline. It is the Layer of data architecture where components are decoupled so that analytic
capabilities may begin.
Data Processing Layer
In this primary layer of Big Data Architecture, the focus is to specialize in the data pipeline
processing system. We can say the data we have collected in the previous layer is processed in this
layer. Here we do some magic with the data to route them to a different destination and classify
the data flow, and it’s the first point where the analysis may occur.
Data Storage Layer
Storage becomes a challenge when the size of the data you are dealing with becomes large. Several
possible solutions, like Data Ingestion Patterns, can rescue from such problems. Finding a storage
solution is very much important when the size of your data becomes large. This layer of Big Data
Architecture focuses on “where to store such large data efficiently.”
Data Query Layer
This is the architectural layer where active analytic processing of Big Data takes place. Here, the
primary focus is to gather the data value to be more helpful for the next layer.
Data Visualization Layer
The visualization, or presentation tier, is probably the most prestigious tier, where the data pipeline
users may feel the VALUE of DATA.
BIG DATA CHARACTERISTICS
Back in 2001, 3 ‘V’s of Big Data are – Variety, Velocity, and Volume.
In the early stages development of big data and related terms, there were only 3 V’s (Volume,
Variety, Velocity) considered as potential characteristics.
But ever-growing technology and tools and variety of sources where information being received
has potentially increased these 3 V’s into 5 V’s and still evolving.
The five Vs of Big data are Velocity, Volume, Veracity, Variety, and Value
VOLUME
Volume is one of the characteristics of big data. Volume refers to the unimaginable amounts of
information generated every second. The exponential growth in data storage as the data is now
more than text data. The data can be found in the format of videos, music, and large images on
our social media channels. It is very common to have Terabytes and Petabytes of the storage
system for enterprises. As the database grows the applications and architecture built to support
the data need to be re-evaluated quite often.
Data has grown exponentially over the past few years
than in the past few decades. Social media, web portals and real-time data using sensors have
increased the amount of data.
For example, Facebook alone can generate about a billion messages, 4.5 billion times that the
“like” button is recorded, and over 350 million new posts are uploaded each day. Such a huge
amount of data can only be handled by Big Data Technologies.
Sometimes the same data is re-evaluated with multiple angles and even though the original data
is the same the newfound intelligence creates an explosion of the data. The big volume indeed
represents Big Data.
We are currently using distributed systems, to store data in several locations and brought
together by a software Framework like Hadoop.
VELOCITY
Velocity essentially refers to the speed at which data is being created in real-time. In a broader
prospect, it comprises the rate of change, linking of incoming data sets at varying speeds, and
activity bursts. The data growth and social media explosion have changed how we look at the
data. There was a time when we used to believe that data of yesterday was recent. As a matter
of fact, newspapers are still following that logic. However, news channels and radios have
changed how fast we receive the news.
Today, people rely on social media to update them with the latest happening. On social media
sometimes a few seconds of old messages (a tweet, status updates, etc.) is not something that
interests users.
They often discard old messages and pay attention to recent updates. The data movement is now
almost real-time and the update window has reduced to fractions of the seconds. This highvelocity data represent Big Data.
Examples of data that is generated with high velocity – Twitter messages or Facebook posts.
VARIETY
Data can be stored in multiple formats. For example, database, excel, CSV, accessed or stored
in a simple text file. Sometimes the data is not even in the traditional format, it may be in the
form of video, SMS, pdf, or something different. It is the need of the organization to arrange it
and make it meaningful. It will be easy to do so if we have data in the same format, however it
is not the case most of the time. The real world has data in many different formats and that is
the challenge we need to overcome with Big Data. This variety of data represents Big Data.
Variety of Big Data refers to structured, unstructured, and semi-structured data that is gathered
from multiple sources. While in the past, data could only be collected from spreadsheets and
databases, today data comes in an array of forms such as emails, PDFs, photos, videos, audios,
SM posts, and so much more. Variety is one of the important characteristics of big data.
VERACITY
Data veracity, in general, is how accurate or truthful a data set may be. More specifically, when it
comes to the accuracy of big data, it’s not just the quality of the data itself but how trustworthy the
data source, type, and processing of it is.
The data quality of captured data can vary greatly, affecting the accurate analysis.
Example: Facebook posts with hashtags.
VALUE
Value is the major issue that we need to concentrate on. It is not just the amount of data that we
store or process. It is actually the amount of valuable, reliable, and trustworthy data that needs to
be stored, processed, analyzed to find insights.
· Mine the data, i.e., a process to turn raw data into useful data. The value represents the benefits
of data to your business such as in finding out insights, results, etc. which were not possible earlier.
Big Data Technology Components:
1. Ingestion:
The ingestion layer is the very first step of pulling in raw data.
It comes from internal sources, relational databases, non-relational databases, social media, emails,
phone calls, etc.
There are two kinds of ingestions:
Batch, in which large groups of data are gathered and delivered together.
Streaming, is a continuous flow of data. This is necessary for real-time data analytics.
2. Storage:
Storage is where the converted data is stored in a data lake or warehouse and eventually processed.
The data lake/warehouse is the most essential component of a big data ecosystem.
It needs to contain only thorough, relevant data to make insights as valuable as possible.
It must be efficient with as little redundancy as possible to allow for quicker processing.
3. Analysis:
In the analysis layer, data gets passed through several tools, shaping it into actionable insights.
There are four types of analytics on big data :
● Diagnostic: Explains why a problem is happening.
● Descriptive: Describes the current state of a business through historical data.
● Predictive: Projects future results based on historical data.
● Prescriptive: Takes predictive analytics a step further by projecting best future efforts.
4. Consumption:
The final big data component is presenting the information in a format digestible to the end-user.
This can be in the forms of tables, advanced visualizations, and even single numbers if requested.
The most important thing in this layer is making sure the intent and meaning of the output are
understandable.
A GENERAL OVERVIEW OF HIGH-PERFORMANCE ARCHITECTURE
Most high-performance platforms is created by connecting multiple nodes together via a variety
of network topologies.
The general architecture distinguishes the management of computing resources (and the
corresponding allocation of tasks) and the management of the data across the network of storage
nodes. It is depicted in the following image. It is generally called master/slave architecture.
In this configuration, a master job manager oversees the pool of processing nodes, assigns tasks,
and monitors the activity. At the same time, a storage manager oversees the data storage pool and
distributes datasets across the collection of storage resources. While there is no apriori requirement
that there be any colocation of data and processing tasks, it is beneficial from a performance
perspective to ensure that the threads process data that is local, or close to minimize the costs of
data access latency.
1. APACHE
HADOOP
Apache Hadoop is one of the main supportive elements in Big Data technologies. It simplifies
the processing of large amounts of structured or unstructured data in a cheap manner. Hadoop
is an open-source project from Apache that is continuously improving over the years. “Hadoop
is basically a set of software libraries and frameworks to manage and process a big amount of
data from a single server to thousands of machines. It provides an efficient and powerful error
detection mechanism based on application layer rather than relying upon hardware.”
2. MapReduce
MapReduce was introduced by Google to create a large amount of web search indexes. It is
basically a framework to write applications that processes a large amount of structured or
unstructured data over the web. MapReduce takes the query and breaks it into parts to run it on
multiple nodes. By distributed query processing it makes it easy to maintain a large amount of
data by dividing the data into several different machines. Hadoop MapReduce is a software
framework for easily writing applications to manage large amounts of data sets in a highly faulttolerant manner.
3. HDFS
(Hadoop distributed file system)
HDFS is a java based file system that is used to store structured or unstructured data over large
clusters of distributed servers. The data stored in HDFS has no restriction or rule to be applied, the
data can be either fully unstructured or purely structured. In HDFS the work to make data useful
is done by the developer’s code only. Hadoop distributed file system provides a highly faulttolerant atmosphere with deployment on low-cost hardware machines. HDFS is now a part of the
Apache Hadoop project.
4.
HIVE
Hive was originally developed by Facebook, now it is made open source for some time. Hive
works something like a bridge between SQL and Hadoop, it is basically used to make SQL
queries on Hadoop clusters. Apache Hive is basically a data warehouse that provides ad-hoc
queries, data summarization, and analysis of huge data sets stored in Hadoop compatible file
systems. Hive provides a SQL-like called HiveQL query-based implementation of the huge
amount of data stored in Hadoop clusters. In January 2013 apache releases Hive 0.10.0, more
information and installation guide can be found at Apache Hive Documentation.
5.
PIG
Pig was introduced by yahoo and later on it was made fully open source. It also provides a bridge
to query data over Hadoop clusters but unlike hive, it implements a script implementation to
make Hadoop data access able by developers and business persons. Apache pig provides a highlevel programming platform for developers to process and analyze Big Data using user-defined
functions and programming efforts. In January 2013 Apache released Pig 0.10.1 which is
defined for use with Hadoop 0.10.1 or later releases. More information and an installation guide
can be found at Apache Pig Getting Started Documentation.
BIG DATA USE CASES
Big data techniques can be used to leverage the business benefits and increase the value of an
organization. Big data has beneficial in many applications and in general, the following are the
common categories. It is derived from The Apache Software Foundation’s Powered By Hadoop
Web site.
·Business intelligence, querying, reporting, searching, including many implementations of
searching, filtering, indexing, speeding up aggregation for reporting and for report generation,
trend analysis, search optimization, and general information retrieval.
·Improved performance for common data management operations, with the majority focusing
on log storage, data storage, and archiving, followed by sorting, running joins,
extraction/transformation/ loading (ETL) processing, other types of data conversions, as well as
duplicate analysis and elimination.
·Non-database applications, such as image processing, text processing in preparation for
publishing, genome sequencing, protein sequencing and structure prediction, web crawling, and
monitoring workflow processes.
·Data mining and analytical applications, including social network analysis, facial recognition,
profile matching, other types of text analytics, web mining, machine learning, information
extraction, personalization and recommendation analysis, ad optimization, and behavior analysis.
In turn, the core capabilities that are implemented using the big data application can be further
abstracted into more fundamental categories:
·Counting functions applied to large bodies of data that can be segmented and distributed among
a pool of computing and storage resources, such as document indexing, concept filtering, and
aggregation (counts and sums).
·Scanning functions that can be broken up into parallel threads, such as sorting, data
transformations, semantic text analysis, pattern recognition, and searching.
·Modeling capabilities for analysis and prediction.
·Storing large datasets while providing relatively rapid access.
Generally, Processing applications can combine these core capabilities in different ways. In
today’s world, big data have several applications, some of them are listed below:
Tracking Customer Spending Habit, Shopping Behavior:
In big retail stores, the management team has to keep data of customer’s spending habits, shopping
behavior, most liked product, which product is being searched/sold most, based on that data, the
production/collection rate of that product gets fixed.
Recommendation:
By tracking customer spending habits, shopping behavior, big retail stores provide
recommendations to the customers.
Smart Traffic System:
Data about the condition of the traffic of different roads, collected through cameras, GPS devices
placed in the vehicle. All such data are analyzed and jam-free or less jam way, less time taking
ways are recommended.
One more profit is fuel consumption can be reduced.
Secure Air Traffic System:
At various places of flight, sensors are present. These sensors capture data like the speed of flight,
moisture, temperature, and other environmental conditions.
Based on such data analysis, environmental parameter within flight is set up and varied. By
analyzing flight machine-generated data, it can be estimated how long the machine can operate
flawlessly and when it can be replaced/repaired.
Auto Driving Car:
In the various spots of the car camera, a sensor is placed that gathers data like the size of the
surrounding car, obstacle, distance from those, etc. These data are being analyzed, then various
calculations are carried out.These calculations help to take action automatically.
Virtual Personal Assistant Tool:
Big data analysis helps virtual personal assistant tools like Siri, Cortana, and Google Assistant to
provide the answer to the various questions asked by users.
This tool tracks the location of the user, their local time, season, other data related to questions
asked, etc. Analyzing all such data provides an answer.
Example: Suppose one user asks “Do I need to take Umbrella?”The tool collects data like the
location of the user, season, and weather conditions at that location then analyzes these data to
conclude if there is a chance of raining, then provides the answer.
IoT:
Manufacturing companies install IOT sensors into machines to collect operational data. Analyzing
such data, it can be predicted how long a machine will work without any problem when it requires
repair. Thus, the cost to replace the whole machine can be saved.
Education Sector Energy Sector:
Online educational courses conducting organizations utilize big data to search candidates interested in that
course. If someone searches for a YouTube tutorial video on a subject, then an online or offline course
provider organization on that subject sends an ad online to that person about their course.
Media and Entertainment Sector:
Media and entertainment service providing companies like Netflix, Amazon Prime, Spotify do analysis on
data collected from their users. Data like what type of video, music users are watching, listening to most,
how long users are spending on site, etc are collected and analyzed to set the next business strategy.
Big Data Importance
The importance of big data does not revolve around how much data a company has but how a
company utilizes the collected data. Every company uses data in its own way; the more efficiently
a company uses its data, the more potential it has to grow. The company can take data from any
source and analyze it to find answers which will enable:
1. Cost Savings: Some tools of Big Data like Hadoop and Cloud-Based Analytics can bring cost
advantages to businesses when large amounts of data are to be stored and these tools also help in
identifying more efficient ways of doing business.
2. Time Reductions: The high speed of tools like Hadoop and in-memory analytics can easily
identify new sources of data which helps businesses analyze data immediately and make quick
decisions based on the learning.
3. Understand the market conditions: By analyzing big data you can get a better understanding
of current market conditions. For example, by analyzing customers’ purchasing behaviors, a
company can find out the products that are sold the most and produce products according to this
trend. By this, it can get ahead of its competitors.
4. Control online reputation: Big data tools can do sentiment analysis. Therefore, you can get
feedback about who is saying what about your company. If you want to monitor and improve the
online presence of your business, then, big data tools can help in all this.
5. Using Big Data Analytics to Boost Customer Acquisition and Retention: The customer is
the most important asset any business depends on. There is no single business that can claim
success without first having to establish a solid customer base. However, even with a customer
base, a business cannot afford to disregard the high competition it faces. If a business is slow to
learn what customers are looking for, then it is very easy to begin offering poor-quality products.
In the end, loss of clientele will result, and this creates an adverse overall effect on business
success. The use of big data allows businesses to observe various customer-related patterns and
trends. Observing customer behavior is important to trigger loyalty.
6. Using Big Data Analytics to Solve Advertiser’s Problem and Offer Marketing Insights:
Big data analytics can help change all business operations. This includes the ability to match
customer expectations, changing the company’s product line, and of course ensuring that the
marketing campaigns are powerful.
7. Big Data Analytics as a Driver of Innovations and Product Development: Another huge
advantage of big data is the ability to help companies innovate and redevelop their products.
Big Data Challenges
The challenges include capture, curation, storage, search, sharing, transfer, analysis, and
visualization.
Big Data is a trend to larger data sets due to the additional information derivable from analysis of
a single large set of related data, as compared to separate smaller sets with the same total amount
of data, allowing correlations to be found to “spot business trends, determine the quality of
research, prevent diseases, link legal citations, combat crime, and determine real-time roadway
traffic conditions.”
Challenges of Big Data
The following are the five most important challenges of Big Data:
a) Big data usually includes data sets with sizes beyond the ability of commonly used software
tools to capture, curate, manage, and process data within a tolerable elapsed time. Big data “size”
is a constantly moving target.
Meeting the need for speed
In today’s hypercompetitive business environment, companies not only have to find and analyze
the relevant data they need, but they must also find it quickly.
b) Visualization helps organizations perform analyses and make decisions much more rapidly,
but the challenge is going through the sheer volumes of data and accessing the level of detail
needed, all at a high speed.
c) The challenge only grows as the degree of granularity increases. One possible solution is
hardware. Some vendors are using increased memory and powerful parallel processing to crunch
large volumes of data extremely quickly
d)Understanding the data
It takes a lot of understanding to get data in the RIGHT SHAPE so that you can use visualization
as part of data analysis.
e) Addressing data quality
Even if you can find and analyze data quickly and put it in the proper context for the audience that
will be consuming the information, the value of data for DECISIONMAKING PURPOSES will
be jeopardized if the data is not accurate or timely. This is a challenge with any data analysis.
f) Displaying meaningful results
Plotting points on a graph for analysis becomes difficult when dealing with extremely large
amounts of information or a variety of categories of information.
For example, imagine you have 10 billion rows of retail SKU data that you’re trying to compare.
The user trying to view 10 billion plots on the screen will have a hard time seeing so many data
points. By grouping the data together, or “binning,” you can more effectively visualize the data.
g) Dealing with outliers
The graphical representations of data made possible by visualization can communicate trends and
outliers much faster than tables containing numbers and text. Users can easily spot issues that need
attention simply by glancing at a chart. Outliers typically represent about 1 to 5 percent of data,
but when you’re working with massive amounts of data, viewing 1 to 5 percent of the data is rather
difficult. We can also bin the results to both view the distribution of data and see the outliers.
While outliers may not be representative of the data, they may also reveal previously unseen and
potentially valuable insights. Visual analytics enables organizations to take raw data and present
it in a meaningful way that generates the most value. However, when used with big data,
visualization is bound to lead to some challenges.
List of Big Data Platforms
a) Hadoop
b) Cloudera
c) Amazon Web Services
d) Hortonworks
e) MapR
f) IBM Open Platform
g) Microsoft HDInsight
h) Intel Distribution for Apache Hadoop
i) Datastax Enterprise Analytics
j) Teradata Enterprise Access for Hadoop
k) Pivotal HD
Hadoop
· Hadoop is open-source, Java-based programming framework and server software which is used
to save and analyze data with the help of 100s or even 1000s of commodity servers in a clustered
environment.
· Hadoop is designed to store and process large datasets extremely fast and in a fault-tolerant way.
· Hadoop uses HDFS (Hadoop File System) for storing data on clusters of commodity computers.
If any server goes down it knows how to replicate the data and there is no loss of data even in
hardware failure.
· Hadoop is Apache sponsored project and it consists of many software packages which run on top
of the Apache Hadoop system.
· Hadoop provides a set of tools and software for making the backbone of the Big Data analytics
system.
· Hadoop ecosystem provides necessary tools and software for handling and analyzing Big Data.
· On top of the Hadoop system, many applications can be developed and plugged in to provide an
ideal solution for Big Data needs.
Cloudera
· Cloudera is one of the first commercial Hadoop-based Big Data Analytics Platform offering Big
Data solutions.
· Its product range includes Cloudera Analytic DB, Cloudera Operational DB, Cloudera Data
Science & Engineering, and Cloudera Essentials.
· All these products are based on the Apache Hadoop and provide real-time processing and
analytics of massive data sets.
Website: https://www.cloudera.com
Amazon Web Services
· Amazon is offering a Hadoop environment in the cloud as part of its Amazon Web Services
package.
· AWS Hadoop solution is hosted solution which runs on Amazon’s Elastic Cloud Compute and
Simple Storage Service (S3).
· Enterprises can use Amazon AWS to run their Big Data processing analytics in the cloud
environment.
· Amazon EMR allows companies to set up and easily scale Apache Hadoop, Spark, HBase, Presto,
Hive, and other Big Data Frameworks using its cloud hosting environment.
Website: https://aws.amazon.com/emr/
Open Source Big Data Platform
There are various open-source Big Data Platforms that can be used for Big Data handling and data
analytics in a real-time environment. Both small and Big Enterprises can use these tools for
managing their enterprise data for getting the best value from their enterprise data.
Apache Hadoop
· Apache Hadoop is a Big Data platform and software package which is Apache sponsored project.
· Under the Apache Hadoop project, various other software is being developed which runs on the
top of the Hadoop system to provide enterprise-grade data management and analytics solutions to
the enterprise.
· Apache Hadoop is an open-source, distributed file system that provides a data processing and
analysis engine for analyzing a large set of data.
· Hadoop can run on Windows, Linux, and OS X operating systems, but it is mostly used on
Ubuntu and other Linux variants.
MapReduce
· The MapReduce engine was originally written by Google and this is the system that enables the
developers to write programs that can run in parallel on 100 or even 1000s of computer nodes to
process vast data sets.
· After processing all the jobs on the different nodes it comes to the results and returns it to the
program which executed the MapReduce job.
· This software is platform-independent and runs on the top of the Hadoop ecosystem. It can
process tremendous data at a very high speed in a Big Data environment.
Apache Storm
· Apache Storm is a software for real-time computing and distributed processing.
· Its free and open-source software developed at Apache Software Foundation. It’s a real-time,
parallel processing engine.
· Apache Storm is highly scalable, fault-tolerant which supports almost all the
programming languages.
Apache Strom can be used in
· Real-time analytics
· Online machine learning
· Continuous computation
· Distributed RPC
· ETL
· And all other places where real-time processing is required.
Apache Strom is used by Yahoo, Twitter, Spotify, Yelp, Flipboard, and many other data giants.
Website: http://storm.apache.org/
Apache Spark
· Apache Spark is software that runs on the top of Hadoop and provides API for real-time, inmemory processing and analysis of a large set of data stored in the HDFS.
· It stores the data into memory for faster processing.
· Apache Spark runs program 100 times faster in-memory and 10 times faster on disk as compared
to the MapReduce.
· Apache Spark is here to faster the processing and analysis of big data sets in a Big Data
environment.
· Apache Spark is being adopted very fast by the business to analyze their data set to get the real
value of their data.
· Website: http://spark.apache.org/
SAMOA
· SAMOA stands for Scalable Advanced Massive Online Analysis,
· It’s a system for mining the Big Data streams.
· SAMOA is open-source software distributed at GitHub, which can be used as distributed machine
learning framework also.
· Website: https://github.com/yahoo/samoa
Thus, the Big Data industry is growing very fast in 2017 and companies are fast-moving their data
to the Big Data Platform. There is a huge requirement for Big Data in the job market.
CHALLENGES OF CONVENTIONAL SYSTEMS
Conventional Systems
The system consists of one or more zones each having either manually operated call points or
automatic detection devices, or a combination of both.
· Big data is a huge amount of data that is beyond the processing capacity of conventional
database systems to manage and analyze the data in a specific time interval.
Difference between conventional computing and intelligent computing
· Conventional computing functions logically with a set of rules and calculations while neural
computing can function via images, pictures, and concepts.
· Conventional computing is often unable to manage the variability of data obtained in the real
world. On the other hand, neural computing, like our own brains, is well suited to situations that
have no clear algorithmic solutions and are able to manage noisy imprecise data. This allows them
to excel in those areas that conventional computing often finds difficult.
Comparison of Big Data with Conventional Data
Big Data
Conventional Data
Huge data sets
Data set size in control.
Unstructured data such as text, video, Normally structured data such as numbers
and audio.
and categories, but can take other forms
as well.
Hard-to-perform queries and analysis
Relatively easy-to-perform queries and
analysis.
Needs a new methodology for analysis. Data analysis can be achieved by using
conventional methods.
Need tools such as Hadoop, Hive,
Tools such as SQL, SAS, R, and Excel
Hbase, Pig, Sqoop, and so on.
alone may be sufficient.
The aggregated or sampled or filtered Raw transactional data.
data.
Used for reporting, basic analysis, and Used for reporting, advanced analysis, and
text mining. Advanced analytics is only
predictive modeling.
in a starting stage in big data.
Big data analysis needs both
Analytical skills are sufficient for
programming skills (such as Java) and conventional data; advanced analysis tools
analytical skills to perform analysis.
don’t require expert programming skills.
Petabytes/exabytes of data.
Millions/billions of accounts.
Billions/trillions of transactions.
Megabytes/gigabytes of data.
Thousands/millions of accounts.
Millions of transactions
Generated by big financial institutions, Generated by small enterprises and small
Facebook, Google, Amazon, eBay,
banks.
Walmart, and more.
List of challenges of Conventional Systems
Big data is the storage and analysis of large data sets. These are complex data sets that can be both
structured and unstructured. They are so large that it is not possible to work on them with
traditional analytical tools.
The following list of challenges has been dominating in the case Conventional systems in realtime scenarios:
1) Uncertainty of Data Management Landscape
2) The Big Data Talent Gap
3) The talent gap that exists in the industry Getting data into the big data platform
4) Need for synchronization across data sources
5) Getting important insights through the use of Big data analytics
1) Uncertainty of Data Management Landscape
Because big data is continuously expanding, there are new companies and technologies that are
being developed every day. A big challenge for companies is to find out which technology works
bests for them without the introduction of new risks and problems.
2) The Big Data Talent Gap
While Big Data is a growing field, there are very few experts available in this field. This is because
Big data is a complex field and people who understand the complexity and intricate nature of this
field are far few and between.
3) Getting data into the big data platform
Data is increasing every single day. This means that companies have to tackle a limitless amount
of data on a regular basis. The scale and variety of data that is available today can overwhelm any
data practitioner and that is why it is important to make data accessibility simple and convenient
for brand managers and owners.
4) Need for synchronization across data sources
As data sets become more diverse, there is a need to incorporate them into an analytical platform.
If this is ignored, it can create gaps and lead to wrong insights and messages.
5) Getting important insights through the use of Big data analytics:
It is important that companies gain proper insights from big data analytics and it is
important that the correct department has access to this information. A major challenge in big data
analytics is bridging this gap in an effective fashion.
Other Three challenges of Conventional systems
Three Challenges that big data faces are
1. Data
2. Process
3. Management
1. Data Challenges
Volume
1. The volume of data, especially machine-generated data, is exploding,
2. How fast that data is growing every year, with new sources of data that are emerging.
3. For example, in the year 2000, 800,000petabytes (PB) of data were stored in the world, and it is
expected to reach 35 zettabytes (ZB) by2020 (according to IBM).
Social media plays a key role: Twitter generates 7+ terabytes (TB) of data every day. Facebook,
10 TB. Mobile devices play a key role as well, as there were an estimated 6 billion mobile phones
in 2011.
The challenge is how to deal with the size of Big Data.
Variety, Combining Multiple Data Sets
More than 80% of today’s information is unstructured and it is typically too big to manage
effectively. Today, companies are looking to leverage a lot more data from a wider variety of
sources both inside and outside the organization.
Things like documents, contracts, machine data, sensor data, social media, health records, emails,
etc. The list is endless really.
A lot of this data is unstructured or has a complex structure that’s hard to represent in rows and
columns.
2. Processing
Processing such a variety (heterogeneous) of data from various sources is a challenging task.
3. Management
· A lot of this data is unstructured, or has a complex structure that’s hard to represent in rows and
columns.
Schemaless and Column-oriented Databases (No SQL)
We are using table and row-based relational databases over the years, these databases are just
fine with online transactions and quick updates. When an unstructured and large amount of data
comes into the picture we need some databases without having a hard code schema attachment.
There are a number of databases to fit into this category, these databases can store unstructured,
semi-structured, or even fully structured data.
Apart from other benefits, the finest thing with schema-less databases is that it makes data
migration very easy. MongoDB is a very popular and widely used NoSQL database these days.
NoSQL and schema-less databases are used when the primary concern is to store a huge amount
of data and not to maintain relationships between elements. “NoSQL (not only SQL) is a type
of database that does not primarily rely upon the schema-based structure and does not use SQL
for data processing.”
The traditional approach work on the structured data that has a basic layout and the structure
provided.
The structured approach designs the database as per the requirements in tuples and columns.
Working on the live coming data, which can be an input from the ever-changing scenario cannot
be dealt with in the traditional approach. The Big Data approach is iterative.
Big data analytics work on unstructured data, where no specific pattern of the data is defined.
The data is not organized in rows and columns. The live flow of data is captured and the analysis
is done on it.
Name:
Description:
…