BIG DATA ANALYSIS , Amount of Data Created Each Day on the Internet , HDFS

Big data is a term that describes the large volume of data — both structured and unstructured — that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves.

8 min readSep 17, 2020

BIGDATA HISTORY :

The term “big data” refers to data that is so large, fast or complex that it’s difficult or impossible to process using traditional methods. The act of accessing and storing large amounts of information for analytics has been around a long time. But the concept of big data gained momentum in the early 2000s when industry analyst Doug Laney articulated the now-mainstream definition of big data as the three V’s:

Volume: Organizations collect data from a variety of sources, including business transactions, smart (IoT) devices, industrial equipment, videos, social media and more. In the past, storing it would have been a problem — but cheaper storage on platforms like data lakes and Hadoop have eased the burden.

Velocity: With the growth in the Internet of Things, data streams in to businesses at an unprecedented speed and must be handled in a timely manner. RFID tags, sensors and smart meters are driving the need to deal with these torrents of data in near-real time.

Variety: Data comes in all types of formats — from structured, numeric data in traditional databases to unstructured text documents, emails, videos, audios, stock ticker data and financial transactions.

Two more Vs have emerged over the past few years:Variability and veracity.

Variability:

In addition to the increasing velocities and varieties of data, data flows are unpredictable — changing often and varying greatly. It’s challenging, but businesses need to know when something is trending in social media, and how to manage daily, seasonal and event-triggered peak data loads.

veracity:Veracity refers to the quality of data. Because data comes from so many different sources, it’s difficult to link, match, cleanse and transform data across systems. Businesses need to connect and correlate relationships, hierarchies and multiple data linkages. Otherwise, their data can quickly spiral out of control.

Why Is Big Data Important?

Determining root causes of failures, issues and defects in near-real time.
Generating coupons at the point of sale based on the customer’s buying habits.
Recalculating entire risk portfolios in minutes.
Detecting fraudulent behavior before it affects your organization.

Big Data Challenges:

First, big data is…big. Although new technologies have been developed for data storage, data volumes are doubling in size about every two years. Organizations still struggle to keep pace with their data and find ways to effectively store it.

But it’s not enough to just store the data.Finally, big data technology is changing at a rapid pace.

A few years ago, Apache Hadoop was the popular technology used to handle big data. Then Apache Spark was introduced in 2014. Today, a combination of the two frameworks appears to be the best approach. Keeping up with big data technology is an ongoing challenge.

HADOOP:

The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

The Amount of Data Created Each Day on the Internet-

Given how much data is on the internet, the actual amount of data used is difficult to calculate.

But if we’re talking about how much data is created every day the current estimate stands at 1.145 trillion MB per day.

here its just a approx ratio :-

1. 1.7MB of data is created every second by every individual throughout 2020.

2. 2.5 quintillion bytes of data are produced by humans every day.

3. There are 4.57 billion active internet users around the world.

4. 58.7% of people around the world have access to the internet.

5. Google processes over 3.5 billion search queries every day.That is 2 trillion searches per year worldwide. That is over 40,000 search queries per second!

Social Media Usage Statistics:

350 million photos are uploaded to Facebook each day.

2. Every day, 306.4 billion emails are sent, and 682 million tweets per day!

3. Facebook generates 4 petabytes of data every day.

Video Growth Statistics:

A 480p video on YouTube uses 8.3MB per minute and 500MB per hour.

2. A 480p video on Twitch uses between 0.405GB and 0.54GB per hour.

3. One hour sending and receiving Snapchat messages uses around 160MB of data.

Communication Statistics:

One text message only uses the equivalent of 0.0001335 MB of data.

2. Sharing a message on WhatsApp usually uses only Kilobytes of data.

3. Expect to use between 0.5MB and 1.3MB per minute for a VoIP call.

Big Data Growth Statistics:

Data growth stats in 2020 tell us that big data is growing at an unprecedented rate. The majority of the world’s data has come about in only the past two years as indicated by data growth statistics. Meanwhile, machine-generated data will account for 40% of internet data this year.

In the last two years alone, 90% of the world’s data has been created.

2. By 2020, 44 zettabytes will make up the entire digital universe.

3.Machine-generated data will account for over 40% of internet data in 2020.Statistics on data machine learning growth and web data growth show that 60% of data generated on the internet will be made by humans.

Iot Growth Statistics:

IoT is showing no sign of slowing down. In fact, the industry is booming. As the number of IoT devices increases, the number of active users and subscriptions increases too.

1 . By 2023, there are expected to be around 1.3 billion IoT subscriptions.

2 . Last year, the number of active IoT devices was 26.66 billion.

3. In 2020, 31 billion IoT devices will exist.

We already know that the IoT market is growing rapidly. So much so that the number of devices existing in this area will be 31 billion by 2020.

FACEBOOK STATISTICS:

Users spend nearly one hour a day on Facebook, but Instagram and Snapchat are quickly catching up.
Since 2013, the number of Facebook Posts shared each minute has increased 22%, from 2.5 Million to 3 million posts per minute in 2016. This number has increased more than 300 percent, from around 650,000 posts per minute in 2011!
Every minute on Facebook: 510,000 comments are posted, 293,000 statuses are updated, and 136,000 photos are uploaded.
There are over 38,000 status updates on Facebook every minute.
Facebook users also click the like button on more than 4 million posts every minute!, and the Facebook like button has been pressed 13 trillion times.
There are over 2 billion monthly active facebook users, compared to 1.44 billion at the start of 2015 and 1.65 at the start of 2016.
Facebook has 1.58 billion daily active users on average as of Q2 2019
4.3 BILLION Facebook messages posted daily!
5.76 BILLION Facebook likes every day.

HDFS:

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is now an Apache Hadoop subproject.

Assumptions and Goals

Hardware Failure

Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.

Streaming Data Access

Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS.

Large Data Sets

Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.

Simple Coherency Model

HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A MapReduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future.

“Moving Computation is Cheaper than Moving Data”

A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughOut of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located.

Portability Across Heterogeneous Hardware and Software Platforms

HDFS has been designed to be easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications.