‘Volume’ characteristics of Big Data

📅 Published 14.01.2017

In the following article, I’ll analyze Big Data from a perspective of high-velocity capture, storing, processing and visualizing the large volumes of data. I’ll provide the examples of applications that create or collect the massive amounts of data, briefly explain the process; and expand on some of the drawbacks that engineers and developers face in the process of analysis and visualization of the Big Data volumes. Also, if you like the below article, don't forget to check my post on The Significance of Big Data Lifecycle Management (BDLM).

What is Big Data?

Before I move onto the main topic of this article, let’s first describe the meaning of ‘Big Data’ term. The definition of Big Data is often a quite convoluted subject, and those new to the field may even feel that the term lacks the proper definition. It’s usually the case because most definitions of Big Data are either very dense or not easy to grasp.

One of the most dense attempts at defining the Big Data term is the so called ‘improved’ Gartner’s definition: ‘Big Data (Data Intensive) Technologies are targeting to process high-volume, high velocity, high-variety data (sets/assets) to extract intended data value and ensure high veracity of original data and obtained information that demand cost-effective, innovative forms of data and information processing (analytics) for enhanced insight, decision-making, and processes control; all of those demand (should be supported by) new data models (supporting all data states and stages during the whole data lifecycle) and new infrastructure services and tools that allows also obtaining (and processing data) from a variety of sources (including sensor networks) and delivering data in a variety of forms to different data and information consumers and devices.’, as cited by Gandomi, A., & Haider in International Journal of Information Management (Gandomi, A., & Haider, M., 2015).

As most can tell, this definition didn’t stand a chance to get even remotely popular with the general public. As a matter of fact, the above definition scored an incredible -73.8 on Flesch-Kincaid’s Ease of Reading scale, indicating that Ph.D. graduate would have difficulties reading and understanding it. So, no wonder that Gartner eventually decided to rework their definition of Big Data. And they did an excellent job, as the new definition is possibly one of the best I’ve seen yet, it reads as follows:

**“Big data is high-volume, high-velocity and high variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.” (**Gartner, 2016).

So, as we can see, it’s many times the author's choice of words, but this shouldn’t discourage us, the topic of Big Data, while indeed quite complicated, isn’t as intricate as definitions often portrait it.

In an effort to make this topic more understandable, the Big Data industry came up with various ways to bring it down a notch and eventually settled on so-called 6 V’s of Big Data. This is my take on using a simple English to explain the 6 V’s:

Volume - there is a lot of it (it refers to the scale of data), and that is the main topic of this article.

Variety - there are many different types of data

Veracity - the data cannot be trusted to represent things (data uncertainty) truly

Velocity - there's typically a lot coming in all at once

Variability - Constantly changing data

Value - Value extraction by using various analysis methods

To add little of my perspective, the following diagram is my personal take on illustrating the main characteristics and base featured of Big Data, often also called ‘V’s of Big Data.

When I first came across the Big Data definition, it only specified 5+1 V’s. However, the data Visualization is in my view one of one of the most important ways to bring the data to the consumer. “Big data is set to offer companies tremendous insight and data visualization is becoming an increasingly important component of analytics in the age of big data.” (SAS Institute Inc., 2013).

I am referring to visual charts, graphs, maps and reports; that essentially make the outcomes of the data processing more readable. It’s, in my opinion, an indispensable part of Big Data, and the following is my way of illustrating the fact.

Big Data - Volume

Now that we’ve covered the general meaning of Big Data let’s discuss the Big Data ‘Volume’, which is synonymous with Big Data, and as a matter of fact, it’s also its most distinctive characteristic.

Nearly 2 billion people are connected to the internet today. We’re using millions of devices and sensors (a modern car has more than 100 of them) and thus exponentially generating bigger volumes of data. “Like the physical universe, the digital universe is large – by 2020 containing nearly as many digital bits as there are stars in the universe. It is doubling in size every two years, and by 2020 the digital universe – the data we create and copy annually – will reach 44 zettabytes or 44 trillion gigabytes.” (EMC2, 2014).

In 2016 we already have 18.9 billion active network connections, which is 2.5 connection for every person on earth.

Wrapping the head around such numbers is hard. Try to imagine one billion hours of TV shows and movies streamed from Netflix per month. That seems unfathomable and hard to comprehend. Let alone the data collected at 2.8 Gigabytes per second from the ASKAP radio telescope, or 5 PBytes data per month generated by The Large Hadron Collider (LHC); or similar scientific instruments.

Nevertheless, these are just some of the examples of applications that produce and need to consume the vast amount of data. There is an explosion of data volumes in every field. As a matter of fact, more data has been generated in the past 48 months than in the entire history of humans. Here are some facts about Big Data and the Volume of data, as compiled by Pat Stricker, RN, Med, Senior Vice President of TCS Healthcare Technologies (Stricker, P., 2016):

- Close to 1.5 billion smartphones, proficient at collecting data, were shipped In 2015.

- By 2020, about 1.7 MB of new information will be formed every second for every person on earth.

- 40 thousand Google search queries are created every second, which resolved to 1 trillion Google searches per year.

- Up to 300 hours of videos are uploaded to YouTube every minute.

- There will be over 6 billion smartphones by year 2020.

- Within 5 years there will be over 50 billion smart connected devices.

- In August 2015 Facebook crossed the threshold, when over 1 billion people used Facebook in a single day. Users send an average of 31.25 million messages and view 2.77 million videos every minute.

- Estimates suggest that healthcare could save as much as $300 billion a year by better integrating big data. That’s a savings of $1000 a year for every man, woman, and child.

We would all agree, that that’s a truly mind-boggling volume of data. However, even more, mind-boggling than the astounding amount of data we generate is the outcome of the recent Digital Universe study which ‘finds that 0.5% of global data is analyzed, and o nly half of data requiring security measures is protected’. (Burn-Murdoch, J., 2012).

Jeremy Burton, Executive Vice President of Product Operations and Marketing for EMC recently stated that "As the volume and complexity of data barraging businesses from all angles increases, IT organizations have a choice: they can either succumb to information-overload paralysis, or they can take steps to harness the tremendous potential teeming within all of those data streams" (Burn-Murdoch, J., 2012).

Volumes that we collect are based on current models predicted to reach 5,2 GB per person in the next 3 years (by 2020); and most of us would agree that we don’t want to ‘succumb to information overload’, in fact, there is a tremendous untapped potential in front of us.

However, it’s also not hard to imagine the problems imposed by such enormous data volumes, and that’s currently the main topic of the Big Data Volume research as well as endless academic discussions.

The topic that generates the most questions when it comes to Big Data Volumes concentrates mainly around the technical inquiries in dealing with the Volume of Big Data, such as how do we store, process, analyze and extract the value from vast amounts of unused data volumes.

Dealing with the ‘Volume’ characteristic of Big Data

The following is my take at a high-level summary of the technical issues related to dealing with large amounts of data.

As we can imagine, when it comes to volume of data, we’ll not solve the problem just by purchasing a new large enough hard drive. Albeit for some businesses it can be the case, the Big Data, in general, is not defined by having an extensive database stored on the large disk. We’re referring to a much wider technology environment, which is coined under the term of Big Data Ecosystem (BDE) and relates to all interconnected parts, ranging from required infrastructure to data itself.

When it comes to Big Data volumes, the science is approaching a new realm, which needs to reconsider many of the existing technologies, as they no longer fit the purpose. Simply said, we need better processes and solutions to address the large data volumes.

Let’s illustrate the challenge a bit, by looking at Google search engine. Google answers approximately 40 thousand searches per second. That’s over 1 trillion searches per year. How would you go about analyzing the log files for these search queries? Keep in mind, that data the searches are coming from a multitude of Google domains (google.com, google.ca, google.uk, etc.) and they are stored in Google’s data centers distributed all over the globe, arriving to log files at a velocity rate of 40 thousand per second. And that’s just searches, how about 100 billion of web pages Google stores on their drives, petabytes of YouTube videos, etc. “Gross total estimate of all data Google saved by 2016 is at approximately 10 exabytes.” Vandecauter, M. (2016).

And that’s fundamentally a challenge because it’s not just Google that has this problem. Thinking about the topic, those V’s of big data come back to mind again, as all of them eventually relate to dealing with the large volume of data. But those we’ve already covered earlier. There are three fundamental issue areas that need to be addressed in dealing with big data: storage issues, management issues, and processing issues (Katal, A., Wazid, M., & Goudar, R. H., 2013), but I’ll concentrate on some of the more specific issues that come to mind (this could be an evolving topic):

**Infrastructure – **Deals with questions of building infrastructure that supports the Storage, Velocity, and Processing of the sheer amount of data. Some of the current technologies use a grid approach, where many computers are employed to counter some of these issues, to address all of the issues, but this approach comes with its disadvantages. We need to build hardware that has enough memory (such as new in-memory technology) and can store High Velocity of Data, and also hardware that supports parallel processing and crunching through enormous volumes of data, and do so swiftly.

Data Management – this is a very complex topic, basically outlining the need for frameworks and centralized systems that allow administrators as well as other stakeholders to manage the data that are collected.

Data Distribution – How do we access, search, process and manage distributed data volumes in multiple locations and many times surrounded by local policies that govern how the data can be used?

**Data Quality **– Even if our infrastructure allowed us to store all the information at once, it’s more practical from a cost, processing and analysis perspective to make sure data is accurate and contains trustworthy information. We need autonomous tools and mechanisms that help us in this regard and based on preconfigured settings adjust the information automatically.

Data Variation – How do we correlate formless data with the highly structured historical data and how do we prevent instability of such processing analytic system.

Data and Security – One of the very common issues with the data is its security. We need technologies that can guarantee the trustworthiness and authenticity of our data, a system that supports distributed nature of the data.

**Availability **– On a big distributed data set, how do we achieve availability and timeliness, how do we identify the data?

Analysis and Visualization – The issues of analyzing the data are one of the most complex. Even if we had all of the problems mentioned above resolved, we need a better, predictive methods to obtain the Value from data. Resolving all of the above issues would amount to the nothing, if businesses cannot enhance their methods, tools and products by deriving new insight from the data collection mechanisms.

The Future

While Google has almost 10 exabytes stored, it's nothing compared to what we collectively generate worldwide. Thus, if we want to put some of the information to good use, we will need new innovative technologies to not only store the data but most importantly new ways to analyze and interpret it.

Looking at the improvements made in the field of analysis, one can already tell that one of the most pressing issues is the lack of metadata and the deficit of structure in the datasets.

These are issues that cannot be easily resolved after the collection manually, nor easily by humans. Interestingly enough, this has spurred the growth of the machine learning and term-based data science. Almost everywhere we look there is an abundance of disorganized data, from which we are unable to gather any insights and it seems like using the artificial intelligence (AI) to analyze the datasets and make the connections is the way to go.

In its report, IDC predicted that healthcare and manufacturing will be the biggest drivers of cognitive computing and AI revenues between now and 2020, while the education sector will also invest heavily in such technologies. And earlier this month, Tony Baer, principal analyst in information management at Ovum, predicted that machine learning in particular "will be the biggest disruptor for big data analytics in 2017." That trend will also make it increasingly important for organizations to treat data science as a "team sport," he added (Big Data Trends, 2017).

IBM's big data evangelist James Kobielus recently said that "People who can design AI-powered products that combine robotics, embodied cognition, IoT fog computing, deep learning, predictive analytics, emotion analytics, geospatial contextualization, conversational engagement and wearable form factors will be in hot demand." (Network, N., 2016).

In fact, it seems that as most of the data collected today are in the unstructured format, and because the conventional business intelligence methods that are utterly inadequate, the AI has a real monopoly when it comes to analysis. Its dynamic algorithms assisted by machine learning are currently one of the best ways to do the timely analytics of enormous data volumes and while not perfect, we can already see that AI helps a lot with data analysis.

One of the examples of such AI development is DreamQuark that creates an Artificial Intelligence dedicated to Healthcare and Insurance. They develop the state-of-the-art deep learning technology that allows them to analyze a wide variety of data types (images, texts, sounds, etc.) collected until today, and which helps the healthcare and insurance actors to invent better prevention, diagnosis and care systems.

Another good example was a use of Artificial Intelligence is in identifying fake news in the news collections. "Among the many ‘V’s’ that characterize big data (volume, variety, and velocity being the most familiar), we have now the added challenge of data veracity. Fake news, after all, is in essence a big data veracity challenge. It doesn’t matter how well we move, process, or secure our information if our information is simply incorrect." (Bloomberg, J., 2017).

One of the great examples of implementing AI to detect fake news is the approach done by Facebook. They are training a system to identify fake news based upon what sorts of articles people have flagged as misinformation in the past. They use these patterns later to find the bogus news stories. Mark Zuckerberg says: “The most important thing we can do is improve our ability to classify misinformation,” Zuckerberg explains. “This means better technical systems to detect what people will flag as false before they do it themselves.”

So, in my opinion, the use of AI technology to built the patterns, use of machine learning seems to be the current answer, because we can use it to learn the new behaviors, for a continually improved pattern of recognition, which at the end of the day improves the big data analysis.

Conclusion

The Big Data is already resolving many real-world issues already. Not so long ago, UPS implemented a system, which informs them about a need to replace vehicle parts. It was a huge challenge for them, but they wanted to and eventually implemented a predictive, proactive system, that ensures their vehicles are in good shape. To do so, they decided to collect data from hundreds of sensors on each of their vehicles and use Big Data algorithms to analyze the data and predict when a part is likely to break. This has “saved UPS millions in maintenance costs” (Satell, G. 2013).

However, it's important to note, that we can have as much data as we want, but if we're not able to understand it, then for all practicalities it's useless to us. That said, this is a recognized problem, and there is an entire phase of the Big Data lifecycle, which speaks directly to this issue. As we collect and do quality assurance on the dataset, and before we discover, integrate and analyze, we need to describe the data. To do that, we use a set of data called metadata, that describes and gives information about our datasets.

The problem is that most of the data collected today comes from such a variety of devices that's it's nearly impossible to govern it. Naturally, this raises the question of standardization of big data. A need to introduce a data collection standard, a very accurate and well-described ways to collect data for each type.

The United Nations' International Telecommunication Union (ITU) agency, which promotes the global cooperation in a variety of technical areas, appears to be on top of the issue. In December of 2016, they 've announced its first-ever standard for big data. "This new ITU standard provides internationally agreed fundamentals of cloud-based big data," said Chaesub Lee, director of the ITU's Telecommunication Standardization Bureau. "It will build cohesion in the terminology used to describe cloud-based big data and offer a common basis for the development of big data services and supporting technical standards." (Noyes, K., 2015)

Outlined in the ITU's new report are recommendations and requirements for visualization, analysis, security, and storage, among other areas, as well as standards for data collection. It's one thing to come up with the norm and entirely different issue to promote and govern the rule. How do we get all those myriads of hardware and software manufacturers on board. The process goes hand in hand with edification of the industry and their understanding of benefits of big data collection. Otherwise, if we'd be in trouble very quickly (especially considering that amount of collected data is doubling every two years). However, while that may seem like a pretty trivial thing to do, it's not simple.

That said, even if we didn't have any standard on describing our data, we unquestionably need the implement the metadata. Anyone or anything that produces and collects the data must apply the data-specific description method and include information that allows us to understand what the collected data represents. Without knowing how and where the parameters were measured or generated, what units of measurement we were following and what composition formats were used in the dataset, we will be in the dark, lose the efficiency and accuracy and end with uncertainty.

Unfortunately, even looking at my place of work, we often don't do this. We collect terabytes of data every day, but it's never used, simply because we don't have the metadata that describes the information to a level of detail needed for analysis. This alone speaks to the importance of BDLM.

We have a long way ahead of us when it comes to Big Data. But there are estimated 4.4 million IT jobs already created to support big data, 2 million of which is in the United States alone and software as well as hardware engineers are slowly improving and resolving many of the issues outlined in this article.

References

Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2), 137-144. (Accessed: 13 January 2017).

Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity. (Accessed: 13 January 2017).

McAfee, A., Brynjolfsson, E., Davenport, T. H., Patil, D. J., & Barton, D. (2012). Big data. The management revolution. Harvard Bus Rev, 90(10), 61-67. (Accessed: 13 January 2017).

Katal, A., Wazid, M., & Goudar, R. H. (2013). Big data: issues, challenges, tools and good practices. In Contemporary Computing (IC3), 2013 Sixth International Conference on (pp. 404-409). IEEE. (Accessed: 13 January 2017).

Gartner (2016) What is big data? - Gartner IT glossary - big data. Available at: http://www.gartner.com/it-glossary/big-data/ (Accessed: 14 January 2017).

EMC2 (2014) The digital universe of opportunities: Rich Data and the increasing value of the Internet of things. Available at: https://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm (Accessed: 14 January 2017).

Stricker, P. (2016) Zettabytes and other interesting ‘big data’ facts. Available at: http://www.naylornetwork.com/cmsatoday/articles/index-v2.asp?aid=367612&issueID=39069 (Accessed: 14 January 2017).

SAS Institute Inc. (2013) Five big data challenges and how to overcome them with visual analytics. Available at: https://www.sas.com/resources/asset/five-big-data-challenges-article.pdf (Accessed: 14 January 2017).

Gray, S. (2013) 50x current information = lots more disruption. Available at: https://mediareset.com/2013/01/29/50x-current-information-lots-more-disruption/ (Accessed: 14 January 2017).

Taylor, C. and TIBCO (2013) Nintendo’s boss promises the switch won't have the NES classic's supply issues. Available at: https://www.wired.com/insights/2013/08/three-enormous-problems-big-data-tech-solves/ (Accessed: 14 January 2017).

Vandecauter, M. (2016) How big is Google’s database?. Available at: https://www.quora.com/How-big-is-Googles-database (Accessed: 14 January 2017).

Satell, G. (2013) Yes, big data can solve real world problems. Available at: http://www.forbes.com/sites/gregsatell/2013/12/03/yes-big-data-can-solve-real-world-problems/#6542d5f9298f (Accessed: 14 January 2017).

Burn-Murdoch, J. (2012) Study: Less than 1% of the world’s data is analysed, over 80% is unprotected. Available at: https://www.theguardian.com/news/datablog/2012/dec/19/big-data-study-digital-universe-global-volume (Accessed: 14 January 2017).

Noyes, K. (2015) Big data gets its first official standard at the ITU. Available at: http://www.computerworld.com/article/3017164/cloud-computing/big-data-gets-its-first-official-standard-at-the-itu.html (Accessed: 16 January 2017).

Big Data Trends (2017) Available at: http://www.cio-today.com/article/index.php?story_id=1000037XTAKS (Accessed: 17 January 2017).

Network, N. (2016) How will big data evolve in the year ahead? | NewsFactor network. Available at: http://www.newsfactor.com/story.xhtml?story_id=011000CXO2JD (Accessed: 17 January 2017).

Bloomberg, J. (2017) Fake news? Big data and artificial intelligence to the rescue. Available at: http://www.forbes.com/sites/jasonbloomberg/2017/01/08/fake-news-big-data-and-artificial-intelligence-to-the-rescue/#6562a7617a21 (Accessed: 17 January 2017).