Impact of Big Data Volume in the Context of Distributed Data, Scalability, Data Access and Storage

Volume is the most characteristic property of Big Data, which to a large extent affects the other five V’s of Big Data, namely the Velocity, Variety, Veracity, Variability, and Value....

Volume is the most characteristic property of Big Data, which to a large extent affects the other five V’s of Big Data, namely the Velocity, Variety, Veracity, Variability, and Value. In this article, I explore the Volume’s impact primarily in the context of distributed data and scalability, data access and storage, as well as its impact on data transfer.

Big Data Volume

Eric Schmidt, the previous CEO of Google stated in 2010 that “Every 2 days we create as much information as we did up to 2003”. (Siegler, M., 2010)

Now we are in the year 2017, and most of the businesses (not just IT sector) must deal with the enormous quantities of data. The issue of growing volume of data is accompanied by numerous challenges, mainly in the area of performance and capacity.

I name few of the Big Data Volume impacts in this paper.

Distributed Data

When we use the term ‘distributed data,' we are mainly referring to a network, in which the data is kept in multiple locations, that being either computers or database nodes and clusters, where data is frequently replicated to maintain a balance between performance and efficient use of storage space.

The size of the data is a critical factor when it comes to distributed data. In numerous commercial and technical solutions, there is a requirement to process large amounts of data in an effective way. “This has contributed to the big data problem faced by the industry due to the inability of conventional database systems and software tools to manage or process the big data sets within tolerable time limits.” (Patel, A. B., Birla, M., & Nair, U., 2012)

Until recently, of the very common methods of keeping large amounts of data on disk is by storing them in RAID arrays. However, this eventually turned out as not the best fit for Big Data, especially because these RAID systems did not provide the best data resilience, nor overall data access and retrieval performance. The reason for this is simple; more drives we involve in the process, larger will be the measured overall latency in every segment. Moreover, as the amount of the information we need to store grows over time, also the rate of drive failures increases. This statement was perfectly summarized in IBM’s Performance and Capacity Implications For Big Data, where author’s state that “a system with a billion cores has an MTBF of one hour. The failure of a particular cluster node affects the overall calculation work of the large infrastructure that is required to process big data transactions.” (Jewell, D., Barros, R.D., Diederichs, S., Duijvestijn, L.M., Hammersley, M., Hazra, A., Holban, C., Li, Y., Osaigbovo, O., Plach, A., Portilla, I., Saptarshi, M., Seera, H.P., Stahl, E. and Zolotow, C., 2014)

So how was this major issue connected to Volume of Big Data eventually conquered?

Ultimately, as the problems of big data became more imminent, the giant IT companies (namely Google and Yahoo) came with their own way of dealing with the issue. They built the large networks of commodity hardware and to deal with the known performance issues, they have also invented their own distributed file systems.

GOOGLE - Google implemented a variant of the MapReduce programming model (originally a proprietary Google technology) and invented a Distributed File System they named GFS (Google File System). GFS is an exclusively distributed file system which was created by Google to provide an efficient as well as a dependable way to work with data that are stored in massive clusters made of commodity hardware and some several other Google technologies. The whole high-performance data storage system that was eventually created on top of GFS was named ‘BigTable.' Nowadays, Google classifies the Bigtable as a NoSQL Big Data database service, rather than the proprietary data storage system. This Google’s Big Data system is responsible for powering the most of core Google services, such as Google Search, Google Analytics, as well as Google Maps, and Google’s email solution: Gmail.

YAHOO - Going back to history, around the same time as Google started working on their Google File System (circa 2006), Hadoop was created by Doug Cutting, who was working for Yahoo at the time (please note, Yahoo was also the first user of Hadoop). That said, Hadoop, similarly to BigTable, owes its existence to MapReduce that dates to 2003, it is Hadoop’s programming model to this date. To minimize the issue of performance, Hadoop, same as Google decided to create its own distributed file system, which they named HDFS (Hadoop Distributed File System). So, the same way as GFS is the basis for BigTable in Google’s solution, HDFS is a file system for HBase - Hadoop Cluster. A high-level outline of the Hadoop’s model can be seen in Figure 1.

Figure 1

Image © Bowick, G. (2015)

Figure 2 provides an overall high-level comparison of the two of these most popular big data platforms.

Figure 2

Image © Bowick, G. (2015)

The statement, which truly elegantly explains the reasoning why both of these companies created a distributed file system, was posted by Cisco in their white paper “Big Data in the Enterprise - Network Design Considerations”.

The paper states that “to process massive amounts of data efficiently, it was important to move computing to where the data is, using a distributed file system rather than [a] central system for the data. A single large file is split into blocks, and the blocks are distributed among the nodes” (Cisco, 2014), and that is a perfect summary and reasoning for the development of distributed file systems on which sit the most of the today’s big data clusters.

Data Storage & Access

The speed of access is another area in which big data systems struggle. As the data reside in multiple locations, it takes time for it to get in and out of the memory of each of the systems. Then it takes additional time to get from one node to another over the network. Then there are issues of how fast are the hard drives used as a storage (read and write) and many other considerations. Another key issue related to replication or any movement of data, which is very impractical for big data solutions, mainly due to restraints that such actions can place on to the network. As you can imagine “moving petabytes of data across a network in a one-to-one or one-to-many fashion requires an extremely high-bandwidth, low-latency network infrastructure for efficient communication between computer nodes.” (Sabbithi, V., & Rapaka, J. M., 2015)

One of the good examples is previously mentioned Hadoop, which isn’t known for being able to move data efficiently, especially when we refer to an overall speed it takes to move large data in Hadoop. As I have mentioned earlier, Hadoop sits on top of the HDFS, which in order to make sure that data is not lost, writes every byte of data to hard drives three times (something called a triple mirroring replication scheme. While this is great, because it allows Hadoop Cluster to protect the data, it is also a problem when it comes to data access. This is due to the triple redundancy feature, which dramatically increases the need for storage capacity, uses a lot of computing resources (CPU, memory, I/O operations, etc.) and at the end of the day is the reason for latencies and overall slower system.

Conclusion

To combat the impact of Volume on other V’s of Big Data, we need to invent better techniques that deal with an overall performance of the big data cluster. We also need to find the new innovative ways to better use of the overall disk capacity required by big data datasets.

One of the areas that can provide an immediate result is the big data analytics. By using predictive models, analyzing our networks, visualize data transfers and also cleaning up and properly metadata our datasets, we can find the bottlenecks that once resolved allow us to lower the overall latency as well as optimize the total processing time.

References

Sabbithi, V., & Rapaka, J. M. (2015). A Study on Big Data analysis Environment. IJSEAT, 3(11), 871-877. (Accessed: 17 January 2017).

Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010, May). The hadoop distributed file system. In Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on (pp. 1-10). IEEE. (Accessed: 18 January 2017)

Sagiroglu, S., & Sinanc, D. (2013, May). Big data: A review. In Collaboration Technologies and Systems (CTS), 2013 International Conference on (pp. 42-47). IEEE. (Accessed: 18 January 2017)

Cisco (2014) Big data in the enterprise - network design considerations white paper. Available at: http://www.cisco.com/c/en/us/products/collateral/switches/nexus-5000-series-switches/white_paper_c11-690561.html (Accessed: 19 January 2017).

Hadoop History (2016) Available at: https://en.wikipedia.org/wiki/Apache_Hadoop#History (Accessed: 19 January 2017).

Jewell, D., Barros, R.D., Diederichs, S., Duijvestijn, L.M., Hammersley, M., Hazra, A., Holban, C., Li, Y., Osaigbovo, O., Plach, A., Portilla, I., Saptarshi, M., Seera, H.P., Stahl, E. and Zolotow, C. (2014) Performance and capacity implications for big data. Available at: http://www.redbooks.ibm.com/redpapers/pdfs/redp5070.pdf (Accessed: 19 January 2017).

Apache Nutch (2016) in Wikipedia. Available at: https://en.wikipedia.org/wiki/Apache_Nutch (Accessed: 19 January 2017).

Bowick, G. (2015) Cloud Computing Cloud Computing PaaS Techniques File System.. Available at: http://slideplayer.com/slide/1517745/ (Accessed: 19 January 2017).

Distributed database (2016) in Wikipedia. Available at: https://en.wikipedia.org/wiki/Distributed_database (Accessed: 19 January 2017).

Patel, A. B., Birla, M., & Nair, U. (2012). Addressing big data problem using Hadoop and Map Reduce. In Engineering (NUiCONE), 2012 Nirma University International Conference on (pp. 1-5). IEEE. (Accessed: 19 January 2017).

Siegler, M. (2010) Eric Schmidt: Every 2 days we create as much information as we did up to 2003. Available at: https://techcrunch.com/2010/08/04/schmidt-data/ (Accessed: 19 January 2017).