The main goal of the following article is to analyze the significance of Big Data Lifecycle Management (BDLM) within the scope of Big Data reference architecture. I will concentrate and explore BDLM model and workflow, as well as provide some real-world examples.
Big Data Lifecycle and Data Management
The big data academics, scientists, and engineers are encountering numerous issues when it comes to big data management. Some of the most pressing management issues are related to the adoption of methodical techniques that can assist us to organize the distributed and composite massive data sets efficiently, and meanwhile also provide the precision and well-organized service.
“Big data usually has characteristics such as massive, distributed, heterogeneous, association complexity, real-time changes, and so on. Thus, the big data management faces the challenges like distributed processing, semantic integration, association mapping, timeliness, and so on.” (Cheng, X., Hu, C., Li, Y., Lin, W., & Zuo, H., 2013).
For example, in the field of materials science, the “timber” could be used for two major categories, which are the building materials and the road transport materials. The concept about “timber” is expressed as “roof panels” in building materials, and is expressed as “wooden bridge” in road transport materials. Thereby the related attribute data about “timber” are distributed in two different regions. Therefore, researchers must use some kind of technical approach to construct the semantic associate for the heterogeneous and distributed data. Meanwhile, when some of the source data have changed, it should be able to access the updated data timely. Based on this, the related research about data evolution analysis in the lifecycle of big data would become increasingly important.
As we can see, the Big Data Lifecycle and Big Data management issues are becoming progressively important to address.
So, what is the Big Data Lifecycle?
There isn’t a single definitive answer, as the definition seems to be still evolving, but most references point to anywhere between three to five major phases of the Big Data Life Cycle. One of the best examples to illustrate the challenges of Big Data Lifecycle is the following graphic by Jimmy Nolan (Lionbridge – Figure 1):
Figure 1
Image © - Nolan, J. (2013)
Exploring the above graphic, we should be able to quickly grasp the importance of the big data management lifecycle now. To keep the explanation simple, this is what it all boils down to. We need a very precise way of acquiring data, the Data Storage and Data Acquisition phases relate to that. Once we obtain the data, we will require a high level of data awareness, the so-called Data Normalization, and if all these are met, only then we’ll be able to extract the valuable, clean, accurate and timely information from the data through Data Analytics. The only part that I am missing in the image above is a Big Data Management part, that some refer to under the name of** Data Governance**, which is a reasonable attempt at creating a framework that essentially manages and controls all primary stages of the lifecycle. The most important part that image does not convey, however, is that the entire progression of steps, each particular phase, must be precisely and correctly executed, to guarantee the data validity.
A little more elaborate description of the big data lifecycle is provided by Demchenko, Y., De Laat, C., & Membrey, P. (Figure 2), where the definition of all architecture components of the Big Data Ecosystem is illustrated in the progression of following steps.
Figure 2
Image © - Demchenko, Y., De Laat, C., & Membrey, P. (2014)
This particular model refers to the following stages of Big Data Lifecycle: Data collection and registration Data filtering and classification Data analysis, modeling, prediction Data delivery and visualization. It is a very similar approach to define the Big Data Lifecycle, albeit Figure 2 illustrates the process in little more detail, explaining the interactions between each of the phases, as well as the importance of storage and data models. In here, the flow is as follows. First, the data collected and registered from the data source. In the next phase, it’s cleaned, augmented, enhanced and classified. Only then it can be processed by the analytics phase of the lifecycle, in which we can apply modeling and prediction mechanisms to analyze the data. The next step is that of a data delivery, which is signified by using the tools to visualize the gathered valuable insights better. It’s important to note, that information at this stage is transferable and mobile; or in other words, we can go back and forth between the data filtering and enrichment phase, as well ad data analytics phase. It only depends on our need to repurpose and refactor or post-process the data we’re working with. The last stage is a consumer data analytics application, in which the data are presented to an end user.
“Figure 1 outlines the new approach to data management and processing in Big Data industry - the Big Data Lifecycle Management (BDLM) model, proposed as a result of analysis of the existing practices in different scientific communities and industry technology domains. New BDLM requires data storage and preservation at all stages what should allow data re-use/re-purposing and secondary research/analytics on the processed data and published results. However, this is possible only if the full data identification, cross-reference, and linkage are implemented in BDI. Data integrity, access control, and accountability must be supported during the whole data lifecycle. Data curation is an important component of the discussed BDLM and must also be done in a secure and trustworthy way.” (Demchenko, Y., De Laat, C., & Membrey, P., 2014).
Big Data Reference Architecture (BDRA)
First, I want to mention, that it’s important to understand the overarching picture, the so-called ‘The Big Data Architecture Framework (BDAF)’ model that addressed all aspects of the Big Data Ecosystem. It includes the following components:
- Big Data Infrastructure
- Big Data Analytics
- Data structures and models
- Big Data Security
- As well as ‘Big Data Lifecycle Management,' which is only a single part of the entire Big Data Ecosystem.
Additionally, the table in Figure 3 exemplifies the interrelation between Big Data Architecture Framework (BDAF) components, perfectly illustrating the importance of Data Management and Lifecycle in all of BDAF stages.
Figure 3
Image © - Demchenko, Y., De Laat, C., & Membrey, P. (2014)
Conclusion
We’re living in the age of Big Data. YouTube hosts over 2 billion videos, Facebook has 1 billion users that create 30 billion posts a month, Twitter search engine handles 32 billion searches a month, and an average teenager sends on average almost 5,000 text messages a month using instant messaging apps. All these applications are generating enormous and a unique and distributed data at a rate of 2.5 quintillion bytes of data per day, worldwide.
Thus, it’s only natural to assume that one of the primary goals of Big Data science is to evolve the big data solutions. “Market research firm IDC forecasts a 50% increase in revenues from the sale of big data and business analytics software, hardware, and services between 2015 and 2019. The Big Data software solutions will be one of the most major revenue generators in big data and business analytics, according to IDC, with sales expected to generate more than $55 billion in revenues in 2019. Nearly half of those revenues are projected to come from purchases of the end-user query, reporting, and analysis tools, and of data warehouse management tools, according to IDC.” (Davis, J., 2017). The issue is so important that White House administration already invested over $200 million into big data research projects, with the primary goal of “improving our ability to extract knowledge and insights from large and complex collections of digital data. The initiative promises to help solve some the Nation’s most pressing challenges.” (Weiss, R. and Zgorski, L.-J., 2012)
I’ll conclude the article by quoting Peter Sondergaard, Senior Vice President of Gartner: "Information is the oil of the 21st century, and analytics is the combustion engine.”
This quote is a great way of saying, that we’re entering a new realm compared in size to that of an industrial revolution. To prepare for the future, we need to lay out a good foundation in the form of new Big Data Lifecycle and Management frameworks and tools, that will allow us to process this massive data overload easily and also provide new and efficient ways to extract the valuable information from all of the data we collect.
References
Cheng, X., Hu, C., Li, Y., Lin, W., & Zuo, H. (2013). Data Evolution Analysis of Virtual DataSpace for Managing the Big Data Lifecycle. 2013 IEEE International Symposium On Parallel & Distributed Processing, Workshops & Phd Forum, 2054. doi:10.1109/IPDPSW.2013.57 (Accessed: 15 January 2017).
Demchenko, Y., De Laat, C., & Membrey, P. (2014). Defining architecture components of the Big Data Ecosystem. In Collaboration Technologies and Systems (CTS), 2014 International Conference on (pp. 104-112). IEEE. (Accessed: 14 January 2017).
Cagle, K. (2015) Understanding the Big Data Life-Cycle. Available at: https://www.linkedin.com/pulse/four-keys-big-data-life-cycle-kurt-cagle (Accessed: 15 January 2017).
Davis, J. (2017) Big data, Analytics sales will reach $187 Billion by 2019. Available at: http://www.informationweek.com/big-data/big-data-analytics/big-data-analytics-sales-will-reach-$187-billion-by-2019/d/d-id/1325631 (Accessed: 15 January 2017).
Weiss, R. and Zgorski, L.-J. (2012) ‘BIG DATA’ INITIATIVE. Available at: https://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_press_release.pdf (Accessed: 15 January 2017).
Nolan, J. (2013) The promise of big data still requires a human touch - business process CrowdsourcingBusiness process Crowdsourcing. Available at: http://blog.lionbridge.com/enterprise-crowdsourcing/2013/09/16/the-promise-of-big-data-still-requires-a-human-touch/ (Accessed: 15 January 2017).
Kuketz, D. (2016) The 7 biggest business benefits from big data. Available at: http://archive.utopiainc.com/insights/blog/381-7-biggest-business-benefits-from-big-data (Accessed: 15 January 2017).