Migrating Enterprise Data Warehouse (EDW) to Big Data

In the following post, I cover the brief history of Enterprise Data Warehouse (EDW), analyze the major challenges of Enterprise Data Warehouse solutions and discuss traditional EDW and their capacity to handle the Volume, Variety, and Velocity (three of the V’s of Big Data). I also explore Big Data platforms as a potential alternative to EDW.

 

Enterprise Data Warehouse – A Short History

The easiest way to illustrate the purpose of Enterprise Data Warehouse (EDW) is to envision a large enterprise with many departments, such as Administrative, Financial, Marketing, Production, Sales or IT & Infrastructure. It is not hard to imagine that all of these departments generate the information in a variety of data formats that can easily become disorganized without any central repository and a common data format.

The problem of Data Management and Warehousing become apparent in 70’s when leading large industries of the time started to deploy more computing power to assist their business processes. Around the same time, in 1970, William H. Inmon (also called a Father of Data Warehousing) began to outline the principles surrounding the topic of the Data Warehouse. By late 80‘s, the research of data management practices increased dramatically, mainly due to uptake in computerization of informational and other corporate departmental systems and we start to see the terms such as ‘data warehouse’ first appearing in the IT research journals of the day.

The credit for the phrase ‘business data warehouse’ officially goes to Barry Devlin and Paul Murphy for the article published in IBM Systems Journal (1988 – Volume 27, Issue 1). It was fittingly named ‘An architecture for a business information system, ’ and the article outlined one of the first models for the data collection from various operating systems and also discussed issues such as data centralization and data analysis.

The first Enterprise Data Warehouse solutions were born primarily from the need mentioned above, and that is, to assist the large corporation in a way they capture and move various departmental data into a single central repository. Thus, the main purpose of Enterprise Data Warehouse (EDW) is to serve as the unified/federal database system and a central repository of data. EDW technology became primarily utilized by the large enterprises, to manage and report on the various forms of data captured in the process of running their business activities. So, in short, EDW was an answer to a need to organize, classify and represent the data. Figure 1, illustrates the usually deployed EDW solution.

Transitioning from Enterprise Data Warehouse to Big Data

EDW & Scalability

While the EDW is an excellent way to integrate the data from multiple sources, nowadays, the traditional EDW architecture is under increased pressure from competing solutions such as Big Data, that ultimately offers much faster and cost-effective ways to gather insights from data analysis.

As a matter of fact, the “The Big Data solutions such as data warehouse management solutions will be one of the main revenue generators in big data and business analytics, with sales expected to generate more than $55 billion in revenues in 2019.” (Davis, J., 2017).

In the below paragraphs, I briefly outline the challenges that the businesses may face during the migration from Enterprise Data Warehouse to Big Data & Cloud Computing technologies.

Enterprise Data Warehouse and Big Data Volume, Variety and Velocity

The main issue of EDW is its lack of scalability when it comes to the following three big V’s of Big Data: Volume, Variety, and Velocity.

The businesses as well as every single one of us, generate the enormous quantity of data. As a matter of fact, the data we create and collect are growing at an exponential rate. It is estimated that by the year .2020, there’ll be 50 billion IoT devices and every single person on earth will produce close to 2 MB of data every single second.

T.his increase in the volume of data is the major reason why the EDW approach is re-evaluated. Why?

Well, let’s assume there is an enterprise that collects all the information from all the departments through ETL into a single SQL database stored on a premise. How will you address scalability and performance in a situation when advances in computing increase the volume of collected data two-fold every single year? How can you accomplish effective real-time data analysis in such cases? In a nutshell, EDW is not scalable!

For example, the centralized EDW databases, are powerless when it comes to quickly and cost effectively adapting to cumulative increases in volume and velocity of incoming data. It is mainly due to highly structured nature of relational databases which only process the data sequentially. They simply cannot match the performance and cost-effectiveness, nor scalability of the distributed NoSQL database solutions, that process queries parallelly, are horizontally scalable as well as excel in procedural processing. Additionally, relational databases are known to be sluggish and incapable of effectively addressing issues of database replication. The processes of synchronizing data between databases in EDW are lowering down the performance of the entire system, and with growing volumes, the issue becomes even more apparent.

When it comes to volume, the localized data storage that are traditionally used in EDW systems are linearly un-scalable, and cannot compete with the distributed file systems used in Big Data or Cloud Storage solutions of today.

Another issue altogether is related to a variety of data, where nowadays with the numerous sources of data, also many unstructured types need to be processed alongside the structured data sources. Traditional ETL systems are not as performant with unstructured data types.

The Big Data Alternative to Data Warehousing

Let’s look at the Big Data alternative to a traditional solution. The Hadoop ecosystem is a good example. Not only it is designed for enormous parallel computing, but it is also built to work with NoSQL databases, where data can co-exist on multiple server machines with only a slight performance decrease due to its distributed nature. NoSQL doesn’t use database schema, and that essentially means that NoSQL databases can also manage the unstructured data (those without schema), as well as structured types without any issues. Distributed nature of NoSQL databases makes it a also a great solution for the rapid bulk data absorption. This architecture is also ideal because it is easily scalable and adaptable for the collection of data from numerous sources, neverminding a lot more performant during data analysis than a traditional EDW solution. “Parallel processing and unstructured data ingestion with Big Data technology can be leveraged to enhance the existing ETL capabilities for most enterprises.” (EDW to Big Data, 2013).

Figure 2 illustrates a Big Data Hadoop alternative architecture to Data Warehousing.

Figure 2 – Grover, M. (2017)

 

Conclusion

Enterprise Data Warehouse technologies were an excellent response to a need to centralize and report on multiple data sources. However, the scalability issues are turning the EDW platforms obsolete. In today’s world of increasing data volumes, the Big Data represents the economies of the scale. The distributed Big Data systems (storage, databases, and analytical engines) are far more performant during data analyses and also more flexible and cost-effective when it comes to rapidly handling increasing volume, velocity, and a variety of data.

 

References

An architecture for a business and information system – IEEE Xplore document (2017) Available at: http://ieeexplore.ieee.org/document/5387658/ (Accessed: 23 February 2017).

Wadkar, S., & Siddalingaiah, M. (2014). Data Warehousing Using Hadoop. In Pro Apache Hadoop (pp. 217-239). Apress. (Accessed: 23 February 2017).

Thusoo, A., Shao, Z., Anthony, S., Borthakur, D., Jain, N., Sen Sarma, J., … & Liu, H. (2010, June). Data warehousing and analytics infrastructure at facebook. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (pp. 1013-1020). ACM. (Accessed: 24 February 2017).

3 approaches to healthcare data warehousing: A comparison (2016) Available at: https://www.healthcatalyst.com/whitepaper/3-approaches-healthcare-data-warehousing (Accessed: 25 February 2017).

Davis, J. (2017) Big data, Analytics sales will reach $187 Billion by 2019. Available at: http://www.informationweek.com/big-data/big-data-analytics/big-data-analytics-sales-will-reach-$187-billion-by-2019/d/d-id/1325631 (Accessed: 25 February 2017).

Organisations and management accounting (2017) Available at: http://www.open.edu/openlearn/money-management/organisations-and-management-accounting/content-section-4.1 (Accessed: 25 February 2017).

EDW to Big Data (2013) Offload enterprise data warehouse (EDW) to big data lake. Ample white paper – PDF. Available at: http://docplayer.net/3429702-Offload-enterprise-data-warehouse-edw-to-big-data-lake-ample-white-paper.html (Accessed: 25 February 2017).

Grover, M. (2017) Data warehousing with Hadoop. Available at: https://es.slideshare.net/hadooparchbook/data-warehousing-with-hadoop (Accessed: 25 February 2017).

Comments

comments