Log Management and Big Data Analytics

In the following article, I explore the issue of log collection and analysis, a very specific problem domain for many large organizations. The logging is a suitable example of a volume and high-velocity data set, which makes it a good candidate for the application of Big Data analytic techniques. This article is not meant to go into details of how analytic methods perform data classification or certain other analytic tasks; it’s mainly to shed some light on the application of Big Data techniques to logging and outline some of the business benefits of log analysis.

Problems of Log Management

One of the common difficulties experienced by small and large organizations is the issue of log management which covers the problems of effectively managing the large volumes of computer-generated log files. The sheer number and variety of formats often present one of the biggest issues. The following is a very brief introduction to these categories. The main point of this section is to paint the size of the problem and the complexity of dealing with the log files.

Problems related to Variety

The following list represents the diversity of sources that are typically collected by a larger enterprise:

  • Hardware & Networking Devices – servers, routers, switches, bridges, hubs, firewalls, etc. – often distributed across multiple datacenters
  • Operating Systems – Windows & Unix based operating systems event logs in various levels of information detail, containing registry, event, file system, Syslog, SNMP, NetFlow and other types of logs.
  • Database Systems – transaction, query, recovery, configuration and audit log
  • Applications – HTTP weblogs, FTP logs, Log4J, JMS, JMX, Net events, Portal logs, but mainly the core business applications that often have their own homegrown log formats
  • Customer Facing Data – logs reporting on product usage, click stream data, shopping card logs, online transaction logs
  • Virtualization & Cloud – hypervisor logs, VM logs, cloud app logs
  • Outside Datacentre – Manufacturing, logistics logs, CDRs & IPDRs, power consumption, RFID or GPS data logs


Problems related to Volume

Combined Size – It is not uncommon for a medium sized enterprise to collect the log files that are often many terabytes in size. The log files are frequently right at the top of the list when it comes to volume of information generated by IT organizations or those that use computers to conduct its business. As we can imagine collecting and storing data can be a challenge, particularly in these volumes. Those are high-level issues of dealing with the volume on a massive scale, but there is also an issue of dealing with the log files individually.

Individual Size – The businesses often realize (usually when it is a bit late), that when an application incident occurs, they need to locate and immediately start analyzing the logs. The usual approach is first to find the trail, or in other words, to identify the individual log files that contain the valuable information. As I have often seen, even when the employees in charge can find the log files quickly, they usually hit the wall, trying to work with these huge log files effectively. Sometimes individual log files are truly gigantic, often not because the specific device generating the log does not support the log rotation or automatic splitting of logs to smaller chunks. It is a whole different problem, but it is usually because the default setting of most devices is to keep the log in a single file and because organizations habitually leave these configurations set to default (rarely I have seen any procedure for changing the log file settings).

That means that the individual log file can be as large as tens of gigabytes and that is not something a typical Notepad++ editor can easily open. An employee responsible for the log file analyses faces issues that often need to be addressed right after the production incident gets reported. And that is not the time to ask any of the following questions:

  • Which logs files need to be collected?
  • How to get the log from the server that is not responding?
  • How to quickly transfer the log file(s) over to employee workstation for analysis?
  • How to open the large files, what software solution to use?
  • How to effectively search inside the large file?
  • How to find the information, a pattern that can assist in the analyses of the problem?

Moreover, these are just some questions, imagine you need to compare two or more of the large log files because the issue is spread across various servers and domains, what then?

As you can imagine, the entire logging management issue becomes a very time-consuming and unreliable process, which rarely leads to any immediate positive business insight. I have seen the cases where dealing with log files was such a lengthy and cumbersome job, that organization had to drop it entirely and focus on other methods of fixing the issue. Eventually these organisations resort to using the log files after the fact, during the post-problem analysis phase, when there is usually more time to dig through those big log files.

Other Issues

As we can see, the problem is largely about the Variety, because application and hardware logs are generated in multiple structured and unstructured formats. It’s also about the total and individual size of logs files (Volume). However, there are also other aspects that should be mentioned when it comes to log files in general, such as:

– Velocity – high speed of new information produced and continuously saved to log files

Veracity – the events contained in the log files may not always be accurate, due to settings (e.g. time configuration differences) or in security logs where intruders are known to amend the logs to confuse intrusion detection systems


Just by looking at these characteristics, it is instantly clear, that the log file management is an excellent candidate for Big Data analysis. One of the most important characteristics of Big Data for any enterprise is its ability to generate the Value, which is one of the V’s of Big Data. My experience tells me that log files in most organizations are not utilized pro-actively, but rather reactively, when the production problem occur, which is not the most efficient approach. The goal of this document is to explore how we can derive the business value and insight by applying Big Data to a complex scenario of log management.



Applying Big Data Analytics to Logging

The following section of the document briefly explores the major aspects of the Big Data logging architecture (Figure 1).

For medium and large corporations that experience an exponential growth in the volume of log files, the issue of log management can quickly become a daunting task. The million-dollar question faces by these corporations, was to find the business solution capable of collecting and processing terabytes of data easily. However, also, a solution, that can provide accurate and valuable insights and logging analysis, while keeping the process cost-effective. That is where the Big Data solutions can certainly come to the rescue.

Figure 1 is my attempt at visualizing the process in which Big Data solutions typically address the issue log analyses:

Figure 1


Please note that when it comes to Storage part of the above image, I didn’t mean to illustrate the Storage as a real stage in the Big Data lifecycle model, as shows the following representation (image directly below). I simply wanted to illustrate how I would handle the log analysis as components of the Big Data solution.


I explore the real life scenarion in the following new article, check it out:

Processing Apache Access log with K-means Clustering Algorithm


Log Transformation, Collection, and Aggregation

The Big Data solutions typically concentrate on leveraging log forwarders to power the process of aggregation of logs into a single centralized source. Log files are transformed into a single common format during or after the process. This is the crucial initial step what allows the data contained in the logs to be later processed and analyzed. There are many classes of solutions that are now available on the market, designed specifically to handle the high-volume and high-throughput logs from variety of sources, event collection and data transformation to a single common format that can be later effectively analyzed. Among some of the solutions that can be mentioned here, as specific solutions to log aggregation is server solutions such as Scribe, Apache Flume, logstash, that support shipping, parsing and indexing logs as well as moving large amounts of data into storage. Centralized location and a common log format is one of the first benefits the businesses gains from implementing a Big Data solution.

Log Storage

Once the log files are transformed into a common format, they’re usually moved into a NoSQL database that sits on top of a distributed file system such as HDFS. This is generally how the Big Data solution stores the information of any type. The selection of NoSQL database is in this case driven by the data store type for which the NoSQL database is designed. Once the log files are stored inside such a distributed system, the businesses will see that the eventual loss of information contained in logs files immediately becomes lesser of an issue. Moreover, that helps in two specific areas, in the overall retention of data (Big Data solutions are designed to contain large volumes), as well as an ease of compliance with regulations.

Log Processing

Solutions such as Hadoop, Apache Spark and others in this space, provide many efficient ways to process the logs by harnessing the power of Big Data technologies. Here we gain the overall intelligence into what is happening on all systems as the solutions such as these sorts and reduce the information so that it can be efficiently analyzed.

Log Analysis and Visualization

Log analyses is the part of the system, in which we use various advanced Big Data analytic techniques to correlate the data and gather evidence, such as data mining, machine learning, pattern recognition and applied statistics. “All these techniques started in different fields. Machine learning, as well as pattern recognition, came from the are of Artificial Intelligence, whereas Data mining is a branch of machine learning and Applied Statistics existed already as the applied method of statistics. However, increasingly the separation between these fields has become blurred, and they are merging into one field that can be called Analytics” (University of Liverpool, 2014). These techniques allow us to do the in-depth analysis, apply information fusion and other techniques. Nowadays, commonly used are also machine learning and artificial intelligence technologies, as well as predictive analysis approaches. Many other advanced techniques can be implemented during log analysis, but the bottom line is, that these tools and methods can provide much value to businesses. For example, automated log analyses can alert us of various security issues, do the new threat analysis, tell us more about server utilization, etc.

Here is a short video demo of how business could analyze server logs with Hadoop and Logstash:




Implementing a Big Data solution opens a window into all of the organization assets in a way which we would never expect. It not only gives us the ability to visually monitor entire hardware and software system through the visual portal and various analytic dashboard, but also allows us to see the overall health of all systems, use custom reporting, get the status of automated monitoring, see the real-time reporting and react to various near and real-time alerts. Analyze log files can be used to monitor the servers and applications better, improve business and customer intelligence, prevent fraud as well as improve the overall security of the entire system for which the logs are collected. Employing a Big Data solution allows businesses to build improved prediction and recommendation models, use continual tuning (automatically change hardware and software configuration in response to results of analytics), and generally to drive more automation into the entire process of working with logs. The result is that the businesses gain a new source of improved business insights, which at the end of the day leads to improved intelligence from data and create a competitive advantage that is often responsible for the business success and more profit.

Analysis of log files can also be used for monitoring of the servers and applications, to improve business and customer intelligence, prevent fraud as well as improve the overall security of the entire system for which the logs are collected. Employing a Big Data solution allows businesses to build improved prediction and recommendation models, use continual tuning (automatically change hardware and software configuration in response to results of analytics), and generally to drive more automation into the entire process of working with logs. The result is that the businesses gain a new source of improved business insights, which at the end of the day leads to improved intelligence from data and create a competitive advantage that is often responsible for the business success and more profit.

Applying Big Data solutions to log management will provide the business with new source of insights, which at the end of the day will lead to improved intelligence and create a competitive advantage that is often responsible for the business success and at the end of the day, more profit.




Downing, J. A. (1979). Aggregation, transformation, and the design of benthos sampling programs. Journal of the Fisheries Board of Canada, 36(12), 1454-1463. (Accessed: 4 February 2017).

Papierniak, K. A., Thaisz, J. E., Diwekar, A. M., & Chiang, L. J. (2000). U.S. Patent No. 6,151,601. Washington, DC: U.S. Patent and Trademark Office. (Accessed: 4 February 2017).

Phaneendra, S. V., & Reddy, E. M. (2013, April). Big Data-solutions for RDBMS problems-A survey. In 12th IEEE/IFIP Network Operations & Management Symposium (NOMS 2010)(Osaka, Japan, Apr 19 {23 2013) (Accessed: 4 February 2017).

University of Liverpool (2014) Data Analytics Basics. Available at: https://elearning.uol.ohecampus.com/bbcswebdav/institution/UKL1/201740JAN/MS_CKIT/CKIT_525/readings/UKL1_CKIT_525_Week04_LectureNotes.pdf (Accessed: 5 February 2017).

Nguyen, T. M., Schiefer, J., & Tjoa, A. M. (2005, November). Sense & response service architecture (SARESA): an approach towards a real-time business intelligence solution and its use for a fraud detection application. In Proceedings of the 8th ACM international workshop on Data warehousing and OLAP (pp. 77-86). ACM. (Accessed: 5 February 2017).

Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J. M., & Welton, C. (2009). MAD skills: new analysis practices for big data. Proceedings of the VLDB Endowment, 2(2), 1481-1492. (Accessed: 5 February 2017).

Russom, P. (2011). Big data analytics. TDWI best practices report, fourth quarter, 1-35. (Accessed: 5 February 2017).

Zikopoulos, P., & Eaton, C. (2011). Understanding big data: Analytics for enterprise class hadoop and streaming data. McGraw-Hill Osborne Media. (Accessed: 5 February 2017).

Holmes, A. (2012). Hadoop in practice. Manning Publications Co.. (Accessed: 5 February 2017).

Howitt, M. A., Goldfein, J. E., & Nonemacher, M. N. (2007). U.S. Patent No. 7,231,403. Washington, DC: U.S. Patent and Trademark Office. (Accessed: 5 February 2017).