Processing Apache Access logs with K-means Clustering Algorithm

📅 Published 12.02.2017

In this article, I continue exploring Logging as a data set. I have described this type of datasets earlier in Log Management and Big Data Analytics post. In this section,...

In this article, I continue exploring Logging as a data set. I have described this type of datasets earlier in Log Management and Big Data Analytics post. In this section, I suggest an application of a particular partitioning method called: K-means clustering, because I think that it is the most suitable candidate for use within the specific section of a problem domain of log file management. I explain why I considered the k-means technique to be the most appropriate for this type of data. I also cover the advantages that this analysis brings to logging in general, and demonstrate on a real data set the usage of k-means cluster analysis method.

The Goal Definition

The main reason for enterprises to perform the log analysis is to uncover certain valuable insights that unless analyzed would remain hidden in the log files. The goal is to enhance the business understanding of what is taking place in the log file, but also to discover the potential competitive advantages. That said, the most important first step is to outline the organizational goal of the log file analysis.

For this article, I cover the hypothetical scenario in which the business is running an Apache web server and they need to analyze the Apache access log files to find an average byte size of the pages accessed by the visitors of their website. That would not be too hard, as they could simply add together bytes of each of the accessed web pages (information is in the log) and then divide the value by a total number of accesses. However, that on its own would only give us the average size of all pages. How about if we wanted to discover the groups of web pages as they fall into categories of their average size? That a harder question, but the clustering algorithms were invented to deal with precisely these types of questions.

In the next section, we will use the k-means algorithm on an actual existing Apache log file and see if we can generate the natural groups (clusters) formed by the byte size of all web pages accesses by the visitors of a particular domain.

K-means Clustering Algorithm

Looking for a good definition of clustering I came across this one given by the Rana, H. and Patel, M., in ‘A study of web log analysis using clustering techniques’ paper released in 2013: “Clustering is an unsupervised classification technique widely used for web usage mining with primary objective to group a given collection of unlabeled objects into meaningful clusters” (Rana, H. and Patel, M., 2013). Simply said, the clustering is a technique that allows us to segment the data into natural groups. The Figure 1 () shows the demonstration of the standard k-means algorithm:

Figure 1

Solution – Analysis of Apache Access log by using K-means Clustering Algorithm

The goal of this exercise is to analyze the real Apache log file and use the k-means method to group each page of our actively visited website and see if we can form the size variance clusters that could open a window into the size segmentation of all pages hosted on our example domain.

There are countless ways of interpreting log file entries, but to gather a clustering metrics and visual reports, the trend is moving towards more automated translation, which is a prefer method of most business organizations.

So, in the following is a demo of the k-means clustering process I have decided to use the RapidMiner Studio.

Step 1

The first step was to load the Apache log file.

To work on something tangible, I have decided to load the Apache access log for my website joe0.com, that contains every single visitor access to my site between 15th of January 2017 and 12th of February 2017.

As Apache log files are not in the standard format that is recognized by the RapidMiner. I have first loaded the access.log file as a text file, which I split by using the space as a column denominator. That allowed me to parse the Apache log file and import it into RapidMiner.

Figure 2 illustrates the Apache log file and all its 129,492 rows loaded into RapidMiner.

As you can see, I have ended up with a couple of columns, that I named by the characteristics of the data contained in the access log file, namely: IP Address, Data of Access, Type of Access, Status Code, Size of the request, Reference Page and Browser type.

Figure 2

Step 2

Next step was to find out what sort of size clusters does the majority of the pages hosted on joe.com belong to.

- To do so, I have loaded the Apache log file first.

- Then to speed up the process I have decided to filter all accesses to last 10,000 records.

- Also used ‘remove duplicates’ operator to get rid of the rows that contained the duplicates and would only slow down the process (note, this removed 830 duplicate lines).

- Then added the filter examples operator and got rid of all lines that were missing any size related values.

- The next step was to add k-means clustering operator, which I set to order all document sized into 3 clusters

- Lastly, applied model operator.

Figure 3 contains the diagram in RapidMiner Studio.

Figure 3

Step 2 – Run the Model

The process took a little while even that only 10 thousand entries were used from the access log.

However, ethe results provided a fascinating insight into the distribution of sizes of the joe0.com domain.

As far as the total number of page views per cluster, I have discovered that:

- Cluster 0 had almost as many views (45.9%) as Cluster 1 (44.0%)

- It was only the Cluster 2 tha had 10.1% visitors.

The Cluster 2 sharply stand out as a unique group among the 3 clusters.

To find out what’s in the Cluster 2, I created a cluster/size chart report (Figure 4) which revealed the following insight:

The Cluster 2 is comprised of those web pages that are less than 100 KB in size.

Figure 4

This showed that Cluster 0 is also fascinating section of data and that is because it is visited by almost 46% percent of all visitors and it is made of all the large pages (up to 2 MB in size). So, I decided to create the scatter/size chart to see how many of those up to 2 MB in size pages lie in the spectrum between 1-2 MB. Luckily most are below 1 MB.

Figure 5

Then, I started to think about what is so special about those pages that are in the Cluster 0, the most visited pages of my website. I created a diagram inside RapidMiner, which compared each of the groups and pages in it to their reference page information. That was a real eye opener. Looks like all those large pages, are also those the ones that are most referred to by search engines and other websites:

Conclusion

This was an interesting inside into inner workings of my own website. Before analyzing the Apache log of my domain, I had no idea that only 10% of all pages of my site (Cluster 2) belong to a group that is less than 100 KB in size.

Moreover, neither did I know, that 46% visitors to my site land on a Cluster 0 which is comprised of pages almost 5 times bigger that those in Cluster 2 - which is accessed by 10 percent visitors of the website. This essentially means, that only 10% of my site is optimized and the segment which is most visited is the one that is least optimized. Additionally, I realized why most people land on Cluster 0 pages, it’s because these are the pages most referenced by the search engines, which means that larger the page I create, more often it seems to be referenced by others (interesting insight).

Anyhow, this newly gained information forced me to look at other details immediately. Eventually, I have found that I am not optimizing the website images, which is likely the reason for the inflated size. So, I will need to fix this problem, by optimizing the website pictures. Moreover, that should show as a positive decrease in overall bandwidth. Once optimized, the server should load the most popular pages a lot faster than now.

Thanks to the k-means algorithm I was able to identify a couple of insights and also one of the optimization issues.

What a great way to demonstrate the power of log file analyses!

References

Rana, H. and Patel, M. (2013) ‘A study of web log analysis using clustering techniques’, International Journal of Innovative Research in Computer and Communication Engineering, 1(4). (Accessed: 11 February 2017).

Jain, A.K., Murty, M.N. and Flynn, P.J. (1999) ‘Data clustering: A review’, ACM Computing Surveys, 31(3), pp. 264–323. doi: 10.1145/331499.331504. (Accessed: 12 February 2017).

K-means clustering (2017) in Wikipedia. Available at: https://en.wikipedia.org/wiki/K-means_clustering (Accessed: 13 February 2017).

Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), 100-108. (Accessed: 13 February 2017).

Bradley, P. S., & Fayyad, U. M. (1998, July). Refining Initial Points for K-Means Clustering. In ICML (Vol. 98, pp. 91-99). (Accessed: 13 February 2017).

Alsabti, K., Ranka, S., & Singh, V. (1997). An efficient k-means clustering algorithm. (Accessed: 13 February 2017).

Rapidminer k-means (2017) Available at: https://www.youtube.com/results?search_query=rapidminer+k-means (Accessed: 13 February 2017).