Migrating Enterprise Data Warehouse (EDW) to Big Data

In the following post, I cover the brief history of Enterprise Data Warehouse (EDW), analyze the major challenges of Enterprise Data Warehouse solutions and discuss traditional EDW and their capacity to handle the Volume, Variety, and Velocity (three of the V’s of Big Data). I also explore Big Data platforms as a potential alternative to EDW. [Read more…]

The ‘Percentage Correct’ and other Performance Prediction Methods

The following post discusses the method of ‘percentage correct’ predictions and explains why it may not be the most precise method to measure performance. I also examine the topic of analytic measurement techniques in general and recommend the correct substitute prediction method for the situation when ‘percentage correct’ is not a suitable performance measurement approach. [Read more…]

Processing Apache Access logs with K-means Clustering Algorithm

In this article, I continue exploring Logging as a data set. I have described this type of datasets earlier in Log Management and Big Data Analytics post. In this section, I suggest an application of a particular partitioning method called: K-means clustering, because I think that it is the most suitable candidate for use within the specific section of a problem domain of log file management. I explain why I considered the k-means technique to be the most appropriate for this type of data. I also cover the advantages that this analysis brings to logging in general, and demonstrate on a real data set the usage of k-means cluster analysis method. [Read more…]

Log Management and Big Data Analytics

In the following article, I explore the issue of log collection and analysis, a very specific problem domain for many large organizations. The logging is a suitable example of a volume and high-velocity data set, which makes it a good candidate for the application of Big Data analytic techniques. This article is not meant to go into details of how analytic methods perform data classification or certain other analytic tasks; it’s mainly to shed some light on the application of Big Data techniques to logging and outline some of the business benefits of log analysis. [Read more…]

Simple Java Web Parser with AI Capabilities (aka Programmatic approach to derive the meaning behind text content)

This article is just me thinking loud about creating something better than the simple wordcount.java example that is usually bundled with the Big Data solutions such as Hadoop – which I covered in the previous post. I wanted a script that would be a bit more complex and relate more to a meaningful web indexing. I wrote a Java program that acts as a Web Parser and can programmatically provide the meaning of any website by statistically judging its content. If ran against Google search results, it can also provide AI like answers to complex questions (such as ‘who is the president of some country’), or guess the closest meaning behind the set of keywords (such as ‘gold, color, breed’ will result in the response: ‘Golden Retriever) – see the examples below. Of course, this is just a result of a bit of a spare time. But it’s something that could perhaps be further explored, as a method to derive basic meaning behind the textual content in big data (to get the gist of the content in couple words). Anyhow, in the current form it’s just a further play on Hadoop’s wordcount.java.

[Read more…]