The ‘Percentage Correct’ and other Performance Prediction Methods

The following post discusses the method of ‘percentage correct’ predictions and explains why it may not be the most precise method to measure performance. I also examine the topic of analytic measurement techniques in general and recommend the correct substitute prediction method for the situation when ‘percentage correct’ is not a suitable performance measurement approach. [Read more…]

Processing Apache Access logs with K-means Clustering Algorithm

In this article, I continue exploring Logging as a data set. I have described this type of datasets earlier in Log Management and Big Data Analytics post. In this section, I suggest an application of a particular partitioning method called: K-means clustering, because I think that it is the most suitable candidate for use within the specific section of a problem domain of log file management. I explain why I considered the k-means technique to be the most appropriate for this type of data. I also cover the advantages that this analysis brings to logging in general, and demonstrate on a real data set the usage of k-means cluster analysis method. [Read more…]

Log Management and Big Data Analytics

In the following article, I explore the issue of log collection and analysis, a very specific problem domain for many large organizations. The logging is a suitable example of a volume and high-velocity data set, which makes it a good candidate for the application of Big Data analytic techniques. This article is not meant to go into details of how analytic methods perform data classification or certain other analytic tasks; it’s mainly to shed some light on the application of Big Data techniques to logging and outline some of the business benefits of log analysis. [Read more…]

Simple Java Web Parser with AI Capabilities (aka Programmatic approach to derive the meaning behind text content)

This article is just me thinking loud about creating something better than the simple example that is usually bundled with the Big Data solutions such as Hadoop – which I covered in the previous post. I wanted a script that would be a bit more complex and relate more to a meaningful web indexing. I wrote a Java program that acts as a Web Parser and can programmatically provide the meaning of any website by statistically judging its content. If ran against Google search results, it can also provide AI like answers to complex questions (such as ‘who is the president of some country’), or guess the closest meaning behind the set of keywords (such as ‘gold, color, breed’ will result in the response: ‘Golden Retriever) – see the examples below. Of course, this is just a result of a bit of a spare time. But it’s something that could perhaps be further explored, as a method to derive basic meaning behind the textual content in big data (to get the gist of the content in couple words). Anyhow, in the current form it’s just a further play on Hadoop’s

[Read more…]

CAP Theorem and Big Data


The following article analyses the applicability of the CAP theorem to Big Data. I will explain the CAP theorem, explore the three of its characteristics, as well as provide the proof of the CAP theorem on an example that is closely related to Big Data use case. I will also briefly discuss couple of possible ways deal with the CAP-related issues in distributed Big Data applications and offer overview of those implementations that best fit each of the CAP properties. [Read more…]

Introduction to NoSQL & Document Data Store

The following article provides a high-level overview of NoSQL databases and the various associated data store types related to these kinds of databases. A particular section of the article is dedicated to a brief summary of the Document Oriented NoSQL databases. I provide example data that illustrate how Document NoSQL database store the data and also outline the most significant differences between the relational type of SQL database and document-oriented NoSQL. [Read more…]

MongoDB and BSON format

Recently I came across a statement that said: “MongoDB (btw. that’s MongoDB) uses the BSON format which extends the JSON model to provide additional data types” and I think this topic deserves a bit of clarification. [Read more…]

Microtargeting, Big Data, and Elections

Microtargeting (also micro-targeting or micro-niche targeting) is one of the methods that is used by the marketing sector to analyze consumer data collected from various sources to detect interests of specific individuals. This collection of data is ordered and classified and later provides the information that is used to influence the thoughts of specific like-minded groups of people. That said, one of the major aims of microtargeting initiatives is to simply identify their target audience to as granular level as possible and also identify target’s preferred communication channel. [Read more…]

Big Data & Deforestation – Use Case


The following use case is my attempt at denoting the importance of Big Data in reference to the world’s largest food companies and their current impact on the overall trend of deforestation in the world. The use case encompasses many of the V’s of Big Data and demonstrates that Big Data are increasingly important to consider, especially in connection to world’s largest food manufacturers and their analysis of the current and future deforestation trends. [Read more…]

Impact of Big Data Volume in the Context of Distributed Data, Scalability, Data Access and Storage


Volume is the most characteristic property of Big Data, which to a large extent affects the other five V’s of Big Data, namely the Velocity, Variety, Veracity, Variability, and Value. In this article, I explore the Volume’s impact primarily in the context of distributed data and scalability, data access and storage, as well as its impact on data transfer.

[Read more…]