The following post discusses the method of ‘percentage correct’ predictions and explains why it may not be the most precise method to measure performance. I also examine the topic of analytic measurement techniques in general and recommend the correct substitute prediction method for the situation when ‘percentage correct’ is not a suitable performance measurement approach. [Read more…]
In this article, I continue exploring Logging as a data set. I have described this type of datasets earlier in Log Management and Big Data Analytics post. In this section, I suggest an application of a particular partitioning method called: K-means clustering, because I think that it is the most suitable candidate for use within the specific section of a problem domain of log file management. I explain why I considered the k-means technique to be the most appropriate for this type of data. I also cover the advantages that this analysis brings to logging in general, and demonstrate on a real data set the usage of k-means cluster analysis method. [Read more…]
In the following article, I explore the issue of log collection and analysis, a very specific problem domain for many large organizations. The logging is a suitable example of a volume and high-velocity data set, which makes it a good candidate for the application of Big Data analytic techniques. This article is not meant to go into details of how analytic methods perform data classification or certain other analytic tasks; it’s mainly to shed some light on the application of Big Data techniques to logging and outline some of the business benefits of log analysis. [Read more…]
This article is just me thinking loud about creating something better than the simple wordcount.java example that is usually bundled with the Big Data solutions such as Hadoop – which I covered in the previous post. I wanted a script that would be a bit more complex and relate more to a meaningful web indexing. I wrote a Java program that acts as a Web Parser and can programmatically provide the meaning of any website by statistically judging its content. If ran against Google search results, it can also provide AI like answers to complex questions (such as ‘who is the president of some country’), or guess the closest meaning behind the set of keywords (such as ‘gold, color, breed’ will result in the response: ‘Golden Retriever) – see the examples below. Of course, this is just a result of a bit of a spare time. But it’s something that could perhaps be further explored, as a method to derive basic meaning behind the textual content in big data (to get the gist of the content in couple words). Anyhow, in the current form it’s just a further play on Hadoop’s wordcount.java.
This is a short guide on how to install Hadoop single node cluster on a Windows computer without Cygwin. The intention behind this little test, is to have a test environment for Hadoop in your own local Windows environment. [Read more…]
The following article analyses the applicability of the CAP theorem to Big Data. I will explain the CAP theorem, explore the three of its characteristics, as well as provide the proof of the CAP theorem on an example that is closely related to Big Data use case. I will also briefly discuss couple of possible ways deal with the CAP-related issues in distributed Big Data applications and offer overview of those implementations that best fit each of the CAP properties. [Read more…]
The following article provides a high-level overview of NoSQL databases and the various associated data store types related to these kinds of databases. A particular section of the article is dedicated to a brief summary of the Document Oriented NoSQL databases. I provide example data that illustrate how Document NoSQL database store the data and also outline the most significant differences between the relational type of SQL database and document-oriented NoSQL. [Read more…]
Recently I came across a statement that said: “MongoDB (btw. that’s MongoDB) uses the BSON format which extends the JSON model to provide additional data types” and I think this topic deserves a bit of clarification. [Read more…]
This is just a short look at the popularity of MongoDB, Redis and Apache Cassandra. [Read more…]
Microtargeting (also micro-targeting or micro-niche targeting) is one of the methods that is used by the marketing sector to analyze consumer data collected from various sources to detect interests of specific individuals. This collection of data is ordered and classified and later provides the information that is used to influence the thoughts of specific like-minded groups of people. That said, one of the major aims of microtargeting initiatives is to simply identify their target audience to as granular level as possible and also identify target’s preferred communication channel. [Read more…]
The following article is my attempt at exploring a niche market of Smart IoT Door Locking Solutions and partially investigate how Big Data analytics could improve this specific sector and thus also our personal life and home security. [Read more…]
The following use case is my attempt at denoting the importance of Big Data in reference to the world’s largest food companies and their current impact on the overall trend of deforestation in the world. The use case encompasses many of the V’s of Big Data and demonstrates that Big Data are increasingly important to consider, especially in connection to world’s largest food manufacturers and their analysis of the current and future deforestation trends. [Read more…]
Volume is the most characteristic property of Big Data, which to a large extent affects the other five V’s of Big Data, namely the Velocity, Variety, Veracity, Variability, and Value. In this article, I explore the Volume’s impact primarily in the context of distributed data and scalability, data access and storage, as well as its impact on data transfer.
This short post talks about the PID services and how they are used in all components of Big Data Architecture Framework. [Read more…]
The main goal of the following article is to analyze the significance of Big Data Lifecycle Management (BDLM) within the scope of Big Data reference architecture. I will concentrate and explore BDLM model and workflow, as well as provide some real-world examples. [Read more…]