How to move WordPress images to Amazon S3 – Free Solution!

Web administrators transitioning the existing WordPress sites (and image resources) to Amazon S3 have couple of paid plugins to choose from to ease the move to S3. However, most of the WordPress plugins come at a pretty hefty prices, some as high as couple of hundred dollars. The following guide outlines the step by step option that is completely free of charge. [Read more…]

How to anonymize traffic programmatically by using PHP/Curl and Tor network

Tor is a free software that prevents people from learning your location or browsing habits by letting you communicate anonymously on the Internet. Over the years, I have seen TOR installed in a variety of different environments, and I think it’s important to mention that outside of using Tor Browser for anonymous web browsing, there are various other ways of using TOR. In the following article, I outline one of these TOR options. [Read more…]

Data Variety and Data Security

Nowadays, it is the volume, velocity, veracity and variety of Big Data that are the primary factors and true amplifiers of the security issues experienced in the large-scale cloud infrastructures. The upsurge in security issues in Big Data installations is predominantly driven by an overall increase in volume and velocity of the data. However, dealing with the diversity of data sources (variety of data) is quickly becoming yet another of the pressing security concerns and the existence of enormous amounts of data is no longer the single factor creating the new security challenges. Data variety is one of the newest security challenges of Big Data infrastructures. 

[Read more…]

Migrating Enterprise Data Warehouse (EDW) to Big Data

In the following post, I cover the brief history of Enterprise Data Warehouse (EDW), analyze the major challenges of Enterprise Data Warehouse solutions and discuss traditional EDW and their capacity to handle the Volume, Variety, and Velocity (three of the V’s of Big Data). I also explore Big Data platforms as a potential alternative to EDW. [Read more…]

The ‘Percentage Correct’ and other Performance Prediction Methods

The following post discusses the method of ‘percentage correct’ predictions and explains why it may not be the most precise method to measure performance. I also examine the topic of analytic measurement techniques in general and recommend the correct substitute prediction method for the situation when ‘percentage correct’ is not a suitable performance measurement approach. [Read more…]

Processing Apache Access logs with K-means Clustering Algorithm

In this article, I continue exploring Logging as a data set. I have described this type of datasets earlier in Log Management and Big Data Analytics post. In this section, I suggest an application of a particular partitioning method called: K-means clustering, because I think that it is the most suitable candidate for use within the specific section of a problem domain of log file management. I explain why I considered the k-means technique to be the most appropriate for this type of data. I also cover the advantages that this analysis brings to logging in general, and demonstrate on a real data set the usage of k-means cluster analysis method. [Read more…]

Log Management and Big Data Analytics

In the following article, I explore the issue of log collection and analysis, a very specific problem domain for many large organizations. The logging is a suitable example of a volume and high-velocity data set, which makes it a good candidate for the application of Big Data analytic techniques. This article is not meant to go into details of how analytic methods perform data classification or certain other analytic tasks; it’s mainly to shed some light on the application of Big Data techniques to logging and outline some of the business benefits of log analysis. [Read more…]

Simple Java Web Parser with AI Capabilities (aka Programmatic approach to derive the meaning behind text content)

This article is just me thinking loud about creating something better than the simple wordcount.java example that is usually bundled with the Big Data solutions such as Hadoop – which I covered in the previous post. I wanted a script that would be a bit more complex and relate more to a meaningful web indexing. I wrote a Java program that acts as a Web Parser and can programmatically provide the meaning of any website by statistically judging its content. If ran against Google search results, it can also provide AI like answers to complex questions (such as ‘who is the president of some country’), or guess the closest meaning behind the set of keywords (such as ‘gold, color, breed’ will result in the response: ‘Golden Retriever) – see the examples below. Of course, this is just a result of a bit of a spare time. But it’s something that could perhaps be further explored, as a method to derive basic meaning behind the textual content in big data (to get the gist of the content in couple words). Anyhow, in the current form it’s just a further play on Hadoop’s wordcount.java.

[Read more…]

CAP Theorem and Big Data

 

The following article analyses the applicability of the CAP theorem to Big Data. I will explain the CAP theorem, explore the three of its characteristics, as well as provide the proof of the CAP theorem on an example that is closely related to Big Data use case. I will also briefly discuss couple of possible ways deal with the CAP-related issues in distributed Big Data applications and offer overview of those implementations that best fit each of the CAP properties. [Read more…]