How to query and visualize Apache logs in Splunk with Python

Splunk Enterprise is presumably one of the best platforms for gaining real-time intelligence from data. The following short tutorial explains how to install Splunk server (on Windows or Linux server) and how to use Python to programmatically query Splunk and visualize Apache data by plotting them on a graph by using the Matplotlib library. In this post, show step by step, how to add Apache logs to Splunk, and illustrate how relatively easy is to use Python to query how much bandwidth was consumed by one of my websites (in the past week, in megabytes).

Splunk Server Installation (Windows/Linux)

Windows Installation

If you’re using XAMPP and want to monitor Apache logs with Windows, we’ll need to install Splunk Enterprise on Windows box. You can do this for free, as we get a Splunk Enterprise license for 60 days that lets you index up to 500 megabytes of data per day. After 60 days you can convert to a perpetual free license or purchase a Splunk Enterprise license. To download Splunk, go to https://www.splunk.com/en_us/download/splunk-enterprise.html and simply select the Windows, Linux or Mac OS installation:

Windows installation file is approx. 200 MB and by default Splunk is installed into C:\Program Files\Splunk\ directory.

Once installed, Splunk will be available at the following internal address: http://localhost:8000 and look like this:

 

Centos 7 or Amazon Linux Installation

The following instructions are for those running Apache on Linux and want to monitor their Apache logs locally – directly from the Linux box. To do so, simply download Splunk Enterprise for free, you get a Splunk Enterprise license for 60 days that lets you index up to 500 megabytes of data per day. After 60 days you can convert to a perpetual free license or purchase a Splunk Enterprise license. To download Splunk, go to https://www.splunk.com/en_us/download/splunk-enterprise.html

The file to download has the TGZ extension and will be named something like this: splunk-7.2.6-c0bf0f679ce9-Linux-x86_64.tgz, depending on the current stable version of Splunk.

Copy splunk-7.2.6-c0bf0f679ce9-Linux-x86_64.tgz file over to your /tmp directory, then run the following commands to run the installation

Once above is done, you’ll have the Splunk running on the local system of whatever is your IP address, likely at http://localhost:8000 or your server’s http://your-server-ext-ip:8000:

The last step usually is to enable the port 8000 and also 8089 (API for programmatic access) in your firewall. On your Centos install you can add a permanent rule to firewall like this:

On Amazon Linux, you can do the exception for port 8000 and 8089 through AWS interface.

Once done, you should be able to access Splunk interface through the browser:

And port 8000 and 8089 are configured in Server settings » General settings:

 

ADDING APACHE LOGS TO SPLUNK

Now, let’s add our Apache logs to Splunk. To do so, login to Splunk server interface and select: Settings, then choose an option to ‘Add Data’:

Scroll towards the bottom and select the option to monitor files:

Then, under Files & Directories, you can enable monitoring of an entire directory.

Add in your /var/log/httpd or wherever your Apache logs are:

Once successfully added, you’ll get a message that the ‘File input has been created successfully’:

 

QUERING SPLUNK FROM THE SPLUNK INTERFACE

Now that we have Apache logs inside Splunk, let’s do some simple query.

Let’s say I want to know how much bandwidth was consumed by one of my websites (joe0.com) in the past 7 days, including today, in megabytes.

To do so, I can run a Splunk query like this:

Which produces expected results that outline the daily bandwidth usage of joe0.com website:

That’s cool, I just learned that I on a regular business day, my site joe0.com requires on average less than half a GB of bandwidth. That’s useful.

But how about if we want to run the same query, not from the Splunk interface, but instead by using Python?

Programmatic access to Splunk can surely be useful at times when running queries through Splunk interface are not sufficient for our purposes.

 

HOW TO QUERY SPLUNK BY USING PYTHON PROGRAMMING LANGUAGE

Ok, so now we’re finally getting to the meat of the article. Let’s replicate the same ‘total bandwidth per day’ report, but use Python to query the Splunk database.

Let’s start by creating a new project in our IDE and creating a new python file called: pythonsplunk.py:

Install the package: ‘splunk-sdk‘ and ‘matplotlib‘ by running the following command first:

  • Splunk-SDK: With the Splunk SDK for Python, you can write Python applications to programmatically interact with the Splunk engine. The SDK is built on top of the REST API, providing a wrapper over the REST API endpoints. So that means with fewer lines of code, you can write applications that:
    • Search your data, run saved searches, and work with search jobs.
    • Manage Splunk configurations and objects.
    • Integrate search results into your applications.
    • Log directly to Splunk.
    • Present a custom UI.
  • Matplotlib: Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.
  • NumPy: is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

Once both libraries are installed, let’s start our code by importing the above-mentioned libraries. Also, note, we’ll be using DateTime library, which is required for date time parsing.

On the next lines, we define the connection details and credentials to our Splunk Server:

Next, we establish a service connection to the Splunk server, with above details:

Now, we’re ready to create a result reader and query Splunk for the daily bandwidth usage of joe0.com website

Let’s create two arrays, one which will hold the dates and one for values (megabytes) we get from Splunk:

Now, let’s connect to Splunk and then read the results of our Splunk query:

The result will look like this:

As we can see, there is some weird thing happening that Splunk produces the results twice in the row.

In order to get around it, I’ve implemented a different way to list the results, that implements parsing the values that occur after the information message “INFO: Your time range was substituted based on your search string” and one that also formats the dates and results and adds them into arrays:

Which results in a greatly improved format of the output:

VISUALISING SPLUNK DATA USING MATPLOTLIB

Now, we can work with that and use the Matplotlib library to plot the dates and values on the bar graph.

This can be done, but adding the following code to our program:

If we run the program now, voila, we’ve programmatically plotted the bandwidth values extracted from Splunk onto a graph by using Matplotlib library, here is what the result looks like:

Screenshot from the IDE:

 

SOURCE CODE

Source code is posted to Github https://github.com/JozefJarosciak/PythonSplunk and also here:

 

 

 

 

 

Facebook Comments