Simple Java Web Parser with AI Capabilities (aka Programmatic approach to derive the meaning behind text content)

This article is just me thinking loud about creating something better than the simple wordcount.java example that is usually bundled with the Big Data solutions such as Hadoop – which I covered in the previous post. I wanted a script that would be a bit more complex and relate more to a meaningful web indexing. I wrote a Java program that acts as a Web Parser and can programmatically provide the meaning of any website by statistically judging its content. If ran against Google search results, it can also provide AI like answers to complex questions (such as ‘who is the president of some country’), or guess the closest meaning behind the set of keywords (such as ‘gold, color, breed’ will result in the response: ‘Golden Retriever) – see the examples below. Of course, this is just a result of a bit of a spare time. But it’s something that could perhaps be further explored, as a method to derive basic meaning behind the textual content in big data (to get the gist of the content in couple words). Anyhow, in the current form it’s just a further play on Hadoop’s wordcount.java.

 

What does it do?

– Script allows you to specify any URL

– It pretends the browser’s user agent (simulating actual browser user) and in such manner it grabs the textual content of the website specified in the URL

– Once the text is acquired, the script will find the frequency of occurrence of all word combinations inside the text (it looks at the entire article, so not just single words, but also pairs, triples, quadruples, entire repeating sentences, etc.)

– Script recognizes that capital and small letter could skew the word combinations counts so it’s looking at the text in cap and small letter insensitive manner (this can be switched off)

– Implements stop words based on the list provided and removes words that are common and thus useless for the purposes of finding a real frequency of word combinations

– Prints occurrence of all word combinations, by rank and phrase it found

 

How can it be used?

  • It can programmatically provide the meaning of any website by statistically judging it’s content
  • If ran agains Google search results, it can provide AI like answers to complex questions (such as ‘who is the president of some country’), or guess the closest meaning behind the set of keywords (such as ‘gold, color, breed’ will result in the response: ‘Golden Retriever) – see the examples below.
  • Looking at the most recent news stories, it can guess the most important headlines in any time frame and get the gist of what’s currently happening.

 

Example #1 – Parsing the meaning behind the Wikipedia Page

Let’s try the script on a simple Wikipedia page about CAP Theorem: https://en.wikipedia.org/wiki/CAP_theorem

Once processed by the script, it’ll provide the output cleaned of stop words, such as this:

 

 

Script summarized the output of the first 6 keyword combinations. So what does it tell us? Well, it says that the page is about Cap Theorem, talks about consistency, availability, distributed systems, and distributed computing. And that it somehow relates to the name ‘Eric Brewer‘ and the topic is somehow connected to Computer Science. Not bad for a one page script right?

As we can see, the outcome is a logical analysis a web search engine based on the occurrence of word combinations. And it tells us something about the page without even visiting it. Do you agree? In my opinion, it’s a way better way to extract some AI from the site than looking at the single word count using in mapreduce type of scripts used in Big Data solutions.

In my opinion, there has to be a connection between words, and once that is combined with the removal of stop words, it’s even more meaningful. Using single word matching, what would it tell me if I saw the word ‘theorem‘ or ‘distributed‘ or the name ‘Eric‘ ranking high. Nothing! It needs to be ‘Eric Brewer’, ‘Cap Theorem’, ‘Distributed Computing’…

You can immediately see the potential of the scripts like this. We don’t need to visit any page to find it’s overall high-level meaning. We can use a programmatic approach to extract the probable significance and meaning behind the page.

And for an AI system, that could be as good as reading the content of the page.

And imagine if this is ran against multiple articles. Could this script provide the answers to complex questions?

Sure, let’s give it a try.

EXAMPLE #2

Imagine you’re thinking about a dog breed of golden color, but can’t recall the dog breeds name.

The script can help, all we need to do, is to run it against the 3 keyword Google search, somethign we know, such as words: Golden, Color, Breed.

Script successfully determines I am most likely talking about Golden Retriever and that I am likely referring to a Dog Breed.

Check it out:

Isn’t that cool, I’ve got my answer.

EXAMPLE #3

How about running it against a complex question.

I’ll ask Google search through my script a simple question: Who is the prime minister of Canada?

Script determined that I am talking mostly likely about prime minister of Canada, Justin Trudeau.

Cool isn’t it? :)

 


EXAMPLE #1 – DEMO & CODE

For the purpose of the EXAMPLE #1, let’s look at how the script works, in terms of the two combinations consistency availability and the flipped meaning:

As you can see:

– consistency availability combination is claimed to be occurring 5 times on the page

and

availability consistency combination is there to be 2 times.

If you were to look at the real page, you’d find that the script works and also successfully uses the stop words:

PROOF THAT IT WORKS:

 

I’ve made a short YouTube video demo if you want to see it in action: 

 

Anyhow, here is the actual code you can grab and play with:

 

Anyhow, this is my attempt at coding a better word count example, than the one we were provided in the Hadoop testing (previous article).

Of course this is not optimized for Big Data, it’s just something that could be used as an algorithm on a back end side.

Mainly it’s just me playing around with it…

 

Comments

comments