How to parse Wikipedia infobox html using Java and Jsoup?

📅 Published 21.09.2014

It's pretty straight forward. Following is the code... In this tutorial will try to parse out infobox for Tom Cruise: Response res = Jsoup.connect("http://en.wikipedia.org/wiki/Tom_Cruise").execute(); String html = res.body(); Document doc2...

It's pretty straight forward. Following is the code...

In this tutorial will try to parse out infobox for Tom Cruise:

  Response res = Jsoup.connect("http://en.wikipedia.org/wiki/Tom_Cruise").execute();
  String html = res.body();
  Document doc2 = Jsoup.parseBodyFragment(html);
  Element body = doc2.body();
  Elements tables = body.getElementsByTag("table");
  for (Element table : tables) {
      if (table.className().contains("infobox")==true) {
          System.out.println(table.outerHtml());
          break;
      }
  }

table.outerHtml() will output HTML, which displays following:

If you need to parse out a specific element only, you can again use JSOUP to do so.

Let's say you need to get content of Occupation, this is how you can extract the text:

String Occupation = doc.select("td[class=role]").first().text();

String Occupation will contain the text: Actor, producer