Converting UTZOO Usenet archive from magnetic tapes to MySQL database using Java

Recently, I came across a post from 2001 which allowed downloading the entire collection of UTZOO NetNews Archive of earliest USENET posts. These were essentially the earliest available discussions posted to the Internet by folks working for various Universities connected to the Internet. Millions of posts created between Feb 1981 and June of 1991.

Until 2001, these early Usenet discussions were considered being lost, but miraculously Henry Spencer from the University of Toronto, Department of Zoology was backing it up onto magnetic tapes and kept them stored for all these years (apparently at the great cost).

H. Spencer had altogether 141 of these magnetic tapes, but there were of no use, so eventually, him and a couple of motivated people such as David Wiseman (who dragged 141 tapes back and forth in his a pickup truck), Lance Bailey, Bruce Jones, Bob Webber, Brewster Kahle, and Sue Thielen; embarked on a process of converting all of these tapes into regular format, accessible to everyone.

And that’s the copy I downloaded. What a treasure, right?

Well, not so fast, once I unzipped the data, I realized that the TGZ format contains literally millions of small text files (each post in its own file). While it was certainly nice to have, it wasn’t something that I or anyone else could read. Certainly not in a forum like discussion format, it wasn’t obvious which post is the one that starts the discussion or which ones are the replies to the thread. And forget about searching through these files, that was utterly not possible. Just to put things into perspective, it took me over 5 hours to just unzip the archives.

That said, it didn’t take long for me to decide to develop a Java-based converter that would attempt to convert the entire collection from millions of flat files into a fully searchable MySQL database. The following post talks about the process and also includes the Java code of the solution released as open source.

So, I say said, the UTZOO Usenet archive came to me as a single archive.zip (2,107,041,162 bytes – 1.96GB) file that unzipped contained 161 x TAR Archive files. It looked like this:

I grabbed a copy of the 7-Zip archiver from https://www.7-zip.org and started decompressing the files. It proved to be a quite time-consuming process, 7-Zip was showing close to 7.5 hours of time needed to extract all of the compressed tapes. So I cancelled that and opened WinRAR, selected the files and started extracting them… A bit faster, WinRar took ‘only’ 5 hours to extract all of the compressed tapes. Anyhow, once decompressed, I’ve ended up with over 2,104,830 flat text files in 56,953 folders, which was the entire copy of Henry Spencer’s Usenet archive:

However, examining the extract, I realized that Magnetic Tape 118 is uncompressed in \utzoo-wiseman-usenet-archive\news118f1 folder, named tape118, so I had rename it to tape118.tar and extracted it manually, only to realize it’s a copy of files which I already have. Someone creating the original archive forgotten to remove that file. Same goes for other 3 tars in the last three folders.

If you opened one of the folders and navigated down to one of the many subfolders, you’d find a file that contained the message. For example, going into \utzoo-wiseman-usenet-archive\news006f1\b15\net\aviation folder, I was now apparently in the net.aviation Usenet group. But the only way to find out was to open one of the files and look at the content. Here I highlighted what it looked like. As you can see, each file seems to consist of a header, then a single empty line and the body of the message:

So, first I decided to build a Java parser, that went through all these files reading the header portion of each message and grouping all unique results together, giving me all the possible headers such as (From, Subject, Newsgroup, etc.). I found that there were about 79 x different types of headers. So it appeared that not all messages adhered to the same basic structure. However, going through the counters, all had some that were common.

Once I had the common field, I’ve built a database called ‘usenetarchive’ and table called archive, where I decided to dump all the data I’ll get out of the Java parser. Some of the fields had to be adjusted a couple of time during testing because often data would not fit into columns that I initially created with smaller VARCHAR values. This is what the final table design looks like.

I had to do some data conversion massaging in Java to get the proper date in Unix date format, as dates were coming in a variety of formats. I’ve tried Nutty and also Joda-Time which is the de facto standard date and time library for Java prior to Java, but neither did a good job. Surprisingly ‘parse(java.lang.String)’ which is now deprecated did an awesome job getting the proper date and time in Unix timestamp out of the string dates.

Java Source Code

Well and then it was all about creating a simple Java command line converter that allowed me to point it to an Archive directory, start the MySQL database and wait :)

At the bottom of this article is the code of the Java solution. It’s not that big at all, 360 lines, and it took altogether about 3 hours to create and test it (less than decompressing the files).

The source code is available on GitHub as open source under MIT license: https://github.com/JozefJarosciak/UsenetUTZOOArchiveDBConversionTool

How to Run It?

The final solution artifact is called ‘usenetarchiveparser.jar‘ and accepts 5 arguments

  • A) MySQL or MariaDB server location
  • B) MySQL name of the database
  • C) MySQL user name
  • D) My SQL password
  • E) Directory with unzipped archives

And this is the way I run it, with some additional JVM parameters, adding a bit more memory to the script which sped it dramatically. Once started, the conversion utility show not only how many posts were processed, but also which directory is currently being processed:

In case you need to stop the program and run it later, the utility is smart enough to resume from the last spot it was processing (information is stored in the DB).

The script can process approximately 10 thousand posts a minute on my old Intel® Core™ i7-2700K Processor with 16GB of RAM (btw. hard to believe you can buy this processor on eBay for about 30 dollars) :).

Here is a screenshot of the database after only 5 minutes of conversion:

As you can see, the conversion utility produces a database with all the important headers, as well as it includes a full copy of the full header and body.

Let’s test it out… Let’s say we want to select all messages in the newsgroups ‘comp.lang.prolog’ dedicated to the Prolog programming language.

I want to list the originator of the message, post’s subject, the date and time, message id and the body. It’s a one-liner :), here we go, in the database format it’s all easy to see and search:

Now, let’s say we want to follow the entire randomly selected discussion about ‘Quintus and external lang interfaces’ which I highlighted in the screenshot above, it has Message-ID: <Jun.27.14.28.58.1989.250@caip.rutgers.edu>.

Easy, we need to take a message id of the post and write a query that looks for all its references. Now we can see the discussion and how it unfolds in time. The query is simple and as you can see, for ordering I used DateParsed column. This is my own DB column, it’s an Epoch (Unix time) conversion of the DATE text found in each of the posts.

Anyhow, the result shows that the original post was done on 27 Jun 1989 by Mark F. Russo, with the first follow-up by Jianhua Zhu which was posted two days after the initial post. The whole discussion had only three posts.

 

 

So the hardest part is done. You can grab the code from Github and adjust it to your liking. Just please don’t judge the code, it’s not pretty, not formatted or commented out :) – it wasn’t planned to release. You’re more than welcome to clean it up though :)

This is the illustrated process of getting information from each of the files to the MySQL database.

Screenshot Description:

  • 0. Magnetic Tapes

  • 1. Extracted Flat files

  • 2. Headers and Body example of each of the flat file posts
  • 3. Writing and running Java code to parse out all fields

  • 4. Final MySQL fully searchable database of all lost Usenet posts Feb 1981 and June of 1991

 

What’s next?

Well, thinking about it… this took only a couple of hours to do. So why not make it available to everyone.

I’ve registered http://usenetarchives.com and my next plan is to make these archives available online in an easy to read and search forum like format.

But, that’s work for another day…

 

Java Source Code – local copy

See the link further above for Github copy of the source code…

 

Facebook Comments