Recently, I came across a resource that allowed me to download the entire collection of UTZOO NetNews Archive of the earliest USENET posts. These were essentially the earliest available discussions posted to the Internet by people working at various Universities who were already connected to the Internet. There were approximately 2.1 million posts in these archives created between Feb 1981 and June of 1991. This article describes the journey of converting those tapes into fully searchable PostgreSQL database and later also into the usenetarchives.com website.
How & Why
Until 2001, these early Usenet discussions were considered being lost, but miraculously Henry Spencer from the University of Toronto, Department of Zoology was backing it up onto magnetic tapes and kept them stored for all these years (apparently at a great cost).
H. Spencer had altogether 141 of these magnetic tapes, but there were of no use, so eventually, him and a couple of motivated people such as David Wiseman (who dragged 141 tapes back and forth in his a pickup truck), Lance Bailey, Bruce Jones, Bob Webber, Brewster Kahle, and Sue Thielen; embarked on a process of converting all of these tapes into the regular format, accessible to everyone.
And that’s the copy I downloaded. What a treasure, right?
Well, not so fast, once I unzipped the data, I realized that the TGZ format contains literally millions of small text files (each post in its own file). While it was certainly nice to have, it wasn’t something that I or anyone else could read. Certainly not in a forum like discussion format. It wasn’t obvious which post is the one that starts the discussion or which ones are the replies to the thread. And forget about searching through these files, that was utterly not possible. Just to put things into perspective, it took me over 5 hours to un-tar the archives.
That said, it didn’t take long for me to decide to develop a Python-based converter that would allow me to convert the entire collection from millions of flat files into a fully searchable PostgreSQL database. The following post talks about the process and also includes the Python code of the solution released as open source.
The UTZOO Usenet archive can be downloaded here:
- Or using this torrent: utzoo-wiseman-usenet-archive_archive
Once downloaded you’ll see that archive contains 161 x TAR Archive files. It looks like this:
So, I grabbed a copy of the 7-Zip archiver from https://www.7-zip.org and started decompressing the files.
I ended up with over 2,104,828 flat text files in 56,988 folders, which was the entire copy of Henry Spencer’s Usenet archive.
For those who like numbers, here is each Utzoo tape along with its size, number of files and folders:
While examining the extract, I realized that Magnetic Tape 118 is uncompressed in \utzoo-wiseman-usenet-archive\news118f1 folder, named tape118, so I had rename it to tape118.tar and extracted it manually, only to realize it’s a copy of files which I already have. Someone creating the original archive forgotten to remove that file. There are 3 files in these folders that need to have.tar extension added and decompressed as well:
If you opened one of the folders and navigated down to one of the many subfolders, you’d find a file that contained the message. For example, going into \utzoo-wiseman-usenet-archive\news006f1\b15\net\aviation folder, I was now apparently in the net.aviation Usenet group. But the only way to find out was to open one of the files and look at the content. Here I highlighted what it looked like. As you can see, each file seems to consist of a header, then a single empty line and the body of the message:
So, I decided to build a Python parser, that went through all these files reading the header portion of each message and grouping all unique results together, giving me all the possible headers such as (From, Subject, Newsgroup, etc.). I found that there were about 79 x different types of headers. So it appeared that not all messages adhered to the same basic structure. Going through the headers, all had the standard set that was common across all posts.
Once I had the common field, I’ve created a Postgres database called ‘utzoo’
create database utzoo;
And a new schema called all_messages
create schema all_messages;
The above database and schema were the pre-requisites. Everything else, like table creation, inserting the posts, etc. is part of the Python script and fully automated.
In terms of table creation, the script automatically creates 5 tables for each detected newsgroup:
- headers – parsed headers
- references – references for each message
- body – text of the message
- from – who posted the message
- subjects – list of unique subject lines
This is what the script auto-creates for each unique Group name:
create table all_messages.GroupName_headers ( id bigserial not null constraint GroupName_headers_pk primary key, dateparsed timestamp, subj_id bigint, ref smallint, msg_id text, msg_from bigint, enc text, contype text, processed timestamp default CURRENT_TIMESTAMP ); alter table all_messages.GroupName_headers owner to postgres; create table all_messages.GroupName_refs ( id bigint, ref_msg text default null ); alter table all_messages.GroupName_refs owner to postgres; create table all_messages.GroupName_body ( id bigint primary key, data text default null ); alter table all_messages.GroupName_body owner to postgres; create table all_messages.GroupName_from ( id serial not null constraint GroupName_from_pk primary key, data text ); alter table all_messages.GroupName_from owner to postgres; create table all_messages.GroupName_subjects ( id serial not null constraint GroupName_subjects_pk primary key, subject text ); alter table all_messages.GroupName_subjects owner to postgres;
Those will be the tables where the Python parser will dump all the data and make sure posts are properly lined up between tables.
The python script also creates indexes to make the inserting and later reading of the posts faster:
create unique index GroupName_headers_uiidx on all_messages.GroupName_headers(id); create unique index GroupName_headers_umidx on all_messages.GroupName_headers(msg_id); create unique index GroupName_body_idx on all_messages.GroupName_body(id);; create unique index GroupName_from_idx on all_messages.GroupName_from(data); create unique index GroupName_subjects_idx on all_messages.GroupName_subjects(subject);
Once created, the structure per group looks like this:
The following screenshot explains how it’s all wired up. I didn’t do any hardcoded relationships, but you can change the script if you want that.
Date Related Issues
The date is an integral part of each message and I had to do some data conversion massaging in Python to get the proper date, as dates were coming in a variety of formats. I’ve tried various libraries but dateutil.parser.parse standard date and time library for Python did the best job.
However, I still needed to account for various labelling of data fields in the headers, so if data wasn’t found in the ‘date’ header, I had to look into other header parts such as ‘NNTP-Posting-Date’, ‘X-Article-Creation-Date’, ‘Posted’, or ‘Received’ fields.
Python Source Code
Well and then it was all about creating a Python parser, start the PostgreSQL, point it to an archive directory, and wait :)
At the bottom of this article is the code of the Python solution. It’s about 1,000 lines, and it took altogether about 1 day to create and test it. The script is smart enough to keep the track of where it started, so if it needs to be interrupted, it’ll know where to continue from to get the job done.
The source code is available on GitHub as open-source under MIT license:
How to Run It?
The final solution artifact is called ‘utzoo2postgres.py‘ , and it was tested on Python 3.8.
Open the script and define the path to un-tared Utzoo archive directories.
# for Windows positionFilePath = "E:\\Usenet\\Utzoo\\" # for linux: # positionFilePath = "/Usenet/Utzoo/"
Also, define the particulars of your PostgreSQL database:
db_connection = psycopg2.connect(host="localhost", user="", password="", port="5432", database="utzoo")
And then just execute the script!
python 3 utzoo2postgres.py
Note: In case you need to stop the program and run it later, the script is smart to resume from the last spot it was processing.
The script will process all Utzoo Archive messages in about 6 hours (depending on the speed of your machine).
Screenshot from processing:
Here is a screenshot of the database after only a couple of minutes of conversion:
As you can see, the conversion utility produces a database with 5 tables per group where messages are linked to each other through auto-created indexes.
Let’s say we want to look up all discussions in the net.physics discussions; and sort them out by the number of replies.
This is how you can do that:
Now, we can look up a particular discussion by the ID. For example, we want the ID: 1648 from the screenshot above, the discussion with the subject: “Question on FTL and quantum mechanics“. That’s not so hard either:
The Final Product
It’s nice to have a database full of posts, but it’s hardly usable that way. I needed something that would allow me to easily access these posts.
So, once everything was done, I built a PHP script around this code and registered https://usenetarchives.com to make all these archives available online, in an easy to read and search (forum-like) web site.
The PHP code is not part of this article, but you can head over to https://usenetarchives.com/groups.php?c=utzoo to see how it all works:
So now that it’s all done, I have to say, it was a great journey.
For those who want to play with the code, you can grab it from Github and adjust it to your liking. Please don’t judge the code, it’s not pretty, nor formatted or commented out (for the most part) as I wasn’t exactly planning to release it. I did so primarily for posterity reasons. But now that it’s out there you’re more than welcome to fork the repo, clean it up though and commit your changes, so others can benefit from your work too.
To conclude this article, this is the illustrated process of getting information from each of the files into the PostgreSQL database.
1. Henry Spences stores early internet posts on Magnetic Tapes
2. Downloaded copy of tar files is extracted into millions of flat files
- 2. Testing Headers and Body example of each of the flat file posts
3. Writing and running Python code to parse out all header and body fields
5-6. The Python script auto creates tables and indexes
- 7. The result: PostgreSQL fully searchable database of all lost Usenet posts Feb 1981 and June of 1991
- 8. Making the whole Utzoo archive available online at https://usenetarchives.com/groups.php?c=utzoo
Python Open Source Code