Rescuing Full Archives From A Google Group

A project I’m newly affiliated with has used Google Groups for all their private group communication so far. Since I’m not a big fan of storing private data in the proprietary data silos of cloud providers, this is a situation I want to rectify. Why use Google Groups when you can set up GNU mailman yourself and not have all data and meta-data pass through Google?

There’s one caveat: While Google provides an export of group member lists, there’s no export functionality for the current archive. Which in this case represented 2 years’ worth of fruitful discussion and organizational knowledge. Some tools exist to try and dump all of a group’s archive, but none really agreed with me. So I rolled my own.

I give to you: https://github.com/henryk/gggd

Inside you’ll find a Python script that uses the lynx browser to access the Google Groups API (so it can work with a Google login cookie as an authenticated user) and will enumerate all messages in a group’s archive and download each into a different file as a standard RFC (2)822 message. While programming this I found that some of the messages are being returned from the API in a mangled form, so I also wrote a tool (can be called with an option from the downloader) that can partially reverse this mangling.

With the message files from my download tool, formail from the procmail package, and some shell scripting I was able to generate an mbox file with the entire groups’ archives which could be easily imported into mailman.