JeeLabs: Squashing tons of source code

And then came internet, open source software, and public code repositories…

Never before has it been possible to access so much software, of all kinds, in any programming language, for any application, and of all qualities & complexities. Even if not of immediate use “as is”, source code is a phenomenal resource - for knowledge, development ideas, engineering tricks, clever hacks - or simply to learn what others have done and how they have done it.

At JeeLabs, there’s an archive of source code - both “unzipped” and “checked out” - which has grown to some 125 GB over the years. Over half a million files.

Much of it is unknown, will never be seen, and is perhaps even obsolete.

But that’s not the point. Having access to the code, to find things, to see what’s available and to learn from others where possible - it’s a truly fantastic resource. And it’s definitely not the same as going through everything online: local disk access is faster, you can access it via the editor and other tools you know, and hey… it’s even there when the internet link is having a bad day.

Having a large source code repository (or actually a huge set of them) at arm’s length can be very useful during development. But it’s also easy to create a big mess of it. Which version is which, where did it come from? Before you know it, you can drown in semi-identical copies…

Fortunately, most source code now lives in public repositories (git, svn, mercurial, bazaar, cvs, rcs, etc). Which means you can simply “clone” from the source, and you get as much history as you like with it, as well as README’s, docs, links to the original “repo”, and more. Extremely convenient with Git & GitHub. That’s exactly what’s being saved more and more at JeeLabs.

As mentioned before, this is really a read-only archive - for reference, browsing, searching, sometimes for re-use, and occasionally even as basis for modification or derived work.

Storing these files as is on a local disk can be a bit inconvenient:

it eats up space (most files are tiny, with a lot of partial-block waste)
it’s too easy to accidentally change things
it’s hard to refer to, if the archive changes regularly

For this reason, the current collection of snapshots over the past 10 years has now been turned into a highly-compressed read-only archive. To be extended once a year, perhaps.

There are many ways to do this (one way was to burn file collections to iso’s, i.e. CD-ROM images, even if not physically stored that way). Which is in fact exactly how the source code collection has been managed here, until now. But it gets messy, and worst of all: there is a huge amount of duplication, as source gets checked out again, perhaps with some changes.

But there’s a solution, which looks like it could work quite well, called SquashFS. It’s a file system with a number of very useful properties in this context:

the file system is created once and is then essentially read-only
each file is compressed, but so is most of the file system meta information
duplicate files are identified and only included once (“poor man’s de-duplication”)
no free space is wasted due to “blocking” with some fixed granularity

With lots of small source files and some occasional duplication across them, SquashFS achieved a remarkable compression: those 125 GB ended up as a single 32 GB disk file. Note that this is a lossless transformation: the exact same directory tree ends up inside SquashFS.

SquashFS requires Linux (sort of, see below). But that’s no big deal: we can simply copy that sources.sqsh archive file to a Linux setup, which could be a Raspberry Pi or an Odroid board, and it’ll do the work for us. The result can then be shared as a Samba file server volume.

Here’s how to “mount” that file in Linux on an (existing) directory, called /mnt/sources:

sudo mount /sources.sqsh /mnt/sources -t squashfs -o loop

Or, to make this happen automatically on reboot, we can add this line to /etc/fstab:

/sources.sqsh   /mnt/sources   squashfs   ro,defaults   0 0

It’s extremely simple to create such a SquashFS archive:

mksquashfs /path/to/original/sources sources.sqsh

Compression of a large set of files requires a lot of processing power (in this case an 8-core i7 running several minutes full blast). But it’s no big deal since: 1) it only needs to be done once, and 2) you don’t have to compress on the same machine as where the result will be mounted. There’s even a build of the SquashFS tools for Mac OSX (via brew install squashfs).

See the SquashFS HOWTO for further information and examples.

JeeLabs: Squashing tons of source code

Trending Articles

Moondru Mudichu 27-05-2016 – Polimer tv Serial

Revised GDS Gratuity, Severance Amount and SDBS contribution - Social...

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

Rigol oscilloscope teardown and repair

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

Joshua Pigden from Bristol faces trial over rape and Diazepam...

Practice Sheet of Right form of verbs for HSC Students

Man to stand trial on three charges of money laundering

GTA 5 PPSSPP Zip File Download For Android Mediafire 382 MB

Cheriyal Mandal Sarpanch Mobile Numbers List Warangal District in Telangana...

Snes4Sym emulator for nokia s60v3

Password Reset on SX6036?

Name Of Parts Of The Day In hindi And English-List Of Part Of Days In Hindi

Throw Back: Samini — Where My Baby Dey (Prod by Kaywa)

VMOU RSCIT Result 2017, RSCIT Result VMOU rkcl.vmou.ac.in Name Wise

DRP MAKER WITH CHEMICALS 9491234553

Chai Status, Funny Tea Quotes in Hindi, चाय पर शायरी

Samuel Llewellyn Richards

Gulabi kallu Lyrics and translation | GAV / Govindhudu andhari vadele (2014)

SPY CAMERAS: Bus lane clampdown will be running in Derby by the end of November