Compression is a key way that we can increase download speeds and reduce data storage costs, but many compression types such as the popular DEFLATE compression used in gzip and zip files do not support random access reads. This isn’t a problem if you plan to download/decompress the whole data file, but cloud access patterns rely on the ability to grab only the data they need from much larger files (i.e., random access reads). Since SO much data is stored in either gzip or zip archives, this severely limits our ability to efficient cloud access to older datasets.
I’ve been working on a new project that should make this much easier though! Mark Adler, the creator of the popular
zlib compression library developed a utility called zran to provide pseudo-random reads for these formats via the creation of sidecar index files. These files contain information on the decompression state of the parent file at various points and can be used to start reading at roughly 20 KB block boundaries throughout the file.
I’ve taken Mark’s utility and wrapped it as a Python library so that others can use it more easily. You can install the zran python package via PyPI, or check out a development version on GitHub.
My hope is that we can combine this utility with other projects such as
fsspec to provide Xarray-esque access to compressed file archives.
FYI, there is another package that does something similar called
indexed_gzip, but it is tailored to a specific neuro-imaging use case and the developer doesn’t have much time to devote to the project.