Improved Cloud Access for Archival Formats (gzip and zip)

forrestfwilliams · April 17, 2023, 7:00pm

Compression is a key way that we can increase download speeds and reduce data storage costs, but many compression types such as the popular DEFLATE compression used in gzip and zip files do not support random access reads. This isn’t a problem if you plan to download/decompress the whole data file, but cloud access patterns rely on the ability to grab only the data they need from much larger files (i.e., random access reads). Since SO much data is stored in either gzip or zip archives, this severely limits our ability to efficient cloud access to older datasets.

I’ve been working on a new project that should make this much easier though! Mark Adler, the creator of the popular zlib compression library developed a utility called zran to provide pseudo-random reads for these formats via the creation of sidecar index files. These files contain information on the decompression state of the parent file at various points and can be used to start reading at roughly 20 KB block boundaries throughout the file.

I’ve taken Mark’s utility and wrapped it as a Python library so that others can use it more easily. You can install the zran python package via PyPI, or check out a development version on GitHub.

My hope is that we can combine this utility with other projects such as kerchunk and fsspec to provide Xarray-esque access to compressed file archives.

FYI, there is another package that does something similar called indexed_gzip, but it is tailored to a specific neuro-imaging use case and the developer doesn’t have much time to devote to the project.

TomAugspurger · April 17, 2023, 8:01pm

Thanks for sharing. I came across GitHub - sozip/sozip-spec: Specification of seek-optimized zip file profile the other day, which sounds related (though I haven’t used it yet).

forrestfwilliams · April 17, 2023, 8:43pm

Oh nice I hadn’t come across this project before. From a quick read, it looks like sozip will create new zip files with these indexes. Conversely zran will create indexes for existing DEFLATE compressed data without rewriting it.

Topic		Replies	Views
Recommendation for hosting cloud-optimized data Data	15	2762	January 21, 2022
Data Provider Strategies for Hosting Different Cloud-Optimized Data Formats Cloud	8	829	October 2, 2023
Cloud array storage solutions Data	3	1187	November 29, 2023
Suggested database for large amount of NetCDF data Data	13	2886	April 7, 2022
Assessment tools for lossy compression of geoscientific data CMIP6 Project Proposals location-ncar	10	1141	October 16, 2019

Improved Cloud Access for Archival Formats (gzip and zip)

Related topics