FSSpec caching tutorial

Greetings -

I’m working on a recipe for Pangeo-forge and @cisaacstern introduced me to the joys of fsspec. This seems super-handy. I’m trying to see how I might use it in other contexts but having trouble understand what’s possible and how to do what I want.

What I’d like to be able to do is generate a sequence (list?) of URLs and have them automatically cached to local storage when they are opened, say with xr.open_dataset(). Following the example in the docs I tried to create a file-opening instance with custom headers in the request

of = fsspec.open("filecache::" + make_url(dates[0]),
                 https=dict(headers=dict(Authorization=f"Bearer {os.environ['EARTHDATA_TOKEN']}")),
                 same_names = True,
                 filecache={'cache_storage':'./cache'})

Sadly for me, xr.open_dataset(of, engine="netcdf4") raises an exception

AttributeError: '_io.BufferedReader' object has no attribute '__fspath__'

Is there a way to use fsspec open file objects to automatically read and cache files from a URL locally? Extra bonus points if the names remain legible.

Thanks - Robert

Hi Rob!

The key problem here is that the netCDF4 python library cannot open a “file-like” object because it needs to pass that object down to the underlying netCDF4-C library. The thing that fsspec gives you is file-like, but not an actual file.

One way to fix this problem is to use a different engine. engine='h5netcdf' should work, as it uses h5py to open the file, which can handle this type of file-like object.

The other way is to try to find the actual path of your downloaded file (e.g. ./cache/some_filename.nc). I’m not sure how to do that with fsspec. For this scenario, instead, I always reach for Pooch. Pooch is explicitly designed to do one thing: download data files and just give you the local path as a string to open up. While fsspec does support file caching, that is not really its main goal. I find the caching features to be a bit confusing, and the documentation a bit thin. So in the scenario where you just want to download a remote file and open it, I would try Pooch.

I found myself faced with a similar problem needing local files (workflows with some R mixed in), and I wrote a context manager using tempfile that could give me a local file when fsspec.open_local() couldn’t.

with GetFile(path, fs) as local_path:
    code_that_does_not_like_file_like_objects(local_path)

I should work on contributing it back up to the project at some point, though right now it’s kinda specialized to work as a Dagster resource.

Here is a gist. The most relevant bit will be the GetFile class which should be adaptable without too many changes (though you can probably get rid of my intermediate FsSpecFilesystem wrapper class).

Thanks very much @rabernat . I got Pooch working on this problem. Good dog… Thanks for pointing me to the right tool for the job.

1 Like