I’ve been working for the last couple of years compiling US electricity system data for use by NGOs working in regulatory and legislative processes, and I thin k we are finally to the point where we want to make a live copy of the data available to users. Up until this month folks had to run the whole data processing pipeline themselves to access the outputs, which was more than most of our users were willing/able to do, since many of them are coming from a more finance / spreadsheet oriented analytical background. Some of them realize the limitations of that framework though, and want to start using Python and Jupyter notebooks for analysis, but we’ve still had challenges with the interface and process changing over time, and so it has been frustrating for them to keep their local system & data up to date. Having a JupyterHub with all the processed data loaded on it would let us take care of all the system upkeep, and provide them with access while minimizing the number of things they need to learn to be effective in working with the data (just the Python & Jupyter part… and not all the underlying infrastructure).
I don’t know if there’s a clear line between running JupyterHub on Kubernetes with cloud access to large-ish datasets, and Pangeo proper, but I’m starting to look at how to set these things up. I don’t think our data is really appropriate for zarr/xarray – it’s not big data cubes. The larger datasets are generally going to be time series with a ~1e9 to ~1e10 records, which we’re currently storing in Apache Parquet files and accessing with Dask. The smaller datasets are organized into an SQLite database locally, with tables that have up to ~1e6 records.
We’re containerizing our ETL process to make it more easily reproducible, and so we can have it run on cloud resources regularly, validating new data and new code on an ongoing basis, and generating data release candidates automatically. It seems like the same containers could be used for the JupyterHub, right?
Does anyone have recommendations on how we ought to store this data for use with a JupyterHub? Generally it’s meant to be read-only, with analyses happening in dataframes / notebooks to generate summary analyses or figures. Should it all get loaded into something like BigQuery? Should we just keep using the combination of SQLite + Parquet on disk? How does the data get replicated or shared across several different users at the same time?
I’m sure I could figure it out on my own eventually, but would love to get pointed in the right direction initially so I don’t end up going down a bunch of dead ends on my own, or end up configuring something that doesn’t end up meeting our needs well.