I’ve been working for the last couple of years compiling US electricity system data for use by NGOs working in regulatory and legislative processes, and I thin k we are finally to the point where we want to make a live copy of the data available to users. Up until this month folks had to run the whole data processing pipeline themselves to access the outputs, which was more than most of our users were willing/able to do, since many of them are coming from a more finance / spreadsheet oriented analytical background. Some of them realize the limitations of that framework though, and want to start using Python and Jupyter notebooks for analysis, but we’ve still had challenges with the interface and process changing over time, and so it has been frustrating for them to keep their local system & data up to date. Having a JupyterHub with all the processed data loaded on it would let us take care of all the system upkeep, and provide them with access while minimizing the number of things they need to learn to be effective in working with the data (just the Python & Jupyter part… and not all the underlying infrastructure).
I don’t know if there’s a clear line between running JupyterHub on Kubernetes with cloud access to large-ish datasets, and Pangeo proper, but I’m starting to look at how to set these things up. I don’t think our data is really appropriate for zarr/xarray – it’s not big data cubes. The larger datasets are generally going to be time series with a ~1e9 to ~1e10 records, which we’re currently storing in Apache Parquet files and accessing with Dask. The smaller datasets are organized into an SQLite database locally, with tables that have up to ~1e6 records.
We’re containerizing our ETL process to make it more easily reproducible, and so we can have it run on cloud resources regularly, validating new data and new code on an ongoing basis, and generating data release candidates automatically. It seems like the same containers could be used for the JupyterHub, right?
Does anyone have recommendations on how we ought to store this data for use with a JupyterHub? Generally it’s meant to be read-only, with analyses happening in dataframes / notebooks to generate summary analyses or figures. Should it all get loaded into something like BigQuery? Should we just keep using the combination of SQLite + Parquet on disk? How does the data get replicated or shared across several different users at the same time?
I’m sure I could figure it out on my own eventually, but would love to get pointed in the right direction initially so I don’t end up going down a bunch of dead ends on my own, or end up configuring something that doesn’t end up meeting our needs well.
Hi @zaneselvans! I’m glad to hear you are moving forward with this project. I’ll provide some opinionated responses here, others may have additional ideas.
For many of our cloud deployments, we’ve been putting our data in cloud object stores (e.g. s3, gcs). For our large multi-dimensional datasets, we’ve been using zarr. But for tabular/time-series datasets, Parquet would be the natural analog. You may have seen this but dask’s dataframe documentation has some nice documentation on how you can use dask with remote parquet datasets:
So my opinionated idea would be to just stick with parquet for now and provide access to the data via public cloud bucket. The cloud object store will handle the duplication and parallel/simultaneous access for you.
The last you may want to think about is some sort of high-level data broker or catalog application. We’ve been using Intake and I think it could work well for your applications.
Okay, so parquet in a storage bucket for the tables with billions of rows. Does it make sense to keep those datasets partitioned as they are now by year and state? (i.e. now they’re in a directory structure with names like year=2018 which contains a bunch of subdirectories like state=CO), so that the whole dataset doesn’t have to be scanned every time it’s queried).
And what do I do with the smaller and more relational data that’s currently living in an SQLite DB for local usage? It’s a few dozen well normalized tables, but less than 1GB in total. We’re using SQLAlchemy internally to query it. Can the pudl.sqlite database file just get dropped in a storage bucket too? That seems like… pretty janky. Should it get loaded into some Google SQL offering instead? The archiving format we’re using is tabular data packages, which are just CSVs for the data, with metadata including the database schema stored in JSON files. In theory the same data should be loadable using the datapackage tools into a PostgreSQL DB or BigQuery, though we haven’t played with that yet. But is BigQuery really even the right thing to use? This is just a normal little relational DB, with tables having up to ~1 million rows.
Since parquet is already a sharded format, you could consider abandoning your convention of keeping your datasets partitioned as they are now by year and state and instead store them in one single giant parquet file (with year and state as additional columns / indexes).
For your sqlite DB, The “cloud-native” solution would be to put your relational data into a cloud-based database like Google Cloud SQL.
Hmm, so what is the difference between a Parquet dataset that’s partitioned on disk into multiple files/folders and having some kind of internal indexing / partitioning? I had thought that splitting the data into different files minimized the amount of data that had to be scanned if one was querying against the partitioning columns, reducing read times, and the cost of running queries against the data in a cloud hosting context. Dask seems happy to get pointed at a whole partitioned dataset (the top level directory) and then it only reads from the files as required to satisfy a query / operation.
Yet another here for staying with Parquet. As per dataset partitionning:
That is true! Parquet will be able to put your data into different chunks anyway, even without explicit partitionning based on content with subdirectories. And as it keeps some information on what is in each chunk, reads will already be optimized, this is probably what @rabernat is refering too. But I also think it won’t hurt and can be more efficient to explicitly split your data. As @rabernat is saying in its last comment, no generic answer.
However, one thing that was not mentioned yet when using Parquet or any other chunked format in an object store (like Zarr) is that the size of the chunks matters a lot. If your explicit partitionning results in too small chunks, then the performance in an object store will be affected. An optimal chunk size we generally suggest is somewhere between 10s of MB to 100s of MB. Martin Durant who doesn’t seem to be active in this forum yet might have things to say about this.
Hmm, interesting. So maybe we should only be partitioning by year or state, and not both. Compressed on disk each year averages 200MB of data (and they’re all about the same size) while each state is about 100MB of data (but they vary wildly in size).
Do you have to do something explicitly to implement the partitioning that happens inside the files rather than in the filesystem? The 2-layers of partitioning on disk is just the output from something like:
The 100MB recommendation comes from comparing the time to complete a minimum operation on s3 (originally) and the download rate. Higher values make the overhead even smaller, and the limit there is the amount of memory available to your worker - but remember that 100MB on disc will expand into potentially much bigger a size for the in-memory representation. How big depends on many factors.
One thing not mentioned so far here, is that if you have many many files because of partitions, and you don’t have a centralised “_metadata” file, then you will at some point have to list all the files, which can be time-consuming in itself. The metadata file can be built after-the-fact, but this is rare and not appropriate if you mean to keep expanding the parquet data. A single-file version of the data will necessarily have all the metadata (which is good and bad, since now you have to parse it all in one go).
Rarely mentioned: partitioning by paths will save you space, because you don’t need to store values for the column(s) being used. However, you would have expected them to pack particularly well anyway.
To partition “within a file”, you would have to “group-by” with the columns and unstack again and append the pieces one-by-one (because you don’t know where the file end is until the previous piece is done); this is not a typical workflow. Lots-of-files is far more common
Note that more work needs to be done in Dask so that if you formulate something like df[df.col0 > val && df.col1 == otherval].finalcol efficiently picks only required chunks of data.
Finally, on DBs, the cloud offerings give you a real SQL experience (windowed joins, query optimisation…), so if that’s important, go for it. They are geared to high query volume and you might pay more than you expect otherwise. For reference DBs of ~MBs, downloading SQlite files if fine - in fact, any format would be fine! Putting a postgreSQL server is a helm one-liner, if you happen to have a cluster, so another fine solution.
Heeeey, @rabernat@jhamman after 6 months buried in other work and pandemic freakout I’m finally back to working on this. We’re trying to come up with a workplan, and some kind of estimate for how much time it’s going to take, and were wondering if y’all could provide some guidance / commentary
Some things we’re trying to do:
Provide our collaborators with live access to the processed data with as little overhead as possible.
Start doing continuous deployment of the data, by kicking off a full ETL run whenever we merge code into our master or dev branches, which populates resources that are then available for use on the JupyterHub – so a user can choose an environment when they log in, like the current development branch output, and have both the data and the code that corresponds to that branch. Or if they want the most recent released stable version they can choose that.
This would also allow us to set up an at-most-nightly build that tries to run the full ETL on the current sprint branch (which gets code merged into it every few days) so that we can catch bugs that only come up in the full integration, and not unit tests.
Potentially also allow other applications / demonstrations to be build on top of the live, processed data (i.e. have it backing the JupyterHub, but also maybe other web applications)
As far as I understand it, there’s really two separate chunks of work that have to happen:
Setting up a system to automatically populate these cloud accessible resources, say with a GitHub Action.
Creating a JupyterHub instance which then makes use of those resources.
What would the first step look like? Should we create a Docker container that runs the ETL process, and then somehow hook up a GitHub Action to run that container on GCP nightly if there’s been a push?
Also, given that this is all open data, and we’re mainly supporting academics and non-profit users, and we’d like to make it widely available, do y’all think we might be able to apply to Google for some storage and/or compute credits? Does anyone have a relevant contact?
For the second step, it seems like mostly running through the Zero to JupyterHub on Kubernetes setup guide. How much time would y’all expect that to take someone who is only slightly familiar with Docker & GCP, and who hasn’t touched Helm/Kubernetes before? Like if you were telling a grad student or a post-doc to do it… what would you expect?
Base the image on existing jupyter/scipy-notebook image?
PUDL package and all dependencies installed.
Provide access to ferc1 & pudl SQL resources
Provide access to Apache Parquet datasets via cloud storage buckets
Q: Will user Docker containers need to track PUDL software versions and/or sprint / dev / master branches to ensure that the software environment and ETL data are in sync? Or can the appropriate version of the PUDL software / DB connections / storage bucket IDs be specified from within a Jupyter notebook / console? (looks like we can allow users to select one of several pre-generated environments, and have one for each available version / branch).
Q: Where in all of this do we set up a Dask cluster / runner for managing tasks from within a notebook?
Q: How do we make it easy for any notebooks being used for remote analysis to be checked into / synced with a GitHub repo?
Q: How does this setup interact with ongoing module development? Now we run a local Jupyter server, and have a pip install -e ./ version of our software that we edit, commit, push, etc. and it’s automatically reloaded with every cell that’s run. How would this work in parallel? Would we need to push our locally edited code to a repo that’s then being pulled down to the JupyterHub? Would we have to edit / commit / push from the remote server?
Q: Is there any way to track per-user resource costs?
Q: How does a JupyterHub scale up / down and free up resources that aren’t being used to avoid unnecessary costs?
@martindurant thanks for your note about just using a normal DB rather than the managed cloud solution. We have two SQLite DBs right now, which are ~300MB and ~700MB respectively. They’ll grow over time, but I imagine they’ll stay less than 1GB each for the foreseeable future. If we were storing SQLite DB files in a GCS storage bucket, then would we be copying those files into each user’s persistent volume? If we ran our own Postgres server inside the containers, would that be similar – each user would end up with their own personal copy of the DB? Or could it be a single shared DB? Right now we’re treating the DB as read-only for all intents and purposes. It’s a place to go grab the data from quickly in a well-defined structure, so you can play with it in dataframes. Would it be possible to use this DB for other applications outside of the JupyterHub context?
Anyway, excited to finally get this up and running, and hopefully spend much less time running the ETL process manually and walking users through the admittedly tedious current setup process.