Setting up a US Electricity System data deployment

Hi y’all,

I’ve been working for the last couple of years compiling US electricity system data for use by NGOs working in regulatory and legislative processes, and I thin k we are finally to the point where we want to make a live copy of the data available to users. Up until this month folks had to run the whole data processing pipeline themselves to access the outputs, which was more than most of our users were willing/able to do, since many of them are coming from a more finance / spreadsheet oriented analytical background. Some of them realize the limitations of that framework though, and want to start using Python and Jupyter notebooks for analysis, but we’ve still had challenges with the interface and process changing over time, and so it has been frustrating for them to keep their local system & data up to date. Having a JupyterHub with all the processed data loaded on it would let us take care of all the system upkeep, and provide them with access while minimizing the number of things they need to learn to be effective in working with the data (just the Python & Jupyter part… and not all the underlying infrastructure).

I don’t know if there’s a clear line between running JupyterHub on Kubernetes with cloud access to large-ish datasets, and Pangeo proper, but I’m starting to look at how to set these things up. I don’t think our data is really appropriate for zarr/xarray – it’s not big data cubes. The larger datasets are generally going to be time series with a ~1e9 to ~1e10 records, which we’re currently storing in Apache Parquet files and accessing with Dask. The smaller datasets are organized into an SQLite database locally, with tables that have up to ~1e6 records.

We’re containerizing our ETL process to make it more easily reproducible, and so we can have it run on cloud resources regularly, validating new data and new code on an ongoing basis, and generating data release candidates automatically. It seems like the same containers could be used for the JupyterHub, right?

Does anyone have recommendations on how we ought to store this data for use with a JupyterHub? Generally it’s meant to be read-only, with analyses happening in dataframes / notebooks to generate summary analyses or figures. Should it all get loaded into something like BigQuery? Should we just keep using the combination of SQLite + Parquet on disk? How does the data get replicated or shared across several different users at the same time?

I’m sure I could figure it out on my own eventually, but would love to get pointed in the right direction initially so I don’t end up going down a bunch of dead ends on my own, or end up configuring something that doesn’t end up meeting our needs well.

Our most recent data release: https://zenodo.org/record/3672068
The project on Github: https://github.com/catalyst-cooperative/pudl

1 Like

Hi @zaneselvans! I’m glad to hear you are moving forward with this project. I’ll provide some opinionated responses here, others may have additional ideas.

For many of our cloud deployments, we’ve been putting our data in cloud object stores (e.g. s3, gcs). For our large multi-dimensional datasets, we’ve been using zarr. But for tabular/time-series datasets, Parquet would be the natural analog. You may have seen this but dask’s dataframe documentation has some nice documentation on how you can use dask with remote parquet datasets:

https://docs.dask.org/en/latest/remote-data-services.html#remote-data

So my opinionated idea would be to just stick with parquet for now and provide access to the data via public cloud bucket. The cloud object store will handle the duplication and parallel/simultaneous access for you.

The last you may want to think about is some sort of high-level data broker or catalog application. We’ve been using Intake and I think it could work well for your applications.

2 Likes

Okay, so parquet in a storage bucket for the tables with billions of rows. Does it make sense to keep those datasets partitioned as they are now by year and state? (i.e. now they’re in a directory structure with names like year=2018 which contains a bunch of subdirectories like state=CO), so that the whole dataset doesn’t have to be scanned every time it’s queried).

And what do I do with the smaller and more relational data that’s currently living in an SQLite DB for local usage? It’s a few dozen well normalized tables, but less than 1GB in total. We’re using SQLAlchemy internally to query it. Can the pudl.sqlite database file just get dropped in a storage bucket too? That seems like… pretty janky. Should it get loaded into some Google SQL offering instead? The archiving format we’re using is tabular data packages, which are just CSVs for the data, with metadata including the database schema stored in JSON files. In theory the same data should be loadable using the datapackage tools into a PostgreSQL DB or BigQuery, though we haven’t played with that yet. But is BigQuery really even the right thing to use? This is just a normal little relational DB, with tables having up to ~1 million rows.

1 Like

Another :+1: for parquet on GCS. If your users like to use SQL, they can directly query parquet files using BigQuery: https://cloud.google.com/bigquery/external-data-sources

Since parquet is already a sharded format, you could consider abandoning your convention of keeping your datasets partitioned as they are now by year and state and instead store them in one single giant parquet file (with year and state as additional columns / indexes).

For your sqlite DB, The “cloud-native” solution would be to put your relational data into a cloud-based database like Google Cloud SQL.

1 Like

Hmm, so what is the difference between a Parquet dataset that’s partitioned on disk into multiple files/folders and having some kind of internal indexing / partitioning? I had thought that splitting the data into different files minimized the amount of data that had to be scanned if one was querying against the partitioning columns, reducing read times, and the cost of running queries against the data in a cloud hosting context. Dask seems happy to get pointed at a whole partitioned dataset (the top level directory) and then it only reads from the files as required to satisfy a query / operation.

I think there is no generic answer to these questions. Ultimately one needs to define a few use cases and benchmark the different options.

Yet another :+1: here for staying with Parquet. As per dataset partitionning:

That is true! Parquet will be able to put your data into different chunks anyway, even without explicit partitionning based on content with subdirectories. And as it keeps some information on what is in each chunk, reads will already be optimized, this is probably what @rabernat is refering too. But I also think it won’t hurt and can be more efficient to explicitly split your data. As @rabernat is saying in its last comment, no generic answer.

However, one thing that was not mentioned yet when using Parquet or any other chunked format in an object store (like Zarr) is that the size of the chunks matters a lot. If your explicit partitionning results in too small chunks, then the performance in an object store will be affected. An optimal chunk size we generally suggest is somewhere between 10s of MB to 100s of MB. Martin Durant who doesn’t seem to be active in this forum yet might have things to say about this.

Hmm, interesting. So maybe we should only be partitioning by year or state, and not both. Compressed on disk each year averages 200MB of data (and they’re all about the same size) while each state is about 100MB of data (but they vary wildly in size).

Do you have to do something explicitly to implement the partitioning that happens inside the files rather than in the filesystem? The 2-layers of partitioning on disk is just the output from something like:

parquet.write_to_dataset(pyarrow.Table.from_pandas(df, partition_cols=["year", "state"]))

But I guess I need to go read more about how the file format works and play around with it.

The 100MB recommendation comes from comparing the time to complete a minimum operation on s3 (originally) and the download rate. Higher values make the overhead even smaller, and the limit there is the amount of memory available to your worker - but remember that 100MB on disc will expand into potentially much bigger a size for the in-memory representation. How big depends on many factors.

One thing not mentioned so far here, is that if you have many many files because of partitions, and you don’t have a centralised “_metadata” file, then you will at some point have to list all the files, which can be time-consuming in itself. The metadata file can be built after-the-fact, but this is rare and not appropriate if you mean to keep expanding the parquet data. A single-file version of the data will necessarily have all the metadata (which is good and bad, since now you have to parse it all in one go).

Rarely mentioned: partitioning by paths will save you space, because you don’t need to store values for the column(s) being used. However, you would have expected them to pack particularly well anyway.
To partition “within a file”, you would have to “group-by” with the columns and unstack again and append the pieces one-by-one (because you don’t know where the file end is until the previous piece is done); this is not a typical workflow. Lots-of-files is far more common

Note that more work needs to be done in Dask so that if you formulate something like df[df.col0 > val && df.col1 == otherval].finalcol efficiently picks only required chunks of data.

Finally, on DBs, the cloud offerings give you a real SQL experience (windowed joins, query optimisation…), so if that’s important, go for it. They are geared to high query volume and you might pay more than you expect otherwise. For reference DBs of ~MBs, downloading SQlite files if fine - in fact, any format would be fine! Putting a postgreSQL server is a helm one-liner, if you happen to have a cluster, so another fine solution.

1 Like