Setting up a US Electricity System data deployment

Hmm, interesting. So maybe we should only be partitioning by year or state, and not both. Compressed on disk each year averages 200MB of data (and they’re all about the same size) while each state is about 100MB of data (but they vary wildly in size).

Do you have to do something explicitly to implement the partitioning that happens inside the files rather than in the filesystem? The 2-layers of partitioning on disk is just the output from something like:

parquet.write_to_dataset(pyarrow.Table.from_pandas(df, partition_cols=["year", "state"]))

But I guess I need to go read more about how the file format works and play around with it.