Setting up a US Electricity System data deployment

zaneselvans · February 28, 2020, 10:20pm

Hmm, interesting. So maybe we should only be partitioning by year or state, and not both. Compressed on disk each year averages 200MB of data (and they’re all about the same size) while each state is about 100MB of data (but they vary wildly in size).

Do you have to do something explicitly to implement the partitioning that happens inside the files rather than in the filesystem? The 2-layers of partitioning on disk is just the output from something like:

parquet.write_to_dataset(pyarrow.Table.from_pandas(df, partition_cols=["year", "state"]))

But I guess I need to go read more about how the file format works and play around with it.

Topic		Replies	Views
Cloud array storage solutions Data	3	1187	November 29, 2023
Deploying Pangeo with GitLab K8s tools Cloud	1	609	September 23, 2019
AWS Pangeo JupyterHubs to Shut Down Friday, March 17 News & Announcements	4	704	March 9, 2023
Pangeo Showcase: "Cloud Native Data Loaders for Machine Learning Using Zarr and Xarray" Pangeo Showcase machine-learning	6	925	October 25, 2024
Tips for platform for teaching climate data analysis with python?	7	178	March 27, 2025

Setting up a US Electricity System data deployment

Related topics