Setting up a US Electricity System data deployment

zaneselvans · February 27, 2020, 5:37pm

Hmm, so what is the difference between a Parquet dataset that’s partitioned on disk into multiple files/folders and having some kind of internal indexing / partitioning? I had thought that splitting the data into different files minimized the amount of data that had to be scanned if one was querying against the partitioning columns, reducing read times, and the cost of running queries against the data in a cloud hosting context. Dask seems happy to get pointed at a whole partitioned dataset (the top level directory) and then it only reads from the files as required to satisfy a query / operation.

Topic		Replies	Views
Cloud array storage solutions Data	3	1187	November 29, 2023
Deploying Pangeo with GitLab K8s tools Cloud	1	609	September 23, 2019
AWS Pangeo JupyterHubs to Shut Down Friday, March 17 News & Announcements	4	704	March 9, 2023
Pangeo Showcase: "Cloud Native Data Loaders for Machine Learning Using Zarr and Xarray" Pangeo Showcase machine-learning	6	925	October 25, 2024
Tips for platform for teaching climate data analysis with python?	7	178	March 27, 2025

Setting up a US Electricity System data deployment

Related topics