Hmm, so what is the difference between a Parquet dataset that’s partitioned on disk into multiple files/folders and having some kind of internal indexing / partitioning? I had thought that splitting the data into different files minimized the amount of data that had to be scanned if one was querying against the partitioning columns, reducing read times, and the cost of running queries against the data in a cloud hosting context. Dask seems happy to get pointed at a whole partitioned dataset (the top level directory) and then it only reads from the files as required to satisfy a query / operation.