I was given some hydrologic model output in the form of a 10GB fixed width ASCII file, where the first two columns are year and month, and the rest of the columns are streamflow at specific locations. There are 1512 rows, and 497,000 columns. Nice, eh?
Since this is just an array of numbers, I thought I’d read it with Dask Dataframe, convert to a Dask array and write to Zarr. But I can’t figure out how to write to Zarr:
import dask.dataframe as dd df = dd.read_fwf('/scratch/streamflow.monthly.all', sample_rows=1, sample=256000, header=None) da = df.to_dask_array() da
but I can’t just do
da.to_zarr() because it complains about the chunk sizes not being known.
And I tried computing the chunksizes using:
da = df.to_dask_array(lengths=True) but ran for over an hour without returning, so clearly not the best solution.
If I could just specify
chunks=(1512, 10000) I’d be happy.
Is there a way to just specify the chunk size for my dask array?