Dear Pangeo-Community, it was recommended I ask you for advise. I am from X-ray physics. We scan with an X-ray beam samples in raster mode. In my current case, a scan contains thousands of hdf5 files. Each hdf5 file has 3-10 numpy arrays inside. These are my 2d images. I am interested in these images, i.e. numpy arrays. I need to manipulated them with another Python library to extract data from them, but also keep the manipulated data. In addition, I have other hdf5 files where I need to extract as well some numbers and store them with the manipulated data. These are motor positions, in principle x,y values telling me where the X-ray beam was. As there are so many files, it would be nice if it can be easily parallelised. This is why I was looking at dask and dask data frames. But I am not sure if this would work.
My aim is to have each scan as dataframe. Each column contains then different values and in each row I have a scan point. I hope this makes sense.
I leave all files on a cluster and work on the cluster where I could create my own conda environment.
Hence, while asking for advise, somebody pointed me kindly in the direction of pangeo. If you have any code examples that fit maybe my plan, I am grateful to look at them and learn. Thank you for your guidance.
1 Like