Hi Pangeo Community,
I’m excited to share a work-in-progress project called raw2zarr
, a Python package developed to convert weather radar raw data into an Analysis-Ready Cloud-Optimized (ARCO) format. This project aims to streamline the process of working with weather radar data, particularly for operational radars like those used by National Weather Services.
Motivation:
Weather radars, such as those in the Colombian National Weather Radar Network and NEXRAD, produce large volumes of data daily, making it challenging to access and analyze efficiently. By converting raw radar data into ARCO format, I aim to enhance data accessibility, interoperability, and usability across various platforms. This ARCO approach follows the FAIR principles and allows seamless integration with cloud-based workflows and scalable computing environments like Pangeo.
Key Features:
- Zarr-Append Pattern: The package follows a Zarr-append pattern, enabling the dataset to grow over time, which is critical for live, operational radar data updates.
- DataTree Structure: Given that radar measurements are sometimes unequal in dimensions,
raw2zarr
leveragesdatatree
to store radar sweeps at different nodes, allowing flexibility in data storage and access. - Scalability: The package is initially developed for the Colombian radar network but can be extended to NEXRAD, which could greatly improve radar data access and usability.
Data Source:
raw2zarr
uses data from the Colombian National Weather Radar Network, available on an AWS bucket at this link. This open dataset allows easy access to radar data for testing and analysis.
Current Progress:
I’ve successfully tested raw2zarr
on a small dataset, which worked as expected. However, as the dataset grows, I’ve noticed longer load times when opening the data. As shown in the following snippet, the time to open the dataset grows as its size increases:
I suspect this might be due to the Zarr-append pattern and the increasing dataset size, as I came across a discussion on the Pangeo forum (Puzzling S3 Xarray Open Zarr Latency). However, since my data is still stored locally, I don’t believe this is the issue. On the other hand, I’ve been collaborating with the Xarray datatree
team to improve the open_datatree
performance. Despite the progress made, I’m still facing challenges with lazy loading. I’m unsure if this issue stems from the data model (datatree
), Zarr, or Xarray, and I’m continuing to investigate.
Reproducibility:
If anyone is interested in reproducing a small subset of our results, you can clone the raw2zarr
repo, create the environment, and run the included notebook. This should show you how the package works with radar data and have a dataset sample.
Call for Feedback:
I’m sharing this here to gather feedback and suggestions from the community. I’d love to hear your thoughts on improving performance and streamlining the workflow.
Thank you!