Converting Weather Radar Raw Data into Analysis-Ready Cloud-Optimized (ARCO) Datasets

aladinor · October 2, 2024, 4:45pm

Hi Pangeo Community,

I’m excited to share a work-in-progress project called raw2zarr, a Python package developed to convert weather radar raw data into an Analysis-Ready Cloud-Optimized (ARCO) format. This project aims to streamline the process of working with weather radar data, particularly for operational radars like those used by National Weather Services.

Motivation:

Weather radars, such as those in the Colombian National Weather Radar Network and NEXRAD, produce large volumes of data daily, making it challenging to access and analyze efficiently. By converting raw radar data into ARCO format, I aim to enhance data accessibility, interoperability, and usability across various platforms. This ARCO approach follows the FAIR principles and allows seamless integration with cloud-based workflows and scalable computing environments like Pangeo.

Key Features:

Zarr-Append Pattern: The package follows a Zarr-append pattern, enabling the dataset to grow over time, which is critical for live, operational radar data updates.
DataTree Structure: Given that radar measurements are sometimes unequal in dimensions, raw2zarr leverages datatree to store radar sweeps at different nodes, allowing flexibility in data storage and access.
Scalability: The package is initially developed for the Colombian radar network but can be extended to NEXRAD, which could greatly improve radar data access and usability.

Data Source:

raw2zarr uses data from the Colombian National Weather Radar Network, available on an AWS bucket at this link. This open dataset allows easy access to radar data for testing and analysis.

Current Progress:

I’ve successfully tested raw2zarr on a small dataset, which worked as expected. However, as the dataset grows, I’ve noticed longer load times when opening the data. As shown in the following snippet, the time to open the dataset grows as its size increases:

I suspect this might be due to the Zarr-append pattern and the increasing dataset size, as I came across a discussion on the Pangeo forum (Puzzling S3 Xarray Open Zarr Latency). However, since my data is still stored locally, I don’t believe this is the issue. On the other hand, I’ve been collaborating with the Xarray datatree team to improve the open_datatree performance. Despite the progress made, I’m still facing challenges with lazy loading. I’m unsure if this issue stems from the data model (datatree), Zarr, or Xarray, and I’m continuing to investigate.

Reproducibility:

If anyone is interested in reproducing a small subset of our results, you can clone the raw2zarr repo, create the environment, and run the included notebook. This should show you how the package works with radar data and have a dataset sample.

Call for Feedback:

I’m sharing this here to gather feedback and suggestions from the community. I’d love to hear your thoughts on improving performance and streamlining the workflow.

Thank you!

martindurant · October 2, 2024, 6:14pm

I imagine normal python profiling tools can tell you where time is being spent while opening a dataset. It could be listing files, it could be reading the metadata, making coordinate arrays, or something else.

aladinor · October 4, 2024, 1:41pm

Thanks again for the suggestion, @martindurant! I ran the profiling on the open_datatree function using the following code:

import xarray as xr
from xarray.backends.api import open_datatree


def main():
    path = "../zarr/Guaviare_V2.zarr"
    dt = open_datatree(
        path,
        engine="zarr",
        consolidated=True,
        chunks={}
    )


if __name__ == '__main__':
    main()

And here are some insights from the results:

Overall time consumption: The total time spent on the open_datatree function is significant, with a total cumulative time of around 18.7 seconds.
Deepcopy operations: The deepcopy function is called over 4.65 million times, resulting in a cumulative time of 9 seconds, with 4.5 seconds being its own processing time. This suggests that the excessive use of deep copy might be a significant factor in the performance issues.
Alignment and copying operations: The functions related to alignment (like align and reindex_all) also take considerable time, contributing around 9 seconds in total. This might indicate that reindexing and aligning the data during open_datatree is part of the overhead.

Here, you can find the cProfiling file in case you might check it out

dtree.prof

dtree.pstat

I am working on posting a small dataset on a s3 bucket that allows reproducibility.

Please let me know your thoughts or ideas.

TomNicholas · October 4, 2024, 3:12pm

@aladinor, please raise these findings on the xarray issue tracker, particularly on Performance of deep DataTrees · Issue #9511 · pydata/xarray · GitHub. And tell us how many variables / groups are in the tree that you’re opening here.

I am working on posting a small dataset on a s3 bucket that allows reproducibility.

This would be very helpful.

aladinor · October 9, 2024, 5:40pm

Thanks, @TomNicholas, for your suggestion. This is a minimal reproducible example, in case you want to look at it.

import s3fs
import xarray as xr
import fsspec

def main():
    ## S3 bucket connection
    URL = 'https://js2.jetstream-cloud.org:8001/'
    path = f'pythia/radar/erad2024'
    fs = s3fs.S3FileSystem(anon=True, 
    client_kwargs=dict(endpoint_url=URL))
    file =  s3fs.S3Map(f"{path}/zarr_radar/Guaviare_test.zarr", s3=fs)
    
    # opening datatree stored in zarr
    dtree = xr.backends.api.open_datatree(
      file, 
      engine='zarr', 
      consolidated=True, 
      chunks={}
    )

if __name__ == "__main__":
    main()

Opening this small datatree, around ten nodes and ~1GB in size, will take around 7 to 10 seconds.

I will post this into the the performance issue of deep dtree #9511 as well.

martindurant · October 16, 2024, 3:45pm

I see that _chunk_getitems is being called 70 times. This corresponds to the number of coordinate variables in the tree. Each variable is having its values eagerly loaded, and this process happens serially: all the chunks of one coordinate variable are indeed fetched concurrently, but the next set of chunks isn’t requested until this is done. There are only a few chunks per coordinate; it would be totally possible to concurrently load all of the chunks in a single call.

Zarr v3’s more pervasive async internal model may help with this, but I don’t know if xarray is (yet) ready to make use of it.

TomNicholas · October 17, 2024, 1:54am

Thanks @martindurant that’s very helpful - I’ve raised Slow open_datatree for zarr stores with many coordinate variables · Issue #9640 · pydata/xarray · GitHub as an avenue for tracking this specific issue and continuing discussion on maybe using async to load coordinate variables from zarr.

Topic		Replies	Views
Pangeo Showcase: "The Open Radar Stack: Bringing Weather Radar Data into Pangeo" Pangeo Showcase	1	475	April 3, 2024
Netcdf to Zarr best practices Data	13	10334	February 10, 2021
Welcome, I need some support for the design of a forecast archive with Zarr Data	10	1149	April 23, 2022
OPeNDAP vs. direct file access Data	32	4406	January 27, 2021
Formatting Radio Occultation Data for the Cloud! Data	1	479	May 11, 2021