New to Pangeo? A Quickstart Guide for Data Analysts and Engineers

rabernat · November 4, 2022, 3:43pm

I’m involved in several projects that are hiring new data engineers to help on technical aspects of research projects. These folks need to quickly spin up on the main tools, technologies, and architectures used in our community. Rather than putting this information in private emails, I’ve decided to share it here on the forum.

Disclaimer: this is a very biased list that comes from my personal experience and perspective! I’m actively seeking feedback to include more resources. Leave your comments and suggestions below, and I’ll update the post on a rolling basis.

Audience

This is written for data analysts engineers who already have general experience but do not have experience specific to geospatial, weather, climate, ocean etc. data. In particular, I assume you already have the following skills:

Can use the core scientific python packages numpy and pandas for data loading and processing and matplotlib for data visualization.
Comfortable writing and running python code in Jupyter notebooks and standalone scripts.
Comfortable using git and github.
Basic understanding of how to package and share python code.
Solid foundation in the fundamentals of data: data types (e.g. float, int, text, etc), data volumes, throughput, latency, arrays vs. tables, schemas, binary files, json, csv, etc.

Learn about the Data Model and Data Formats

Here is a page I wrote for my course with a quick overview of data models and formats.

Here are some other resources to dig deeper

Learn to Use the Core Libraries

Go through these tutorials

Xarray

Dask

Welcome to the Dask Tutorial — Dask Tutorial documentation

Zarr

Tutorial — zarr 2.16.1 documentation
Ryan’s Zarr tutorial for OGC Cloud Native Outreach Event
- https://www.youtube.com/watch?v=unGL07trSjA
- GitHub - zarr-developers/tutorials: Notebooks and other content for Zarr tutorials

RasterIO and RioXarray

GeoPandas

Understand Advanced Use Cases and Challenges

These are documented on our forum

Explore the more Experimental Libraries and Projects

These are all recent projects that have emerged from this community in response to specific user needs.

Learn about Cloud Storage

Cloud data storage is an area where we are really lacking documentation, tutorials, guides, etc.

Some high level material about “why cloud?” can be found here:

More technical material

Pangeo and Data — Pangeo documentation (Note that this is pretty old and out of date)
Big Arrays, Fast: Profiling Cloud Storage Read Throughput — Pangeo Gallery documentation
The “Zarr in the Cloud” section of Ryan’s Zarr Tutorial has a bit of material about how cloud data access works.

Clearly this is an area where we have work to do in terms of documenting workflows. Does anyone have any more material they can suggest here?

raybellwaves · November 5, 2022, 12:32pm

For intermediate/advanced cloud storage. Could put something on limits of PUT and GET requests e.g. to_zarr to s3 with asynchronous=False · Discussion #5869 · pydata/xarray · GitHub

akpetty · November 7, 2022, 4:05pm

I smell the start of a bootcamp…

geynard · November 10, 2022, 8:16am

Thanks Ryan, that’s awesome!

This is really complete, and I agree we lack some documentation about Cloud storage and how to store data at scale. You already provide some content with your Zarr tutorials and other links, and you also have some interesting content in your Pangeo slides decks. For example this one.

We also tried to explain things like chunking in recent courses, maybe this is relevant?

An I hesitated to share this because this is a bit rough and hard to use, but I made some contents for courses I gave in the past years: GitHub - CNES/big-data-processing-course: Course on Big Data Processing and Cloud computing, for up to 4 days training.. There is for example things about Cloud and Big Data infrastructure or on Object storage (but this comes a lot from your slides ).

Topic		Replies	Views
Data models for pangeo Education	4	887	December 8, 2019
Seeking recommendation for a Pangeo tutorial for uncommitted newbies Education	3	661	July 1, 2020
Suggested database for large amount of NetCDF data Data	13	2922	April 7, 2022
Exploring Pangeo's Data Processing Capabilities for Large-Scale Climate Modeling! Data	0	69	January 31, 2025
Cloud array storage solutions Data	3	1193	November 29, 2023