New to Pangeo? A Quickstart Guide for Data Analysts and Engineers

I’m involved in several projects that are hiring new data engineers to help on technical aspects of research projects. These folks need to quickly spin up on the main tools, technologies, and architectures used in our community. Rather than putting this information in private emails, I’ve decided to share it here on the forum.

Disclaimer: this is a very biased list that comes from my personal experience and perspective! I’m actively seeking feedback to include more resources. Leave your comments and suggestions below, and I’ll update the post on a rolling basis.

Audience

This is written for data analysts engineers who already have general experience but do not have experience specific to geospatial, weather, climate, ocean etc. data. In particular, I assume you already have the following skills:

  • Can use the core scientific python packages numpy and pandas for data loading and processing and matplotlib for data visualization.
  • Comfortable writing and running python code in Jupyter notebooks and standalone scripts.
  • Comfortable using git and github.
  • Basic understanding of how to package and share python code.
  • Solid foundation in the fundamentals of data: data types (e.g. float, int, text, etc), data volumes, throughput, latency, arrays vs. tables, schemas, binary files, json, csv, etc.

Learn about the Data Model and Data Formats

Here is a page I wrote for my course with a quick overview of data models and formats.

Here are some other resources to dig deeper

Learn to Use the Core Libraries

Go through these tutorials

Xarray

Dask

Zarr

RasterIO and RioXarray

GeoPandas

Understand Advanced Use Cases and Challenges

These are documented on our forum

Explore the more Experimental Libraries and Projects

These are all recent projects that have emerged from this community in response to specific user needs.

Learn about Cloud Storage

Cloud data storage is an area where we are really lacking documentation, tutorials, guides, etc.

Some high level material about “why cloud?” can be found here:

More technical material

Clearly this is an area where we have work to do in terms of documenting workflows. Does anyone have any more material they can suggest here?

12 Likes

For intermediate/advanced cloud storage. Could put something on limits of PUT and GET requests e.g. to_zarr to s3 with asynchronous=False · Discussion #5869 · pydata/xarray · GitHub

1 Like

I smell the start of a bootcamp…

2 Likes

Thanks Ryan, that’s awesome!

This is really complete, and I agree we lack some documentation about Cloud storage and how to store data at scale. You already provide some content with your Zarr tutorials and other links, and you also have some interesting content in your Pangeo slides decks. For example this one.

We also tried to explain things like chunking in recent courses, maybe this is relevant?

An I hesitated to share this because this is a bit rough and hard to use, but I made some contents for courses I gave in the past years: GitHub - CNES/big-data-processing-course: Course on Big Data Processing and Cloud computing, for up to 4 days training.. There is for example things about Cloud and Big Data infrastructure or on Object storage (but this comes a lot from your slides :wink:).

1 Like