I’m involved in several projects that are hiring new data engineers to help on technical aspects of research projects. These folks need to quickly spin up on the main tools, technologies, and architectures used in our community. Rather than putting this information in private emails, I’ve decided to share it here on the forum.
Disclaimer: this is a very biased list that comes from my personal experience and perspective! I’m actively seeking feedback to include more resources. Leave your comments and suggestions below, and I’ll update the post on a rolling basis.
Audience
This is written for data analysts engineers who already have general experience but do not have experience specific to geospatial, weather, climate, ocean etc. data. In particular, I assume you already have the following skills:
- Can use the core scientific python packages numpy and pandas for data loading and processing and matplotlib for data visualization.
- Comfortable writing and running python code in Jupyter notebooks and standalone scripts.
- Comfortable using git and github.
- Basic understanding of how to package and share python code.
- Solid foundation in the fundamentals of data: data types (e.g. float, int, text, etc), data volumes, throughput, latency, arrays vs. tables, schemas, binary files, json, csv, etc.
Learn about the Data Model and Data Formats
Here is a page I wrote for my course with a quick overview of data models and formats.
Here are some other resources to dig deeper
Learn to Use the Core Libraries
Go through these tutorials
Xarray
- Welcome to the Xarray Tutorial!
- Material from Ryan’s class:
- Xarray Fundamentals — Earth and Environmental Data Science
- Assignment: Xarray Fundamentals with Atmospheric Radiation Data — Earth and Environmental Data Science
- Xarray Interpolation, Groupby, Resample, Rolling, and Coarsen — Earth and Environmental Data Science
- Assignment: More Xarray with El Niño-Southern Oscillation (ENSO) Data — Earth and Environmental Data Science
- Xarray — Pythia Foundations
Dask
Zarr
- Tutorial — zarr 2.16.1 documentation
- Ryan’s Zarr tutorial for OGC Cloud Native Outreach Event
RasterIO and RioXarray
- Python Quickstart — rasterio 1.4dev documentation
- Raster processing using Python Tools: Working with Raster Datasets
- Welcome to rioxarray’s documentation! — rioxarray 0.15.0 documentation
GeoPandas
- Introduction to GeoPandas — GeoPandas 0+untagged.50.g9a9f097.dirty documentation
- Geopandas Tutorial — Pangeo Gallery documentation
Understand Advanced Use Cases and Challenges
These are documented on our forum
Explore the more Experimental Libraries and Projects
These are all recent projects that have emerged from this community in response to specific user needs.
Learn about Cloud Storage
Cloud data storage is an area where we are really lacking documentation, tutorials, guides, etc.
Some high level material about “why cloud?” can be found here:
- Cloud-Native Repositories for Big Scientific Data | IEEE Journals & Magazine | IEEE Xplore
- https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2020AV000354
More technical material
- Pangeo and Data — Pangeo documentation (Note that this is pretty old and out of date)
- Big Arrays, Fast: Profiling Cloud Storage Read Throughput — Pangeo Gallery documentation
- The “Zarr in the Cloud” section of Ryan’s Zarr Tutorial has a bit of material about how cloud data access works.
Clearly this is an area where we have work to do in terms of documenting workflows. Does anyone have any more material they can suggest here?