Strange error using pangeo-forge-recipes/apache beam in parallel

I’m encountering an error I’m not really sure what to do with when running apache-beam (via pangeo-forge-recipes) with the direct runner with multi process enabled:

terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::erase: __pos (which is 16) > this->size() (which is 15)

which as far as I can tell occurs just before the actual data processing through the pipeline.

I’m just curious if anyone has seen this before, a trawl through github and stackoverflow etc. hasn’t been wonderfully helpful so far.

Some more detail available here: Create example for the UKCEH GEAR-1hrly dataset · Issue #3 · NERC-CEH/object_store_tutorial · GitHub

and the recipe I’m trying to run here: object_store_tutorial/scripts/convert_GEAR_beam.py at GEAR · NERC-CEH/object_store_tutorial · GitHub

Hey @matbro. I do not have a concrete clue about this error, but in my experience with beam debugging (which is just gnarly in general) it might be helpful to look at what step the pipeline was at before this occurs.

2 Likes

Adding on to what @jbusecke said. Debugging beam pipelines can be a real PITA. One tool I found is helpful is adding a | beam.Map(print) Ptransform to stages in your pipeline to see what the previous PTransform is outputting. ex:

recipe = (
    beam.Create(pattern.items())
    | OpenWithXarray(file_type=pattern.file_type)
    | beam.Map(print)
)
1 Like

Thanks both!
I’ve narrowed it down to being something to do with newer versions of pyarrow (>8.0.1) and/or it’s dependencies, but other than that who knows xD
I’ve parked it for now, given I’ve managed to assemble a working environment with older versions of pyarrow, but will be something I’m likely to revisit in future when it comes to deploying this workflow/recipe in anger!!

2 Likes

Sounds good @matbro. I also wanted to flag that we have reorganized bi-weekly working group meetings for PGF. That might be a good spot to talk in more detail!

1 Like

Amazing, thanks @jbusecke that definitely sounds like something I should be a part of!
I can’t tell which weeks it is happening as the https://pangeo.io/meeting-notes.html#meetings-calendar page seems to be 404ing at the moment - is there another place I can find it?

the meetings will start this week, see the calendar.

The website has been changed, and the meetings page has been moved to Meetings | Pangeo (the calendar is missing, though, I pulled the link out of the old website’s sources – see link to the meetings calendar? · Issue #40 · pangeo-data/pangeo.io · GitHub).

thanks @keewis
Can’t make this week but should be able to going forward

1 Like