Geopandas bbox and mask params return empty dataframe (fiona, pyogrio) for file geodatabase

I’m perplexed. I’ve downloaded the zipped file geodatabases for Defra/RPA’s CROME data for years 2016 through 2021. You don’t get a choice in the download format if requesting the whole dataset (ArcGIS Web Application).

I have a geodataframe comprising the field boundaries for a farm (land_covers). I’m using

import geopandas as gp
gp.read_file(pathspec, engine='pyogrio', bbox=tuple(land_covers.total_bounds.tolist()))
# same above also trying engine='fiona'
# or
gp.read_file(pathspec, engine='fiona', mask=land_covers)

where the pathspec for 2021 is
'zip:///home/guy/data/Agreed/gov_environment/CROME/RPA_CropMapOfEngland2021_FGDB_Full.zip!data.gdb'
and for 2020 is
'zip:///home/guy/data/Agreed/gov_environment/CROME/RPA_CropMapOfEngland2020_FGDB_Full.zip!FGDB.gdb'

The only, repeat only, difference I can readily see is within the zipfile, for 2021 there are data.gdb and docs directories, whereas for 2020 (and earlier years, in fact), there is only a FGDB.gdb directory. But if I replace the bbox/mask param for a simple rows=10, I get 10 rows. It’s only the bbox and/or mask params that fail to find any data.

For 2021, I get the desired intersection. For 2020, I get

/home/guy/anaconda3/envs/geocube/lib/python3.10/site-packages/pyogrio/raw.py:120: UserWarning: Layer 'b'Crop_Map_Of_England_2020'' does not have any features to read
  result = ogr_read(

All data are in EPSG:27700, as per

from pyogrio import read_info
read_info(pathspec)
and get
{'crs': 'EPSG:27700',
 'encoding': 'UTF-8',
 'fields': array(['prob', 'county', 'cromeid', 'lucode', 'shape_Length',
        'shape_Area'], dtype=object),
 'dtypes': array(['float64', 'object', 'object', 'object', 'float64', 'float64'],
       dtype=object),
 'geometry_type': 'MultiPolygon',
 'features': 31436011,
 'capabilities': {'random_read': 1,
  'fast_set_next_by_index': 1,
  'fast_spatial_filter': 1}}

where you can also see that pyogrio’s read_info function reports there to definitely be some features present!

Both the pyogrio and fiona I’m using support the bbox param, but only my fiona supports the mask param. None of them work. So I seem to have features present. I definitely have features. I can read them in and get a geodataframe. They’re in the same CRS as my field geodataframe as well. I’ve tried unzipping and reading from the uncompressed location. I’ve tried creating a symlink data.gdb to point to the FGDB.gdb just in case something funky is happening with assumed locations within the directory structure.

My point of entry is obviously geopandas, but that relies on either pyogrio or fiona, and neither of them work. And I don’t know whether it’s ultimately some lower level dependency they have that’s failing or whether they’re failing for the same problem with the data. But I can’t think what to look for next. As I say, I can read in a sample of rows from the data. I can’t think what the data problem might be; the paths within the zip archives changed prior to 2021, but even they work when not using the bbox or mask param. They’re all in the same CRS etc etc.

I know I’m not providing a reproducible example here, but based on the above, where should I focus my attention? I’m in the classic “it should work, but it doesn’t” hole.

Hi,
I can’t open the file with ogrinfo (which all libraries you mention use), but it works well if I unzip the file:

INFO: Open of `FGDB.gdb/'
      using driver `OpenFileGDB' successful.

Layer name: rpa_rpa_admin_rpa_crop_map_of_england_2019
Geometry: Multi Polygon
Feature Count: 31436011
Extent: (82753.600000, 5333.600000) - (655586.312600, 657544.125000)
Layer SRS WKT:
PROJCRS["OSGB36 / British National Grid",
    BASEGEOGCRS["OSGB36",
        DATUM["Ordnance Survey of Great Britain 1936",
            ELLIPSOID["Airy 1830",6377563.396,299.3249646,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4277]],
    CONVERSION["British National Grid",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",49,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",-2,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",0.9996012717,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",400000,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",-100000,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Engineering survey, topographic mapping."],
        AREA["United Kingdom (UK) - offshore to boundary of UKCS within 49°45'N to 61°N and 9°W to 2°E; onshore Great Britain (England, Wales and Scotland). Isle of Man onshore."],
        BBOX[49.75,-9,61.01,2.01]],
    ID["EPSG",27700]]
Data axis to CRS axis mapping: 1,2
FID Column = OBJECTID
Geometry Column = shape
county: String (200.0)
cromeid: String (16.0)
lucode: String (4.0)
shape_Length: Real (0.0)
shape_Area: Real (0.0)

In python (which basically wraps the above), it’s also feasible to read in the data.

In [1]: from osgeo import ogr

In [2]: g=ogr.Open("FGDB.gdb") # Opens OK

In [3]: import geopandas as gpd

In [4]: df=gpd.read_file("FGDB.gdb", rows=100)
df
Out[6]: 
   county  ...                                           geometry
0     BED  ...  MULTIPOLYGON (((507846.313 260038.484, 507806....
1     BED  ...  MULTIPOLYGON (((502206.313 241263.047, 502166....
2     BED  ...  MULTIPOLYGON (((515226.313 239011.391, 515186....
3     BED  ...  MULTIPOLYGON (((511386.313 240743.438, 511346....
4     BED  ...  MULTIPOLYGON (((498726.313 224981.781, 498686....
..    ...  ...                                                ...
95    BED  ...  MULTIPOLYGON (((509466.313 221136.625, 509426....
96    BED  ...  MULTIPOLYGON (((497466.313 251412.875, 497426....
97    BED  ...  MULTIPOLYGON (((504146.313 250027.234, 504126....
98    BED  ...  MULTIPOLYGON (((521286.313 247082.750, 521246....
99    BED  ...  MULTIPOLYGON (((493386.313 257648.250, 493346....

[100 rows x 6 columns]

However, trying to filter by an e.g. bounding box doesn’t work using e.g. ogrinfo -spat xmin ymin xmax ymax returns no hits. Filtering by e.g. county works OK though (well, takes forever but you get a sensible answer).

Maybe you can split by e.g. county, and convert into a spatialite/geopackage file? They’ll be better supported than Geodatabase.
(Hint: I did that to extract a geojson over some area of interest from that dataset)

There are some comments (and examples of conversion etc) in the relevant GDAL driver page.

Is geodatabase format the only format you can download them? I know they changed/updated the platform through which you download them, but I’ve worked with the CROME datasets for the past few years now without issues when downloading them as geopackages instead - using geopandas!

Guy, are there any sort of weird geometries/mixtures in your new one that could be causing your problem - e.g. with your box - does any random small box anywhere in the dataset file?

Thanks @kieransambajpai ,

Yeah, they do like to change things. :wink: I remember they used to provide regional collections, but then switched to county by county. When downloading county by county (tedious, much), you get a choice of several download formats. When using the whole country data (or defining a simple polygon or uploading one in a zipped shapefile from that), the only available download option is File Geodatabase.

If you’re downloading as geopackages, I’m guessing you’re doing it on a county by county basis?

Thanks for looking at this @Jose ,

You gave me some confirmation and a couple of ideas. I think you’re saying that you found that ogrinfo failed with the -spat parameter in effect even on the unzipped file geodatabase. I checked and concur. After unzipping, the same xmin, ymin etc. that resulted in over 3000 features for 2021 reported 0 for 2020 and earlier. I don’t know much of anything about the file geodatabase format, other than it’s ESRI, but has been reverse engineered for the performant open source reader. Initial pickings and my gut suggest the .spx file contains spatial bound data and so if there’s a problem with this, then if other software relies on querying that to check if any features intersect a region of interest, it will fail. All I can say there is that there are a00000004.spx and a00000009.spx files in each; the latter is large and the same size between 2021 and 2019, for example. But the 4 is like a third of the size in 2019 compared to 2021. I don’t know what they should be, but I smell something there.

So I used ogr2ogr to convert from the downloaded file geodatabases to geopackages. I didn’t split the data at all. The result is like 9 GB for each year, but pah. The spatial mask and bbox now works when using the geopackage data. The result comes back super fast as well. I can only figure there’s some issue with the spatial bound data in years prior to 2021 that causes software that rely on that information to misunderstand the spatial bounds in the data.