Decentralized subscribable catalog protocol

TomNicholas · January 31, 2025, 12:48pm

I’ve been thinking a lot recently about how none of the existing data catalog offerings really do everything I would want them to.

So I wrote a long blog post about how what science needs is a social network for sharing big data.

One thing the post gets at is that providing a decentralized global subscribable data catalog is fundamentally a network protocol problem, somewhat similar to RSS.

The social network analogy is particularly generative here - because the desired network structure is similar to that of Federated social media, the protocol I want would be very similar in structure to those protocols underlying attempts to decentralize social media platforms. I therefore think it might well be possible to build what I’m suggesting by piggybacking off of BlueSky’s AT protocol or the Fediverse/Mastodon’s ActivityPub protocol.

I’ve made a repo for discussing ideas for how one might implement such a protocol.

Curious what people think of any of this!

sotosoul · February 1, 2025, 3:20pm

I get your point, @TomNicholas.

Wouldn’t a marketplace, that would eventually become the de facto place to post datasets, be a more suitable analogy to this problem, though?

A social media analogy focuses on the utilization of the data, the scientific background, the variety of use-cases… I’m thinking researchgate.

A marketplace analogy is all about data availability, access, type & format, costs, licensing, …

As a data consumer for my application, I want to go to one place find out what I want or need (social stuff), then go to another place and obtain it (marketplace stuff).

As a data provider, I set up sensors, data loggers, licensing, etc and publish my APIs to the Marketplace. I may want to provide scientific evidence about why this niche “new wavelength measurement” has potential or not (social stuff), or the datasets I provide are standard and all I am offering is higher resolution or something.

That being said, the Marketplace would enforce or suggest specifications for the published datasets.

I dunno… just sharing thoughts. What do you think?

TomNicholas · February 1, 2025, 9:35pm

A marketplace analogy

That’s a great question, but I think in terms of technical stack the only difference between what I’m describing and a marketplace is an intermediate layer between the catalog registry and the storage layer that generates access credentials to the linked S3 bucket if and only if you pay for them.

That’s very similar to what @jedsundwall originally suggested with Source Coop, but actually with these separable components (storage on S3, version-control via Icechunk, pay-for-access-credentials authentication layer, federated registry, discoverability via separate search UIs) you could build all sorts of business models or free services.

enforce or suggest specifications

I don’t think anything should be enforced beyond the requirements in the post. Because:

As soon as you enforce anything it raises the barrier to entry, reducing adoption,
Your enforcement will inevitably bake in some assumptions that seem reasonable in your field but aren’t meetable in general, so you end up making it less general.

Note that GitHub enforces nothing, not even having a license or readme (though it does very strongly suggest them). It doesn’t try to force you to use pyproject.toml for a python project or anything like that, it leaves that entirely up to the python community.

I think every type of quality control and metadata standardization should similarly be left up to the relevant community. A layered architecture faciliates this - for example you could create a public catalog UI that only displays data if its metadata matches some community standardized schema. That would incentivise data providers to make their metadata compliant, but not block them from sharing it if they don’t.

TomNicholas · February 1, 2025, 9:37pm

Also here’s what updates propagating via a subscribable federated catalog would look like in GIF form (I’m very proud of it)

updating

TomNicholas · February 4, 2025, 8:09pm

I’m going to give a talk + host a community discussion about this federation of catalogs problem

Topic		Replies	Views
Pangeo Showcase: "FROST: Federated Registry Of Scientific Things" (Feb 12, 2025) Pangeo Showcase	3	570	February 13, 2025
What's Next — Data management	4	527	December 12, 2023
About the Data category Data	0	876	September 12, 2019
Looking for the best way to compliment, rather than compete with, this community, but commercially Meta	9	1190	April 7, 2021
OPeNDAP vs. direct file access Data	32	4414	January 27, 2021

Decentralized subscribable catalog protocol

Related topics