I’ve been thinking a lot recently about how none of the existing data catalog offerings really do everything I would want them to.
So I wrote a long blog post about how what science needs is a social network for sharing big data.
One thing the post gets at is that providing a decentralized global subscribable data catalog is fundamentally a network protocol problem, somewhat similar to RSS.
The social network analogy is particularly generative here - because the desired network structure is similar to that of Federated social media, the protocol I want would be very similar in structure to those protocols underlying attempts to decentralize social media platforms. I therefore think it might well be possible to build what I’m suggesting by piggybacking off of BlueSky’s AT protocol or the Fediverse/Mastodon’s ActivityPub protocol.
Wouldn’t a marketplace, that would eventually become the de facto place to post datasets, be a more suitable analogy to this problem, though?
A social media analogy focuses on the utilization of the data, the scientific background, the variety of use-cases… I’m thinking researchgate.
A marketplace analogy is all about data availability, access, type & format, costs, licensing, …
As a data consumer for my application, I want to go to one place find out what I want or need (social stuff), then go to another place and obtain it (marketplace stuff).
As a data provider, I set up sensors, data loggers, licensing, etc and publish my APIs to the Marketplace. I may want to provide scientific evidence about why this niche “new wavelength measurement” has potential or not (social stuff), or the datasets I provide are standard and all I am offering is higher resolution or something.
That being said, the Marketplace would enforce or suggest specifications for the published datasets.
I dunno… just sharing thoughts. What do you think?
That’s a great question, but I think in terms of technical stack the only difference between what I’m describing and a marketplace is an intermediate layer between the catalog registry and the storage layer that generates access credentials to the linked S3 bucket if and only if you pay for them.
That’s very similar to what @jedsundwall originally suggested with Source Coop, but actually with these separable components (storage on S3, version-control via Icechunk, pay-for-access-credentials authentication layer, federated registry, discoverability via separate search UIs) you could build all sorts of business models or free services.
enforce or suggest specifications
I don’t think anything should be enforced beyond the requirements in the post. Because:
As soon as you enforce anything it raises the barrier to entry, reducing adoption,
Your enforcement will inevitably bake in some assumptions that seem reasonable in your field but aren’t meetable in general, so you end up making it less general.
Note that GitHub enforces nothing, not even having a license or readme (though it does very strongly suggest them). It doesn’t try to force you to use pyproject.toml for a python project or anything like that, it leaves that entirely up to the python community.
I think every type of quality control and metadata standardization should similarly be left up to the relevant community. A layered architecture faciliates this - for example you could create a public catalog UI that only displays data if its metadata matches some community standardized schema. That would incentivise data providers to make their metadata compliant, but not block them from sharing it if they don’t.