Are free-form metadata searches useful? #5

jonmjoyce · 2022-08-08T23:33:01Z

jonmjoyce
Aug 8, 2022
Maintainer

There's been some debate as to whether or not search-engine supported metadata is a useful feature to the community or not. It seems like most data consumers tend to know which dataset and fields they want. In a machine-to-machine application, search is unlikely to be used. A lot of effort can go into creating data portals, and they seldom are able to serve all constituents well.

What do you think? Should we consider search an important feature to explore in these prototypes?

dpsnowden · 2023-06-10T19:49:15Z

dpsnowden
Jun 10, 2023

This is such a great question. I'd start out by challenging one of your assumptions "...most data consumers tend to know which dataset and fields they want". I don't believe that's true broadly. It's frustrating to me to hear that people still struggle to find data. The number of times one hears FAIR spoken about further confirms that. I believe we still have a data discovery problem. However, I don't believe that a network of portals is the solution. (even our IOOS portal at data.ioos.us, which happens to be down today, perhaps further confirms my point) The APIs don't give results as helpful as they need to be. And they add to the number of websites users need to remember/scour when looking for data. I still hold out some hope that using curated catalogs like ours as a means of reaching Google might help. But I've never seen a thorough study on data discovery, so I don't even know if that's true.

To understand what we can usefully accomplish in this project, we must be more specific about the user or user persona we're trying to serve. You've mentioned STAC in some of your architecture docs. I don't know how mature that technology is or whether it's superior to the OGC CS/W or CKAN API search we've pursued for years in IOOS. You're also researching intake catalogs. Again, I'm very curious to know how mature or useful they are.

This question also relates to #3. As ChatGPT and other AI have exploded onto the scene, we can't discount the need for open descriptions of our data that Large Language Models can parse and analyze. I recently had a conversation with an AI researcher trying to convince me that structured data models for representing semantic understanding of data will soon be a thing of the past. LLMs will be able to parse all the meaning they need. He further told me to use Github to publish all our data (descriptions) because it is rapidly becoming the most open and ubiquitous interface for developers creating tomorrow's AI/ML algorithms. Maybe we should just publish all our metadata as markdown files on GitHub?

Sorry, this was more of a rabbit hole than I intended. For this project, I think we focus on a particular user persona and get a little more specific about their discovery needs. Given the architecture we're discussing, I think the user persona is a relatively tech-savvy scientist comfortable with Python and Jupyterhub. Then, dive into STAC or intake for a couple of examples.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are free-form metadata searches useful? #5

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Are free-form metadata searches useful? #5

jonmjoyce Aug 8, 2022 Maintainer

Replies: 1 comment

dpsnowden Jun 10, 2023

jonmjoyce
Aug 8, 2022
Maintainer

dpsnowden
Jun 10, 2023