[infer] Goals and implementations discussion #721
Replies: 12 comments
-
@rufuspollock
It should be mentioned that high-level requirements for the infer operation could differ:
E.g. a column with 99 integers and 1 string will be a string in the first case and an integer in the second. @Stephen-Gates @loleg (cc - an interesting discussion) |
Beta Was this translation helpful? Give feedback.
-
Definitely a supporter of
|
Beta Was this translation helpful? Give feedback.
-
I think the best way to handle these high-level requirements (of 100% type compatibility or not) is to return a set of types and probabilities, as suggested by Rufus on the first message. Something like: result = infer(['100', '200', '300', 'a'])
# [
# { 'type': 'string', 'confidence': '1' },
# { 'type': 'integer', 'confidence': '0.75' },
# ] Maybe it doesn't make sense to even include |
Beta Was this translation helpful? Give feedback.
-
But I think it's kind simple to provide a way for a client to choice on this. E.g. I think libs like |
Beta Was this translation helpful? Give feedback.
-
For example in this code: accepting With |
Beta Was this translation helpful? Give feedback.
-
@vitorbaptista @roll so do you have a suggestion on the signature of an infer function e.g.
Or should the sample length be merged into the iterator in some way (i.e. you pass an iterator with only a 100 or 1m items). Should the infer method be more like chardet and be able to be used incrementally e.g.
|
Beta Was this translation helpful? Give feedback.
-
@rufuspollock |
Beta Was this translation helpful? Give feedback.
-
@roll @lauragift21 @lwinfree i feel this is the kind of thing that merits a page somewhere - in some sense it could go on specs as some kind of protocol (but don't think that is a great match) or it could on main site - maybe in the guide or even in its own page under "jobs". Any thoughts? |
Beta Was this translation helpful? Give feedback.
-
For me I would rather get back to the discussion of adding Data Quality Spec to the specs in some more general form than it is now Which is related to https://github.com/frictionlessdata/project/issues/459 |
Beta Was this translation helpful? Give feedback.
-
@roll it's partly implementation but i think it is actually something people need to do quite a bit and it would be nice to get the key analysis points recorded somewhere - maybe this should just move to the forum for now (there's no explicit task really atm). |
Beta Was this translation helpful? Give feedback.
-
It can be a research or blog post I think. Honestly, our tools use quite primitive approaches at least regarding performance. So someone coming up with better algorithms would really help. |
Beta Was this translation helpful? Give feedback.
-
@roll & @sapetti9 it might be nice to have this as part of the Q&A here https://framework.frictionlessdata.io/docs/faq (related to the current question on Discord about infer: https://discord.com/channels/695635777199145130/806182489865060423/956942072038965268) |
Beta Was this translation helpful? Give feedback.
-
This is a discussion issue about infer implementations in the context of Frictionless Data
infer
is an operator that takes a set of values (or rows = arrays of values) and infers type information. Usually focus is basic types e.g. integer, number, string etc but it could be widened to semantics types.Example:
Tasks
Notes
Semantic type inference
Originally this hackmd by @pwalsh : https://hackmd.io/KwIwHAzAjAbDAMBaATAY1ogLMzJEE4ATGfLMTbAQwFNV8ZQg?both
Context
Frictionless Data
The Table Schema Frictionless Data libraries do some basic type inference. The Table Schema spec has "types" and "formats". The Python library infers types only, and the JavaScript library infers types and most formats too.
The algorithms are super simple.
Some of what is on offer here is part of what is referred to in the Twitter conversation. eg - inferring email addresses as a format of the string type (in Table Schema).
Use cases
Defining Dedupe Models
For dedupe users, making the data model for comparing records is their hardest task. For this task the user
From our experience, a handful of types of information are very likely to be good candidates for
a dedupe data model -- names, postal addresses, email addresses. If we could infer the semantics of the field we could make good recommendations to the user for how to compare their data (both whicch fields to use and how to compare).
Related work
Schema matching
Last year DataMade wrote a whitepaper on the related problem of schema matching for columnar data. While the problems of semantic typing and schema matching are not the same, many of the techniques that could be used for schema matching can be used, to good effect, for semantic typing.
Beta Was this translation helpful? Give feedback.
All reactions