[infer] Goals and implementations discussion #721

rufuspollock · 2018-02-21T05:15:17Z

rufuspollock
Feb 21, 2018
Maintainer

This is a discussion issue about infer implementations in the context of Frictionless Data

infer is an operator that takes a set of values (or rows = arrays of values) and infers type information. Usually focus is basic types e.g. integer, number, string etc but it could be widened to semantics types.

def infer([value1, value2]):
    return  type   # or could be a set of probabilities for types

# can have a set of rows instead - where a row is a set of values
# this may be more complex than simply doing infer on each column if we believe there is a connection between columns
def infer([row1, row2]):
    # array of types corresponding to columns
    return  [types]

Example:

infer(['10', '12', '15', '2017']) => integer

infer(['1999', '2000', '2001', '2002']) => year

Tasks

Define needs
Research existing work including implementations with FD and outside
- infer in tableschema-py, tableschema-js, tableschema-go etc
Test data
Outline Algorithms

Notes

casting and inferring have much in common
set of types would come from TableSchema or elsewhere
Need to think about Inclusions e.g. number is a refinement of integer
Handling Nulls: e.g. '-'
https://twitter.com/forestgregg/status/927938008497745926 - thread by

Semantic type inference

Originally this hackmd by @pwalsh : https://hackmd.io/KwIwHAzAjAbDAMBaATAY1ogLMzJEE4ATGfLMTbAQwFNV8ZQg?both

Context

Frictionless Data

The Table Schema Frictionless Data libraries do some basic type inference. The Table Schema spec has "types" and "formats". The Python library infers types only, and the JavaScript library infers types and most formats too.

The algorithms are super simple.

Some of what is on offer here is part of what is referred to in the Twitter conversation. eg - inferring email addresses as a format of the string type (in Table Schema).

Use cases

Defining Dedupe Models

For dedupe users, making the data model for comparing records is their hardest task. For this task the user

needs to know what fields will be useful to compare
decide how to compare each of those fields

From our experience, a handful of types of information are very likely to be good candidates for
a dedupe data model -- names, postal addresses, email addresses. If we could infer the semantics of the field we could make good recommendations to the user for how to compare their data (both whicch fields to use and how to compare).

Related work

Schema matching

Last year DataMade wrote a whitepaper on the related problem of schema matching for columnar data. While the problems of semantic typing and schema matching are not the same, many of the techniques that could be used for schema matching can be used, to good effect, for semantic typing.

roll · 2018-02-21T06:57:04Z

roll
Feb 21, 2018
Maintainer

@rufuspollock
The latest iteration of infer at the FD core libs is tableschema.js:infer:

a fast probabilistic algorithm
ability to infer formats like email etc

It should be mentioned that high-level requirements for the infer operation could differ:

getting the 100% type-values compatibility (greatest common denominator)
getting the most reasonable result assuming that there are outliers in the values to fix

E.g. a column with 99 integers and 1 string will be a string in the first case and an integer in the second.

@Stephen-Gates @loleg (cc - an interesting discussion)

0 replies

Stephen-Gates · 2018-02-21T07:07:53Z

Stephen-Gates
Feb 21, 2018

Definitely a supporter of

getting the most reasonable result assuming that there are outliers in the values to fix

0 replies

vitorbaptista · 2018-02-21T14:20:57Z

vitorbaptista
Feb 21, 2018

I think the best way to handle these high-level requirements (of 100% type compatibility or not) is to return a set of types and probabilities, as suggested by Rufus on the first message. Something like:

result = infer(['100', '200', '300', 'a'])
# [
#  { 'type': 'string', 'confidence': '1' }, 
#   { 'type': 'integer', 'confidence': '0.75' },
# ]

Maybe it doesn't make sense to even include string, as anything can be a string anyway. This return type could be defined in a "base" infer method, with other more user friendly methods that use it and already pick the best option (maybe falling back to string if no options are above a certain confidence threshold)

0 replies

roll · 2018-02-22T09:17:43Z

roll
Feb 22, 2018
Maintainer

@Stephen-Gates

Definitely a supporter of

But I think it's kind simple to provide a way for a client to choice on this. E.g. infer(confidence=100) could transform a probabilistic algorithm into a full scan algorithm (most reasonable result -> 100% compatibility). So there should be no need to select only one high-level strategy for the implementation level.

I think libs like chardet provides good examples for this topic - http://chardet.readthedocs.io/en/latest/usage.html#basic-usage

0 replies

roll · 2018-02-22T09:22:03Z

roll
Feb 22, 2018
Maintainer

For example in this code:

https://github.com/frictionlessdata/tableschema-js/blob/master/src/schema.js#L179-L212

accepting config.INFER_THRESHOLD and config.INFER_CONFIDENCE as arguments will make it fully customizable.

With confidence=1 and threshold=sample.length it will guarantee 100% inferred schema types compatibility for every value from the sample (cc @anuveyatsu - frictionlessdata/tableschema-js#111)

0 replies

rufuspollock · 2018-02-26T09:31:01Z

rufuspollock
Feb 26, 2018
Maintainer Author

@vitorbaptista @roll so do you have a suggestion on the signature of an infer function e.g.

def infer(iterator, sample_length, threshold)

Or should the sample length be merged into the iterator in some way (i.e. you pass an iterator with only a 100 or 1m items).

Should the infer method be more like chardet and be able to be used incrementally e.g.

import urllib
from chardet.universaldetector import UniversalDetector

usock = urllib.urlopen('http://yahoo.co.jp/')
detector = UniversalDetector()
for line in usock.readlines():
    detector.feed(line)
    if detector.done: break
detector.close()
usock.close()
print detector.result

0 replies

roll · 2018-02-26T11:13:33Z

roll
Feb 26, 2018
Maintainer

@rufuspollock
I think it really depends on the software goals. E.g. tableschema.infer just gets a path/url/etc (to achieve simplicity). Some extended API could provide a chardet way of customizability.

0 replies

rufuspollock · 2020-04-21T15:06:51Z

rufuspollock
Apr 21, 2020
Maintainer Author

@roll @lauragift21 @lwinfree i feel this is the kind of thing that merits a page somewhere - in some sense it could go on specs as some kind of protocol (but don't think that is a great match) or it could on main site - maybe in the guide or even in its own page under "jobs". Any thoughts?

0 replies

roll · 2020-04-21T15:13:28Z

roll
Apr 21, 2020
Maintainer

For me infer is more an internal thing (implementation details).

I would rather get back to the discussion of adding Data Quality Spec to the specs in some more general form than it is now

Which is related to https://github.com/frictionlessdata/project/issues/459

0 replies

rufuspollock · 2020-04-21T15:21:54Z

rufuspollock
Apr 21, 2020
Maintainer Author

@roll it's partly implementation but i think it is actually something people need to do quite a bit and it would be nice to get the key analysis points recorded somewhere - maybe this should just move to the forum for now (there's no explicit task really atm).

0 replies

roll · 2020-04-22T07:18:20Z

roll
Apr 22, 2020
Maintainer

It can be a research or blog post I think. Honestly, our tools use quite primitive approaches at least regarding performance. So someone coming up with better algorithms would really help.

0 replies

lwinfree · 2022-03-28T19:36:34Z

lwinfree
Mar 28, 2022

@roll & @sapetti9 it might be nice to have this as part of the Q&A here https://framework.frictionlessdata.io/docs/faq (related to the current question on Discord about infer: https://discord.com/channels/695635777199145130/806182489865060423/956942072038965268)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[infer] Goals and implementations discussion #721

{{title}}

Replies: 12 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

[infer] Goals and implementations discussion #721

rufuspollock Feb 21, 2018 Maintainer

Tasks

Notes

Semantic type inference

Context

Frictionless Data

Use cases

Defining Dedupe Models

Related work

Schema matching

Replies: 12 comments

roll Feb 21, 2018 Maintainer

Stephen-Gates Feb 21, 2018

vitorbaptista Feb 21, 2018

roll Feb 22, 2018 Maintainer

roll Feb 22, 2018 Maintainer

rufuspollock Feb 26, 2018 Maintainer Author

roll Feb 26, 2018 Maintainer

rufuspollock Apr 21, 2020 Maintainer Author

roll Apr 21, 2020 Maintainer

rufuspollock Apr 21, 2020 Maintainer Author

roll Apr 22, 2020 Maintainer

lwinfree Mar 28, 2022

rufuspollock
Feb 21, 2018
Maintainer

roll
Feb 21, 2018
Maintainer

Stephen-Gates
Feb 21, 2018

vitorbaptista
Feb 21, 2018

roll
Feb 22, 2018
Maintainer

roll
Feb 22, 2018
Maintainer

rufuspollock
Feb 26, 2018
Maintainer Author

roll
Feb 26, 2018
Maintainer

rufuspollock
Apr 21, 2020
Maintainer Author

roll
Apr 21, 2020
Maintainer

rufuspollock
Apr 21, 2020
Maintainer Author

roll
Apr 22, 2020
Maintainer

lwinfree
Mar 28, 2022