Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add range type #316

Open
hesamsagha opened this issue Sep 28, 2022 · 6 comments
Open

Add range type #316

hesamsagha opened this issue Sep 28, 2022 · 6 comments
Labels
enhancement New feature or request

Comments

@hesamsagha
Copy link
Member

Feature

We can have a range type for the data, specified in string format, but also can be ordered.

For example, we don't want to be precise (or not allowed to, due to privacy) about the birth-year, and 10-year categories will do the job:

 "_1990" : means 1990 and less
 "1991_2000": means between years 1991 to 2000 inclusive
 "2001_2010": means between years 2001 to 2010 inclusive
 "2011_2020": means between years 2011 to 2020 inclusive
 "2020_": means 2020 and afterward
@hagenw
Copy link
Member

hagenw commented Sep 28, 2022

One solution for such a feature would be to add a range type that expects to get labels which are then stored as ordered pandas.CategoricalDtype. Another solution could be to extend the str type by adding an option that indicates that the labels should be ordered. As there is obviously not an easy solution to decide "1991_2000" > "2001_2010" for arbitrary entries, we could fix the order by the given order when creating the scheme.

@hagenw hagenw added the enhancement New feature or request label Sep 28, 2022
@hagenw
Copy link
Member

hagenw commented Oct 19, 2022

I would not create an extra range type as I don't see immediate advantages of having it as an extra type, but would instead extend labels in Scheme to support an order, e.g.

db.schemes['ordered_alphabet'] = audformat.Scheme('str', labels=['a', 'b', 'c'], ordered=True)

If the labels are given as a misc table, the order of the index would then determine the order of the labels.

@frankenjoe
Copy link
Collaborator

I'm with @hagenw here and would not introduce a range type and propose we close the issue.

@hagenw
Copy link
Member

hagenw commented Jan 18, 2023

What about the proposed ordering of the labels? It might be important to store information such as '2001_2010' < '2020_'. CategoricalDtype supports providing an order, which so far we don't make use of.

@frankenjoe
Copy link
Collaborator

If the labels are given as a misc table, the order of the index would then determine the order of the labels.

Ok, so that you can do something like this:

db = audformat.testing.create_db(minimal=True)

db.schemes['scheme'] = audformat.Scheme(labels=[2, 1, 0])

db['table'] = audformat.Table(audformat.filewise_index(['f1', 'f2', 'f3']))
db['table']['column'] = audformat.Column(scheme_id='scheme')
db['table']['column'].set([0, 1, 2])

# force order
db['table'].df.column.dtype._ordered = True

y = db['table']['column'].get()
y.min()
2

Unfortunately, the ordering gets lost when you compare values directly:

y[2] < y[0]
False

@hagenw
Copy link
Member

hagenw commented Jan 18, 2023

Mh, it's of cause a little bit unfortunate that

>>> y[2] < y[0]
False

is not working.

Maybe users that work with this kind of data could provide some input if it would be beneficial for them if the labels are marked as ordered?

/cc @monicagoma, @Pascal-H

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants