Validating tall / long / tidy data #5288
aberges-grd
started this conversation in
Community Use Cases
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
What is tidy data?
As you may know, tidy data is a kinda common data format in Data Science, involving few columns and lots of rows. Essentially, you encode semantics in the columns name and let categorical columns do the indexing/grouping. This is particularly well suited for e.g. timeseries from industrial settings where variables usually have time dependence and hierarchical structure (a variable comes from a sensor ∈ machine part ∈ machine ∈ manufacturing group ∈ factory), so usually you organize data as timestamp, value and one or more "name" columns.
What do I want?
I want Great Expectations to either fully support tidy data as the "go-to" format for hierarchical data structures such as ones arising in industry OR provide a way to "pivot" the data (via e.g.
pandas
) prior to passing it to a suite. Ideally, add support for multi-index dataframes too, so I can do table checks for levels of the multiindex (e.g. check all sensors needed are present).What would it mean to support tidy data? Essentially: being able to apply "column expectations" inside
GroupBy
objects. So, I could check for example that the group[temperature sensor, milling machine]
(which would define a timeseries) has values within MIN and MAX (security / operational levels) and a median between a nominal value ± tolerance.Beta Was this translation helpful? Give feedback.
All reactions