Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I think sum of NULLs in the aggregate query should be zero #20

Open
jbothma opened this issue May 30, 2016 · 5 comments
Open

I think sum of NULLs in the aggregate query should be zero #20

jbothma opened this issue May 30, 2016 · 5 comments

Comments

@jbothma
Copy link

jbothma commented May 30, 2016

right now babbage just does what the database does. In postgres that means sum of 100 and null is 100, while sum of only nulls is null. I'd expect all values in a sum aggregate to be integers (assuming floats aren't supported)

What do you think?

@akariv
Copy link
Member

akariv commented May 30, 2016

I agree.

@Stiivi
Copy link

Stiivi commented Jun 7, 2016

SUM of nulls should be NULL as it indicates erroneous data. In the analytical layer (multi-dimensional data store) NULL should not exist in fields used by the analytical engines, that is: fact keys, dimension keys and measures. If a NULL happens to exist in the tables at this stage, it means there is something wrong with the data production process that generates said tables and it should be checked and corrected.

NULL in this case is indicator of broken ETL or insufficiently prepared data.

@jbothma
Copy link
Author

jbothma commented Jun 7, 2016

if that's the approach then sum of NULL and something else should also be NULL or even better, any null result should result in an exception.

Is this a common view in the ETL or OLAP world?

I tend to consider null as a bad thing in a database as it suggests a bad data model - always requiring further interpretation of this case of NULL.

In the case of https://github.com/Code4SA/municipal-data it means there was a null in the data supplied by our treasury and we try to keep close to that data and suggest applications on our API treat nulls as zero.
For openspending that would probably mean additional preprocessing forcing the user to interpret nulls at import - which might be a good thing.

@akariv
Copy link
Member

akariv commented Jun 7, 2016

When handling official fiscal data we try to keep as close to the original as possible.

In these files, a missing value (aka NULL) might sometimes be meaningful, and hold a different meaning than a '0' in the same cell.

The intuitive way for me of thinking about summation of values - out of which some might be missing - is to ignore the missing values in the list, so that 2 + NULL + 2 == 4. The question of what should be the result of an "all NULL" aggregation is more of a Philosophical nature in my opinion, but it seems that convention says it should be 0 (https://en.wikipedia.org/wiki/Empty_sum).

@pwalsh
Copy link
Member

pwalsh commented Jun 8, 2016

I'm with @akariv

I've dealt with this dilemma in fiscal data in the past. It is too common to get data with budget/spend lines which have "no monetary amount", which is distinct from zero monetary amount.

Of course, we could handle this in the ETL, but we are then altering the meaning of the source data (where "NULL" has a distinct meaning from 0, and is not necessarily erroneous).

For the least surprising interpretation of summing NULLs then, for fiscal data of this type, I think it is a reasonable decision to treat NULL as 0.

However, as Babbage is not necessarily about this particular fiscal data use case, I'd rather see something like a "null handling strategy" so that, in other potential cases we allow NULL to be considered an error (although, that should likely be caught when loading data).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants