I think sum of NULLs in the aggregate query should be zero #20

jbothma · 2016-05-30T09:10:02Z

right now babbage just does what the database does. In postgres that means sum of 100 and null is 100, while sum of only nulls is null. I'd expect all values in a sum aggregate to be integers (assuming floats aren't supported)

What do you think?

akariv · 2016-05-30T09:11:16Z

I agree.

Stiivi · 2016-06-07T18:45:54Z

SUM of nulls should be NULL as it indicates erroneous data. In the analytical layer (multi-dimensional data store) NULL should not exist in fields used by the analytical engines, that is: fact keys, dimension keys and measures. If a NULL happens to exist in the tables at this stage, it means there is something wrong with the data production process that generates said tables and it should be checked and corrected.

NULL in this case is indicator of broken ETL or insufficiently prepared data.

jbothma · 2016-06-07T19:16:53Z

if that's the approach then sum of NULL and something else should also be NULL or even better, any null result should result in an exception.

Is this a common view in the ETL or OLAP world?

I tend to consider null as a bad thing in a database as it suggests a bad data model - always requiring further interpretation of this case of NULL.

In the case of https://github.com/Code4SA/municipal-data it means there was a null in the data supplied by our treasury and we try to keep close to that data and suggest applications on our API treat nulls as zero.
For openspending that would probably mean additional preprocessing forcing the user to interpret nulls at import - which might be a good thing.

akariv · 2016-06-07T21:03:21Z

When handling official fiscal data we try to keep as close to the original as possible.

In these files, a missing value (aka NULL) might sometimes be meaningful, and hold a different meaning than a '0' in the same cell.

The intuitive way for me of thinking about summation of values - out of which some might be missing - is to ignore the missing values in the list, so that 2 + NULL + 2 == 4. The question of what should be the result of an "all NULL" aggregation is more of a Philosophical nature in my opinion, but it seems that convention says it should be 0 (https://en.wikipedia.org/wiki/Empty_sum).

pwalsh · 2016-06-08T05:35:07Z

I'm with @akariv

I've dealt with this dilemma in fiscal data in the past. It is too common to get data with budget/spend lines which have "no monetary amount", which is distinct from zero monetary amount.

Of course, we could handle this in the ETL, but we are then altering the meaning of the source data (where "NULL" has a distinct meaning from 0, and is not necessarily erroneous).

For the least surprising interpretation of summing NULLs then, for fiscal data of this type, I think it is a reasonable decision to treat NULL as 0.

However, as Babbage is not necessarily about this particular fiscal data use case, I'd rather see something like a "null handling strategy" so that, in other potential cases we allow NULL to be considered an error (although, that should likely be caught when loading data).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I think sum of NULLs in the aggregate query should be zero #20

I think sum of NULLs in the aggregate query should be zero #20

jbothma commented May 30, 2016

akariv commented May 30, 2016

Stiivi commented Jun 7, 2016

jbothma commented Jun 7, 2016

akariv commented Jun 7, 2016

pwalsh commented Jun 8, 2016

I think sum of NULLs in the aggregate query should be zero #20

I think sum of NULLs in the aggregate query should be zero #20

Comments

jbothma commented May 30, 2016

akariv commented May 30, 2016

Stiivi commented Jun 7, 2016

jbothma commented Jun 7, 2016

akariv commented Jun 7, 2016

pwalsh commented Jun 8, 2016