-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I think sum of NULLs in the aggregate query should be zero #20
Comments
I agree. |
SUM of nulls should be NULL as it indicates erroneous data. In the analytical layer (multi-dimensional data store) NULL should not exist in fields used by the analytical engines, that is: fact keys, dimension keys and measures. If a NULL happens to exist in the tables at this stage, it means there is something wrong with the data production process that generates said tables and it should be checked and corrected. NULL in this case is indicator of broken ETL or insufficiently prepared data. |
if that's the approach then sum of NULL and something else should also be NULL or even better, any null result should result in an exception. Is this a common view in the ETL or OLAP world? I tend to consider null as a bad thing in a database as it suggests a bad data model - always requiring further interpretation of this case of NULL. In the case of https://github.com/Code4SA/municipal-data it means there was a null in the data supplied by our treasury and we try to keep close to that data and suggest applications on our API treat nulls as zero. |
When handling official fiscal data we try to keep as close to the original as possible. In these files, a missing value (aka NULL) might sometimes be meaningful, and hold a different meaning than a '0' in the same cell. The intuitive way for me of thinking about summation of values - out of which some might be missing - is to ignore the missing values in the list, so that |
I'm with @akariv I've dealt with this dilemma in fiscal data in the past. It is too common to get data with budget/spend lines which have "no monetary amount", which is distinct from zero monetary amount. Of course, we could handle this in the ETL, but we are then altering the meaning of the source data (where "NULL" has a distinct meaning from 0, and is not necessarily erroneous). For the least surprising interpretation of summing NULLs then, for fiscal data of this type, I think it is a reasonable decision to treat NULL as 0. However, as Babbage is not necessarily about this particular fiscal data use case, I'd rather see something like a "null handling strategy" so that, in other potential cases we allow NULL to be considered an error (although, that should likely be caught when loading data). |
right now babbage just does what the database does. In postgres that means sum of 100 and null is 100, while sum of only nulls is null. I'd expect all values in a sum aggregate to be integers (assuming floats aren't supported)
What do you think?
The text was updated successfully, but these errors were encountered: