Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid division by zero #52

Closed
wants to merge 3 commits into from
Closed

Avoid division by zero #52

wants to merge 3 commits into from

Conversation

vpa1977
Copy link

@vpa1977 vpa1977 commented Oct 5, 2023

This addresses #51

While it is not addressing the root cause (sample_counts[0] == 0), it allows to avoid undefined behaviour and allows the test to pass.

@wasade
Copy link
Member

wasade commented Oct 5, 2023

Thanks, @vpa1977!

@sfiligoi, could you look when you have a moment? Any idea why this test wasn't flagged on prior CI runs?

@sfiligoi
Copy link
Collaborator

sfiligoi commented Oct 5, 2023

Pretty sure we are never supposed to have (sample_counts[0] == 0) in there!
(i.e., it is an implicit assumption)
Don't think this is the right approach.... we need to find why would anyone ever invoke this function with "wrong" counts.

But I need some time to think it through.

@vpa1977
Copy link
Author

vpa1977 commented Oct 5, 2023

Pretty sure we are never supposed to have (sample_counts[0] == 0) in there! (i.e., it is an implicit assumption) Don't think this is the right approach.... we need to find why would anyone ever invoke this function with "wrong" counts.

Here is the h5 dump of the data used in the unit test:

HDF5 "/tmp/table.biom" {
GROUP "/" {
   ATTRIBUTE "creation-date" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "2023-10-06T12:05:44.430414"
      }
   }
   ATTRIBUTE "format-url" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "http://biom-format.org"
      }
   }
   ATTRIBUTE "format-version" {
      DATATYPE  H5T_STD_I64LE
      DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
      DATA {
      (0): 2, 1
      }
   }
   ATTRIBUTE "generated-by" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "Table for unit testing"
      }
   }
   ATTRIBUTE "id" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "No Table ID"
      }
   }
   ATTRIBUTE "nnz" {
      DATATYPE  H5T_STD_I64LE
      DATASPACE  SCALAR
      DATA {
      (0): 0
      }
   }
   ATTRIBUTE "shape" {
      DATATYPE  H5T_STD_I64LE
      DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
      DATA {
      (0): 5, 1
      }
   }
   ATTRIBUTE "type" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): ""
      }
   }
   GROUP "observation" {
      GROUP "group-metadata" {
      }
      DATASET "ids" {
         DATATYPE  H5T_STRING {
            STRSIZE H5T_VARIABLE;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_UTF8;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SIMPLE { ( 5 ) / ( 5 ) }
         DATA {
         (0): "OTU1", "OTU2", "OTU3", "OTU4", "OTU5"
         }
      }
      GROUP "matrix" {
         DATASET "data" {
            DATATYPE  H5T_IEEE_F64LE
            DATASPACE  SIMPLE { ( 0 ) / ( 0 ) }
            DATA {
            }
         }
         DATASET "indices" {
            DATATYPE  H5T_STD_I32LE
            DATASPACE  SIMPLE { ( 0 ) / ( 0 ) }
            DATA {
            }
         }
         DATASET "indptr" {
            DATATYPE  H5T_STD_I32LE
            DATASPACE  SIMPLE { ( 6 ) / ( 6 ) }
            DATA {
            (0): 0, 0, 0, 0, 0, 0
            }
         }
      }
      GROUP "metadata" {
      }
   }
   GROUP "sample" {
      GROUP "group-metadata" {
      }
      DATASET "ids" {
         DATATYPE  H5T_STRING {
            STRSIZE H5T_VARIABLE;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_UTF8;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
         DATA {
         (0): "foo"
         }
      }
      GROUP "matrix" {
         DATASET "data" {
            DATATYPE  H5T_IEEE_F64LE
            DATASPACE  SIMPLE { ( 0 ) / ( 0 ) }
            DATA {
            }
         }
         DATASET "indices" {
            DATATYPE  H5T_STD_I32LE
            DATASPACE  SIMPLE { ( 0 ) / ( 0 ) }
            DATA {
            }
         }
         DATASET "indptr" {
            DATATYPE  H5T_STD_I32LE
            DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
            DATA {
            (0): 0, 0
            }
         }
      }
      GROUP "metadata" {
      }
   }
}
}

It has 1 sample foo and 5 observations OTU1..OTU2 without any data points.
Since there are no data points, the sample_counts[] are zero.

set_proportions has normalize = true as a default argument.
I am wondering if normalisation makes sense at all in this case and we should just have a do not normalize clause for the columns that lack data.

@sfiligoi
Copy link
Collaborator

sfiligoi commented Oct 5, 2023

Can you check what happens when you run the code through the command line faithpd?

(unifrac-dev-311) -bash-4.2$ faithpd -i /tmp/table.biom -t /tree/tree.biom  -o a.txt
(unifrac-dev-311) -bash-4.2$ cat a.txt 
#SampleID	faith_pd
foo	0

Additionally, can you check that the generated tree actually makes sense?

(unifrac-dev-311) -bash-4.2$ cat /tree/tree.biom
(((((OTU1:0.5,OTU2:0.5):0.5,OTU3:1.0):1.0):0.0,(OTU4:0.75,OTU5:0.75):1.25):0.0)root;

@sfiligoi
Copy link
Collaborator

sfiligoi commented Oct 6, 2023

Anyway, you are right that we should not be dividing by zero, but we are...
I traced the code, and
0/0 -> -NAN
on my machine.
Which makes the test
(value <= 0)
pass (by pure luck)

Still, we have to figure out how to properly deal with this situation.
There really is no good answer to what 0/0 is, so we should avoid getting in there in the first place.
(likely some kind of filtering before invoking the function)

@wasade Any insight on what the proper semantics should be?

@wasade
Copy link
Member

wasade commented Oct 6, 2023

The correct faith's pd result in this case is zero. Any issue with if(sample_counts[i] == 0) { props[i] = 0; continue; }?

@vpa1977
Copy link
Author

vpa1977 commented Oct 6, 2023

Updated as per comment.

@sfiligoi
Copy link
Collaborator

sfiligoi commented Oct 6, 2023

While this fixes one instance of division by zero, it does not fix the root cause.
(neither it fixes the many other places where we are assuming sample_counts[i] != 0)

FYI, looking into ways to address the root issue of the broken assumption.

@sfiligoi
Copy link
Collaborator

sfiligoi commented Oct 6, 2023

Opened PR #53 as an alternative implementation.

@vpa1977 vpa1977 closed this Oct 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants