Add mean imputation function #892

tszfungc · 2022-08-17T04:23:06Z

Add mean impute function for call_dosage, call_genotype, and call_genotype_probability

tomwhite

Thanks for the contribution @tszfungc!

The overall structure looks fine to me. Hoping @jeromekelleher and @timothymillar can take a look too.

tomwhite · 2022-08-17T14:08:44Z

sgkit/stats/preprocessing.py

+        Dataset containing the variable to be imputed.
+    variable
+        Input variable name
+        ``f"{variable}"`` and ``f"{variable}_masked"`` must be present in ``ds``.


Don't think f-strings work here?

timothymillar · 2022-08-18T09:11:54Z

Thanks for looking into this @tszfungc! I think this could be a great approach for imputing call_dosage and call_genotype_probability. However, I don't think it will produce the desired result for call_genotype.

The values in call_genotype are (potentially unsorted) alleles whose order along the ploidy dimension doesn't have any particular meaning. So, as far as I can tell, the mean of those alleles can't really be used for anything.

tszfungc · 2022-08-18T19:29:02Z

Thanks for the review @tomwhite @timothymillar. I agree that the allele order doesn't have a particular meaning. The order along ploidy should be ignored by computing the mean along dim=['samples', 'ploidy'], But this is also an unusual use to me.

jeromekelleher

Approach basically looks good to me, but I'm not convinced about the general approach of creating new _imputed variables. I would be simpler/better to just replace the missing data and reset the missingness mask in the returned dataset I think.

jeromekelleher · 2022-08-25T08:44:31Z

sgkit/stats/preprocessing.py

+    dim: Union[Hashable, Sequence[Hashable]] = "samples",
+    merge: bool = True,
+) -> Dataset:
+    """Mean impute a masked variable


It would be helpful to give a more descriptive follow up sentence here, like say

This replaces missing data for the specified variable with the mean of the non-missing values.

jeromekelleher · 2022-08-25T08:48:40Z

sgkit/variables.py

@@ -214,6 +214,15 @@ def _check_field(
    )
 )

+call_dosage_imputed, call_dosage_imputed_spec = SgkitVariables.register_variable(


I'm not sure we want to create a whole new bunch of variables here. Wouldn't it be simpler if we returned a copy of the original dataset in which all the missing data for the variable in question was replaced with the mean, and the mask was unset?

This would be more useful for downstream work, wouldn't it? We'd surely want to use the (say) imputed call_dosage in downstream analyses, and we wouldn't want to need to change variable names in order to do this.

timothymillar · 2022-09-01T21:45:46Z

@jeromekelleher the trade-off between returning new variables or replacing existing variables was previously discussed in https://github.com/pystatgen/sgkit/pull/308#issuecomment-705706571. I personally have a slight preference for replacing existing variables but there are some good points raised in that discussion. The primary concern seems to be that replacing existing variables is effectively a mutate operation, which goes against the general pattern of treating arrays as immutable.

jeromekelleher · 2022-09-13T08:47:34Z

I see, thanks. Hmm, not much choice other than to create a bunch of new variables then.

mergify · 2023-03-29T13:17:50Z