Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal and external handling of rounded and censored variates & data #50

Open
pglpm opened this issue Aug 5, 2024 · 0 comments
Open
Assignees
Labels
enhancement New feature or request

Comments

@pglpm
Copy link
Owner

pglpm commented Aug 5, 2024

Rounded data – sometimes technically called grouped – can lead to artefacts if they're treated as continuous; see for example the studies in https://doi.org/10.1214/ss/1177012601, https://doi.org/10.1214/aos/1176348396, and in the references cited there. This is especially true for Bayesian nonparametric methods: owing to rounding, multiple datapoints can end up having identical values, and the nonparametric inference would conclude that there must be a concentration of probability – delta-distributions – at such values.

The present software can handle rounded data properly, so no such artefacts appear.

There is, however, a difference in the way such variates must be handled in drawing inferences about new points or subjects.

  1. Although the data used for learning are rounded, the values of new points will not be rounded.
  2. The values of new points will be rounded, just like the data used for learning.

In case 1. we'd have two options: ("round") round the precise value, in the same way as the data used for learning, and use the rounded value for the inference; ("keep") use the precise value for the inference. Option ("keep") can in some situations lead to improved inferences.

Both options could be implemented in the software. But for the moment we only use option ("round"). In future development we could give the possibility of using option ("keep"). This requires some thinking on how to implement it in an efficient way, in functions like samplesFdistribution() and mutualinfo().

Censored data are a special case of this, where the grouping only happens at the boundaries of the variate's domain. The same considerations and options apply. The software for the moment uses option ("round") for these too.

@pglpm pglpm added enhancement New feature or request invalid This doesn't seem right labels Aug 5, 2024
@pglpm pglpm self-assigned this Aug 5, 2024
@pglpm pglpm removed the invalid This doesn't seem right label Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant