-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multivariate implementation #110
base: master
Are you sure you want to change the base?
Conversation
* Port bivariate estimation to multivariate estimation * Remove now superfluous implementation of bivariate estimation * Export MultivariateDistribution * Support x and y properties of BivariateKDE again * Fix small issues * Add tests for multivariate estimation * Generalize interpolation * Implement syntactic sugar for trivariate KDEs * Simplify generated function * Add test for multivariate dimensions * Add trivariate impl * Update Readme * Extend interpolation section in readme * Fix readme
Are there any plans to merge this PR? |
It is on my TODO list to review and merge, it looks very nicely done so I am sure it will get merged. Apologies for the delay, this repository is community-maintained. |
No problem. I completely understand. The reason I ask is that I have a use case for this feature and would like to use it as soon as possible. I went through the PR mostly from the perspective of a user to see whether I could understand how to use the new feature. I wanted to test the multivariate KDE on a multivariate normal. Here is my setup:
However, it was unclear to me how to compare the interpolated and true pdfs. Here is what I found:
I tried passing the test data in different forms, but none of them returned the expected output. Another issue I noticed is that
such that the rows are samples, and the columns are indexed variables. If possible, please let me know how to use Let me know if I can help in anyway. Update It appears that n = 3 is the state of the art. The following Python code also hangs, but doesn't nearly freeze my computer. It might be good to note this limitation in the README and doc strings.
|
Note that a KDE of 4-dimensional data must produce a 4-dimensional array. So it is likely that your computer runs out of memory. You are right that a warning regarding that could be added to the documentation.
I would claim that this is rather a peculiarity of Distributions.jl. Usually, Julia's column-major memory layout for arrays suggests that samples are represented as rows and features as columns. I just checked that recently for another purpose and to my knowledge all implementation of the Tables.jl interface store features contiguously in memory. So it is reasonably that the API of
So my implementation just generalises what the bivariate one did before. However, you might be right that the implementation was and is a bit strange. Let's have a look. pdf(ik::InterpKDE,xs::AbstractVector,ys::AbstractVector) = [ik.itp(x,y) for x in xs, y in ys] new: function pdf(ik::InterpKDE{K, I}, xs::Vararg{AbstractVector, N}) where
{N, R, K <: MultivariateKDE{N, R}, I}
[ik.itp(x...) for x in Iterators.product(xs...)]
end So Maybe someone more familiar with the original implementation can comment on that? According to Edit: I just noted that the same confusion arose in #102. |
@andreasKroepelin, thank you for your explanation. What you said makes sense to me. You are right. I was expecting a vector with 20 elements, comparable to It might be worth changing that behavior. In the meantime, can you recommend a workaround? |
You can basically implement the right behaviour yourself, maybe in combination with what was suggested in #102: kd = kde(kde_data)
test_vals = rand(true_dist, 20)
est_dist = InterpKDE(kd)
est_pdfs = [est_dist.itp(col...) for col in eachcol(test_vals)] |
This PR ports the previously existing bivariate implementation to a multivariate one, such that KernelDensity.jl now supports arbitrary dimensions.
The bivariate estimation is now just a special case of the more general implementation. Note that this is backwards compatible because a
BivariateKDE
has the same properties as before and the multivariate implementation introduces no runtime overhead copared to the previous bivariate-specific one.