Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multivariate implementation #110

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

andreasKroepelin
Copy link

This PR ports the previously existing bivariate implementation to a multivariate one, such that KernelDensity.jl now supports arbitrary dimensions.
The bivariate estimation is now just a special case of the more general implementation. Note that this is backwards compatible because a BivariateKDE has the same properties as before and the multivariate implementation introduces no runtime overhead copared to the previous bivariate-specific one.

* Port bivariate estimation to multivariate estimation

* Remove now superfluous implementation of bivariate estimation

* Export MultivariateDistribution

* Support x and y properties of BivariateKDE again

* Fix small issues

* Add tests for multivariate estimation

* Generalize interpolation

* Implement syntactic sugar for trivariate KDEs

* Simplify generated function

* Add test for multivariate dimensions

* Add trivariate impl

* Update Readme

* Extend interpolation section in readme

* Fix readme
@itsdfish
Copy link

itsdfish commented Feb 8, 2023

Are there any plans to merge this PR?

@tpapp
Copy link
Collaborator

tpapp commented Feb 8, 2023

It is on my TODO list to review and merge, it looks very nicely done so I am sure it will get merged.

Apologies for the delay, this repository is community-maintained.

@itsdfish
Copy link

itsdfish commented Feb 9, 2023

No problem. I completely understand.

The reason I ask is that I have a use case for this feature and would like to use it as soon as possible.

I went through the PR mostly from the perspective of a user to see whether I could understand how to use the new feature.

I wanted to test the multivariate KDE on a multivariate normal. Here is my setup:

using KernelDensity
using Interpolations
using Distributions
using LinearAlgebra
using Random 

Random.seed!(50)

# number of dimensions 
n = 2
# vector of means 
μ = zeros(n)
# covariance matrix 
Σ = I(n)
# true distribution 
true_dist = MvNormal(μ, Σ)
# data for the kde
data = rand(true_dist, 10_000)

kde_data = Tuple(data[i,:] for i ∈ 1:size(data,1))
kd = kde(kde_data)
# estimated distribution 
est_dist = InterpKDE(kd)

However, it was unclear to me how to compare the interpolated and true pdfs. Here is what I found:

test_vals = rand(true_dist, 20)
kde_test_vals = Tuple(test_vals[i,:] for i ∈ 1:size(test_vals,1))
# why does this return a 20 X 20 matrix? 
est_pdfs = pdf(est_dist, kde_test_vals...)
# throws an error 
# est_pdfs = pdf(est_dist, test_vals)
true_pdfs = pdf(true_dist, test_vals)

I tried passing the test data in different forms, but none of them returned the expected output.

Another issue I noticed is that kde hangs and nearly freezes my computer when n=4. I'm not sure whether that is to be expected, but that limitation is not clear to the user. One last issue that might be addressed in the README file: the dimensions of the matrix should be explained. When the data are sampled from MvNormal, they must be transposed:

kd = kde(data')

such that the rows are samples, and the columns are indexed variables.

If possible, please let me know how to use pdf with multiple inputs.

Let me know if I can help in anyway.

Update

It appears that n = 3 is the state of the art. The following Python code also hangs, but doesn't nearly freeze my computer. It might be good to note this limitation in the README and doc strings.

import numpy as np
from fastkde import fastKDE
import pylab as PP

N = 100
var1 = 50*np.random.normal(0, 1, N) + 0.1
var2 = 50*np.random.normal(0, 1, N) + 0.1
var3 = 50*np.random.normal(0, 1, N) + 0.1
var4 = 50*np.random.normal(0, 1, N) + 0.1

myPDF,axes = fastKDE.pdf(var1,var2, var3, var4)

@andreasKroepelin
Copy link
Author

andreasKroepelin commented Feb 9, 2023

Another issue I noticed is that kde hangs and nearly freezes my computer when n=4.

Note that a KDE of 4-dimensional data must produce a 4-dimensional array. So it is likely that your computer runs out of memory. You are right that a warning regarding that could be added to the documentation.
I guess it is possible to have a lot of fun with sparse data structures to make this work for higher dimensions, but as far as I understand, this package is supposed to be a general purpose "standard" implementation without such fancy things.

One last issue that might be addressed in the README file: the dimensions of the matrix should be explained. When the data are sampled from MvNormal, they must be transposed:

I would claim that this is rather a peculiarity of Distributions.jl. Usually, Julia's column-major memory layout for arrays suggests that samples are represented as rows and features as columns. I just checked that recently for another purpose and to my knowledge all implementation of the Tables.jl interface store features contiguously in memory. So it is reasonably that the API of kde also assumes that. Maybe it can be documented better.
I believe the reason why Distributions.jl's rand function behaves differently is because samples are produced one after the other so for them it is beneficial cache-wise to store samples as columns.

# why does this return a 20 X 20 matrix?

So my implementation just generalises what the bivariate one did before. However, you might be right that the implementation was and is a bit strange. Let's have a look.
old:

pdf(ik::InterpKDE,xs::AbstractVector,ys::AbstractVector) = [ik.itp(x,y) for x in xs, y in ys]

new:

function pdf(ik::InterpKDE{K, I}, xs::Vararg{AbstractVector, N}) where
    {N, R, K <: MultivariateKDE{N, R}, I}

    [ik.itp(x...) for x in Iterators.product(xs...)]
end

So for x in xs, y in ys becomes for x in Iterators.product(xs...), basically. This product is the reason why you get a 20x20 matrix. I believe the implementation should instead be for xz in zip(xs, ys) (old), or for x_tuple in zip(xs...) (new). Then, you would obtain a vector of length 20 in your case, which is probably what you are expecting, right?

Maybe someone more familiar with the original implementation can comment on that? According to git blame, the code is at least seven years old, so there is probably no point in asking the author 😅

Edit: I just noted that the same confusion arose in #102.

@itsdfish
Copy link

itsdfish commented Feb 9, 2023

@andreasKroepelin, thank you for your explanation. What you said makes sense to me.

You are right. I was expecting a vector with 20 elements, comparable to pdf(MvNormal()).

It might be worth changing that behavior. In the meantime, can you recommend a workaround?

@andreasKroepelin
Copy link
Author

You can basically implement the right behaviour yourself, maybe in combination with what was suggested in #102:

kd = kde(kde_data)
test_vals = rand(true_dist, 20)
est_dist = InterpKDE(kd)
est_pdfs = [est_dist.itp(col...) for col in eachcol(test_vals)]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants