Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add a mutual information CQM selector #16

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

juansebastianl
Copy link
Contributor

This pull requests aims to add a mutual information feature selector by constructing a mutual information matrix where the diagonal elements are $I(Y;X_i)$ and the off-diagonal elements are $I(Y;X_i,X_j)$. This is estimated via discrete approximation which is $O(m^2 \cdot (\log(n))^3)$ in time and space complexity, where the number of observations is $n$ and the number of features is $m$. I want to improve this, at least in space complexity because a lot of the large matrix is sparse and memory becomes a constraint quickly, and there are still some numerical errors in this version.

Comment on lines +5 to +7

cdef extern from "math.h":
double log(double x) nogil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cdef extern from "math.h":
double log(double x) nogil
from libc.math cimport log

You can see it here. If you're wondering how to find what Cython actually has already implemented... the answer is that I literally search the source files every. single. time. 😄

cdef extern from "math.h":
double log(double x) nogil

def calculate_mi(np.ndarray[np.float_t,ndim = 2] X,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In modern Cython, they want us to use Typed Memoryviews. So for instance here it would be

Suggested change
def calculate_mi(np.ndarray[np.float_t,ndim = 2] X,
def calculate_mi(np.float_t[:] X,

np.ndarray[np.float_t,ndim = 1] y,
unsigned long refinement_factor = 5):

cdef unsigned long n_obs = X.shape[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I have found it far less headache when going back and forth between NumPy and Cython to use fixed-width types everywhere. So

Suggested change
cdef unsigned long n_obs = X.shape[0]
cdef np.uint64_t n_obs = X.shape[0]

Cython, because everything gets moderated through an intermediate generated file, tends to get confused about typing, and keeping everything fixed tends to help.

raise ValueError("y is the wrong shape")

cdef long sub_index_size = refinement_factor*int(np.round(log(n_obs))) + 2
cdef total_state_space_size = pow(sub_index_size,3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want pow?


cdef extern from "math.h":
double log(double x) nogil

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll get a pretty big performance boost (at the cost of safety) by using

Suggested change
@cython.boundscheck(False) # Deactivate bounds checking
@cython.wraparound(False) # Deactivate negative indexing.

see Cython for NumPy users which is an excellent tutorial.


cdef assign_nn(np.ndarray[np.float_t,ndim = 2] X,
long sub_index_size,
short seed = 12121):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, it's preferred to to make the default seed None.

Also, in this case you're better off not specifying a C type for seed. This is because it's only passed to a python function, np.random.default_rng(seed), below. So if it starts out as a C-type, it will have to be converted to a Python type first. Of course, in this case the performance hit is so minor as to be pointless... but the other advantage of using object here rather than short is it supports None or another rng.

return query(query_number, query_list[:new_len], idx - new_idx_unit)
else:
#print("query : ", query_number, " pivot: ", pivot_value, " new index: ", idx + new_idx_unit)
return query(query_number, query_list[new_len:], idx + new_idx_unit)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may be able to get tail recursion if instead of using query_list[new_len:] you pass the index. query_list[new_len:] makes a new object, a view, so it doesn't copy the underlying data, but it's still a new object being created.

Comment on lines +81 to +84
cdef long[::1] full_index = np.zeros(n_obs, dtype=long)

for i in range(n_obs):
full_index[i] = i
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cdef long[::1] full_index = np.zeros(n_obs, dtype=long)
for i in range(n_obs):
full_index[i] = i
cdef np.int64_t[::1] full_index = np.ones(n_obs, dtype=np.int64)

install_requires=[
"numpy",
],
ext_modules = cythonize(["./dwave/plugins/sklearn/nearest_neighbors.pyx"]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ext_modules = cythonize(["./dwave/plugins/sklearn/nearest_neighbors.pyx"]),
ext_modules = cythonize(["./dwave/plugins/sklearn/nearest_neighbors.pyx"], annotate=True),

Cython annotations are super helpful. We do some stuff with environment variables in some of our other packages (e.g. dimod) but there's no harm in just turning it on always. At some point I'll get around to making a PR to dimod to just change that to always be True.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants