-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add a mutual information CQM selector #16
base: main
Are you sure you want to change the base?
[WIP] Add a mutual information CQM selector #16
Conversation
|
||
cdef extern from "math.h": | ||
double log(double x) nogil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cdef extern from "math.h": | |
double log(double x) nogil | |
from libc.math cimport log |
You can see it here. If you're wondering how to find what Cython actually has already implemented... the answer is that I literally search the source files every. single. time. 😄
cdef extern from "math.h": | ||
double log(double x) nogil | ||
|
||
def calculate_mi(np.ndarray[np.float_t,ndim = 2] X, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In modern Cython, they want us to use Typed Memoryviews. So for instance here it would be
def calculate_mi(np.ndarray[np.float_t,ndim = 2] X, | |
def calculate_mi(np.float_t[:] X, |
np.ndarray[np.float_t,ndim = 1] y, | ||
unsigned long refinement_factor = 5): | ||
|
||
cdef unsigned long n_obs = X.shape[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I have found it far less headache when going back and forth between NumPy and Cython to use fixed-width types everywhere. So
cdef unsigned long n_obs = X.shape[0] | |
cdef np.uint64_t n_obs = X.shape[0] |
Cython, because everything gets moderated through an intermediate generated file, tends to get confused about typing, and keeping everything fixed tends to help.
raise ValueError("y is the wrong shape") | ||
|
||
cdef long sub_index_size = refinement_factor*int(np.round(log(n_obs))) + 2 | ||
cdef total_state_space_size = pow(sub_index_size,3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you want pow?
|
||
cdef extern from "math.h": | ||
double log(double x) nogil | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll get a pretty big performance boost (at the cost of safety) by using
@cython.boundscheck(False) # Deactivate bounds checking | |
@cython.wraparound(False) # Deactivate negative indexing. |
see Cython for NumPy users which is an excellent tutorial.
|
||
cdef assign_nn(np.ndarray[np.float_t,ndim = 2] X, | ||
long sub_index_size, | ||
short seed = 12121): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, it's preferred to to make the default seed None
.
Also, in this case you're better off not specifying a C type for seed
. This is because it's only passed to a python function, np.random.default_rng(seed)
, below. So if it starts out as a C-type, it will have to be converted to a Python type first. Of course, in this case the performance hit is so minor as to be pointless... but the other advantage of using object
here rather than short
is it supports None
or another rng.
return query(query_number, query_list[:new_len], idx - new_idx_unit) | ||
else: | ||
#print("query : ", query_number, " pivot: ", pivot_value, " new index: ", idx + new_idx_unit) | ||
return query(query_number, query_list[new_len:], idx + new_idx_unit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you may be able to get tail recursion if instead of using query_list[new_len:]
you pass the index. query_list[new_len:]
makes a new object, a view, so it doesn't copy the underlying data, but it's still a new object being created.
cdef long[::1] full_index = np.zeros(n_obs, dtype=long) | ||
|
||
for i in range(n_obs): | ||
full_index[i] = i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cdef long[::1] full_index = np.zeros(n_obs, dtype=long) | |
for i in range(n_obs): | |
full_index[i] = i | |
cdef np.int64_t[::1] full_index = np.ones(n_obs, dtype=np.int64) |
install_requires=[ | ||
"numpy", | ||
], | ||
ext_modules = cythonize(["./dwave/plugins/sklearn/nearest_neighbors.pyx"]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ext_modules = cythonize(["./dwave/plugins/sklearn/nearest_neighbors.pyx"]), | |
ext_modules = cythonize(["./dwave/plugins/sklearn/nearest_neighbors.pyx"], annotate=True), |
Cython annotations are super helpful. We do some stuff with environment variables in some of our other packages (e.g. dimod) but there's no harm in just turning it on always. At some point I'll get around to making a PR to dimod to just change that to always be True
.
This pull requests aims to add a mutual information feature selector by constructing a mutual information matrix where the diagonal elements are$I(Y;X_i)$ and the off-diagonal elements are $I(Y;X_i,X_j)$ . This is estimated via discrete approximation which is $O(m^2 \cdot (\log(n))^3)$ in time and space complexity, where the number of observations is $n$ and the number of features is $m$ . I want to improve this, at least in space complexity because a lot of the large matrix is sparse and memory becomes a constraint quickly, and there are still some numerical errors in this version.