Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Composite option, dataString compression, and correct splits percentage #40

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

pjaselin
Copy link

@pjaselin pjaselin commented Dec 27, 2021

Hi @topepo! I've been working on a port of your code into Python (I believe Kirk mentioned that) over here: https://github.com/pjaselin/Cubist. Thank you so much for all the work you are your colleagues have done on this!

Some improvements/fixes I've made here:

  • Composite option: in one parameter you can choose whether to use instance-based correction or let Cubist decide. I think this is helpful because instance-based correction adds a little opacity to the prediction process.
  • Moved number of nearest neighbors to cubistControl: I understand your intent was to change the number of neighbors at predict time since that's where it comes into use but I wonder if it makes more sense to be part of training as it is used in model evaluation.
  • dataString compression: I have the dataString compressed so the model has a smaller memory footprint when using instance-based correction.
  • Correct splits percentage: I noticed that you had hardcoded "<=" at
    sum(x[, as.character(splits$variable[i])] <= splits$value[i]) / nrow(x)

    so I came up with a way to get the right comparison operator based on the model. (This is probably the one definite fix here)

Let me know if you'd like to break this apart and I'd be happy to take feedback!

@pjaselin
Copy link
Author

pjaselin commented Dec 27, 2021

Also, giving the composite option should help accelerate prediction times for cases like #28

@topepo
Copy link
Owner

topepo commented Feb 4, 2022

Thanks for doing this; it's great to have an outside contribute.

I've been looking at it for a few days and I'd like to tweak the ui. It's a little awkward to have an argument that could be logical or character. How about we

  • control the composite/model-only decision using the neighbors argument (values > 0 are composite)?
  • control the auto/manual decision with auto = TRUE/FALSE?

I think that would get us to the same place. Also, the current specification breaks a lot of existing analyses (in books, vignettes, and so on). I'd like it to be backward compatible and defaulting to auto = FALSE would do that.

@pjaselin
Copy link
Author

pjaselin commented Feb 8, 2022

Hi @topepo, I really appreciate your feedback and ideas here. Also thank you for allowing me to contribute! This is a much cleaner UI and I think I'll modify my Python implementation to match it.
Changes made:

  • auto=FALSE: decides whether to allow Cubist to decide whether to use composite models
  • neighbors=NA: decides whether to use composite models
  • cv=NA: run k-fold cross-validation (this is the last of my improvements in the Python code if you'd like to use it here, note that no model is returned by Cubist)

Also definitely please make sure this all passes your tests beyond mine!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants