Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separating repeated processing from classifier models #70

Open
kkarrancsu opened this issue Jan 25, 2018 · 2 comments
Open

Separating repeated processing from classifier models #70

kkarrancsu opened this issue Jan 25, 2018 · 2 comments

Comments

@kkarrancsu
Copy link
Contributor

In between different runs of the ATM, the outputs of all the steps of the pipeline are "static," except for the input and output to the classifier that is chosen by BTB. What I mean by this is, for example, suppose PCA is in the pipeline, then every time ATM/BTB chooses a new model to run, it will recompute the PCA for the same dataset. Unless I'm misunderstanding the flow of data, this seems inefficient. Although the current pipeline is pretty simple (scaling/PCA), there could be more computationally intensive elements to the pipeline that people may want to add.

We can separate the pipeline into two pipelines, one that is "static" and the outputs stored somewhere to disk such that it can be recalled between runs, and a "dynamic" which is essentially the classifier, and any blocks which change based on the ATM/BTB model being run.

If you think this is a good idea, how do we want to go about architecting this from a software perspective? One approach is to compute the static pipeline before the test_classifier method is run and save that to the data directory where the train/test dataset is being saved.

@bcyphers
Copy link
Contributor

Good points.

There actually was architecture for this in earlier versions of ATM: if PCA was part of the pipeline, it would be computed first, then the intermediate data representation would be saved to disk and the cached version would be loaded later on. I removed that function during a big refactor because it was complicating other parts of the code and didn't give us much speedup to just cache the PCA.

I do think it makes sense to do this down the line, but only if we add other static preprocessing steps (as you mentioned in #71). Until then, I think it's premature optimization to build in the caching infrastructure.

@micahjsmith
Copy link
Member

This could be implemented by a feature of MLBLocks (#113) and should wait on the resolution of that issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants