Separating repeated processing from classifier models #70

kkarrancsu · 2018-01-25T16:52:05Z

In between different runs of the ATM, the outputs of all the steps of the pipeline are "static," except for the input and output to the classifier that is chosen by BTB. What I mean by this is, for example, suppose PCA is in the pipeline, then every time ATM/BTB chooses a new model to run, it will recompute the PCA for the same dataset. Unless I'm misunderstanding the flow of data, this seems inefficient. Although the current pipeline is pretty simple (scaling/PCA), there could be more computationally intensive elements to the pipeline that people may want to add.

We can separate the pipeline into two pipelines, one that is "static" and the outputs stored somewhere to disk such that it can be recalled between runs, and a "dynamic" which is essentially the classifier, and any blocks which change based on the ATM/BTB model being run.

If you think this is a good idea, how do we want to go about architecting this from a software perspective? One approach is to compute the static pipeline before the test_classifier method is run and save that to the data directory where the train/test dataset is being saved.

The text was updated successfully, but these errors were encountered:

bcyphers · 2018-01-30T20:27:22Z

Good points.

There actually was architecture for this in earlier versions of ATM: if PCA was part of the pipeline, it would be computed first, then the intermediate data representation would be saved to disk and the cached version would be loaded later on. I removed that function during a big refactor because it was complicating other parts of the code and didn't give us much speedup to just cache the PCA.

I do think it makes sense to do this down the line, but only if we add other static preprocessing steps (as you mentioned in #71). Until then, I think it's premature optimization to build in the caching infrastructure.

micahjsmith · 2019-02-19T21:20:30Z

This could be implemented by a feature of MLBLocks (#113) and should wait on the resolution of that issue.

kkarrancsu mentioned this issue Jan 25, 2018

Adding custom blocks to pipeline #71

Closed

micahjsmith added enhancement help wanted labels Feb 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separating repeated processing from classifier models #70

Separating repeated processing from classifier models #70

kkarrancsu commented Jan 25, 2018

bcyphers commented Jan 30, 2018

micahjsmith commented Feb 19, 2019

Separating repeated processing from classifier models #70

Separating repeated processing from classifier models #70

Comments

kkarrancsu commented Jan 25, 2018

bcyphers commented Jan 30, 2018

micahjsmith commented Feb 19, 2019