You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In between different runs of the ATM, the outputs of all the steps of the pipeline are "static," except for the input and output to the classifier that is chosen by BTB. What I mean by this is, for example, suppose PCA is in the pipeline, then every time ATM/BTB chooses a new model to run, it will recompute the PCA for the same dataset. Unless I'm misunderstanding the flow of data, this seems inefficient. Although the current pipeline is pretty simple (scaling/PCA), there could be more computationally intensive elements to the pipeline that people may want to add.
We can separate the pipeline into two pipelines, one that is "static" and the outputs stored somewhere to disk such that it can be recalled between runs, and a "dynamic" which is essentially the classifier, and any blocks which change based on the ATM/BTB model being run.
If you think this is a good idea, how do we want to go about architecting this from a software perspective? One approach is to compute the static pipeline before the test_classifier method is run and save that to the data directory where the train/test dataset is being saved.
The text was updated successfully, but these errors were encountered:
There actually was architecture for this in earlier versions of ATM: if PCA was part of the pipeline, it would be computed first, then the intermediate data representation would be saved to disk and the cached version would be loaded later on. I removed that function during a big refactor because it was complicating other parts of the code and didn't give us much speedup to just cache the PCA.
I do think it makes sense to do this down the line, but only if we add other static preprocessing steps (as you mentioned in #71). Until then, I think it's premature optimization to build in the caching infrastructure.
In between different runs of the ATM, the outputs of all the steps of the pipeline are "static," except for the input and output to the classifier that is chosen by BTB. What I mean by this is, for example, suppose PCA is in the pipeline, then every time ATM/BTB chooses a new model to run, it will recompute the PCA for the same dataset. Unless I'm misunderstanding the flow of data, this seems inefficient. Although the current pipeline is pretty simple (scaling/PCA), there could be more computationally intensive elements to the pipeline that people may want to add.
We can separate the pipeline into two pipelines, one that is "static" and the outputs stored somewhere to disk such that it can be recalled between runs, and a "dynamic" which is essentially the classifier, and any blocks which change based on the ATM/BTB model being run.
If you think this is a good idea, how do we want to go about architecting this from a software perspective? One approach is to compute the static pipeline before the
test_classifier
method is run and save that to the data directory where the train/test dataset is being saved.The text was updated successfully, but these errors were encountered: