-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Splink4] Use fresh SQLPipeline for all linker methods #2060
Comments
Probably need to start with a minisplink example using dbapi |
Example code:
|
I've tried a solution that modifies sql tasks to allow them to be materialised
And then extends SQLPipeline to allow it to have sub-pipelines (be subdivided into CTE sequences) which are executed. it's possibly viable, but a bit problem is that the pipeline class then needs to somehow be able to predict the name of the output. Maybe that could be enabled by allowing the a task to know about its
class SQLTask:
def __init__(self, sql, output_table_name, materialise=False):
self.sql = sql
self.output_table_name = output_table_name
self.materialise = materialise
class CTEPipeline:
class SQLPipeline:
But you still have a problem that if you have materialisations, it's not clear what input tables need to carry forward to the next step - i.e. it assumes a single linear dag whereas in general a CTE sequence could be materialised half way, but still need the input tables further down. It's still solvable by carrying everything forwards, but i think at that point, we're building a generic dag implementation which is probably a bit too much. However, iirc materialisation is very rarely required mid pipeline, perhaps only in predict(), so instead, I'm thinking about instead just doing a simple option of: sql generation methods always do something like: And then in the special case that materialisation is needed optionally, we just do
or whatever. But trying to stick with a general pattern of the pipeline simply being used to enqueue stuff and then is executed. Once executed it should be discarded - in the lnog run, we want to prevent enqueuing once executed WIll be interesting to observe the extent to which predict() is now decoupled from linker as i go through this |
Was closed by #2062 |
Rather than a shared mutable self._pipeline, instead use a fresh
pipeline = SQLPipeline()
for all methods such aspredict()
One challenge with this at the moment is materialisation.
At the moment, we have methods like
_initialise_df_concat_with_tf
, which is difficult to understand due to how it mutates things and returns things:self._pipelien
by adding sql to itSplinkDataframe
. The pipeline will then be flushed because it's been executedNone
On reason for this behaviour is
_initialise_df_concat_with_tf
is created onced and used many times, so often we want to compute it so it can be cached.Can this be handled somehow by the pipeline itself?
Specifically, can we extend the concept of the pipeline to allow us to tell it that some parts of the pipeline need to be materialised?
How does this interact with caching? For any pipeline, can we magically check whether any part of it is already cached?
The text was updated successfully, but these errors were encountered: