-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[4/n][dagster-fivetran] Implement fetch_fivetran_workspace_data
#25788
Conversation
93c3d6b
to
9a0a143
Compare
0cbc9eb
to
b17bd45
Compare
9a0a143
to
e6d7f2d
Compare
b17bd45
to
252be0c
Compare
fetch_fivetran_workspace_data
e6d7f2d
to
1461946
Compare
252be0c
to
56205b0
Compare
1461946
to
1846939
Compare
56205b0
to
6c50270
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some code structure + testing qs
connector_id = connector_details["id"] | ||
|
||
setup_state = connector_details.get("status", {}).get("setup_state") | ||
if setup_state and setup_state in ("incomplete", "broken"): | ||
continue | ||
|
||
schema_config = client.get_schema_config_for_connector(connector_id=connector_id) | ||
|
||
augmented_connector_details = { | ||
**connector_details, | ||
"schema_config": schema_config, | ||
"destination_id": group_id, | ||
} | ||
connectors.append( | ||
FivetranContentData( | ||
content_type=FivetranContentType.CONNECTOR, | ||
properties=augmented_connector_details, | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies if this comes comment comes off a bit nitty: but I find this logic to be a little confusing, and the amount of dictionary indexing is freaking me out lol.
To quantify that a bit more:
- why is the setup_state two gets? Is this actually optional, and what are the cases in which it's optional?
- I don't like using bare strings to refer to objects ("incomplete", "broken"). Nit; but would prefer a top-level constant.
- I don't feel like the way
FivetranContentData
is being used here is quite the same as the other integrations (although correct me if I'm wrong). It feels like we're making decisions about what data to put in that object, and then just stuffing it into a raw dictionary, whereas in other content data examples, we're just stuffing a raw response body as properties. If we're going to make decisions about what data we're going to put in there, I think we should explicitly type it so that users (and future integration authors) are better able to understand what data is raw coming from the API and what data we've pulled in from disparate sources. So, something like a typeddict at the very least or making properties a union of strongly typed objects.
Also, having all of this complex logic live in a loop like this freaks me out a bit. I think a better way to structure it could be the following:
- wrap the response to
destination_details
,connectors_details
in an object that we can pull properties off of (ie DestinationDetails, ConnectorDetails) - Have like a .to_content_data that you can use to create the
FivetranContentData
object off of theDestinationDetails
,ConnectorsDetails
.
Then, this logic is independently testable and removed from the hot path of this loop.
Sorry for the essay - trying to up my code review game and I saw an opportunity for improvement here, hope it doesn't come off as harsh or anything like that, I've certainly done much worse in my day here 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the setup_state two gets? Is this actually optional, and what are the cases in which it's optional?
That logic was taken from the legacy code, but the fields are now required in the response, so I will update that.
I don't like using bare strings to refer to objects ("incomplete", "broken"). Nit; but would prefer a top-level constant.
That's fair, will update.
I don't feel like the way FivetranContentData is being used here is quite the same as the other integrations (although correct me if I'm wrong)
This is what we do for Power BI, Sigma and Tableau - we augment the data before adding the properties in the content data. That said, I agree that we could wrap these responses in all integrations to keep only what we need and make this easier to maintain. Maybe this could be done separately in another PR as a nice-to-have improvement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just saw your comment on the next PR.
For the context, the code here and in the next PR is my mainly based on the legacy code, the goal being to initially reproduce the exact same behavior as we had before but with the new integration pattern.
I agree that all of this code need improvement and cleaning to make it more maintainable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea what I had in mind regarding "other integrations" was dbt_resource_props = dbt_nodes[unique_id]
from the dbt integration, where we just copy the raw response. Makes sense if this was a property we copied over from the BI integrations, but yea I do think that's kinda an antipattern. If we're making any alterations to the data, the data model should reflect that IMO.
I'm fine with you doing it in a follow up PR, I understand that rebasing is a pain. That said, I don't necessarily think of it as a nice to have, seeing how it carries over to the translator with all of the additional raw dict indexing we have to do there, I do think it's important to fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, missed that message. Makes sense regarding just switching the pattern first and improving things later.
@@ -202,6 +208,169 @@ | |||
}, | |||
} | |||
|
|||
SAMPLE_SCHEMA_CONFIG_FOR_CONNECTOR = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea I really want to understand where these are coming from and how we can generate them (maybe even a script or something would be awesome)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is directly copied and pasted from Fivetran API documentation. I can add comments with the URL where the example is coming from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok nice. Yea comment seems sufficient then if it's that simple
|
||
resource = FivetranWorkspace(api_key=api_key, api_secret=api_secret) | ||
|
||
with workspace_data_api_mocks_fn(include_sync_endpoints=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we not include sync_endpoints here? I'm not sure I follow. Feels like the API should still be able to return a valid response here and we should be asserting that it's not used, but maybe I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
responses.RequestsMock() raises an error if not all mocked responses/endpoints are used in the test - it's mainly to avoid duplicating code.
There is a set of endpoints that are used to fetch the data, and another set of endpoints that are used to sync and poll the data. We will be using both to test sync and poll methods in subsequent PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah gotcha, that honestly makes me feel even more strongly about using a client fake as opposed to mocking. If we have to make these somewhat arbitrary booleans to get things to work across all tests I think that's pretty undesirable. When a new person comes to this code I think that will be really confusing.
I really think something like what we did for dlift could be powerful here, we basically built a "fake dbt cloud with jaffle shop"
def jaffle_shop_contents() -> ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the code to avoid using the boolean in 7408fa2 - I agree that it was kinda awkward.
About the client fake, in this case, I think I see more value in mocking requests for 2 reasons.
- It makes the conftest file more straightforward. It also makes obvious that the only thing that we are mocking are the API requests and responses.
- I think the client fake solution is great, but it kinda adds a layer of complexity for external contributors that would like to contribute to the integration. Like you mentioned, it's more complex to build, so I would avoid using a solution that would be harder to update for contributors. Since mocking for unit tests is widely adopted by the Python community, it seems like the most logical strategy to adopt here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea it's tricky. For what it's worth, I don't think it being more complex to build initially means it's more complex to build on top of. I think the idea would be a client fake solution requires more up front effort, but for 90% of use cases doesn't require users to touch the API surface area at all - they just interact with the instance methods themselves.
I do see your point however around mocking kinda being the "devil we know" so to speak, regarding external contributors, though. It's complexity they're familiar with.
I like the solution you landed on with the hierarchical API mocks more, feels more structured. Given that there's only a few APIs that we're mocking out here for all the tests, I think I'm OK with pushing this through with the mocking solution. Thanks for talking through it!
1846939
to
af34d01
Compare
6c50270
to
55248a3
Compare
af34d01
to
c714e38
Compare
55248a3
to
5c21300
Compare
5c21300
to
011989c
Compare
011989c
to
71bf953
Compare
71bf953
to
7dd39c7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussion, lgtm
…nation (#25889) ## Summary & Motivation This PR implements `FivetranConnector` and `FivetranDestination`, and removes `FivetranContentData` and `FivetranContentType`. This addresses the concerns raised [here](#25788 (comment)) about the legacy code. ## How I Tested These Changes BK with same tests.
Summary & Motivation
This PR implements
fetch_fivetran_workspace_data
, which is based on the legacyFivetranInstanceCacheableAssetsDefinition._get_connectors
code.We are fetching groups, destinations, connectors and their schema config to create the workspace data object, which represents the raw data fetched using the API.
How I Tested These Changes
Additional unit test