Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[4/n][dagster-fivetran] Implement fetch_fivetran_workspace_data #25788

Merged
merged 7 commits into from
Nov 12, 2024

Conversation

maximearmstrong
Copy link
Contributor

@maximearmstrong maximearmstrong commented Nov 7, 2024

Summary & Motivation

This PR implements fetch_fivetran_workspace_data, which is based on the legacy FivetranInstanceCacheableAssetsDefinition._get_connectors code.

We are fetching groups, destinations, connectors and their schema config to create the workspace data object, which represents the raw data fetched using the API.

How I Tested These Changes

Additional unit test

Copy link
Contributor Author

maximearmstrong commented Nov 7, 2024

@maximearmstrong maximearmstrong force-pushed the maxime/rework-fivetran-3 branch from 93c3d6b to 9a0a143 Compare November 7, 2024 22:22
@maximearmstrong maximearmstrong changed the title [4/n][dagster-tableau] Implement fetch_fivetran_workspace_data [4/n][dagster-fivetran] Implement fetch_fivetran_workspace_data Nov 7, 2024
@maximearmstrong maximearmstrong force-pushed the maxime/rework-fivetran-4 branch from 0cbc9eb to b17bd45 Compare November 7, 2024 22:22
@maximearmstrong maximearmstrong force-pushed the maxime/rework-fivetran-3 branch from 9a0a143 to e6d7f2d Compare November 8, 2024 14:02
@maximearmstrong maximearmstrong force-pushed the maxime/rework-fivetran-4 branch from b17bd45 to 252be0c Compare November 8, 2024 14:02
@maximearmstrong maximearmstrong changed the title [4/n][dagster-fivetran] Implement fetch_fivetran_workspace_data [4/n][dagster-fivetran] Implement fetch_fivetran_workspace_data Nov 8, 2024
@maximearmstrong maximearmstrong marked this pull request as ready for review November 8, 2024 14:08
@maximearmstrong maximearmstrong self-assigned this Nov 8, 2024
@maximearmstrong maximearmstrong force-pushed the maxime/rework-fivetran-3 branch from e6d7f2d to 1461946 Compare November 8, 2024 16:44
@maximearmstrong maximearmstrong force-pushed the maxime/rework-fivetran-4 branch from 252be0c to 56205b0 Compare November 8, 2024 16:44
@maximearmstrong maximearmstrong force-pushed the maxime/rework-fivetran-3 branch from 1461946 to 1846939 Compare November 8, 2024 17:00
@maximearmstrong maximearmstrong force-pushed the maxime/rework-fivetran-4 branch from 56205b0 to 6c50270 Compare November 8, 2024 17:00
Copy link
Contributor

@dpeng817 dpeng817 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some code structure + testing qs

Comment on lines 644 to 674
connector_id = connector_details["id"]

setup_state = connector_details.get("status", {}).get("setup_state")
if setup_state and setup_state in ("incomplete", "broken"):
continue

schema_config = client.get_schema_config_for_connector(connector_id=connector_id)

augmented_connector_details = {
**connector_details,
"schema_config": schema_config,
"destination_id": group_id,
}
connectors.append(
FivetranContentData(
content_type=FivetranContentType.CONNECTOR,
properties=augmented_connector_details,
)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies if this comes comment comes off a bit nitty: but I find this logic to be a little confusing, and the amount of dictionary indexing is freaking me out lol.
To quantify that a bit more:

  • why is the setup_state two gets? Is this actually optional, and what are the cases in which it's optional?
  • I don't like using bare strings to refer to objects ("incomplete", "broken"). Nit; but would prefer a top-level constant.
  • I don't feel like the way FivetranContentDatais being used here is quite the same as the other integrations (although correct me if I'm wrong). It feels like we're making decisions about what data to put in that object, and then just stuffing it into a raw dictionary, whereas in other content data examples, we're just stuffing a raw response body as properties. If we're going to make decisions about what data we're going to put in there, I think we should explicitly type it so that users (and future integration authors) are better able to understand what data is raw coming from the API and what data we've pulled in from disparate sources. So, something like a typeddict at the very least or making properties a union of strongly typed objects.

Also, having all of this complex logic live in a loop like this freaks me out a bit. I think a better way to structure it could be the following:

  • wrap the response to destination_details, connectors_details in an object that we can pull properties off of (ie DestinationDetails, ConnectorDetails)
  • Have like a .to_content_data that you can use to create the FivetranContentData object off of the DestinationDetails, ConnectorsDetails.
    Then, this logic is independently testable and removed from the hot path of this loop.

Sorry for the essay - trying to up my code review game and I saw an opportunity for improvement here, hope it doesn't come off as harsh or anything like that, I've certainly done much worse in my day here 😉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the setup_state two gets? Is this actually optional, and what are the cases in which it's optional?

That logic was taken from the legacy code, but the fields are now required in the response, so I will update that.

I don't like using bare strings to refer to objects ("incomplete", "broken"). Nit; but would prefer a top-level constant.

That's fair, will update.

I don't feel like the way FivetranContentData is being used here is quite the same as the other integrations (although correct me if I'm wrong)

This is what we do for Power BI, Sigma and Tableau - we augment the data before adding the properties in the content data. That said, I agree that we could wrap these responses in all integrations to keep only what we need and make this easier to maintain. Maybe this could be done separately in another PR as a nice-to-have improvement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just saw your comment on the next PR.

For the context, the code here and in the next PR is my mainly based on the legacy code, the goal being to initially reproduce the exact same behavior as we had before but with the new integration pattern.

I agree that all of this code need improvement and cleaning to make it more maintainable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea what I had in mind regarding "other integrations" was dbt_resource_props = dbt_nodes[unique_id] from the dbt integration, where we just copy the raw response. Makes sense if this was a property we copied over from the BI integrations, but yea I do think that's kinda an antipattern. If we're making any alterations to the data, the data model should reflect that IMO.

I'm fine with you doing it in a follow up PR, I understand that rebasing is a pain. That said, I don't necessarily think of it as a nice to have, seeing how it carries over to the translator with all of the additional raw dict indexing we have to do there, I do think it's important to fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, missed that message. Makes sense regarding just switching the pattern first and improving things later.

@@ -202,6 +208,169 @@
},
}

SAMPLE_SCHEMA_CONFIG_FOR_CONNECTOR = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea I really want to understand where these are coming from and how we can generate them (maybe even a script or something would be awesome)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is directly copied and pasted from Fivetran API documentation. I can add comments with the URL where the example is coming from.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok nice. Yea comment seems sufficient then if it's that simple


resource = FivetranWorkspace(api_key=api_key, api_secret=api_secret)

with workspace_data_api_mocks_fn(include_sync_endpoints=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we not include sync_endpoints here? I'm not sure I follow. Feels like the API should still be able to return a valid response here and we should be asserting that it's not used, but maybe I'm missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

responses.RequestsMock() raises an error if not all mocked responses/endpoints are used in the test - it's mainly to avoid duplicating code.

There is a set of endpoints that are used to fetch the data, and another set of endpoints that are used to sync and poll the data. We will be using both to test sync and poll methods in subsequent PRs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah gotcha, that honestly makes me feel even more strongly about using a client fake as opposed to mocking. If we have to make these somewhat arbitrary booleans to get things to work across all tests I think that's pretty undesirable. When a new person comes to this code I think that will be really confusing.

I really think something like what we did for dlift could be powerful here, we basically built a "fake dbt cloud with jaffle shop"

and then all of the tests can just run against that thing. Kinda annoying to build, but test logic stays simple. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the code to avoid using the boolean in 7408fa2 - I agree that it was kinda awkward.

About the client fake, in this case, I think I see more value in mocking requests for 2 reasons.

  1. It makes the conftest file more straightforward. It also makes obvious that the only thing that we are mocking are the API requests and responses.
  2. I think the client fake solution is great, but it kinda adds a layer of complexity for external contributors that would like to contribute to the integration. Like you mentioned, it's more complex to build, so I would avoid using a solution that would be harder to update for contributors. Since mocking for unit tests is widely adopted by the Python community, it seems like the most logical strategy to adopt here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea it's tricky. For what it's worth, I don't think it being more complex to build initially means it's more complex to build on top of. I think the idea would be a client fake solution requires more up front effort, but for 90% of use cases doesn't require users to touch the API surface area at all - they just interact with the instance methods themselves.

I do see your point however around mocking kinda being the "devil we know" so to speak, regarding external contributors, though. It's complexity they're familiar with.

I like the solution you landed on with the hierarchical API mocks more, feels more structured. Given that there's only a few APIs that we're mocking out here for all the tests, I think I'm OK with pushing this through with the mocking solution. Thanks for talking through it!

@maximearmstrong maximearmstrong force-pushed the maxime/rework-fivetran-3 branch from 1846939 to af34d01 Compare November 8, 2024 17:41
@maximearmstrong maximearmstrong force-pushed the maxime/rework-fivetran-4 branch from 6c50270 to 55248a3 Compare November 8, 2024 17:41
@maximearmstrong maximearmstrong force-pushed the maxime/rework-fivetran-3 branch from af34d01 to c714e38 Compare November 8, 2024 19:45
@maximearmstrong maximearmstrong force-pushed the maxime/rework-fivetran-4 branch from 55248a3 to 5c21300 Compare November 8, 2024 19:45
Base automatically changed from maxime/rework-fivetran-3 to master November 8, 2024 20:22
@maximearmstrong maximearmstrong force-pushed the maxime/rework-fivetran-4 branch from 5c21300 to 011989c Compare November 8, 2024 20:47
@maximearmstrong maximearmstrong force-pushed the maxime/rework-fivetran-4 branch from 011989c to 71bf953 Compare November 11, 2024 20:14
@maximearmstrong maximearmstrong force-pushed the maxime/rework-fivetran-4 branch from 71bf953 to 7dd39c7 Compare November 11, 2024 22:21
Copy link
Contributor

@dpeng817 dpeng817 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussion, lgtm

@maximearmstrong maximearmstrong merged commit ccef3e7 into master Nov 12, 2024
1 check passed
@maximearmstrong maximearmstrong deleted the maxime/rework-fivetran-4 branch November 12, 2024 18:11
maximearmstrong added a commit that referenced this pull request Nov 15, 2024
…nation (#25889)

## Summary & Motivation

This PR implements `FivetranConnector` and `FivetranDestination`, and
removes `FivetranContentData` and `FivetranContentType`.

This addresses the concerns raised
[here](#25788 (comment))
about the legacy code.

## How I Tested These Changes

BK with same tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants