[4/n][dagster-fivetran] Implement `fetch_fivetran_workspace_data` #25788

maximearmstrong · 2024-11-07T18:15:37Z

Summary & Motivation

This PR implements fetch_fivetran_workspace_data, which is based on the legacy FivetranInstanceCacheableAssetsDefinition._get_connectors code.

We are fetching groups, destinations, connectors and their schema config to create the workspace data object, which represents the raw data fetched using the API.

How I Tested These Changes

Additional unit test

maximearmstrong · 2024-11-07T18:15:45Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

dpeng817

Some code structure + testing qs

dpeng817 · 2024-11-08T16:58:43Z

python_modules/libraries/dagster-fivetran/dagster_fivetran/resources.py

+                connector_id = connector_details["id"]
+
+                setup_state = connector_details.get("status", {}).get("setup_state")
+                if setup_state and setup_state in ("incomplete", "broken"):
+                    continue
+
+                schema_config = client.get_schema_config_for_connector(connector_id=connector_id)
+
+                augmented_connector_details = {
+                    **connector_details,
+                    "schema_config": schema_config,
+                    "destination_id": group_id,
+                }
+                connectors.append(
+                    FivetranContentData(
+                        content_type=FivetranContentType.CONNECTOR,
+                        properties=augmented_connector_details,
+                    )
+                )


Apologies if this comes comment comes off a bit nitty: but I find this logic to be a little confusing, and the amount of dictionary indexing is freaking me out lol.
To quantify that a bit more:

why is the setup_state two gets? Is this actually optional, and what are the cases in which it's optional?

I don't like using bare strings to refer to objects ("incomplete", "broken"). Nit; but would prefer a top-level constant.

I don't feel like the way FivetranContentDatais being used here is quite the same as the other integrations (although correct me if I'm wrong). It feels like we're making decisions about what data to put in that object, and then just stuffing it into a raw dictionary, whereas in other content data examples, we're just stuffing a raw response body as properties. If we're going to make decisions about what data we're going to put in there, I think we should explicitly type it so that users (and future integration authors) are better able to understand what data is raw coming from the API and what data we've pulled in from disparate sources. So, something like a typeddict at the very least or making properties a union of strongly typed objects.

Also, having all of this complex logic live in a loop like this freaks me out a bit. I think a better way to structure it could be the following:

wrap the response to destination_details, connectors_details in an object that we can pull properties off of (ie DestinationDetails, ConnectorDetails)

Have like a .to_content_data that you can use to create the FivetranContentData object off of the DestinationDetails, ConnectorsDetails.
Then, this logic is independently testable and removed from the hot path of this loop.

Sorry for the essay - trying to up my code review game and I saw an opportunity for improvement here, hope it doesn't come off as harsh or anything like that, I've certainly done much worse in my day here 😉

why is the setup_state two gets? Is this actually optional, and what are the cases in which it's optional?

That logic was taken from the legacy code, but the fields are now required in the response, so I will update that.

I don't like using bare strings to refer to objects ("incomplete", "broken"). Nit; but would prefer a top-level constant.

That's fair, will update.

I don't feel like the way FivetranContentData is being used here is quite the same as the other integrations (although correct me if I'm wrong)

This is what we do for Power BI, Sigma and Tableau - we augment the data before adding the properties in the content data. That said, I agree that we could wrap these responses in all integrations to keep only what we need and make this easier to maintain. Maybe this could be done separately in another PR as a nice-to-have improvement?

I just saw your comment on the next PR.

For the context, the code here and in the next PR is my mainly based on the legacy code, the goal being to initially reproduce the exact same behavior as we had before but with the new integration pattern.

I agree that all of this code need improvement and cleaning to make it more maintainable.

Yea what I had in mind regarding "other integrations" was dbt_resource_props = dbt_nodes[unique_id] from the dbt integration, where we just copy the raw response. Makes sense if this was a property we copied over from the BI integrations, but yea I do think that's kinda an antipattern. If we're making any alterations to the data, the data model should reflect that IMO.

I'm fine with you doing it in a follow up PR, I understand that rebasing is a pain. That said, I don't necessarily think of it as a nice to have, seeing how it carries over to the translator with all of the additional raw dict indexing we have to do there, I do think it's important to fix.

Gotcha, missed that message. Makes sense regarding just switching the pattern first and improving things later.

dpeng817 · 2024-11-08T16:59:28Z

python_modules/libraries/dagster-fivetran/dagster_fivetran_tests/experimental/conftest.py

@@ -202,6 +208,169 @@
    },
 }

+SAMPLE_SCHEMA_CONFIG_FOR_CONNECTOR = {


yea I really want to understand where these are coming from and how we can generate them (maybe even a script or something would be awesome)

This is directly copied and pasted from Fivetran API documentation. I can add comments with the URL where the example is coming from.

Ah ok nice. Yea comment seems sufficient then if it's that simple

dpeng817 · 2024-11-08T17:01:08Z

...n_modules/libraries/dagster-fivetran/dagster_fivetran_tests/experimental/test_asset_specs.py

+
+    resource = FivetranWorkspace(api_key=api_key, api_secret=api_secret)
+
+    with workspace_data_api_mocks_fn(include_sync_endpoints=False):


why do we not include sync_endpoints here? I'm not sure I follow. Feels like the API should still be able to return a valid response here and we should be asserting that it's not used, but maybe I'm missing something.

responses.RequestsMock() raises an error if not all mocked responses/endpoints are used in the test - it's mainly to avoid duplicating code.

There is a set of endpoints that are used to fetch the data, and another set of endpoints that are used to sync and poll the data. We will be using both to test sync and poll methods in subsequent PRs.

Ah gotcha, that honestly makes me feel even more strongly about using a client fake as opposed to mocking. If we have to make these somewhat arbitrary booleans to get things to work across all tests I think that's pretty undesirable. When a new person comes to this code I think that will be really confusing.

I really think something like what we did for dlift could be powerful here, we basically built a "fake dbt cloud with jaffle shop"

dagster/examples/experimental/dagster-dlift/dagster_dlift_tests/conftest.py

Line 47 in 0a0c77b

def jaffle_shop_contents() -> (

and then all of the tests can just run against that thing. Kinda annoying to build, but test logic stays simple. What do you think?

I updated the code to avoid using the boolean in 7408fa2 - I agree that it was kinda awkward.

About the client fake, in this case, I think I see more value in mocking requests for 2 reasons.

It makes the conftest file more straightforward. It also makes obvious that the only thing that we are mocking are the API requests and responses.

I think the client fake solution is great, but it kinda adds a layer of complexity for external contributors that would like to contribute to the integration. Like you mentioned, it's more complex to build, so I would avoid using a solution that would be harder to update for contributors. Since mocking for unit tests is widely adopted by the Python community, it seems like the most logical strategy to adopt here.

Yea it's tricky. For what it's worth, I don't think it being more complex to build initially means it's more complex to build on top of. I think the idea would be a client fake solution requires more up front effort, but for 90% of use cases doesn't require users to touch the API surface area at all - they just interact with the instance methods themselves.

I do see your point however around mocking kinda being the "devil we know" so to speak, regarding external contributors, though. It's complexity they're familiar with.

I like the solution you landed on with the hierarchical API mocks more, feels more structured. Given that there's only a few APIs that we're mocking out here for all the tests, I think I'm OK with pushing this through with the mocking solution. Thanks for talking through it!

re-requesting

dpeng817

After discussion, lgtm

…nation (#25889) ## Summary & Motivation This PR implements `FivetranConnector` and `FivetranDestination`, and removes `FivetranContentData` and `FivetranContentType`. This addresses the concerns raised [here](#25788 (comment)) about the legacy code. ## How I Tested These Changes BK with same tests.

This was referenced Nov 7, 2024

[1/n][dagster-fivetran] Scaffold FivetranWorkspace for rework #25750

Merged

[2/n][dagster-fivetran] Update DagsterFivetranTranslator and related classes for rework #25751

Merged

[3/n][dagster-fivetran] Implement FivetranClient for rework #25756

Merged

maximearmstrong force-pushed the maxime/rework-fivetran-3 branch from 93c3d6b to 9a0a143 Compare November 7, 2024 22:22

maximearmstrong changed the title ~~[4/n][dagster-tableau] Implement fetch_fivetran_workspace_data~~ [4/n][dagster-fivetran] Implement fetch_fivetran_workspace_data Nov 7, 2024

maximearmstrong force-pushed the maxime/rework-fivetran-4 branch from 0cbc9eb to b17bd45 Compare November 7, 2024 22:22

maximearmstrong mentioned this pull request Nov 7, 2024

[5/n][dagster-fivetran] Implement FivetranWorkspaceData to FivetranConnectorTableProps method #25797

Merged

maximearmstrong force-pushed the maxime/rework-fivetran-3 branch from 9a0a143 to e6d7f2d Compare November 8, 2024 14:02

maximearmstrong force-pushed the maxime/rework-fivetran-4 branch from b17bd45 to 252be0c Compare November 8, 2024 14:02

maximearmstrong changed the title ~~[4/n][dagster-fivetran] Implement fetch_fivetran_workspace_data~~ [4/n][dagster-fivetran] Implement fetch_fivetran_workspace_data Nov 8, 2024

maximearmstrong marked this pull request as ready for review November 8, 2024 14:08

maximearmstrong self-assigned this Nov 8, 2024

maximearmstrong requested review from benpankow and dpeng817 November 8, 2024 14:08

maximearmstrong force-pushed the maxime/rework-fivetran-3 branch from e6d7f2d to 1461946 Compare November 8, 2024 16:44

maximearmstrong force-pushed the maxime/rework-fivetran-4 branch from 252be0c to 56205b0 Compare November 8, 2024 16:44

This was referenced Nov 8, 2024

[6/n][dagster-fivetran] Implement FivetranWorkspaceDefsLoader #25807

Merged

[7/n][dagster-fivetran] Implement load_fivetran_asset_specs #25808

Merged

maximearmstrong force-pushed the maxime/rework-fivetran-3 branch from 1461946 to 1846939 Compare November 8, 2024 17:00

maximearmstrong force-pushed the maxime/rework-fivetran-4 branch from 56205b0 to 6c50270 Compare November 8, 2024 17:00

dpeng817 previously requested changes Nov 8, 2024

View reviewed changes

maximearmstrong force-pushed the maxime/rework-fivetran-3 branch from 1846939 to af34d01 Compare November 8, 2024 17:41

maximearmstrong force-pushed the maxime/rework-fivetran-4 branch from 6c50270 to 55248a3 Compare November 8, 2024 17:41

maximearmstrong force-pushed the maxime/rework-fivetran-3 branch from af34d01 to c714e38 Compare November 8, 2024 19:45

maximearmstrong force-pushed the maxime/rework-fivetran-4 branch from 55248a3 to 5c21300 Compare November 8, 2024 19:45

maximearmstrong requested a review from dpeng817 November 8, 2024 20:04

Base automatically changed from maxime/rework-fivetran-3 to master November 8, 2024 20:22

maximearmstrong force-pushed the maxime/rework-fivetran-4 branch from 5c21300 to 011989c Compare November 8, 2024 20:47

maximearmstrong force-pushed the maxime/rework-fivetran-4 branch from 011989c to 71bf953 Compare November 11, 2024 20:14

maximearmstrong added 7 commits November 11, 2024 17:21

[4/n][dagster-tableau] Implement fetch_fivetran_workspace_data

9653928

Update connector data

7f30f1a

Add comments for samples

0ff64ca

Update api mocks

b046d73

Lint

eb851e4

Update fetch_fivetran_workspace_data

f4f07a3

Fix pyright

7dd39c7

maximearmstrong force-pushed the maxime/rework-fivetran-4 branch from 71bf953 to 7dd39c7 Compare November 11, 2024 22:21

dpeng817 approved these changes Nov 12, 2024

View reviewed changes

maximearmstrong merged commit ccef3e7 into master Nov 12, 2024
1 check passed

maximearmstrong deleted the maxime/rework-fivetran-4 branch November 12, 2024 18:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[4/n][dagster-fivetran] Implement `fetch_fivetran_workspace_data` #25788

[4/n][dagster-fivetran] Implement `fetch_fivetran_workspace_data` #25788

maximearmstrong commented Nov 7, 2024 •

edited

Loading

maximearmstrong commented Nov 7, 2024 •

edited

Loading

dpeng817 left a comment

dpeng817 Nov 8, 2024

maximearmstrong Nov 8, 2024

maximearmstrong Nov 8, 2024

dpeng817 Nov 8, 2024

dpeng817 Nov 8, 2024

dpeng817 Nov 8, 2024

maximearmstrong Nov 8, 2024

dpeng817 Nov 8, 2024

dpeng817 Nov 8, 2024

maximearmstrong Nov 8, 2024

dpeng817 Nov 8, 2024

maximearmstrong Nov 8, 2024

dpeng817 Nov 12, 2024

dpeng817 left a comment


		resource = FivetranWorkspace(api_key=api_key, api_secret=api_secret)

		with workspace_data_api_mocks_fn(include_sync_endpoints=False):

[4/n][dagster-fivetran] Implement fetch_fivetran_workspace_data #25788

[4/n][dagster-fivetran] Implement fetch_fivetran_workspace_data #25788

Conversation

maximearmstrong commented Nov 7, 2024 • edited Loading

Summary & Motivation

How I Tested These Changes

maximearmstrong commented Nov 7, 2024 • edited Loading

dpeng817 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dpeng817 left a comment

Choose a reason for hiding this comment

[4/n][dagster-fivetran] Implement `fetch_fivetran_workspace_data` #25788

[4/n][dagster-fivetran] Implement `fetch_fivetran_workspace_data` #25788

maximearmstrong commented Nov 7, 2024 •

edited

Loading

maximearmstrong commented Nov 7, 2024 •

edited

Loading