Implement the dagster-openai integration library #19697

maximearmstrong · 2024-02-09T00:17:50Z

Summary & Motivation

This PR adds a new dagster-openai library to our set of libraries.

The main goal of this library is to log the Open AI API usage in the metadata. To do so, we need to wrap the methods called through the client, get the results and update the metadata.

Initial code snippets was hardcoding 3 methods, but we want to give the user some flexibility.

Constraints:

Results must be captured at the method level - the data we seek is included in the OpenAI API response. The results can't be captured at the client level, at teardown for instance.
Not all the methods existing in the OpenAI library should be wrap (private methods, etc.)
Methods are overloaded in the API Resource classes, so wrapping the methods should be done on the instance.

Solution

Implement OpenAIResource.get_client, OpenAIResource.get_client_for_asset and the function wrapper with_usage_metadata.

By default, for assets, the methods for the 3 main API Endpoint classes, Completions, Chat and Embeddings, are wrapped when instantiating the client - wrapping the methods allows to log the usage metadata provided in an OpenAI Completion response. If another endpoint should be wrapped, a user can use with_usage_metadata to it and log the metadata.

OpenAIResource.get_client can be used for assets and ops, but the metadata will not be logged for ops. OpenAIResource.get_client_for_asset can only be used with assets and the metadata will be logged.

TO-DOs

implement the resource
add docstrings
implement tests

How I Tested These Changes

Local implementation
BK
Dogfood in Purina with a toy example

maximearmstrong · 2024-02-09T00:18:03Z

Add docs for dagster-openai integration #20013
Implement the dagster-openai integration library #19697 👈
master

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @maximearmstrong and the rest of your teammates on Graphite

yuhan

my overall reaction is we can limit the MVP to only enable a subset of APIs and allow for more in a later stack

yuhan · 2024-02-13T19:49:32Z

python_modules/libraries/dagster-openai/dagster_openai/resources.py

+    FINE_TUNING = "fine_tuning"
+
+
+API_RESOURCES_TO_ENDPOINT_METHODS_MAPPING = {


this looks a little scary to me in terms of maintenance cost. wonder if we could slim it down to a even smaller set to start with.

We could start with the 3 main ones, Chat, Completions and Embeddings and add more from there?

yuhan · 2024-02-13T21:15:18Z

python_modules/libraries/dagster-openai/dagster_openai/resources.py

+        add_to_asset_metadata(context, "openai.calls", 1, output_name)
+        add_to_asset_metadata(context, "openai.total_tokens", usage.total_tokens, output_name)
+        add_to_asset_metadata(context, "openai.prompt_tokens", usage.prompt_tokens, output_name)
+        if hasattr(usage, "completion_tokens"):
+            add_to_asset_metadata(
+                context, "openai.completion_tokens", usage.completion_tokens, output_name
+            )


nit: we can probably call the add_to_asset_metadata once since we will have all the dict value ready here [1]

yuhan · 2024-02-13T21:17:13Z

python_modules/libraries/dagster-openai/dagster_openai/resources.py

+context_to_counters = WeakKeyDictionary()
+
+
+def add_to_asset_metadata(


re: [1], then you probably change it so it's update dict instead of add single value to a metadata key

yuhan · 2024-02-13T21:17:46Z

python_modules/libraries/dagster-openai/dagster_openai/resources.py

+                client.fine_tuning.jobs.create = with_usage_metadata(
+                    client.fine_tuning.jobs.create
+                )
+                client.fine_tuning.jobs.create(
+                    model="gpt-3.5-turbo",
+                    training_file="some_training_file"
+                )


worth adding comment here to explain different behaviors

yuhan · 2024-02-13T21:23:33Z

python_modules/libraries/dagster-openai/dagster_openai/resources.py

+        # If none of the OpenAI API resource types are specified,
+        # we wrap by default the methods of
+        # `openai.resources.Completions`, `openai.resources.Embeddings`
+        # and `openai.resources.chat.Completions`.
+        # This allows the usage metadata to be captured.
+        if not api_resources:
+            api_resources = [
+                ApiResourceEnum.COMPLETIONS,
+                ApiResourceEnum.CHAT,
+                ApiResourceEnum.EMBEDDINGS,
+            ]


i was imagining this to be an internal implementation to start with.

[2] seems a bit heavy IMO - maybe we could stake a separate PR for this pattern. the MVP can just not allow api_resources - it's easier and better to restrict more and open it up later.

It makes sense, I can remove the option.

yuhan · 2024-02-13T21:23:44Z

python_modules/libraries/dagster-openai/dagster_openai/resources.py

+                api_resources = [
+                    ApiResourceEnum.CHAT,
+                    ApiResourceEnum.FINE_TUNING,
+                ]
+                with openai.get_client(context, api_resources=api_resources) as client:


yuhan · 2024-02-13T21:28:29Z

have you tested this in a toy dagster project and how does that look like in 1) asset metadata which is available across oss and cloud, 2) cloud insights?

benpankow · 2024-02-13T21:53:45Z

python_modules/libraries/dagster-openai/dagster_openai/resources.py

+from pydantic import Field, PrivateAttr
+
+
+class ApiResourceEnum(Enum):


Naming quibble - since this name overloads the Dagster "Resource" noun, it could be pretty confusing to users (they construct an OpenAI Resource & pass it API resources). Are these called resources in OpenAI docs?

Could potentially call this ApiFeaturesEnum or something similar?

Naming quibble - since this name overloads the Dagster "Resource" noun, it could be pretty confusing to users (they construct an OpenAI Resource & pass it API resources). Are these called resources in OpenAI docs?

I agree, but indeed, they are called resources in the OpenAI library we are wrapping. I also thought of ApiEndpointClassesEnum to make it obvious that it is not related to a dagster resource. I can make the change.

ApiEndpointClassesEnum sounds not bad to me - i'd lean towards very descriptive to start with.

benpankow · 2024-02-13T21:55:40Z

python_modules/libraries/dagster-openai/dagster_openai/resources.py

+    @contextmanager
+    def get_client(
+        self,
+        context: AssetExecutionContext,


This can also be retrieved contextually via AssetExecutionContext.get(), if that makes your impl cleaner (e.g. this could be optional). Having the option to explicitly supply it might be nice for testing though.

python_modules/libraries/dagster-openai/dagster_openai/resources.py

benpankow · 2024-02-13T21:59:41Z

python_modules/libraries/dagster-openai/dagster_openai/resources.py

+
+API_RESOURCES_TO_ENDPOINT_METHODS_MAPPING = {
+    "completions": [["create"]],
+    "chat": [["completions", "create"]],


nit: To avoid duplicating strings/promote type checking, could use enum->list mapping

Suggested change

"chat": [["completions", "create"]],

ApiResourceEnum.CHAT: [["completions", "create"]],

maximearmstrong · 2024-02-13T22:47:32Z

have you tested this in a toy dagster project and how does that look like in 1) asset metadata which is available across oss and cloud, 2) cloud insights?

It was fully tested locally with a toy dagster project - the metadata is displayed as expected. Next step was cloud, I wanted to make sure we were on the right track before completing all tests.

maximearmstrong · 2024-02-15T22:47:50Z

@yuhan @benpankow I updated the code in ebe9367 to match the first reviews.

User can now use

OpenAIResource.get_client(context) for AssetExecutionContext and OpExecutionContext
- We don't wrap the methods for op context because we don't capture metadata in that scenario.
with_usage_metadata() to wrap any endpoint they want.
- we support only Chat, Completions and Embeddings endpoint classes by wrapping them by default in get_client(context) when context is instance of AssetExecutionContext.

The code was tested in a toy repo locally and in local cloud deployment.

Missing:

Examples in the docstrings
Unit tests

yuhan

looking close! look forward to the complete docstring and test cases!

python_modules/libraries/dagster-openai/dagster_openai/resources.py

yuhan · 2024-02-15T23:01:53Z

python_modules/libraries/dagster-openai/dagster_openai/resources.py

+    def get_client(
+        self,
+        context: Union[AssetExecutionContext, OpExecutionContext],
+        output_name: Optional[str] = None,


note on unit tests, since you have output_name here: worth testing the behavior in the following cases:

multi_assets (https://docs.dagster.io/concepts/assets/multi-assets)

assets with partitions (https://docs.dagster.io/concepts/partitions-schedules-sensors/partitioning-assets)

graph-backed assets (https://docs.dagster.io/concepts/assets/graph-backed-assets) - i doubt this will behave any different than a regular asset but worth throwing a test case for complexity

All done in dagster-openai/dagster_openai_tests/test_resources.py

yuhan · 2024-02-26T18:40:57Z

python_modules/libraries/dagster-openai/dagster_openai/resources.py

+    def get_client(
+        self,
+        context: Union[AssetExecutionContext, OpExecutionContext],
+        asset_key: Optional[AssetKey] = None,


per internal convo, might worth switching to get_client and get_client_for_asset to keep the consistency with InsightsSnowflakeResource

Done in ec36d2c

yuhan · 2024-02-26T18:42:39Z

python_modules/libraries/dagster-openai/dagster_openai_tests/test_resources.py

+    result = (
+        Definitions(
+            assets=[openai_asset],
+            jobs=[
+                define_asset_job(
+                    name="openai_asset_job", selection=AssetSelection.assets(openai_asset)
+                )
+            ],
+            resources={
+                "openai_resource": OpenAIResource(api_key="xoxp-1234123412341234-12341234-1234")
+            },
+        )
+        .get_job_def("openai_asset_job")
+        .execute_in_process()
+    )


[2] you can test an asset with resources by directly invoking it: https://docs.dagster.io/concepts/testing#testing-assets-with-resources

yuhan · 2024-02-26T18:48:45Z

python_modules/libraries/dagster-openai/dagster_openai_tests/test_resources.py

+        Definitions(
+            assets=[openai_asset],
+            jobs=[
+                define_asset_job(
+                    name="openai_asset_job", selection=AssetSelection.assets(openai_asset)
+                )
+            ],
+            resources={
+                "openai_resource": OpenAIResource(api_key="xoxp-1234123412341234-12341234-1234")
+            },
+        )


probably no need to call Definitions. you can use materialize_to_memory instead:

materialize_to_memory( [openai_asset] resources={"openai_resource": OpenAIResource(api_key="xoxp-1234123412341234-12341234-1234")} ).

https://docs.dagster.io/concepts/testing#testing-multiple-software-defined-assets-together

Done in d3581d5

yuhan · 2024-02-26T18:50:00Z

python_modules/libraries/dagster-openai/dagster_openai_tests/test_resources.py

+        outs={
+            "status": AssetOut(),
+            "result": AssetOut(),
+        },


lets update it to using specs which is a more blessed pattern.

Done in d3581d5

yuhan · 2024-02-26T18:51:05Z

python_modules/libraries/dagster-openai/dagster_openai_tests/test_resources.py

+    result = (
+        Definitions(
+            assets=[openai_multi_asset],
+            jobs=[
+                define_asset_job(
+                    name="openai_multi_asset_job",
+                    selection=AssetSelection.assets(openai_multi_asset),
+                )
+            ],
+            resources={
+                "openai_resource": OpenAIResource(api_key="xoxp-1234123412341234-12341234-1234")
+            },
+        )
+        .get_job_def("openai_multi_asset_job")
+        .execute_in_process()


same as [2]

Done in d3581d5

github-actions · 2024-02-27T16:59:55Z

Deploy preview for dagster-university ready!

✅ Preview
https://dagster-university-d18bt0mvk-elementl.vercel.app
https://maxime-ds-90-implement-the-dagster-openai-integration-library.dagster-university.dagster-docs.io

Built with commit ec36d2c.
This pull request is being automatically deployed with vercel-action

github-actions · 2024-02-27T17:00:28Z

Deploy preview for dagit-core-storybook ready!

✅ Preview
https://dagit-core-storybook-9yiwn6qxx-elementl.vercel.app
https://maxime-ds-90-implement-the-dagster-openai-integration-library.core-storybook.dagster-docs.io

Built with commit ec36d2c.
This pull request is being automatically deployed with vercel-action

yuhan · 2024-02-27T19:52:17Z

python_modules/libraries/dagster-openai/dagster_openai/resources.py

+        # Set up an OpenAI client based on the API key.
+        self._client = Client(api_key=self.api_key)
+
+    @contextmanager


probably need a @public decorator here to make it show up in doc string. we can add and test that in the docs PR tho.

yuhan · 2024-02-27T19:52:33Z

python_modules/libraries/dagster-openai/dagster_openai/resources.py

+                )
+        """
+        yield from self._get_client(context=context, asset_key=None)
+


yuhan

Let's merge this in! And iterate if anything else comes up as we build some examples using it.

Before landing, please make sure we aren't releasing it this week. See how to whitelist in release pipeline here: https://dagsterlabs.slack.com/archives/C03A0D72A6T/p1708645224684229

## Summary & Motivation This PR adds a new `dagster-openai` library to our set of libraries. The main goal of this library is to log the Open AI API usage in the metadata. To do so, we need to wrap the methods called through the client, get the results and update the metadata. Initial code snippets was hardcoding 3 methods, but we want to give the user some flexibility. Constraints: - Results must be captured at the method level - the data we seek is included in the OpenAI API response. The results can't be captured at the client level, at teardown for instance. - Not all the methods existing in the OpenAI library should be wrap (private methods, etc.) - Methods are overloaded in the API Resource classes, so wrapping the methods should be done on the instance. **Solution** Implement `OpenAIResource.get_client`, `OpenAIResource.get_client_for_asset` and the function wrapper `with_usage_metadata`. By default, for assets, the methods for the 3 main API Endpoint classes, `Completions`, `Chat` and `Embeddings`, are wrapped when instantiating the client - wrapping the methods allows to log the usage metadata provided in an OpenAI Completion response. If another endpoint should be wrapped, a user can use `with_usage_metadata` to it and log the metadata. `OpenAIResource.get_client` can be used for assets and ops, but the metadata will not be logged for ops. `OpenAIResource.get_client_for_asset` can only be used with assets and the metadata will be logged. ## TO-DOs - [x] implement the resource - [x] add docstrings - [x] implement tests ## How I Tested These Changes Local implementation BK Dogfood in Purina with a toy example

## Summary & Motivation This PR adds the docs for the `dagster-openai` integration added in PR #19697 ## How I Tested These Changes BK

## Summary & Motivation This PR adds a new `dagster-openai` library to our set of libraries. The main goal of this library is to log the Open AI API usage in the metadata. To do so, we need to wrap the methods called through the client, get the results and update the metadata. Initial code snippets was hardcoding 3 methods, but we want to give the user some flexibility. Constraints: - Results must be captured at the method level - the data we seek is included in the OpenAI API response. The results can't be captured at the client level, at teardown for instance. - Not all the methods existing in the OpenAI library should be wrap (private methods, etc.) - Methods are overloaded in the API Resource classes, so wrapping the methods should be done on the instance. **Solution** Implement `OpenAIResource.get_client`, `OpenAIResource.get_client_for_asset` and the function wrapper `with_usage_metadata`. By default, for assets, the methods for the 3 main API Endpoint classes, `Completions`, `Chat` and `Embeddings`, are wrapped when instantiating the client - wrapping the methods allows to log the usage metadata provided in an OpenAI Completion response. If another endpoint should be wrapped, a user can use `with_usage_metadata` to it and log the metadata. `OpenAIResource.get_client` can be used for assets and ops, but the metadata will not be logged for ops. `OpenAIResource.get_client_for_asset` can only be used with assets and the metadata will be logged. ## TO-DOs - [x] implement the resource - [x] add docstrings - [x] implement tests ## How I Tested These Changes Local implementation BK Dogfood in Purina with a toy example

## Summary & Motivation This PR adds the docs for the `dagster-openai` integration added in PR #19697 ## How I Tested These Changes BK

maximearmstrong self-assigned this Feb 12, 2024

maximearmstrong marked this pull request as ready for review February 12, 2024 21:34

maximearmstrong requested review from gibsondan and yuhan February 12, 2024 21:34

yuhan requested a review from benpankow February 12, 2024 21:45

yuhan reviewed Feb 13, 2024

View reviewed changes

benpankow reviewed Feb 13, 2024

View reviewed changes

maximearmstrong requested review from benpankow and yuhan February 15, 2024 22:48

yuhan reviewed Feb 15, 2024

View reviewed changes

maximearmstrong force-pushed the maxime/ds-90/implement-the-dagster-openai-integration-library branch 2 times, most recently from bfad41d to bbe2b6e Compare February 22, 2024 22:43

maximearmstrong requested a review from yuhan February 22, 2024 23:27

maximearmstrong force-pushed the maxime/ds-90/implement-the-dagster-openai-integration-library branch from dd0e2f2 to 8a65cb8 Compare February 23, 2024 19:00

maximearmstrong mentioned this pull request Feb 23, 2024

Add docs for dagster-openai integration #20013

Merged

yuhan reviewed Feb 26, 2024

View reviewed changes

maximearmstrong force-pushed the maxime/ds-90/implement-the-dagster-openai-integration-library branch 2 times, most recently from abb8098 to ec36d2c Compare February 27, 2024 16:57

maximearmstrong requested a review from yuhan February 27, 2024 16:59

yuhan reviewed Feb 27, 2024

View reviewed changes

yuhan approved these changes Feb 27, 2024

View reviewed changes

maximearmstrong force-pushed the maxime/ds-90/implement-the-dagster-openai-integration-library branch 2 times, most recently from 1bcee6a to c0a0a20 Compare February 27, 2024 21:07

Add library structure and initial resource

e483707

maximearmstrong added 13 commits February 27, 2024 16:59

Update dagster-openai integration

bd71e49

Update dagster-openai integreation post review

775b85a

Add tests for op, asset and multi_asset

2538ad7

Update tests

25cf9b9

Update tests and docstrings

2534770

Lint

861139f

Update docstrings

aef7bb0

update dagster-openai tests post-review

81abc3a

Add experimental marker

fe5492c

Update get_client

df38203

Remove version.py to prevent publishing

18b7fc6

Fix setup.py post removing version.py

55cd04a

Fix __init__.py post removing version.py

0df4f34

maximearmstrong force-pushed the maxime/ds-90/implement-the-dagster-openai-integration-library branch from c0a0a20 to 0df4f34 Compare February 27, 2024 21:59

maximearmstrong merged commit 352d6d8 into master Feb 27, 2024
1 check passed

maximearmstrong deleted the maxime/ds-90/implement-the-dagster-openai-integration-library branch February 27, 2024 22:24

maximearmstrong added a commit that referenced this pull request Mar 6, 2024

Add docs for dagster-openai integration (#20013)

2f2b1ac

## Summary & Motivation This PR adds the docs for the `dagster-openai` integration added in PR #19697 ## How I Tested These Changes BK

dpeng817 pushed a commit that referenced this pull request Mar 6, 2024

Add docs for dagster-openai integration (#20013)

05d2176

## Summary & Motivation This PR adds the docs for the `dagster-openai` integration added in PR #19697 ## How I Tested These Changes BK

dpeng817 pushed a commit that referenced this pull request Mar 6, 2024

Add docs for dagster-openai integration (#20013)

5ed361c

## Summary & Motivation This PR adds the docs for the `dagster-openai` integration added in PR #19697 ## How I Tested These Changes BK

PedramNavid pushed a commit that referenced this pull request Mar 28, 2024

Add docs for dagster-openai integration (#20013)

9632c8d

## Summary & Motivation This PR adds the docs for the `dagster-openai` integration added in PR #19697 ## How I Tested These Changes BK

		FINE_TUNING = "fine_tuning"


		API_RESOURCES_TO_ENDPOINT_METHODS_MAPPING = {

		context_to_counters = WeakKeyDictionary()


		def add_to_asset_metadata(

		from pydantic import Field, PrivateAttr


		class ApiResourceEnum(Enum):

	"chat": [["completions", "create"]],
	ApiResourceEnum.CHAT: [["completions", "create"]],

Implement the dagster-openai integration library #19697

Implement the dagster-openai integration library #19697

Conversation

maximearmstrong commented Feb 9, 2024 • edited Loading

Summary & Motivation

TO-DOs

How I Tested These Changes

maximearmstrong commented Feb 9, 2024 • edited Loading

yuhan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuhan commented Feb 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maximearmstrong commented Feb 13, 2024 • edited Loading

maximearmstrong commented Feb 15, 2024

yuhan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Feb 27, 2024

github-actions bot commented Feb 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuhan left a comment

Choose a reason for hiding this comment

maximearmstrong commented Feb 9, 2024 •

edited

Loading

maximearmstrong commented Feb 9, 2024 •

edited

Loading

yuhan commented Feb 13, 2024 •

edited

Loading

maximearmstrong commented Feb 13, 2024 •

edited

Loading