Skip to content

Commit

Permalink
Move new provider docs to documentation folder (#3273)
Browse files Browse the repository at this point in the history
* move new provider docs to documentation folder

* edit tite

* reference link to correct location

* use github links

* use github links
  • Loading branch information
ngken0995 authored Nov 2, 2023
1 parent 2c6267e commit 4f8281d
Show file tree
Hide file tree
Showing 4 changed files with 16 additions and 12 deletions.
File renamed without changes
File renamed without changes
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ automated by
[Airflow DAGs](https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html)
generated for each provider. A simple provider DAG looks like this:

![Example DAG](assets/provider_dags/simple_dag.png)
![Example DAG](/_static/simple_dag.png)

At a high level the steps are:

Expand All @@ -24,7 +24,7 @@ When a provider supports multiple media types (for example, `audio` _and_
`images`), the `pull` step consumes data of all types, but separate `load` steps
are generated:

![Example Multi-Media DAG](assets/provider_dags/multi_media_dag.png)
![Example Multi-Media DAG](/_static/multi_media_dag.png)

## Adding a New Provider

Expand All @@ -39,19 +39,20 @@ provider DAG, you need to:

We call the code that pulls data from our provider APIs "Provider API scripts".
You can find examples in
[`provider_api_scripts` folder](../dags/providers/provider_api_scripts). This
code will be run during the `pull` steps of the provider DAG.
[`provider_api_scripts` folder](https://github.com/WordPress/openverse/tree/main/catalog/dags/providers/provider_api_scripts).
This code will be run during the `pull` steps of the provider DAG.

At a high level, a provider script should iteratively request batches of records
from the provider API, extract data in the format required by Openverse, and
commit it to local storage. Much of this logic is implemented in a
[`ProviderDataIngester` base class](../dags/providers/provider_api_scripts/provider_data_ingester.py)
[`ProviderDataIngester` base class](https://github.com/WordPress/openverse/blob/main/catalog/dags/providers/provider_api_scripts/provider_data_ingester.py)
(which also provides additional testing features _<TODO: link to documentation
for testing features like ingestion_limit, skip_ingestion_errors etc>_). To add
a new provider, extend this class and implement its abstract methods.

We provide a [script](../dags/templates/create_provider_ingester.py) that can be
used to generate the files you'll need and get you started:
We provide a
[script](https://github.com/WordPress/openverse/blob/main/catalog/templates/create_provider_ingester.py)
that can be used to generate the files you'll need and get you started:

```
# PROVIDER_NAME: The name of the provider
Expand Down Expand Up @@ -84,23 +85,25 @@ NOTE: You will also need to add a new ProviderWorkflow dataclass configuration t
```

This generates a provider script with a templated `ProviderDataIngester` for you
in the [`provider_api_scripts` folder](../dags/providers/provider_api_scripts),
in the
[`provider_api_scripts` folder](https://github.com/WordPress/openverse/tree/main/catalog/dags/providers/provider_api_scripts),
as well as a corresponding test file. Complete the TODOs detailed in the
generated files to implement behavior specific to your API.

Some APIs may not fit perfectly into the established `ProviderDataIngester`
pattern. For advanced use cases and examples of how to modify the ingestion
flow, see the [`ProviderDataIngester` FAQ](provider_data_ingester_faq.md).
flow, see the
[`ProviderDataIngester` FAQ](https://github.com/WordPress/openverse/blob/main/catalog/docs/provider_data_ingester_faq.md).

### Add a `ProviderWorkflow` configuration class

Now that you have an ingester class, you're ready to wire up a provider DAG in
Airflow to automatically pull data and load it into our Catalog database. This
is done by defining a `ProviderWorkflow` configuration dataclass and adding it
to the `PROVIDER_WORKFLOWS` list in
[`provider_workflows.py`](../dags/providers/provider_workflows.py). Our DAG
factories will pick up the configuration and generate a complete new DAG in
Airflow!
[`provider_workflows.py`](https://github.com/WordPress/openverse/blob/main/catalog/dags/providers/provider_workflows.py).
Our DAG factories will pick up the configuration and generate a complete new DAG
in Airflow!

At minimum, you'll need to provide the following in your configuration:

Expand Down
1 change: 1 addition & 0 deletions documentation/catalog/guides/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,5 @@
quickstart
deploy
adding_a_new_provider
```

0 comments on commit 4f8281d

Please sign in to comment.