diff --git a/catalog/docs/assets/provider_dags/multi_media_dag.png b/documentation/_static/multi_media_dag.png similarity index 100% rename from catalog/docs/assets/provider_dags/multi_media_dag.png rename to documentation/_static/multi_media_dag.png diff --git a/catalog/docs/assets/provider_dags/simple_dag.png b/documentation/_static/simple_dag.png similarity index 100% rename from catalog/docs/assets/provider_dags/simple_dag.png rename to documentation/_static/simple_dag.png diff --git a/catalog/docs/adding_a_new_provider.md b/documentation/catalog/guides/adding_a_new_provider.md similarity index 81% rename from catalog/docs/adding_a_new_provider.md rename to documentation/catalog/guides/adding_a_new_provider.md index 97a0d04b638..a2a3f3dff23 100644 --- a/catalog/docs/adding_a_new_provider.md +++ b/documentation/catalog/guides/adding_a_new_provider.md @@ -8,7 +8,7 @@ automated by [Airflow DAGs](https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html) generated for each provider. A simple provider DAG looks like this: -![Example DAG](assets/provider_dags/simple_dag.png) +![Example DAG](/_static/simple_dag.png) At a high level the steps are: @@ -24,7 +24,7 @@ When a provider supports multiple media types (for example, `audio` _and_ `images`), the `pull` step consumes data of all types, but separate `load` steps are generated: -![Example Multi-Media DAG](assets/provider_dags/multi_media_dag.png) +![Example Multi-Media DAG](/_static/multi_media_dag.png) ## Adding a New Provider @@ -39,19 +39,20 @@ provider DAG, you need to: We call the code that pulls data from our provider APIs "Provider API scripts". You can find examples in -[`provider_api_scripts` folder](../dags/providers/provider_api_scripts). This -code will be run during the `pull` steps of the provider DAG. +[`provider_api_scripts` folder](https://github.com/WordPress/openverse/tree/main/catalog/dags/providers/provider_api_scripts). +This code will be run during the `pull` steps of the provider DAG. At a high level, a provider script should iteratively request batches of records from the provider API, extract data in the format required by Openverse, and commit it to local storage. Much of this logic is implemented in a -[`ProviderDataIngester` base class](../dags/providers/provider_api_scripts/provider_data_ingester.py) +[`ProviderDataIngester` base class](https://github.com/WordPress/openverse/blob/main/catalog/dags/providers/provider_api_scripts/provider_data_ingester.py) (which also provides additional testing features __). To add a new provider, extend this class and implement its abstract methods. -We provide a [script](../dags/templates/create_provider_ingester.py) that can be -used to generate the files you'll need and get you started: +We provide a +[script](https://github.com/WordPress/openverse/blob/main/catalog/templates/create_provider_ingester.py) +that can be used to generate the files you'll need and get you started: ``` # PROVIDER_NAME: The name of the provider @@ -84,13 +85,15 @@ NOTE: You will also need to add a new ProviderWorkflow dataclass configuration t ``` This generates a provider script with a templated `ProviderDataIngester` for you -in the [`provider_api_scripts` folder](../dags/providers/provider_api_scripts), +in the +[`provider_api_scripts` folder](https://github.com/WordPress/openverse/tree/main/catalog/dags/providers/provider_api_scripts), as well as a corresponding test file. Complete the TODOs detailed in the generated files to implement behavior specific to your API. Some APIs may not fit perfectly into the established `ProviderDataIngester` pattern. For advanced use cases and examples of how to modify the ingestion -flow, see the [`ProviderDataIngester` FAQ](provider_data_ingester_faq.md). +flow, see the +[`ProviderDataIngester` FAQ](https://github.com/WordPress/openverse/blob/main/catalog/docs/provider_data_ingester_faq.md). ### Add a `ProviderWorkflow` configuration class @@ -98,9 +101,9 @@ Now that you have an ingester class, you're ready to wire up a provider DAG in Airflow to automatically pull data and load it into our Catalog database. This is done by defining a `ProviderWorkflow` configuration dataclass and adding it to the `PROVIDER_WORKFLOWS` list in -[`provider_workflows.py`](../dags/providers/provider_workflows.py). Our DAG -factories will pick up the configuration and generate a complete new DAG in -Airflow! +[`provider_workflows.py`](https://github.com/WordPress/openverse/blob/main/catalog/dags/providers/provider_workflows.py). +Our DAG factories will pick up the configuration and generate a complete new DAG +in Airflow! At minimum, you'll need to provide the following in your configuration: diff --git a/documentation/catalog/guides/index.md b/documentation/catalog/guides/index.md index 24fd9270037..c9cec0b97ad 100644 --- a/documentation/catalog/guides/index.md +++ b/documentation/catalog/guides/index.md @@ -5,4 +5,5 @@ quickstart deploy +adding_a_new_provider ```