Skip to content

Commit

Permalink
Object storage big clarification (#706)
Browse files Browse the repository at this point in the history
* Object storage big clarification

* Fix build
  • Loading branch information
hcourdent authored Sep 19, 2024
1 parent 2bf18b0 commit 8a065db
Show file tree
Hide file tree
Showing 41 changed files with 581 additions and 467 deletions.
6 changes: 3 additions & 3 deletions blog/2023-11-15-launch-week-1/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -297,11 +297,11 @@ If that's not sufficient you can even build your own app in React.

## Day 5

For the last day of our launch week today we focused on features that will help you in your ETLs, with restartable flows and S3 integration for data pipelines.
For the last day of our launch week today we focused on features that will help you in your ETLs, with restartable flows and Workspace object storage for data pipelines.

### Windmill for data pipelines - S3 Integration
### Windmill for data pipelines - Workspace object storage

![Windmill for data pipelines - S3 Integration](../2023-11-24-data-pipeline-orchestrator/data_pipelines.png.webp 'Windmill for data pipelines - S3 Integration')
![Windmill for data pipelines - Workspace object storage](../2023-11-24-data-pipeline-orchestrator/data_pipelines.png.webp 'Windmill for data pipelines - Workspace object storage')

_Run your ETLs on-prem up to 5x faster using Windmill compared to Spark while simplifying your infra._

Expand Down
2 changes: 1 addition & 1 deletion blog/2023-11-20-ai-flow-builder/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ You can see below an example of a simple workflow with a [for-loop](/docs/flows/
![Windmill DAG](./media/windmill-dag.png.webp)

:::info Workflow engine vs Analytics engine
All the examples above focus on small api integrations but data pipeline that would usually be run on dedicated analytics engine are a great fit for Windmill when combined with s3, and dataframe/olap libraries such as polars or duckdb.
All the examples above focus on small api integrations but data pipeline that would usually be run on dedicated analytics engine are a great fit for Windmill when combined with S3, and dataframe/olap libraries such as polars or duckdb.
Indeed, thanks to these integrations and Windmill's lack of boilerplate, Windmill offers state-of-the-art performances for data processing at scale while keeping complexity low.

<br />
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ import ScatterChart from '@site/src/components/ScatterChart';
/>
</div>

[Benchmarking data and dedicated methodology documentation](https://www.Windmill.dev/docs/misc/benchmarks/competitors).
[Benchmarking data and dedicated methodology documentation](/docs/misc/benchmarks/competitors).

You've known Windmill to be a productive environment to monitor, write and iterate on workflows, but we wanted to prove it's also the best system to deploy at scale in production.

Expand Down Expand Up @@ -149,7 +149,7 @@ That being said, Temporal is amazing at what it does and if there are overlaps b

We leave analytics/ETL engines such as Spark or Dagster out of it for today as they are not workflow engines _per se_ even if they are built on top of ones.

ETL and analytics workflows will be covered later this week, and you will find that Windmill offers best-in-class performance for analytics workloads leveraging s3, duckdb and polars
ETL and analytics workflows will be covered later this week, and you will find that Windmill offers best-in-class performance for analytics workloads leveraging S3, duckdb and polars

:::

Expand Down Expand Up @@ -299,7 +299,7 @@ json*path.map(|x| x.split(".").map(|x| x.to_string()).collect::<Vec<*>>())

- share data in a temporary folder
Flows can be configured to be wholly executed on the same worker. When that is the case, a folder is shared and symlinked inside every job's ephemeral folder (jobs are started in an ephemeral folder that is removed at the end of their execution)
- pass data in S3 using the S3 integration (updates specific to that part to be presented on day 5)
- pass data in S3 using the [workspace object storage](/docs/core_concepts/object_storage_in_windmill#workspace-object-storage) (updates specific to that part to be presented on day 5)

## Workers efficiency

Expand Down
4 changes: 2 additions & 2 deletions blog/2023-11-24-data-pipeline-orchestrator/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ And for storage, you can now link a Windmill workspace to an S3 bucket and use i
The very large majority of ETLs can be processed step-wise on single nodes and Windmill provides (one of) the best models for orchestrating non-sharded compute. Using this model, your ETLs will see a massive performance improvement, your infrastructure
will be easier to manage and your pipeline will be easier to write, maintain, and monitor.

## Windmill integration with an external S3 storage
## Windmill integration with an external object storage

In Windmill, a data pipeline is implemented using a [flow](/docs/flows/flow_editor), and each step of the pipeline is a script. One of the key features of Windmill flows is to easily [pass a step result to its dependent steps](/docs/flows/architecture). But
because those results are serialized to Windmill database and kept as long as the job is stored, this obviously won't work when the result is a dataset of millions of rows. The solution is to save the datasets to an external storage at the end of each script.
Expand All @@ -69,7 +69,7 @@ The first step is to define an [S3 resource](/docs/integrations/s3) in Windmill
![S3 workspace settings](./workspace_s3_settings.png 'S3 workspace settings')

From now on, Windmill will be connected to this bucket and you'll have easy access to it from the code editor and the job run details. If a script takes as input a `s3object`, you will see in the input form on the right a button helping you choose the file directly from the bucket.
Same for the result of the script. If you return an `s3object` containing a key `s3` pointing to a file inside your bucket, in the result panel there will be a button to open the bucket explorer to visualize the file.
Same for the result of the script. If you return an `s3object` containing a [key](/docs/core_concepts/rich_display_rendering#s3) `s3` pointing to a file inside your bucket, in the result panel there will be a button to open the bucket explorer to visualize the file.

![Windmill code editor](./s3_object_code_editor.png 'Windmill code editor')

Expand Down
2 changes: 1 addition & 1 deletion blog/2024-07-12-airflow-alternatives/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,7 @@ Windmill is an [open-source](https://github.com/windmill-labs/windmill) workflow

Windmill was designed by developers for developers, ranging from semi-technical (low code builders) to senior/staff software engineers with high standards for production-grade yet flexible and customizable with code. Windmill was built to address the challenge of turning high-value code containing business logic, data transformation, and internal API calls into scalable microservices and tools without the usual heavy lifting.

On the other hand, the support of [Python](/docs/getting_started/scripts_quickstart/python) as a primary language and the integration of a workspace with [object storage](/docs/core_concepts/persistent_storage/large_data_files) (in particular, S3) make Windmill an excellent fit for data engineers, particularly for building [data pipelines](/docs/core_concepts/data_pipelines).
On the other hand, the support of [Python](/docs/getting_started/scripts_quickstart/python) as a primary language and the integration of a workspace with [object storage](/docs/core_concepts/object_storage_in_windmill) (in particular, S3) make Windmill an excellent fit for data engineers, particularly for building [data pipelines](/docs/core_concepts/data_pipelines).

Windmill has three editors (or products), all compatible, each independently functioning:
1. The [Script Editor](/docs/script_editor) is an integrated development environment that allows you to write code in various languages like TypeScript, Python, Go, Bash, SQL, or even run any Docker container through Windmill's Bash support.
Expand Down
4 changes: 2 additions & 2 deletions changelog/2024-05-31-secondary-storage/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@ version: v1.340.0
title: Secondary Storage
tags: ['Persistent Storage']
image: ./secondary_storage.png
description: With all Windmill S3 Integration features, read and write from a storage that is not your main storage by specifying it in the s3 object as "secondary_storage" with the name of it.
description: Read and write from a storage that is not your main storage by specifying it in the S3 object as "secondary_storage" with the name of it.
features:
[
'Add additional storages from S3, Azure Blob, AWS OIDC or Azure Workload Identity.',
'From script, specify the secondary storage with an object with properties `s3` (path to the file) and `storage` (name of the secondary storage).'
]
docs: /docs/core_concepts/persistent_storage/large_data_files#secondary-s3-storage
docs: /docs/core_concepts/object_storage_in_windmill#secondary-s3-storage
---
4 changes: 2 additions & 2 deletions docs/advanced/14_dependencies_in_typescript/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,7 @@ Windmill CLI, it is done automatically on `wmill sync push` for any script that

<div className="grid grid-cols-2 gap-6 mb-4">
<DocCard
title="Codebases & Bundles"
title="Codebases & bundles"
description="Deploy scripts with any local relative imports as bundles."
href="/docs/core_concepts/codebases_and_bundles"
/>
Expand Down Expand Up @@ -254,7 +254,7 @@ Note that path in Windmill can have as many depth as needed, so you can have pat

You can use private npm registries and private npm packages in your TypeScript scripts.

This applies to all methods above. Only, if using Codebases & Bundles locally, there is nothing to configure in Windmill, because the bundle is built locally using your locally-installed modules (which support traditional npm packages and private npm packages).
This applies to all methods above. Only, if using Codebases & bundles locally, there is nothing to configure in Windmill, because the bundle is built locally using your locally-installed modules (which support traditional npm packages and private npm packages).

![Private NPM registry](../6_imports/private_registry.png 'Private NPM registry')

Expand Down
6 changes: 4 additions & 2 deletions docs/advanced/18_instance_settings/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -92,12 +92,14 @@ This setting is only available on [Enterprise Edition](/pricing).

### S3/Azure for Python/Go cache & large logs

Bucket to [store large logs](../../core_concepts/20_jobs/index.mdx#s3azure-for-python-cache--large-logs) and global cache for Python and Go.
[Connect your instance](../../core_concepts/38_object_storage_in_windmill/index.mdx#instance-object-storage) to a S3 bucket to [store large logs](../../core_concepts/20_jobs/index.mdx#large-logs-management-with-s3) and [global cache for Python and Go](../../misc/13_s3_cache/index.mdx).

This feature has no overlap with the [Workspace S3 integration](../../core_concepts/11_persistent_storage/large_data_files.mdx).
This feature has no overlap with the [Workspace object storage](../../core_concepts/38_object_storage_in_windmill/index.mdx#workspace-object-storage).

You can choose to use either S3 or Azure Blob Storage. For each you will find a button to test settings from a server or from a worker.

![S3/Azure for Python/Go cache & large logs](../../core_concepts/20_jobs/s3_azure_cache.png "S3/Azure for Python/Go cache & large logs")

This setting is only available on [Enterprise Edition](/pricing).

### Critical alert channels
Expand Down
2 changes: 1 addition & 1 deletion docs/advanced/1_self_host/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -515,7 +515,7 @@ enterprise:

You will want to disable the postgresql provided with the helm chart and set the database_url to your own managed postgresql.

For high-scale deployments (> 20 workers), we recommend using the [global S3 cache](../../misc/13_s3_cache/index.md). You will need an object storage compatible with the S3 protocol.
For high-scale deployments (> 20 workers), we recommend using the [global S3 cache](../../misc/13_s3_cache/index.mdx). You will need an object storage compatible with the S3 protocol.

## Run Windmill without using a Postgres superuser

Expand Down
2 changes: 1 addition & 1 deletion docs/advanced/3_cli/sync.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,7 @@ export interface SyncOptions {
}
```

## Example Repo for Syncing with Windmill in Git
## Example repo for syncing with Windmill in git

We provide an example repo for syncing with Windmill:

Expand Down
2 changes: 1 addition & 1 deletion docs/advanced/5_sharing_common_logic/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ Windmill CLI, it is done automatically on `wmill sync push` for any script that

<div className="grid grid-cols-2 gap-6 mb-4">
<DocCard
title="Codebases & Bundles"
title="Codebases & bundles"
description="Deploy scripts with any local relative imports as bundles."
href="/docs/core_concepts/codebases_and_bundles"
/>
Expand Down
4 changes: 2 additions & 2 deletions docs/advanced/6_imports/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,7 @@ See the dedicated pages for TypeScript and Python to learn how to handle depende

To import other scripts from your workspace, see [Sharing common logic](../5_sharing_common_logic/index.mdx).

To import from a custom codebase, see [Codebases & Bundles](../../core_concepts/33_codebases_and_bundles/index.mdx).
To import from a custom codebase, see [Codebases & bundles](../../core_concepts/33_codebases_and_bundles/index.mdx).

<div className="grid grid-cols-2 gap-6 mb-4">
<DocCard
Expand All @@ -179,7 +179,7 @@ To import from a custom codebase, see [Codebases & Bundles](../../core_concepts/
href="/docs/advanced/sharing_common_logic"
/>
<DocCard
title="Codebases & Bundles"
title="Codebases & bundles"
description="Deploy scripts with any local relative imports as bundles."
href="/docs/core_concepts/codebases_and_bundles"
/>
Expand Down
2 changes: 1 addition & 1 deletion docs/compared_to/peers.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ By comparison, in Windmill one would just write the canonical python or typescri

<br />

> [1]: Windmill is not just a workflow engine, it is also a function as a service (FaaS) infrastructure where it can run arbitrary scripts in TypeScript/Python/Bash/Go. Contrary to Lambda or GCP cloud functions, we do not need the functions to be pre-packaged and deployed in advance AOT. For TypeScript, we rely on the deno runtime that leverages v8 isolates and the immutable caching capabilities of deno. For Python, we have implemented our own dependency resolver that will override the Python virtual path and create a unique virtual environment for that specific script that will respect the lockfile generated at time of saving the script/flow for reproducibility. Given that those are interpreted languages, we pay no performance penalty to interpret that code on demand. So the only limiting factor for task execution is that in the events that dependencies are not cached by the worker, they need to be installed at time of execution. With a limited number of workers, the likelihood of a cache miss is low as soon as one script/workflow is executed more than once. With a large fleet of workers, cache miss increase and hence we have implemented a global caching mechanism that relies on syncing the cache through s3. It is only available in our [enterprise edition](/pricing). With it in place, we run tasks and workflows with 0 overhead versus running the same scripts on bare-metal. You can even leverage hardware acceleration without any additional configuration.
> [1]: Windmill is not just a workflow engine, it is also a function as a service (FaaS) infrastructure where it can run arbitrary scripts in TypeScript/Python/Bash/Go. Contrary to Lambda or GCP cloud functions, we do not need the functions to be pre-packaged and deployed in advance AOT. For TypeScript, we rely on the deno runtime that leverages v8 isolates and the immutable caching capabilities of deno. For Python, we have implemented our own dependency resolver that will override the Python virtual path and create a unique virtual environment for that specific script that will respect the lockfile generated at time of saving the script/flow for reproducibility. Given that those are interpreted languages, we pay no performance penalty to interpret that code on demand. So the only limiting factor for task execution is that in the events that dependencies are not cached by the worker, they need to be installed at time of execution. With a limited number of workers, the likelihood of a cache miss is low as soon as one script/workflow is executed more than once. With a large fleet of workers, cache miss increase and hence we have implemented a global caching mechanism that relies on syncing the cache through S3. It is only available in our [enterprise edition](/pricing). With it in place, we run tasks and workflows with 0 overhead versus running the same scripts on bare-metal. You can even leverage hardware acceleration without any additional configuration.

</details>

Expand Down
6 changes: 4 additions & 2 deletions docs/core_concepts/10_error_handling/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -69,10 +69,12 @@ There are other tricks to do Error handling in flows, see:
/>
</div>

## Schedules' Error Handlers
## Schedules Error Handlers

Add a special script or flow to execute in case of an error in your [scheduled](../1_scheduling/index.mdx) script or flow.

Schedule Error hander is an [Enterprise Edition](/pricing) feature.

You can pick the Slack pre-set schedule error handler or define your own.

<video
Expand All @@ -92,7 +94,7 @@ You can pick the Slack pre-set schedule error handler or define your own.

## Workspace Error Handler

Define a script or flow to be executed automatically in case of error in the workspace.
Define a script or flow to be executed automatically in case of error in the workspace (e.g. a scheduled job fails to re-schedule).

### Workspace Error Handler on Slack

Expand Down
14 changes: 7 additions & 7 deletions docs/core_concepts/11_persistent_storage/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -56,19 +56,19 @@ All details at:
/>
</div>

## Object Storage for Large Data: S3, R2, MinIO, Azure Blob
## Large data: S3, R2, MinIO, Azure Blob

On heavier data objects & unstructured data storage, [Amazon S3](https://aws.amazon.com/s3/) (Simple Storage Service) and its alternatives [Cloudflare R2](https://www.cloudflare.com/developer-platform/r2/) and [MinIO](https://min.io/) as well as [Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs) storage are highly scalable and durable object storage service that provides secure, reliable, and cost-effective storage for a wide range of data types and use cases.

Windmill comes with a [native integration with S3 and Azure Blob](./large_data_files.mdx#connect-your-windmill-workspace-to-your-s3-bucket-or-your-azure-blob-storage), making it the recommended storage for large objects like files and binary data.
Windmill comes with a [native integration with S3 and Azure Blob](./large_data_files.mdx), making it the recommended storage for large objects like files and binary data.

![S3 Integration Infographic](./s3_infographics.png "S3 Integration Infographic")
![Workspace object storage Infographic](./s3_infographics.png "Workspace object storage Infographic")

All details at:

<div className="grid grid-cols-2 gap-6 mb-4">
<DocCard
title="Object Storage for Large Data: S3, R2, MinIO, Azure Blob"
title="Large data: S3, R2, MinIO, Azure Blob"
description="Windmill comes with a native integration with S3 and Azure Blob, making it the recommended storage for large objects like files and binary data."
href="/docs/core_concepts/persistent_storage/large_data_files"
/>
Expand All @@ -82,21 +82,21 @@ All details at:

<div className="grid grid-cols-2 gap-6 mb-4">
<DocCard
title="Big Structured SQL Data: Postgres (Supabase, Neon.tech)"
title="Big structured SQL data: Postgres (Supabase, Neon.tech)"
description="For Postgres databases (best for structured data storage and retrieval, where you can define schema and relationships between entities), we recommend using Supabase or Neon.tech."
href="/docs/core_concepts/persistent_storage/structured_databases"
/>
</div>

## NoSQL and Document Databases (Mongodb, Key-Value Stores)
## NoSQL & Document databases (Mongodb, Key-Value Stores)

Key-value stores are a popular choice for managing non-structured data, providing a flexible and scalable solution for various data types and use cases. In the context of Windmill, you can use MongoDB Atlas, Redis, and Upstash to store and manipulate non-structured data effectively.

All details at:

<div className="grid grid-cols-2 gap-6 mb-4">
<DocCard
title="NoSQL and Document Databases (Mongodb, Key-Value Stores)"
title="NoSQL & Document databases (Mongodb, Key-Value Stores)"
description="Key-value stores are a popular choice for managing non-structured data, providing a flexible and scalable solution for various data types and use cases."
href="/docs/core_concepts/persistent_storage/key_value_stores"
/>
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# NoSQL and Document Databases (Mongodb, Key-Value Stores)
# NoSQL & Document databases (Mongodb, Key-Value Stores)

This page is part of our section on [Persistent Storage & Databases](./index.mdx) which covers where to effectively store and manage the data manipulated by Windmill. Check that page for more options on data storage.

Expand Down
Loading

0 comments on commit 8a065db

Please sign in to comment.