Object storage big clarification (#706)

* Object storage big clarification * Fix build
windmill-labs · Sep 19, 2024 · 8a065db · 8a065db
1 parent 2bf18b0
commit 8a065db
Show file tree

Hide file tree

Showing 41 changed files with 581 additions and 467 deletions.
diff --git a/blog/2023-11-15-launch-week-1/index.mdx b/blog/2023-11-15-launch-week-1/index.mdx
@@ -297,11 +297,11 @@ If that's not sufficient you can even build your own app in React.
 
 ## Day 5
 
-For the last day of our launch week today we focused on features that will help you in your ETLs, with restartable flows and S3 integration for data pipelines.
+For the last day of our launch week today we focused on features that will help you in your ETLs, with restartable flows and Workspace object storage for data pipelines.
 
-### Windmill for data pipelines - S3 Integration
+### Windmill for data pipelines - Workspace object storage
 
-![Windmill for data pipelines - S3 Integration](../2023-11-24-data-pipeline-orchestrator/data_pipelines.png.webp 'Windmill for data pipelines - S3 Integration')
+![Windmill for data pipelines - Workspace object storage](../2023-11-24-data-pipeline-orchestrator/data_pipelines.png.webp 'Windmill for data pipelines - Workspace object storage')
 
 _Run your ETLs on-prem up to 5x faster using Windmill compared to Spark while simplifying your infra._
 

diff --git a/blog/2023-11-20-ai-flow-builder/index.mdx b/blog/2023-11-20-ai-flow-builder/index.mdx
@@ -103,7 +103,7 @@ You can see below an example of a simple workflow with a [for-loop](/docs/flows/
 ![Windmill DAG](./media/windmill-dag.png.webp)
 
 :::info Workflow engine vs Analytics engine
-All the examples above focus on small api integrations but data pipeline that would usually be run on dedicated analytics engine are a great fit for Windmill when combined with s3, and dataframe/olap libraries such as polars or duckdb.
+All the examples above focus on small api integrations but data pipeline that would usually be run on dedicated analytics engine are a great fit for Windmill when combined with S3, and dataframe/olap libraries such as polars or duckdb.
 Indeed, thanks to these integrations and Windmill's lack of boilerplate, Windmill offers state-of-the-art performances for data processing at scale while keeping complexity low.
 
 <br />

diff --git a/blog/2023-11-22-why-is-windmill-the-fastest-workflow-engine/index.mdx b/blog/2023-11-22-why-is-windmill-the-fastest-workflow-engine/index.mdx
@@ -106,7 +106,7 @@ import ScatterChart from '@site/src/components/ScatterChart';
 	/>
 </div>
 
-[Benchmarking data and dedicated methodology documentation](https://www.Windmill.dev/docs/misc/benchmarks/competitors).
+[Benchmarking data and dedicated methodology documentation](/docs/misc/benchmarks/competitors).
 
 You've known Windmill to be a productive environment to monitor, write and iterate on workflows, but we wanted to prove it's also the best system to deploy at scale in production.
 
@@ -149,7 +149,7 @@ That being said, Temporal is amazing at what it does and if there are overlaps b
 
 We leave analytics/ETL engines such as Spark or Dagster out of it for today as they are not workflow engines _per se_ even if they are built on top of ones.
 
-ETL and analytics workflows will be covered later this week, and you will find that Windmill offers best-in-class performance for analytics workloads leveraging s3, duckdb and polars
+ETL and analytics workflows will be covered later this week, and you will find that Windmill offers best-in-class performance for analytics workloads leveraging S3, duckdb and polars
 
 :::
 
@@ -299,7 +299,7 @@ json*path.map(|x| x.split(".").map(|x| x.to_string()).collect::<Vec<*>>())
 
 - share data in a temporary folder
   Flows can be configured to be wholly executed on the same worker. When that is the case, a folder is shared and symlinked inside every job's ephemeral folder (jobs are started in an ephemeral folder that is removed at the end of their execution)
-- pass data in S3 using the S3 integration (updates specific to that part to be presented on day 5)
+- pass data in S3 using the [workspace object storage](/docs/core_concepts/object_storage_in_windmill#workspace-object-storage) (updates specific to that part to be presented on day 5)
 
 ## Workers efficiency
 

diff --git a/blog/2023-11-24-data-pipeline-orchestrator/index.mdx b/blog/2023-11-24-data-pipeline-orchestrator/index.mdx
@@ -57,7 +57,7 @@ And for storage, you can now link a Windmill workspace to an S3 bucket and use i
 The very large majority of ETLs can be processed step-wise on single nodes and Windmill provides (one of) the best models for orchestrating non-sharded compute. Using this model, your ETLs will see a massive performance improvement, your infrastructure
 will be easier to manage and your pipeline will be easier to write, maintain, and monitor.
 
-## Windmill integration with an external S3 storage
+## Windmill integration with an external object storage
 
 In Windmill, a data pipeline is implemented using a [flow](/docs/flows/flow_editor), and each step of the pipeline is a script. One of the key features of Windmill flows is to easily [pass a step result to its dependent steps](/docs/flows/architecture). But
 because those results are serialized to Windmill database and kept as long as the job is stored, this obviously won't work when the result is a dataset of millions of rows. The solution is to save the datasets to an external storage at the end of each script.
@@ -69,7 +69,7 @@ The first step is to define an [S3 resource](/docs/integrations/s3) in Windmill
 ![S3 workspace settings](./workspace_s3_settings.png 'S3 workspace settings')
 
 From now on, Windmill will be connected to this bucket and you'll have easy access to it from the code editor and the job run details. If a script takes as input a `s3object`, you will see in the input form on the right a button helping you choose the file directly from the bucket.
-Same for the result of the script. If you return an `s3object` containing a key `s3` pointing to a file inside your bucket, in the result panel there will be a button to open the bucket explorer to visualize the file.
+Same for the result of the script. If you return an `s3object` containing a [key](/docs/core_concepts/rich_display_rendering#s3) `s3` pointing to a file inside your bucket, in the result panel there will be a button to open the bucket explorer to visualize the file.
 
 ![Windmill code editor](./s3_object_code_editor.png 'Windmill code editor')
 

diff --git a/blog/2024-07-12-airflow-alternatives/index.mdx b/blog/2024-07-12-airflow-alternatives/index.mdx
@@ -138,7 +138,7 @@ Windmill is an [open-source](https://github.com/windmill-labs/windmill) workflow
 
 Windmill was designed by developers for developers, ranging from semi-technical (low code builders) to senior/staff software engineers with high standards for production-grade yet flexible and customizable with code. Windmill was built to address the challenge of turning high-value code containing business logic, data transformation, and internal API calls into scalable microservices and tools without the usual heavy lifting.
 
-On the other hand, the support of [Python](/docs/getting_started/scripts_quickstart/python) as a primary language and the integration of a workspace with [object storage](/docs/core_concepts/persistent_storage/large_data_files) (in particular, S3) make Windmill an excellent fit for data engineers, particularly for building [data pipelines](/docs/core_concepts/data_pipelines).
+On the other hand, the support of [Python](/docs/getting_started/scripts_quickstart/python) as a primary language and the integration of a workspace with [object storage](/docs/core_concepts/object_storage_in_windmill) (in particular, S3) make Windmill an excellent fit for data engineers, particularly for building [data pipelines](/docs/core_concepts/data_pipelines).
 
 Windmill has three editors (or products), all compatible, each independently functioning:
 1. The [Script Editor](/docs/script_editor) is an integrated development environment that allows you to write code in various languages like TypeScript, Python, Go, Bash, SQL, or even run any Docker container through Windmill's Bash support.

diff --git a/changelog/2024-05-31-secondary-storage/index.md b/changelog/2024-05-31-secondary-storage/index.md
@@ -4,11 +4,11 @@ version: v1.340.0
 title: Secondary Storage
 tags: ['Persistent Storage']
 image: ./secondary_storage.png
-description: With all Windmill S3 Integration features, read and write from a storage that is not your main storage by specifying it in the s3 object as "secondary_storage" with the name of it.
+description: Read and write from a storage that is not your main storage by specifying it in the S3 object as "secondary_storage" with the name of it.
 features:
   [
     'Add additional storages from S3, Azure Blob, AWS OIDC or Azure Workload Identity.',
     'From script, specify the secondary storage with an object with properties `s3` (path to the file) and `storage` (name of the secondary storage).'
   ]
-docs: /docs/core_concepts/persistent_storage/large_data_files#secondary-s3-storage
+docs: /docs/core_concepts/object_storage_in_windmill#secondary-s3-storage
 ---
diff --git a/docs/advanced/14_dependencies_in_typescript/index.mdx b/docs/advanced/14_dependencies_in_typescript/index.mdx
@@ -188,7 +188,7 @@ Windmill CLI, it is done automatically on `wmill sync push` for any script that
 
 <div className="grid grid-cols-2 gap-6 mb-4">
 	<DocCard
-		title="Codebases & Bundles"
+		title="Codebases & bundles"
 		description="Deploy scripts with any local relative imports as bundles."
 		href="/docs/core_concepts/codebases_and_bundles"
 	/>
@@ -254,7 +254,7 @@ Note that path in Windmill can have as many depth as needed, so you can have pat
 
 You can use private npm registries and private npm packages in your TypeScript scripts.
 
-This applies to all methods above. Only, if using Codebases & Bundles locally, there is nothing to configure in Windmill, because the bundle is built locally using your locally-installed modules (which support traditional npm packages and private npm packages).
+This applies to all methods above. Only, if using Codebases & bundles locally, there is nothing to configure in Windmill, because the bundle is built locally using your locally-installed modules (which support traditional npm packages and private npm packages).
 
 ![Private NPM registry](../6_imports/private_registry.png 'Private NPM registry')
 

diff --git a/docs/advanced/18_instance_settings/index.mdx b/docs/advanced/18_instance_settings/index.mdx
@@ -92,12 +92,14 @@ This setting is only available on [Enterprise Edition](/pricing).
 
 ### S3/Azure for Python/Go cache & large logs
 
-Bucket to [store large logs](../../core_concepts/20_jobs/index.mdx#s3azure-for-python-cache--large-logs) and global cache for Python and Go.
+[Connect your instance](../../core_concepts/38_object_storage_in_windmill/index.mdx#instance-object-storage) to a S3 bucket to [store large logs](../../core_concepts/20_jobs/index.mdx#large-logs-management-with-s3) and [global cache for Python and Go](../../misc/13_s3_cache/index.mdx).
 
-This feature has no overlap with the [Workspace S3 integration](../../core_concepts/11_persistent_storage/large_data_files.mdx).
+This feature has no overlap with the [Workspace object storage](../../core_concepts/38_object_storage_in_windmill/index.mdx#workspace-object-storage).
 
 You can choose to use either S3 or Azure Blob Storage. For each you will find a button to test settings from a server or from a worker.
 
+![S3/Azure for Python/Go cache & large logs](../../core_concepts/20_jobs/s3_azure_cache.png "S3/Azure for Python/Go cache & large logs")
+
 This setting is only available on [Enterprise Edition](/pricing).
 
 ### Critical alert channels

diff --git a/docs/advanced/1_self_host/index.mdx b/docs/advanced/1_self_host/index.mdx
@@ -515,7 +515,7 @@ enterprise:
 
 You will want to disable the postgresql provided with the helm chart and set the database_url to your own managed postgresql.
 
-For high-scale deployments (> 20 workers), we recommend using the [global S3 cache](../../misc/13_s3_cache/index.md). You will need an object storage compatible with the S3 protocol.
+For high-scale deployments (> 20 workers), we recommend using the [global S3 cache](../../misc/13_s3_cache/index.mdx). You will need an object storage compatible with the S3 protocol.
 
 ## Run Windmill without using a Postgres superuser
 

diff --git a/docs/advanced/3_cli/sync.mdx b/docs/advanced/3_cli/sync.mdx
@@ -152,7 +152,7 @@ export interface SyncOptions {
 }
 ```
 
-## Example Repo for Syncing with Windmill in Git
+## Example repo for syncing with Windmill in git
 
 We provide an example repo for syncing with Windmill:
 

diff --git a/docs/advanced/5_sharing_common_logic/index.mdx b/docs/advanced/5_sharing_common_logic/index.mdx
@@ -115,7 +115,7 @@ Windmill CLI, it is done automatically on `wmill sync push` for any script that
 
 <div className="grid grid-cols-2 gap-6 mb-4">
 	<DocCard
-		title="Codebases & Bundles"
+		title="Codebases & bundles"
 		description="Deploy scripts with any local relative imports as bundles."
 		href="/docs/core_concepts/codebases_and_bundles"
 	/>

diff --git a/docs/advanced/6_imports/index.mdx b/docs/advanced/6_imports/index.mdx
@@ -170,7 +170,7 @@ See the dedicated pages for TypeScript and Python to learn how to handle depende
 
 To import other scripts from your workspace, see [Sharing common logic](../5_sharing_common_logic/index.mdx).
 
-To import from a custom codebase, see [Codebases & Bundles](../../core_concepts/33_codebases_and_bundles/index.mdx).
+To import from a custom codebase, see [Codebases & bundles](../../core_concepts/33_codebases_and_bundles/index.mdx).
 
 <div className="grid grid-cols-2 gap-6 mb-4">
 	<DocCard
@@ -179,7 +179,7 @@ To import from a custom codebase, see [Codebases & Bundles](../../core_concepts/
 		href="/docs/advanced/sharing_common_logic"
 	/>
 	<DocCard
-		title="Codebases & Bundles"
+		title="Codebases & bundles"
 		description="Deploy scripts with any local relative imports as bundles."
 		href="/docs/core_concepts/codebases_and_bundles"
 	/>

diff --git a/docs/compared_to/peers.mdx b/docs/compared_to/peers.mdx
@@ -53,7 +53,7 @@ By comparison, in Windmill one would just write the canonical python or typescri
 
 <br />
 
-> [1]: Windmill is not just a workflow engine, it is also a function as a service (FaaS) infrastructure where it can run arbitrary scripts in TypeScript/Python/Bash/Go. Contrary to Lambda or GCP cloud functions, we do not need the functions to be pre-packaged and deployed in advance AOT. For TypeScript, we rely on the deno runtime that leverages v8 isolates and the immutable caching capabilities of deno. For Python, we have implemented our own dependency resolver that will override the Python virtual path and create a unique virtual environment for that specific script that will respect the lockfile generated at time of saving the script/flow for reproducibility. Given that those are interpreted languages, we pay no performance penalty to interpret that code on demand. So the only limiting factor for task execution is that in the events that dependencies are not cached by the worker, they need to be installed at time of execution. With a limited number of workers, the likelihood of a cache miss is low as soon as one script/workflow is executed more than once. With a large fleet of workers, cache miss increase and hence we have implemented a global caching mechanism that relies on syncing the cache through s3. It is only available in our [enterprise edition](/pricing). With it in place, we run tasks and workflows with 0 overhead versus running the same scripts on bare-metal. You can even leverage hardware acceleration without any additional configuration.
+> [1]: Windmill is not just a workflow engine, it is also a function as a service (FaaS) infrastructure where it can run arbitrary scripts in TypeScript/Python/Bash/Go. Contrary to Lambda or GCP cloud functions, we do not need the functions to be pre-packaged and deployed in advance AOT. For TypeScript, we rely on the deno runtime that leverages v8 isolates and the immutable caching capabilities of deno. For Python, we have implemented our own dependency resolver that will override the Python virtual path and create a unique virtual environment for that specific script that will respect the lockfile generated at time of saving the script/flow for reproducibility. Given that those are interpreted languages, we pay no performance penalty to interpret that code on demand. So the only limiting factor for task execution is that in the events that dependencies are not cached by the worker, they need to be installed at time of execution. With a limited number of workers, the likelihood of a cache miss is low as soon as one script/workflow is executed more than once. With a large fleet of workers, cache miss increase and hence we have implemented a global caching mechanism that relies on syncing the cache through S3. It is only available in our [enterprise edition](/pricing). With it in place, we run tasks and workflows with 0 overhead versus running the same scripts on bare-metal. You can even leverage hardware acceleration without any additional configuration.
 
 </details>
 

diff --git a/docs/core_concepts/10_error_handling/index.mdx b/docs/core_concepts/10_error_handling/index.mdx
@@ -69,10 +69,12 @@ There are other tricks to do Error handling in flows, see:
 	/>
 </div>
 
-## Schedules' Error Handlers
+## Schedules Error Handlers
 
 Add a special script or flow to execute in case of an error in your [scheduled](../1_scheduling/index.mdx) script or flow.
 
+Schedule Error hander is an [Enterprise Edition](/pricing) feature.
+
 You can pick the Slack pre-set schedule error handler or define your own.
 
 <video
@@ -92,7 +94,7 @@ You can pick the Slack pre-set schedule error handler or define your own.
 
 ## Workspace Error Handler
 
-Define a script or flow to be executed automatically in case of error in the workspace.
+Define a script or flow to be executed automatically in case of error in the workspace (e.g. a scheduled job fails to re-schedule).
 
 ### Workspace Error Handler on Slack
 

diff --git a/docs/core_concepts/11_persistent_storage/index.mdx b/docs/core_concepts/11_persistent_storage/index.mdx
@@ -56,19 +56,19 @@ All details at:
 	/>
 </div>
 
-## Object Storage for Large Data: S3, R2, MinIO, Azure Blob
+## Large data: S3, R2, MinIO, Azure Blob
 
 On heavier data objects & unstructured data storage, [Amazon S3](https://aws.amazon.com/s3/) (Simple Storage Service) and its alternatives [Cloudflare R2](https://www.cloudflare.com/developer-platform/r2/) and [MinIO](https://min.io/) as well as [Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs) storage are highly scalable and durable object storage service that provides secure, reliable, and cost-effective storage for a wide range of data types and use cases.
 
-Windmill comes with a [native integration with S3 and Azure Blob](./large_data_files.mdx#connect-your-windmill-workspace-to-your-s3-bucket-or-your-azure-blob-storage), making it the recommended storage for large objects like files and binary data.
+Windmill comes with a [native integration with S3 and Azure Blob](./large_data_files.mdx), making it the recommended storage for large objects like files and binary data.
 
-![S3 Integration Infographic](./s3_infographics.png "S3 Integration Infographic")
+![Workspace object storage Infographic](./s3_infographics.png "Workspace object storage Infographic")
 
 All details at:
 
 <div className="grid grid-cols-2 gap-6 mb-4">
 	<DocCard
-		title="Object Storage for Large Data: S3, R2, MinIO, Azure Blob"
+		title="Large data: S3, R2, MinIO, Azure Blob"
 		description="Windmill comes with a native integration with S3 and Azure Blob, making it the recommended storage for large objects like files and binary data."
 		href="/docs/core_concepts/persistent_storage/large_data_files"
 	/>
@@ -82,21 +82,21 @@ All details at:
 
 <div className="grid grid-cols-2 gap-6 mb-4">
 	<DocCard
-		title="Big Structured SQL Data: Postgres (Supabase, Neon.tech)"
+		title="Big structured SQL data: Postgres (Supabase, Neon.tech)"
 		description="For Postgres databases (best for structured data storage and retrieval, where you can define schema and relationships between entities), we recommend using Supabase or Neon.tech."
 		href="/docs/core_concepts/persistent_storage/structured_databases"
 	/>
 </div>
 
-## NoSQL and Document Databases (Mongodb, Key-Value Stores)
+## NoSQL & Document databases (Mongodb, Key-Value Stores)
 
 Key-value stores are a popular choice for managing non-structured data, providing a flexible and scalable solution for various data types and use cases. In the context of Windmill, you can use MongoDB Atlas, Redis, and Upstash to store and manipulate non-structured data effectively.
 
 All details at:
 
 <div className="grid grid-cols-2 gap-6 mb-4">
 	<DocCard
-		title="NoSQL and Document Databases (Mongodb, Key-Value Stores)"
+		title="NoSQL & Document databases (Mongodb, Key-Value Stores)"
 		description="Key-value stores are a popular choice for managing non-structured data, providing a flexible and scalable solution for various data types and use cases."
 		href="/docs/core_concepts/persistent_storage/key_value_stores"
 	/>

diff --git a/docs/core_concepts/11_persistent_storage/key_value_stores.mdx b/docs/core_concepts/11_persistent_storage/key_value_stores.mdx
@@ -1,4 +1,4 @@
-# NoSQL and Document Databases (Mongodb, Key-Value Stores)
+# NoSQL & Document databases (Mongodb, Key-Value Stores)
 
 This page is part of our section on [Persistent Storage & Databases](./index.mdx) which covers where to effectively store and manage the data manipulated by Windmill. Check that page for more options on data storage.