-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add datalake network to clickhouse #339
Conversation
4040c4f
to
1d65da2
Compare
WalkthroughThe changes made in the pull request involve the addition of a new network configuration named Changes
Sequence Diagram(s)sequenceDiagram
participant Service as Analytics Datastore Clickhouse
participant Network as Datalake Network
Service->>Network: Connect to Datalake
Network-->>Service: Establish connection
Possibly related PRs
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
analytics-datastore-clickhouse/docker-compose.yml (1)
Line range hint
62-64
: LGTM! Consider documenting the network integration.The network configuration correctly references the external MinIO network. This enables ClickHouse to access S3-compatible storage provided by MinIO.
Consider adding a comment in the docker-compose file to document:
- The purpose of this network integration (S3 connectivity)
- Any required MinIO bucket configurations or permissions
- The dependency on MinIO service being up before ClickHouse
Example comment:
datalake: + # Enables ClickHouse to access S3-compatible storage via MinIO + # Requires: MinIO service to be running with appropriate bucket permissions name: minio_public external: true
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
analytics-datastore-clickhouse/docker-compose.yml
(1 hunks)
🔇 Additional comments (1)
analytics-datastore-clickhouse/docker-compose.yml (1)
25-25
: LGTM! Verify network dependency order.
The addition of the datalake network to the ClickHouse service looks good and aligns with the PR objective of enabling S3 connectivity.
Let's verify the network dependency exists:
✅ Verification successful
Network dependency verified and properly configured
The network configuration is correct and well-structured:
- MinIO service is properly defined in
datalake/docker-compose.yml
and creates the network - The
minio_public
network is correctly marked as external in the ClickHouse compose file - The network is consistently configured across all services that need MinIO access
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify that the minio_public network exists and is used by MinIO service
# Expected: Find docker-compose files that define or use the minio_public network
# Search for minio_public network definition or usage
rg -l "minio_public"
# Search for MinIO service definition
rg -l "image:.*(minio)"
Length of output: 198
Script:
#!/bin/bash
# Let's examine the network configurations in the found files
echo "=== analytics-datastore-clickhouse/docker-compose.yml ==="
rg -A 3 "minio_public" analytics-datastore-clickhouse/docker-compose.yml
echo -e "\n=== datalake/docker-compose.yml ==="
rg -A 3 "minio_public" datalake/docker-compose.yml
echo -e "\n=== monitoring/docker-compose.yml ==="
rg -A 3 "minio_public" monitoring/docker-compose.yml
# Let's also check if MinIO service is defined in datalake compose
echo -e "\n=== MinIO service in datalake/docker-compose.yml ==="
rg "services:" -A 20 datalake/docker-compose.yml
Length of output: 1608
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (4)
test/cucumber/features/cluster-mode/clickhouse.cluster.feature (1)
Line range hint
1-33
: Consider documenting the network architecture and data flow.The implementation properly enables ClickHouse-datalake connectivity while maintaining network isolation. Consider:
- Documenting the network topology and data flow patterns
- Adding monitoring for S3 operations and network connectivity
- Implementing retry mechanisms for S3 operations in case of network issues
datalake/docker-compose.cluster.yml (1)
57-57
: Document port changes in architecture documentationThe port changes are consistently applied across all MinIO nodes. Consider:
- Updating architecture documentation to reflect the new port configuration
- Adding a comment in the docker-compose file explaining the reason for the port change
- Creating a migration guide for existing deployments
Also applies to: 62-62
test/cucumber/features/single-mode/recipe.feature (2)
33-33
: Enhance test coverage for datalake integrationWhile adding MinIO service check is good, consider enhancing the test coverage:
- Add specific tests for ClickHouse-MinIO connectivity
- Verify data retrieval from datalake via ClickHouse's S3 function
- Include error scenarios (e.g., MinIO unavailability)
Would you like me to help create additional test scenarios for the datalake integration?
Line range hint
1-85
: Add dedicated scenarios for datalake functionalityConsider adding the following scenarios to thoroughly test the datalake integration:
Scenario: Verify ClickHouse can access datalake When I store test data in MinIO Then ClickHouse should be able to query it via S3 function Scenario: Handle datalake connectivity issues Given MinIO service is stopped When I attempt to query datalake from ClickHouse Then appropriate error handling should occur
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (7)
datalake/docker-compose.cluster.yml
(3 hunks)openfn/docker-compose.yml
(1 hunks)test/cucumber/features/cluster-mode/clickhouse.cluster.feature
(2 hunks)test/cucumber/features/cluster-mode/monitoring.feature
(0 hunks)test/cucumber/features/single-mode/clickhouse.feature
(2 hunks)test/cucumber/features/single-mode/monitoring.feature
(1 hunks)test/cucumber/features/single-mode/recipe.feature
(1 hunks)
💤 Files with no reviewable changes (1)
- test/cucumber/features/cluster-mode/monitoring.feature
✅ Files skipped from review due to trivial changes (1)
- openfn/docker-compose.yml
🔇 Additional comments (4)
test/cucumber/features/single-mode/clickhouse.feature (1)
9-9
: LGTM! Addition of minio_public network enables S3 connectivity.
The addition of the minio_public
network to ClickHouse's network configuration aligns with the PR objective of enabling datalake access via S3.
Let's verify the corresponding MinIO configuration:
Also applies to: 22-22
✅ Verification successful
Let me check the docker-compose.yml file to verify the MinIO configuration.
Let me check the configuration files for any S3-related settings since they might contain the actual MinIO connection configuration.
Network configuration for MinIO connectivity is properly set up
The docker-compose.yml correctly defines the datalake
network that maps to the external minio_public
network, which aligns with the network configuration in the Cucumber feature file. While we don't see explicit S3/MinIO connection settings in the configuration files, this is expected as such credentials are typically injected at runtime through environment variables or separate configuration files managed by the deployment system.
- ✓ Network configuration in docker-compose.yml matches the feature file
- ✓ External network
minio_public
is properly declared - ✓ Network is added to the ClickHouse service configuration
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if ClickHouse has the necessary S3 configurations to connect to MinIO
rg -l 's3|minio' analytics-datastore-clickhouse/
Length of output: 100
Script:
#!/bin/bash
# Check the docker-compose.yml for MinIO/S3 configuration
cat analytics-datastore-clickhouse/docker-compose.yml
Length of output: 1699
Script:
#!/bin/bash
# Check configuration files for S3/MinIO settings
rg -l "s3|minio" analytics-datastore-clickhouse/general_configs/
Length of output: 66
test/cucumber/features/single-mode/monitoring.feature (1)
18-18
: Verify the intentional reduction in volume count.
The reduction from 6 to 3 volumes appears to be related to the decoupling of MinIO from the monitoring package. This aligns with the broader changes of establishing a dedicated datalake configuration.
Let's confirm this is intentional:
✅ Verification successful
Volume count reduction is correct and intentional
The docker-compose.yml shows exactly 3 volumes defined for the monitoring package:
- prometheus-data
- grafana-data
- loki-data
This matches the updated test assertion "There should be 3 volumes" in the monitoring.feature file. The volume count is accurate and reflects the current architecture where MinIO storage has been moved to a dedicated datalake configuration, as evidenced by the presence of MinIO-related files in the separate datalake directory shown in the earlier search results.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for any remaining MinIO volume definitions in monitoring
rg -l 'minio.*volume' monitoring/
# Check for any MinIO references in monitoring configs
rg -l 'minio' monitoring/config/
Length of output: 127
Script:
#!/bin/bash
# Let's search for MinIO and volume-related configurations in the entire codebase
rg -l "minio" .
# Search for volume-related test scenarios in cucumber features
fd "monitoring.feature" . --exec grep -l "volume" {}
# Get the content of the monitoring feature file to understand the context
fd "monitoring.feature" . --exec cat {}
Length of output: 5029
Script:
#!/bin/bash
# Let's check the docker-compose files to understand the volume configuration
cat monitoring/docker-compose.yml
# Check if there are any previous versions of the file in git history
git log -p --all -- monitoring/docker-compose.yml | head -n 100
Length of output: 10109
test/cucumber/features/cluster-mode/clickhouse.cluster.feature (1)
9-9
: LGTM! Consistent network configuration across cluster nodes.
The addition of minio_public
network is properly implemented across all ClickHouse cluster nodes, ensuring uniform S3 access capabilities throughout the cluster.
Let's verify the cluster-wide S3 configuration:
Also applies to: 12-12, 15-15, 18-18, 33-33
✅ Verification successful
Based on the review of the docker-compose files and network configurations, I can now provide a final response:
Network configuration properly implemented for S3/MinIO access
The minio_public
network has been correctly added to all ClickHouse nodes in the cluster setup, with consistent configuration across the test scenarios. The network is properly defined as external in the docker-compose files, ensuring proper connectivity between ClickHouse and MinIO services.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for consistent S3 configurations across cluster nodes
rg -l 'cluster.*s3|cluster.*minio' analytics-datastore-clickhouse/
# Check for any cluster-specific S3 settings
rg -l 'clickhouse-\d+.*s3|clickhouse-\d+.*minio' analytics-datastore-clickhouse/
Length of output: 151
Script:
#!/bin/bash
# Let's check the feature file content to understand the cluster setup and network configuration
cat test/cucumber/features/cluster-mode/clickhouse.cluster.feature
# Also check for any docker-compose files that might contain cluster configuration
fd docker-compose --exec cat {}
Length of output: 67600
Script:
#!/bin/bash
# Let's check for any docker-compose files in the analytics-datastore-clickhouse directory
fd docker-compose analytics-datastore-clickhouse/ --exec cat {}
# Also check for any network configurations in the package
rg -l "minio_public.*network" analytics-datastore-clickhouse/
Length of output: 12126
Script:
#!/bin/bash
# Let's check the cluster configuration files for any S3/MinIO related settings
fd -e xml cluster_configs/ analytics-datastore-clickhouse/ --exec cat {}
# Also check the config importer files for any S3/MinIO configurations
fd -e js config/ analytics-datastore-clickhouse/ --exec cat {}
Length of output: 143
datalake/docker-compose.cluster.yml (1)
35-35
: Configuration consistent with MinIO-02
The changes mirror those in MinIO-02, maintaining consistency across the cluster.
Also applies to: 40-40
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (1)
monitoring/loki/loki-config.yml (1)
42-43
: Consider adding retry mechanism for S3 operationsGiven that this configuration is part of the datalake network integration, consider adding S3 retry configurations to handle temporary network issues or MinIO failovers. This is especially important for the monitoring stack to maintain data consistency.
Example addition:
aws: s3: http://${MO_SECURITY_ADMIN_USER}:${MO_SECURITY_ADMIN_PASSWORD}@minio-01.:9090/loki s3forcepathstyle: true + max_retries: 5 + retry_wait: 30s
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
monitoring/loki/loki-config.yml
(1 hunks)
🔇 Additional comments (1)
monitoring/loki/loki-config.yml (1)
42-42
: Verify MinIO port configuration across the platform
The MinIO endpoint port has been changed from 9000 to 9090. While this aligns with the changes in datalake/docker-compose.cluster.yml
, we should ensure this port change is consistently applied across all services that interact with MinIO.
✅ Verification successful
Let me check the actual port configurations in these files to ensure consistency.
Port configuration is consistent across MinIO services
The port change from 9000 to 9090 is consistently implemented across all relevant configurations:
- MinIO server is configured to listen on port 9090 in both docker-compose files
- Loki's S3 endpoint correctly uses port 9090 to connect to MinIO
- The port is properly exposed through traefik load balancer
- No port conflicts were found as 9090 is exclusively used for MinIO API
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for any remaining references to port 9000 for MinIO
# and verify the new port 9090 configuration is consistent
echo "Checking for MinIO port configurations..."
rg -l "minio.*:9000" --type yaml
echo "Verifying MinIO service definitions..."
rg -l "minio.*:9090" --type yaml
echo "Checking for potential port conflicts..."
rg -l "port.*9090" --type yaml
Length of output: 497
Script:
#!/bin/bash
# Check the actual MinIO port configurations in the identified files
echo "Content of monitoring/loki/loki-config.yml:"
rg "minio.*:90" monitoring/loki/loki-config.yml -A 2 -B 2
echo -e "\nContent of datalake/docker-compose.cluster.yml:"
rg "minio.*:90" datalake/docker-compose.cluster.yml -A 2 -B 2
echo -e "\nContent of datalake/docker-compose.yml:"
rg "port.*9090|minio.*:90" datalake/docker-compose.yml -A 2 -B 2
Length of output: 2056
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This allows clickhouse to connect and fetch assets from datalake using the s3 function
Summary by CodeRabbit
New Features
datalake
network.minio_public
network for the Analytics Datastore Clickhouse package.minio-01
service in the initialization process.Bug Fixes
minio-01
service.Chores