StarRocks · alvin-celerdata · Aug 7, 2024 · Aug 6, 2024 · Aug 6, 2024 · Aug 6, 2024
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
@@ -41,3 +41,13 @@ If yes, please specify the type of change:
   - [ ] 3.1
   - [ ] 3.0
   - [ ] 2.5
+
+## Documentation PRs only:
+
+If you are submitting a PR that adds or changes English documentation and have not
+included Chinese documentation, then you can check the box to request GPT to translate the
+English doc to Chinese. Please ensure to uncheck the **Do not translate** box if translation is needed.
+The workflow will generate a new PR with the Chinese translation after this PR is merged.
+
+- [ ] Yes, translate English markdown files with GPT
+- [x] Do not translate
diff --git a/.github/workflows/translate.yml b/.github/workflows/translate.yml
@@ -0,0 +1,96 @@
+name: Translate changes to Chinese
+
+on:
+  pull_request:
+    branches:
+      - main
+    types:
+      - closed
+    paths:
+      - 'docs/en/**'
+defaults:
+  run:
+    shell: bash  # default shell is sh
+
+jobs:
+  # -------------------------------------------------------------
+  # Event `pull_request`: Returns all changed pull request files.
+  # --------------------------------------------------------------
+  changed_files:
+    # NOTE:
+    # This workflow will only translate docs if:
+    # - the PR is merged
+    # - the `Yes, translate...` box is checked
+    # - the `Do not translate` box is unchecked
+
+    if: github.event.pull_request.merged == true && contains(toJson(github.event.pull_request.body), '[x] Yes, translate English markdown files with GPT') && contains(toJson(github.event.pull_request.body), '[ ] Do not translate')
+    runs-on: ubuntu-latest  # windows-latest || macos-latest
+    name: Test changed-files
+    permissions:
+      pull-requests: write
+
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          fetch-depth: 2
+
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          python -m pip install gpt_translate
+
+      - name: Get changed files
+        id: changed-files
+        uses: tj-actions/changed-files@v44
+        with:
+            files: docs/en/**/*.{md,mdx}
+            output_dir: '.github/outputs' # this is the default dir
+            write_output_files: 'true'
+
+      - name: List all changed files
+        env:
+          ALL_CHANGED_FILES: ${{ steps.changed-files.outputs.all_changed_files }}
+        run: |
+          for file in $ALL_CHANGED_FILES; do
+            echo "$file"
+          done
+          echo "also cat the generated file all_changed_files.txt"
+          cat ./.github/outputs/all_changed_files.txt
+
+      - name: Translate files
+        env:
+          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
+          WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}
+        run: |
+          cp -r docs/translation/configs . # ugh, only works if configs is in cwd
+          gpt_translate.files \
+            --input_file ./.github/outputs/all_changed_files.txt \
+            --config_folder ./configs
+          rm -rf configs
+
+      - name: Fix sidebar display
+        env:
+          ALL_CHANGED_FILES: ${{ steps.changed-files.outputs.all_changed_files }}
+        run: |
+          sed "s#docs/en#docs/zh#g" ./.github/outputs/all_changed_files.txt > ./.github/outputs/new_files.txt
+          cat ./.github/outputs/new_files.txt
+          while IFS="" read -r english || [ -n "$english" ]
+          do
+            sed -i'' '/displayed_sidebar:/s/English/Chinese/' "$english"
+          done < ./.github/outputs/new_files.txt
+
+      - name: Create Pull Request
+        uses: peter-evans/create-pull-request@v6
+        with:
+          token: ${{ secrets.TRANSLATE_PAT }}
+          commit-message: Translated Docs
+          title: Automatic translation
+          body: |
+            This PR was automatically created by the translate-action when merging [PR](${{ github.event.pull_request.number }})
+            Please review the changes and merge if they are correct.
+          branch: translate-pr-${{ github.event.pull_request.number }}
+          base: main
+          delete-branch: true
+          labels: translation
+          add-paths: |
+            docs/zh
diff --git a/docs/translation/configs/config.yaml b/docs/translation/configs/config.yaml
@@ -0,0 +1,25 @@
+# Logs:
+debug: false  # Debug mode
+weave_project: "gpt-translate"  # Weave project
+silence_openai: true  # Silence OpenAI logger
+
+# Translation:
+language: "zh"  # Language to translate to
+replace: true  # Replace existing file
+remove_comments: true  # Remove comments
+do_evaluation: true  # Do evaluation
+do_translate_header_description: true  # Translate the header description
+max_openai_concurrent_calls: 7  # Max number of concurrent calls to OpenAI
+
+# Files:
+input_file: "docs/intro.md"  # File to translate
+out_file: " intro_ja.md"  # File to save the translated file to
+input_folder: ./docs/en  # Folder to translate
+out_folder: ./docs/zh  # Folder to save the translated files to
+
+limit: null  # Limit number of files to translate (useful for testing)
+
+# Model:
+model: "gpt-4o"
+temperature: 0.2
+max_tokens: 4096
diff --git a/docs/translation/configs/evaluation_prompt.txt b/docs/translation/configs/evaluation_prompt.txt
@@ -0,0 +1,9 @@
+How good is the translation regarding the instructions you were given? Provide a detailed analysis.
+Return a json object with the following keys:
+- analysis: a detailed analysis of the translation
+- completeness: a boolean indicating if the translation is complete, not missing a piece at the end.
+- translation_rating: a rating from 1 to 10 indicating the quality of the translation
+- product_words: A boolean indicating if the translation respects the given dictionary, check carefully the dictionary with the corresponding translation. Make sure not translating Weights & Biases product terms.
+- code_comments: A boolean indicating if the code comments are translated correctly
+- links: A boolean indicating if the links are translated correctly
+
diff --git a/docs/translation/configs/human_prompt.txt b/docs/translation/configs/human_prompt.txt
@@ -0,0 +1,51 @@
+Regarding StarRocks specifics, we have a list of product names and technical phrases that are always associated to the product and *never* to be translated. Keep them in English.
+- StarRocks
+- starrocks
+- external catalog
+- catalog
+- Default Catalog
+- Hive
+- Leader FE
+- Follower FE
+- Observer FE
+- tablet
+- property enforcement
+- Operator
+- Data Cache
+- Query Cache
+- Delete Vector
+- Compaction
+- Stream Load
+- Broker Load
+- Routine Load
+- Spark Load
+- schema change
+- Colocate Join
+- Lateral Join
+- Shuffle Join
+- Broadcast Join
+- Colocation Group
+- Sorted streaming aggregate
+- Flat JSON
+- Query Profile
+- Information Schema
+- Docker
+- Kubernetes
+- Kubernetes secret
+
+These words appear often on lists like this:
+
+1. [**word**](link/target.md): Something about wandb
+2. [**word2**](link/target2.md): Something about wandb again
+etc...
+
+Never translate them in this context.
+
+Here is a chunk of documentation in docusaurus Markdown format to translate. 
+
+```markdown
+{md_chunk}
+```
+
+Return the translation only in markdown format, without adding anything else. Do not add the ```markdown``` tags or any backticks.
+
diff --git a/docs/translation/configs/language_dicts/zh.yaml b/docs/translation/configs/language_dicts/zh.yaml
@@ -0,0 +1,139 @@
+FEs: FE
+BEs: BE
+Data loading: 数据导入
+Data unloading: 数据导出
+load: 导入
+native table: 内表 
+Cloud-native table: 存算分离表
+External Table: 外部表
+File External Table: 文件外部表
+Hive external table: Hive 外表
+hierarchy of data objects: 数据库模式层次结构
+storage layering: 存储分层
+separation of storage and compute: 存算分离 
+shared-data mode: 存算分离模式
+shared-nothing mode: 存算一体模式
+shared-data cluster: 存算分离集群
+shared-nothing cluster: 存算一体集群
+zero-migration: 0数据迁移
+native vectorized engine: 原生向量化引擎  
+query federation/federated query: 联邦查询 
+columnar storage: 列式存储 
+row storage: 行存储
+intelligent materialized view: 智能物化视图 
+base table: 基表 
+materialized view: 物化视图
+synchronous materialized view: 同步物化视图
+asynchronous materialized view: 异步物化视图
+unified batch and streaming，batch-stream integrated: 流批一体 
+high availability: 高可用 
+high scalability: 高可扩展性 
+data ingestion: 数据摄取 
+denormalized table/flat table: 大宽表  
+pre-aggregation: 预聚合 
+aggregate query: 聚合查询
+star schema: 星形模型 
+snowflake schema: 雪花模型 
+point query: 点查询 
+table type: 表模型
+Duplicate Key table: 明细表
+Aggregate table: 聚合表
+Primary Key table: 主键表
+Unique Key table: 更新表
+data cleaning: 数据清洗 
+global dictionary: 全局字典 
+global dictionary for low-cardinality optimization: 全局低基数字典优化 
+low cardinality: 低基数 
+warehousing logistics: 仓储物流 
+query performance: 查询性能 
+data acquisition: 数据采集 
+multi-table join query: 多表关联查询 
+cost-based optimizer: CBO优化器 
+separation of storage and compute: 存算分离 
+cost estimation: 成本估算 
+binary tree: 二叉树 
+data analytics: 数据分析 
+data lake analytics: 数据湖分析
+detailed data: 明细数据 
+distinct count: 去重 
+exact distinct count: 精准去重 
+approximate distinct count: 近似去重 
+table schema: 表结构
+sort key: 排序键 
+bucketing key: 分桶键
+partitioning column: 分区列 
+partition key: 分区键
+random bucketing: 随机分桶
+partitioning method: 分区方式 
+automatic partitioning: 自动分区
+dynamic partitioning: 动态分区
+expresion partitioning: 表达式分区
+list partitioning: LIST 分区
+replica: 副本 
+user profiling: 用户画像 
+user retention: 用户留存 
+precision marketing: 精准营销 
+group analysis: 群体分析 
+late materialization: 延迟物化 
+data locality: 数据局部性  
+prefix index: 前缀索引 
+suffix column: 后置列 
+prefix column: 前置列 
+tiered storage: 分级存储 
+inverted indexing: 倒排索引 
+approximation algorithm: 近似算法 
+nested query: 嵌套查询 
+memory leak: 内存泄漏 
+memory bloat: 内存膨胀  
+secondary partitioning: 二次分区 
+pruning: 分区裁剪 
+partition file: 分区文件 
+abstract syntax tree: 抽象语法树 
+constant folding: 常量折叠 
+constant propagation: 常量传播 
+pessimistic locking: 悲观加锁 
+optimistic locking: 乐观加锁 
+rolling update: 滚动升级 
+stratified search: 分层搜索 
+unified search: 统一搜索 
+populate a table: 给表添加数据 
+working directory: 工作目录 
+compute node: 计算节点 
+compute engine: 计算引擎
+equi-height histogram: 等深直方图 
+equi-width histogram: 等宽直方图 
+paginated query: 分页查询 
+correlated columns: 相关列
+literal: 字面量
+generated column: 生成列
+auto increment: 自增列
+cloud-native: 云原生
+persistent index: 持久化索引
+utility function: 工具函数
+aggregate function: 聚合函数
+spill to disk: 中间结果落盘
+query rewrite: 查询改写
+data re-distribution: 数据均衡
+local disk: 本地磁盘
+automatical cooldown: 自动降冷
+storage medium: 存储介质
+garbage collection: 垃圾回收
+session variable: 会话变量
+global variable: 全局变量
+big query: 大查询
+classifier: 分类器
+storage volume: 存储卷
+remote storage: 远端存储
+cardinality-preserving join: 基数保持 JOIN
+parallelism: 并行度
+partial update: 部分列更新
+conditional update: 条件更新
+higher-order function: 高阶函数
+async refresh: 异步刷新
+query acceleration: 查询加速
+segment file: Segment 文件
+Hybrid row-column storage: 行列混存
+List Partitioning: List 分区
+Range Partitioning: Range 分区
+Bitmap index: Bitmap 索引
+Bloom filter index: Bloom filter 索引