Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tool] Auto translate #49457

Merged
merged 5 commits into from
Aug 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,13 @@ If yes, please specify the type of change:
- [ ] 3.1
- [ ] 3.0
- [ ] 2.5

## Documentation PRs only:

If you are submitting a PR that adds or changes English documentation and have not
included Chinese documentation, then you can check the box to request GPT to translate the
English doc to Chinese. Please ensure to uncheck the **Do not translate** box if translation is needed.
The workflow will generate a new PR with the Chinese translation after this PR is merged.

- [ ] Yes, translate English markdown files with GPT
- [x] Do not translate
DanRoscigno marked this conversation as resolved.
Show resolved Hide resolved
96 changes: 96 additions & 0 deletions .github/workflows/translate.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
name: Translate changes to Chinese

on:
pull_request:
branches:
- main
types:
- closed
paths:
- 'docs/en/**'
defaults:
run:
shell: bash # default shell is sh

jobs:
# -------------------------------------------------------------
# Event `pull_request`: Returns all changed pull request files.
# --------------------------------------------------------------
changed_files:
# NOTE:
# This workflow will only translate docs if:
# - the PR is merged
# - the `Yes, translate...` box is checked
# - the `Do not translate` box is unchecked

if: github.event.pull_request.merged == true && contains(toJson(github.event.pull_request.body), '[x] Yes, translate English markdown files with GPT') && contains(toJson(github.event.pull_request.body), '[ ] Do not translate')
runs-on: ubuntu-latest # windows-latest || macos-latest
name: Test changed-files
permissions:
pull-requests: write

steps:
- uses: actions/checkout@v4
with:
fetch-depth: 2

- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install gpt_translate

- name: Get changed files
id: changed-files
uses: tj-actions/changed-files@v44
with:
files: docs/en/**/*.{md,mdx}
output_dir: '.github/outputs' # this is the default dir
write_output_files: 'true'

- name: List all changed files
env:
ALL_CHANGED_FILES: ${{ steps.changed-files.outputs.all_changed_files }}
run: |
for file in $ALL_CHANGED_FILES; do
echo "$file"
done
echo "also cat the generated file all_changed_files.txt"
cat ./.github/outputs/all_changed_files.txt

- name: Translate files
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}
run: |
cp -r docs/translation/configs . # ugh, only works if configs is in cwd
gpt_translate.files \
--input_file ./.github/outputs/all_changed_files.txt \
--config_folder ./configs
rm -rf configs

- name: Fix sidebar display
env:
ALL_CHANGED_FILES: ${{ steps.changed-files.outputs.all_changed_files }}
run: |
sed "s#docs/en#docs/zh#g" ./.github/outputs/all_changed_files.txt > ./.github/outputs/new_files.txt
cat ./.github/outputs/new_files.txt
while IFS="" read -r english || [ -n "$english" ]
do
sed -i'' '/displayed_sidebar:/s/English/Chinese/' "$english"
done < ./.github/outputs/new_files.txt

- name: Create Pull Request
uses: peter-evans/create-pull-request@v6
with:
token: ${{ secrets.TRANSLATE_PAT }}
commit-message: Translated Docs
title: Automatic translation
body: |
This PR was automatically created by the translate-action when merging [PR](${{ github.event.pull_request.number }})
Please review the changes and merge if they are correct.
branch: translate-pr-${{ github.event.pull_request.number }}
base: main
delete-branch: true
labels: translation
add-paths: |
docs/zh
DanRoscigno marked this conversation as resolved.
Show resolved Hide resolved
25 changes: 25 additions & 0 deletions docs/translation/configs/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Logs:
debug: false # Debug mode
weave_project: "gpt-translate" # Weave project
silence_openai: true # Silence OpenAI logger

# Translation:
language: "zh" # Language to translate to
replace: true # Replace existing file
remove_comments: true # Remove comments
do_evaluation: true # Do evaluation
do_translate_header_description: true # Translate the header description
max_openai_concurrent_calls: 7 # Max number of concurrent calls to OpenAI

# Files:
input_file: "docs/intro.md" # File to translate
out_file: " intro_ja.md" # File to save the translated file to
input_folder: ./docs/en # Folder to translate
out_folder: ./docs/zh # Folder to save the translated files to

limit: null # Limit number of files to translate (useful for testing)

# Model:
model: "gpt-4o"
temperature: 0.2
max_tokens: 4096
9 changes: 9 additions & 0 deletions docs/translation/configs/evaluation_prompt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
How good is the translation regarding the instructions you were given? Provide a detailed analysis.
Return a json object with the following keys:
- analysis: a detailed analysis of the translation
- completeness: a boolean indicating if the translation is complete, not missing a piece at the end.
- translation_rating: a rating from 1 to 10 indicating the quality of the translation
- product_words: A boolean indicating if the translation respects the given dictionary, check carefully the dictionary with the corresponding translation. Make sure not translating Weights & Biases product terms.
- code_comments: A boolean indicating if the code comments are translated correctly
- links: A boolean indicating if the links are translated correctly

51 changes: 51 additions & 0 deletions docs/translation/configs/human_prompt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
Regarding StarRocks specifics, we have a list of product names and technical phrases that are always associated to the product and *never* to be translated. Keep them in English.
- StarRocks
- starrocks
- external catalog
- catalog
- Default Catalog
- Hive
- Leader FE
- Follower FE
- Observer FE
- tablet
- property enforcement
- Operator
- Data Cache
- Query Cache
- Delete Vector
- Compaction
- Stream Load
- Broker Load
- Routine Load
- Spark Load
- schema change
- Colocate Join
- Lateral Join
- Shuffle Join
- Broadcast Join
- Colocation Group
- Sorted streaming aggregate
- Flat JSON
- Query Profile
- Information Schema
- Docker
- Kubernetes
- Kubernetes secret

These words appear often on lists like this:

1. [**word**](link/target.md): Something about wandb
2. [**word2**](link/target2.md): Something about wandb again
etc...

Never translate them in this context.

Here is a chunk of documentation in docusaurus Markdown format to translate.

```markdown
{md_chunk}
```

Return the translation only in markdown format, without adding anything else. Do not add the ```markdown``` tags or any backticks.

139 changes: 139 additions & 0 deletions docs/translation/configs/language_dicts/zh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
FEs: FE
BEs: BE
Data loading: 数据导入
Data unloading: 数据导出
load: 导入
native table: 内表
Cloud-native table: 存算分离表
External Table: 外部表
File External Table: 文件外部表
Hive external table: Hive 外表
hierarchy of data objects: 数据库模式层次结构
storage layering: 存储分层
separation of storage and compute: 存算分离
shared-data mode: 存算分离模式
shared-nothing mode: 存算一体模式
shared-data cluster: 存算分离集群
shared-nothing cluster: 存算一体集群
zero-migration: 0数据迁移
native vectorized engine: 原生向量化引擎
query federation/federated query: 联邦查询
columnar storage: 列式存储
row storage: 行存储
intelligent materialized view: 智能物化视图
base table: 基表
materialized view: 物化视图
synchronous materialized view: 同步物化视图
asynchronous materialized view: 异步物化视图
unified batch and streaming,batch-stream integrated: 流批一体
high availability: 高可用
high scalability: 高可扩展性
data ingestion: 数据摄取
denormalized table/flat table: 大宽表
pre-aggregation: 预聚合
aggregate query: 聚合查询
star schema: 星形模型
snowflake schema: 雪花模型
point query: 点查询
table type: 表模型
Duplicate Key table: 明细表
Aggregate table: 聚合表
Primary Key table: 主键表
Unique Key table: 更新表
data cleaning: 数据清洗
global dictionary: 全局字典
global dictionary for low-cardinality optimization: 全局低基数字典优化
low cardinality: 低基数
warehousing logistics: 仓储物流
query performance: 查询性能
data acquisition: 数据采集
multi-table join query: 多表关联查询
cost-based optimizer: CBO优化器
separation of storage and compute: 存算分离
cost estimation: 成本估算
binary tree: 二叉树
data analytics: 数据分析
data lake analytics: 数据湖分析
detailed data: 明细数据
distinct count: 去重
exact distinct count: 精准去重
approximate distinct count: 近似去重
table schema: 表结构
sort key: 排序键
bucketing key: 分桶键
partitioning column: 分区列
partition key: 分区键
random bucketing: 随机分桶
partitioning method: 分区方式
automatic partitioning: 自动分区
dynamic partitioning: 动态分区
expresion partitioning: 表达式分区
list partitioning: LIST 分区
replica: 副本
user profiling: 用户画像
user retention: 用户留存
precision marketing: 精准营销
group analysis: 群体分析
late materialization: 延迟物化
data locality: 数据局部性
prefix index: 前缀索引
suffix column: 后置列
prefix column: 前置列
tiered storage: 分级存储
inverted indexing: 倒排索引
approximation algorithm: 近似算法
nested query: 嵌套查询
memory leak: 内存泄漏
memory bloat: 内存膨胀
secondary partitioning: 二次分区
pruning: 分区裁剪
partition file: 分区文件
abstract syntax tree: 抽象语法树
constant folding: 常量折叠
constant propagation: 常量传播
pessimistic locking: 悲观加锁
optimistic locking: 乐观加锁
rolling update: 滚动升级
stratified search: 分层搜索
unified search: 统一搜索
populate a table: 给表添加数据
working directory: 工作目录
compute node: 计算节点
compute engine: 计算引擎
equi-height histogram: 等深直方图
equi-width histogram: 等宽直方图
paginated query: 分页查询
correlated columns: 相关列
literal: 字面量
generated column: 生成列
auto increment: 自增列
cloud-native: 云原生
persistent index: 持久化索引
utility function: 工具函数
aggregate function: 聚合函数
spill to disk: 中间结果落盘
query rewrite: 查询改写
data re-distribution: 数据均衡
local disk: 本地磁盘
automatical cooldown: 自动降冷
storage medium: 存储介质
garbage collection: 垃圾回收
session variable: 会话变量
global variable: 全局变量
big query: 大查询
classifier: 分类器
storage volume: 存储卷
remote storage: 远端存储
cardinality-preserving join: 基数保持 JOIN
parallelism: 并行度
partial update: 部分列更新
conditional update: 条件更新
higher-order function: 高阶函数
async refresh: 异步刷新
query acceleration: 查询加速
segment file: Segment 文件
Hybrid row-column storage: 行列混存
List Partitioning: List 分区
Range Partitioning: Range 分区
Bitmap index: Bitmap 索引
Bloom filter index: Bloom filter 索引
Loading
Loading