Skip to content

Commit

Permalink
Update docs after datagen split (#5206)
Browse files Browse the repository at this point in the history
Closes #4721
  • Loading branch information
robertbastian authored Jul 9, 2024
1 parent 60eca99 commit 05264a4
Show file tree
Hide file tree
Showing 17 changed files with 92 additions and 141 deletions.
3 changes: 2 additions & 1 deletion CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,8 @@ ffi/ecma402/ @filmil
ffi/harfbuzz/ @hsivonen
provider/blob/ @sffc @Manishearth
provider/core/ @sffc @Manishearth
provider/datagen/ @sffc @robertbastian @Manishearth
provider/source/ @sffc @robertbastian @Manishearth
provider/export/ @sffc @robertbastian
provider/fs/ @sffc
provider/macros/ @Manishearth @sffc
tools/benchmark/binsize/ @gnrunge
Expand Down
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ Our wider testsuite is organized as `ci-job-foo` make tasks corresponding to eac
- `ci-job-test-tutorials`: Builds all our tutorials against both local code (`locale`), and released ICU4X (`cratesio`).
<br/>

- `ci-job-testdata`: Runs an `icu_datagen` integration test with a subset of CLDR, ICU, and LSTM source data.
- `ci-job-testdata`: Runs an `icu_provider_source` integration test with a subset of CLDR, ICU, and LSTM source data.
- `ci-job-full-datagen`: Generates compiled data for all crates.
<br/>

Expand Down
2 changes: 1 addition & 1 deletion components/experimental/tests/transliterate/data/gen.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ cargo run -p icu4x-datagen --features experimental_components -- \
--locales full \
--deduplication none \
--no-internal-fallback \
--cldr-root $(dirname $0)/../../../../../provider/datagen/tests/data/cldr \
--cldr-root $(dirname $0)/../../../../../provider/source/tests/data/cldr \
--format mod \
--out $(dirname $0)/baked \
--pretty \
Expand Down
2 changes: 1 addition & 1 deletion components/plurals/src/rules/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,7 @@
#[doc(hidden)]
pub mod reference;
// Need to expose it for `icu_datagen` use, but we don't
// Need to expose it for datagen, but we don't
// have a reason to make it fully public, so hiding docs for now.
#[cfg(feature = "experimental")]
mod raw_operands;
Expand Down
2 changes: 1 addition & 1 deletion components/segmenter/src/grapheme.rs
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ pub type GraphemeClusterBreakIteratorUtf16<'l, 's> =
/// Thus, if the data supplied by the provider comprises all
/// [grapheme cluster boundary rules][Rules] from Unicode Standard Annex #29,
/// _Unicode Text Segmentation_, which is the case of default data
/// (both test data and data produced by `icu_datagen`), the `segment_*`
/// (both test data and data produced by `icu_provider_source`), the `segment_*`
/// functions return extended grapheme cluster boundaries, as opposed to
/// legacy grapheme cluster boundaries. See [_Section 3, Grapheme Cluster
/// Boundaries_][GC], and [_Table 1a, Sample Grapheme Clusters_][Sample_GC],
Expand Down
2 changes: 1 addition & 1 deletion documents/process/graduation.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ This document contains a checklist for the requirements to migrate a component f
- [ ] The APIs should follow ICU4X style
- [ ] All options bags should be `Copy` (and contain references if they need to). Exceptions can be made by discussion.
- [ ] The data structs should fully follow ZeroVec style
- [ ] Deserialization should not have a "zero-copy violation" in the [make-testdata](https://github.com/unicode-org/icu4x/blob/main/provider/datagen/tests/make-testdata.rs) test
- [ ] Deserialization should not have a "zero-copy violation" in the [make-testdata](https://github.com/unicode-org/icu4x/blob/main/provider/source/src/tests/make_testdata.rs) test
- [ ] Constructors should avoid allocating memory in the common case
- [ ] Opaque blobs of data should be avoided if possible (instead use VarZeroVec, ZeroMap, etc.)
- [ ] Data structs should not be panicky to load/deserialize and conform to [data_safety.md](https://github.com/unicode-org/icu4x/blob/main/documents/design/data_safety.md)
Expand Down
100 changes: 33 additions & 67 deletions documents/process/writing_a_new_data_struct.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,8 @@ The following steps take place at build time:
2. The source data is parsed and transformed into a runtime data struct. This step can be expensive, because it is normally run as an offline build step.
3. The runtime data struct is stored in a way so that a provider can use it: a postcard blob, JSON directory tree, Rust module, etc.

These steps are performed by the `icu_datagen`, but clients can also write their own data generation logic.
Steps 1 and 2 are performed by the `icu_provider_source` crate, but clients can also write their own source provider. Step 3 is performed by the `icu_provider_export` crate.
The `icu4x-datagen` pulls these two crates together as a CLI.

When deserializing from the blob store, it is a design principle of ICU4X that no heap allocations will be required. We have many utilities and abstractions to help make this safe and easy.

Expand All @@ -42,83 +43,43 @@ Additionally, data structs should keep internal invariants to a minimum. For mor

The first step to introduce data into the ICU4X pipeline is to download it from an external source. This corresponds to step 1 above.

When clients use ICU4X, this is generally a manual step, although we may provide tooling to assist with it. For the purpose of ICU4X test data, the tool `download-repo-sources` should automatically download data from the external source and save it in the ICU4X tree. `download-repo-sources` should not do anything other than downloading the raw source data.
When clients use ICU4X, this is generally an automatic step. For the purpose of ICU4X test data, the tool `download-repo-sources` should automatically download data from the external source and save it in the ICU4X tree. `download-repo-sources` should not do anything other than downloading the raw source data.

To download test data into the ICU4X source tree, run:
To add new files to the repo, edit `tools/testdata-scripts/globs.rs.data`, and run

```console
$ cargo make download-repo-sources
```

### Source Data Providers

"Source data providers" read from a source data file, deserialize it, and transform it to an ICU4X data struct. This corresponds to steps 2 and 3 above.
"Source data providers" read from a source data file, deserialize it, and transform it to an ICU4X data struct. This corresponds to step 2 above.

To add a new source data provider, implement the following traits on [`DatagenProvider`](https://unicode-org.github.io/icu4x/rustdoc/icu_datagen/struct.DatagenProvider.html) for your data marker(s):
To enabled generation of a new data marker `M`, add the following implementations to ICU4X's source data provider, [`SourceDataProvider`](https://unicode-org.github.io/icu4x/rustdoc/icu_provider_source/struct.SourceDataProvider.html):

- `DataProvider<M>` for one or more data markers `M`; this impl is the main step where data transformation takes place
- `IterableDataProviderInternal<M>`, which automatically results in a cached impl of `IterableDataProvider<M>`
- `DataProvider<M>`; this impl is the main step where data transformation takes place
- `IterableDataProviderCached<M>`, which automatically results in a cached impl of `IterableDataProvider<M>`

Source data providers are often complex to write. Rules of thumb:
Source data providers implementations are often complex to write. Rules of thumb:

- Optimize for readability and maintainability. The source data providers are not used in production, so performance is not a driving concern; however, we want the transformer to be fast enough to make a good developer experience.
- If the data source is similar to an existing data source (e.g., importing new data from CLDR JSON), try to share code with existing data providers for that source.
- If the data source is novel, feel free to add a new module under `icu_datagen::transform`.

### Data Exporters and Runtime Data Providers
As the last step, add the marker to the [registry](https://unicode-org.github.io/icu4x/rustdoc/icu_provider_registry/index.html).

"Data exporters" read from one or more ICU4X data structs and dump them to storage. This corresponds to step 4 above.
You can now run `cargo make testdata` to test your implementation on our testing locales. This will generate JSON data in `provider/source/data/debug`, which you can use for debugging.

Examples of data exporters include:
After you are done, add your data marker to the component's `provider::KEYS` list, and run `cargo make bakeddata` to generate compiled data for inclusion in the crate.

- [`FilesystemExporter`](https://unicode-org.github.io/icu4x/rustdoc/icu_provider_fs/export/fs_exporter/struct.FilesystemExporter.html)
- [`BlobExporter`](https://unicode-org.github.io/icu4x/rustdoc/icu_provider_blob/export/struct.BlobExporter.html)
### Data export and runtime data providers

"Runtime data providers" are ones that read serialized ICU4X data structs and deserialize them for use at runtime. These are the providers where performance is the key driving factor.
[`ExportDriver`](https://unicode-org.github.io/icu4x/rustdoc/icu_provider_export/struct.ExportDriver.html) reads from a source data provider and dumps it to storage. This corresponds to step 3 above. It is parameterized by a [`DataExporter`](https://unicode-org.github.io/icu4x/rustdoc/icu_provider/export/trait.DataExporter.html), which produces data for one of the high-performance runtime providers:

Examples of runtime data providers include:
- [`FilesystemExporter`](https://unicode-org.github.io/icu4x/rustdoc/icu_provider_fs/export/fs_exporter/struct.FilesystemExporter.html) for [`FsDataProvider`](https://unicode-org.github.io/icu4x/rustdoc/icu_provider_fs/struct.FsDataProvider.html)
- [`BlobExporter`](https://unicode-org.github.io/icu4x/rustdoc/icu_provider_blob/export/struct.BlobExporter.html) for [`BlobDataProvider`](https://unicode-org.github.io/icu4x/rustdoc/icu_provider_blob/struct.BlobDataProvider.html)
- [`BakedExporter`](https://unicode-org.github.io/icu4x/rustdoc/icu_provider_baked/export/struct.BakedExporter.html) for a baked data provider

- [`FsDataProvider`](https://unicode-org.github.io/icu4x/rustdoc/icu_provider_fs/struct.FsDataProvider.html)
- [`BlobDataProvider`](https://unicode-org.github.io/icu4x/rustdoc/icu_provider_blob/struct.BlobDataProvider.html)

**Most ICU4X contributors will not need to touch the data exporters or runtime data providers.** New implementations are only necessary when adding a new ICU4X data struct storage mechanism.

### Data Generation Tool (`icu4x-datagen`)

The [data generation tool, i.e., `icu4x-datagen`](https://unicode-org.github.io/icu4x/rustdoc/icu_datagen/index.html), ties together the source data providers with a data exporter.

When adding new data structs, it is necessary to make `icu4x-datagen` aware of your source data provider. To do this, edit
[*provider/datagen/src/registry.rs*](https://github.com/unicode-org/icu4x/blob/main/provider/datagen/src/registry.rs) and add your data provider to the macro

```rust,compile_fail
registry!(
// ...
FooV1Marker,
)
```

as well as to the list of keys

```rust

use std::borrow::Cow;
use icu_provider::prelude::*;

#[derive(Debug, PartialEq, Clone)]
#[icu_provider::data_struct(marker(FooV1Marker, "foo/bar@1"))]
pub struct FooV1<'data> {
message: Cow<'data, str>,
}

```

When finished, run from the top level:

```bash
$ cargo make testdata
```

If everything is hooked together properly, JSON files for your new data struct should appear under *provider/datagen/tests/data/json*.
**Most ICU4X contributors will not need to touch data export or runtime data providers.** New implementations are only necessary when adding a new ICU4X data struct storage mechanism.

## Example

Expand Down Expand Up @@ -172,7 +133,7 @@ The above example is an abridged definition for `DecimalSymbolsV1`. Note how the

### CLDR JSON Deserialize

[*provider/datagen/src/transform/cldr/cldr_serde/numbers.rs*](https://github.com/unicode-org/icu4x/blob/main/provider/datagen/src/transform/cldr/cldr_serde/numbers.rs)
[*provider/source/src/cldr_serde/numbers.rs*](https://github.com/unicode-org/icu4x/blob/main/provider/source/src/cldr_serde/numbers.rs)


```rust
Expand Down Expand Up @@ -206,35 +167,40 @@ pub struct Resource {
}
```


The above example is an abridged definition of the Serde structure corresponding to CLDR JSON. Since this Serde definition is not used at runtime, it does not need to be zero-copy.

### Transformer

[*provider/core/src/data_provider.rs*](https://github.com/unicode-org/icu4x/blob/main/provider/core/src/data_provider.rs)

[*provider/core/src/datagen/iter.rs*](https://github.com/unicode-org/icu4x/blob/main/provider/core/src/datagen/iter.rs)

```rust,compile_fail
impl DataProvider<FooV1Marker> for DatagenProvider {
impl DataProvider<FooV1Marker> for SourceDataProvider {
fn load(
&self,
req: DataRequest,
) -> Result<DataResponse<FooV1Marker>, DataError> {
// Use the data inside self.source and emit it as an ICU4X data struct.
// Use the data inside self and emit it as an ICU4X data struct.
// This is the core transform operation. This step could take a lot of
// work, such as pre-parsing patterns, re-organizing the data, etc.
// This method will be called once per option returned by supported_locales.
// This method will be called once per option returned by iter_locales.
}
}
impl IterableDataProviderInternal<FooV1Marker> for FooProvider {
fn supported_locales_impl(
impl IterableDataProviderCached<FooV1Marker> for SourceDataProvider {
fn iter_locales_cached(
&self,
) -> Result<HashSet<DataLocale>, DataError> {
// This should list all supported locales.
}
}
```

The above example is an abridged snippet of code illustrating the most important boilerplate for implementing and ICU4X data transform.
### Registry

```rust,compile_fail
registry!(
// ...
icu::foo::provider::FooV1Marker = "foo/bar@1",
// ...
)
```
2 changes: 2 additions & 0 deletions provider/baked/src/export.rs
Original file line number Diff line number Diff line change
Expand Up @@ -543,6 +543,8 @@ impl DataExporter for BakedExporter {
.parse::<TokenStream>()
.unwrap();

self.dependencies.insert("icu_provider_baked");

let search = if !needs_fallback {
quote! {
let metadata = Default::default();
Expand Down
21 changes: 6 additions & 15 deletions provider/core/README.md

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

16 changes: 6 additions & 10 deletions provider/core/src/export/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,10 @@
// called LICENSE at the top level of the ICU4X source tree
// (online at: https://github.com/unicode-org/icu4x/blob/main/LICENSE ).

//! This module contains various utilities required to generate ICU4X data files, typically
//! via the `icu_datagen` reference crate. End users should not need to consume anything in
//! this module as a library unless defining new types that integrate with `icu_datagen`.
//! This module contains types required to export ICU4X data via the `icu_provider_export` crate.
//! End users should not need to consume anything in this module.
//!
//! This module can be enabled with the `datagen` Cargo feature on `icu_provider`.
//! This module is enabled with the `export` Cargo feature.
mod payload;

Expand Down Expand Up @@ -96,18 +95,15 @@ impl ExportableProvider for Box<dyn ExportableProvider> {
}
}

/// This macro can be used on a data provider to allow it to be used for data generation.
/// This macro can be used on a data provider to allow it to be exported by `ExportDriver`.
///
/// Data generation 'compiles' data by using this data provider (which usually translates data from
/// different sources and doesn't have to be efficient) to generate data structs, and then writing
/// them to an efficient format like [`BlobDataProvider`] or [`BakedDataProvider`]. The requirements
/// them to an efficient format like `BlobDataProvider` or `BakedDataProvider`. The requirements
/// for `make_exportable_provider` are:
/// * The data struct has to implement [`serde::Serialize`](::serde::Serialize) and [`databake::Bake`]
/// * The provider needs to implement [`IterableDataProvider`] for all specified [`DataMarker`]s.
/// This allows the generating code to know which [`DataLocale`] to collect.
///
/// [`BlobDataProvider`]: ../../icu_provider_blob/struct.BlobDataProvider.html
/// [`BakedDataProvider`]: ../../icu_datagen/index.html
/// This allows the generating code to know which [`DataIdentifierCow`]s to export.
#[macro_export]
#[doc(hidden)] // macro
macro_rules! __make_exportable_provider {
Expand Down
Loading

0 comments on commit 05264a4

Please sign in to comment.