Skip to content

Latest commit

 

History

History
885 lines (620 loc) · 17.5 KB

rustgraz_talk_06.md

File metadata and controls

885 lines (620 loc) · 17.5 KB
author title date
Lukas Prokop
Rust Graz – 06 Unicode
18th of December, 2019

Unicode; 18th of December 2019 Rust Graz, lab10

Prologue


Clarification 1: debug/release build, stack size, cargo

fn count_calls(n: u64) -> u64 {
    println!("{:p}", &n);
    if n < 1 {
        0
    } else {
        1 + count_calls(n - 1)
    }
}

fn main() {
    println!("{}", count_calls(174470))
}

Clarification 1: debug/release

% cargo run
…
0x7ffc9324f6b0
0x7ffc9324f610
0x7ffc9324f570

thread 'main' has overflowed its stack
fatal runtime error: stack overflow
[1]    11645 abort      cargo run

⇒ result of last time: 160 bytes per stackframe


Clarification 1: debug/release

% cargo run --release
…
0x7ffe628fe5a8
0x7ffe628fe548
0x7ffe628fe4e8

thread 'main' has overflowed its stack
fatal runtime error: stack overflow
[1]    11803 abort      cargo run --release

⇒ 96 bytes per stackframe


Clarification 1: debug/release

“The Manifest Format” via Cargo book

# The development profile, used for `cargo build`.
[profile.dev]
# controls the `--opt-level` the compiler builds with.
# 0-1 is good for debugging. 2 is well-optimized. Max is 3.
# 's' attempts to reduce size, 'z' reduces size even more.
opt-level = 0

# (u32 or bool) Include debug information (debug symbols).
# Equivalent to `-C debuginfo=2` compiler flag.
debug = true

# Link Time Optimization usually reduces size of binaries
# and static libraries. Increases compilation time.
# If true, passes `-C lto` flag to the compiler, and if a
# string is specified like 'thin' then `-C lto=thin` will
# be passed.
lto = false

# The release profile, used for `cargo build --release`
# (and the dependencies for `cargo test --release`,
# including the local library or binary).
[profile.release]
opt-level = 3
debug = false
lto = false

# The testing profile, used for `cargo test` (for `cargo
# test --release` see the `release` and `bench` profiles).
[profile.test]
opt-level = 0
debug = 2
lto = false

# The benchmarking profile, used for `cargo bench` (and the
# test targets and unit tests for `cargo test --release`).
[profile.bench]
opt-level = 3
debug = false
lto = false

Clarification 1: debug/release

debug build (160 bytes, 2.4MB):

% ls -l ./target/debug/buildtest 
-rwxrwxr-x 2 user user 2514680 Dec 17 22:15 ./target/debug/buildtest

release build (96 bytes, 2.4MB):

% ls -l ./target/release/buildtest 
-rwxrwxr-x 2 user user 2497912 Dec 17 22:22 ./target/release/buildtest

Clarification 1: debug/release

Old Cargo.toml:

[package]
name = "buildtest"
version = "0.1.0"
authors = ["meisterluk <[email protected]>"]
edition = "2018"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]

Clarification 1: debug/release

New Cargo.toml:

[package]
name = "buildtest"
version = "0.1.0"
authors = ["meisterluk <[email protected]>"]
edition = "2018"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[profile.release]
opt-level = 3
debug = true
lto = true

[dependencies]

Clarification 1: debug/release

custom release build (always opt-level=3)

  • debug=true, lto=false:
    96 bytes stackframe & 2507520 bytes executable
  • debug=true, lto=true: 96 & 974440 bytes
  • debug=false, lto=true: 96 & 965904 bytes

I don't know how to go below 96 bytes stack frames.


Clarification 2: PartialEq and Eq

PartialEq

symmetric
a == b implies b == a; and
transitive
a == b and b == c implies a == c.

Eq

additionally reflexive
a == a;

Clarification 2: PartialEq and Eq

use std::f64;
fn main() {
    println!("{}", f64::NAN == f64::NAN); // false
}

⇒ If you implement PartialEq then #[derive(Eq)] as well unless you can’t

Related traits: Hash, PartialOrd, Ord

Dialogue: Unicode


Unicode

Unicode maps numbers to code points.

Unicode 12.1 maps numbers to 137,994 unicode code points.

How can we encode 137,994 Unicode code points to bytes? ⇒ Unicode Transformation Format (UTF).


れいReiwa combined Kanji

  • 2019/04/30: Emperor Akihito abdicated. 2019/05/01: Emperor Naruhito ascended the throne. 1 character added in 12.1
  • Previously へいせい ⇒ ㍻: U+5E73 CJK UNIFIED IDEOGRAPH-5E73 and U+6210 CJK UNIFIED IDEOGRAPH-6210 merged into U+337B SQUARE ERA NAME HEISEI

Unicode

I came up with some UTF. I will introduce 5 versions of a “Complementary Properties Encoding” (CPE). Let's discuss its properties.


CPE5

2 bytes = 16 bits. Fixed-width encoding. What are potential problems?

Custom ‘CPE5’ fixed-width encoding using 2 bytes


CPE4

Custom ‘CPE4’ encoding using up to 4 bytes. Leading bytes and continuation bytes are distinguished by suffix bit


CPE3

Custom ‘CPE3’ encoding using up to 4 bytes. Leading bytes and continuation bytes are distinguished by leading bit


CPE2

Custom ‘CPE2’ encoding using up to 4 bytes. First byte denotes how many continuation bytes follow


CPE1

Custom ‘CPE1’ encoding using up to 4 bytes. First byte denotes how many continuation bytes follow. Continuation bytes have distinguishing prefix


UTF-8

UTF-8 encoding using up to 4 bytes


UTF-8 properties

Backward/ASCII compatibility
Setting one special bit of single byte, we have 7 remaining bit with same assignment like ASCII
Extended ASCII detection/fallback
UTF-8 multibyte strings are rarely linguistically legit Extended ASCII strings.
þ ⇒ þ, ø ⇒ ø, ß ⇒ ß
Prefix freedom
There is no whole code word in the system that is a prefix of any other code word in the system

UTF-8 properties

Self-synchronization
If we jump to some byte, we can easily determine the start of the next character
Sorting order
Lexicographical order of bytes equal unicode codepoint order

via Wikipedia


UTF-8 fallback example

Wikipedia Mojibake example

UTF-8 encoded Japanese Wikipedia rendered in cp1252


UTF-8 fallback example

use std::fs::File;
use std::io::prelude::*;

fn main() -> std::io::Result<()> {
    let mut fd = File::create("pile_of_poo.html")?;
    fd.write(b"<!DOCTYPE html>\n<head><title>\
\xf0\x9f\x92\xa9</title>\n")?;
    Ok(())
}

UTF-8 fallback example

use std::fs::File;
use std::io::prelude::*;

fn main() -> std::io::Result<()> {
    let mut fd = File::create("mojibake.html")?;
    fd.write(b"<!DOCTYPE html>\n<head><title>\xda \
\xf0\x9f\x92\xa9</title>\n")?;
    Ok(())
}

UTF-8 fallback example

Pile of Poo HTML title

Pile of Poo HTML title


UTF-8 fallback example

Mojibake in title

Mojibake in title


CPE

versionASCII compatfallbackprefix-freeself-syncsort
5?
4?
3?
2?
1?

Unicode / UTF-8 terminology

Mojibake
Character rendered in wrong encoding
Han unification
Korean and Japanese writing systems are based on Chinese characters ⇒ huge overlap ⇒ merge different writing systems
Overlong encoding
Remove leading zeros in your binary string. Then cram those bits into 1-4 UTF-8 bytes; as few as needed! If you take more bytes, you have overlong encoding; which is disallowed.

UTF-8

UTF-8 encoding using up to 4 bytes


Unicode: Han Unification

One possible rationale is the desire to limit the size of the full Unicode character set, where CJK characters as represented by discrete ideograms may approach or exceed 100,000 characters. Version 1 of Unicode was designed to fit into 16 bits and only 20,940 characters (32%) out of the possible 65,536 were reserved for these CJK Unified Ideographs.


Unicode: Han Unification

Han unification


Unicode: Han Unification

TRON Code is a multi-byte character encoding used in the TRON project. It is similar to Unicode but does not use Unicode's Han unification process: each character from each CJK character set is encoded separately, including archaic and historical equivalents of modern characters

via Wikipedia: TRON encoding


Unicode / UTF-8 terminology

Surrogates
In UTF16: Extension to encode 2.5 bytes in a 2 bytes fixed-width encoding.
In UTF8: Invalid bit patterns for compatibility with UTF16.
Basic Multilingual Plane
Plane of common use characters (65,536 code points)

Unicode: Surrogates

UTF-16 surrogates

compare with Wikipedia


Unicode

UTF-8 encoding of Black female detective

via twitter


Unicode

  1. shapecatcher.com
  2. joelonsoftware.com: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets

UTF-8 in rust

const TEXT: &str = "Héllö Wør̶l̶d";

fn main() {
    println!("{}", TEXT);
}

String

fn main() {
    let s = String::from("Hello Graz");
    println!("{}", s);
}

String

fn main() {
    // let s = String::with_capacity(11);
    let s = String::from("Hello Graz");
    s += "!";  // also:  s.push_str("!");
    println!("{}", s);
}

Does it compile? No, it's immutable.


String

fn main() {
    let mut s = String::from("Hello Graz");
    s += "!";
    println!("{}", s);
}

Does it compile? Yes.


String

  • data must be valid UTF-8 string
  • owns its data (dropping String means deallocate data)
  • ⇒ “owned string”
  • does not implement Copy, thus move semantics apply
  • consists of {&data, length, capacity}

String

// \xD8 is a surrogate code point
fn main() {
    let s = String::from("Hello\xD8Graz!");
    println!("{}", s);
}

String

error: this form of character escape may only
       be used with characters in the range [\x00-\x7f]
 --> src/main.rs:3:32
  |
2 |     let s = String::from("Hello\xD8Graz!");
  |                                ^^^^

String

fn sub(arg: String) {
    println!("{}", arg);
}

fn main() {
    let s = String::from("Hello Graz");
    sub(s);
    println!("{}", s);
}

Does it compile?


String

error[E0382]: borrow of moved value: `s`
 --> src/main.rs:8:20
  |
6 |     let s = String::from("Hello Graz");
  |         - move occurs because `s` has type
  |           `std::string::String`, which does
  |           not implement the `Copy` trait
7 |     sub(s);
  |         - value moved here
8 |     println!("{}", s);
  |                    ^ value borrowed here after move

&str

  • data must be valid UTF-8 string
  • stored in .data/.rodata, does not deallocate
  • ⇒ “borrowed string”, lives as long as the program
  • consists of {&data, length}

&str: Syntax

fn main() {
    let s = "H\x65\u{6C}lo \
             Graz";
    println!("{}", s);
}

&str: Syntax

fn main() {
    println!("\u{1F4A9}");
}

💩


&str

fn main() {
    let s: &str = "Hello Graz";
    s += "!";
    println!("{}", s);
}

&str

error[E0368]: binary assignment operation
    `+=` cannot be applied to type `&str`
 --> src/main.rs:3:5
  |
3 |     s += "!";
  |     -^^^^^^^
  |     |
  |     cannot use `+=` on type `&str`

&str

fn sub(arg: &str) {
    println!("{}", arg);
}

fn main() {
    let s: &str = "Hello Graz!";
    sub(s);
    println!("{}", s);
}

Does it compile? Yes.


&str

fn main() {
    let s = "Hello Graz!";
    let b = "Hello Graz!";
    println!("{:p} {:p}", s, b);
    // 0x557ab7f09cc0 0x557ab7f09cc0
}

memory size

use std::mem;

fn main() {
    println!("{}", mem::size_of::<A>());
}

where A is u8 (1), u32 (4), f64 (8), &u8 (8), String (24), &str (16), Vec<u8> (24) or &[char] (16).


Other types

  • Vec<u8> can contain an arbitrary non-UTF-8 string
  • char is always 4 bytes and thus can contain any UTF-8 code point
  • [u8] is a slice of u8. Cumbersome to handle
  • OsString, ffi::CString … if you need compatibility strings.

String operations

fn main() {
    println!("{}", "ß".to_uppercase());
}

Output: SS

Compare with Unicode casemap F.A.Q.


string's lifetime

fn main() {
    let s = {
        let alt: &str = "Graz";
        alt
    };
    println!("{}", s);
}

Does it compile? Yes.


string's lifetime

fn main() {
    let s = {
        let alt: String = "Graz".to_string();
        alt
    };
    println!("{}", s);
}

Does it compile? Yes.


string's lifetime

fn main() {
    let s = {
        let alt: String = "Graz".to_string();
        alt.as_str()
    };
    println!("{}", s);
}

Does it compile?


string's lifetime

error[E0597]: `alt` does not live long enough
 --> src/main.rs:4:9
  |
2 |     let s = {
  |         - borrow later stored here
3 |         let alt: String = "Graz".to_string();
4 |         alt.as_str()
  |         ^^^ borrowed value does not live long enough
5 |     };
  |     - `alt` dropped here while still borrowed

Deref trait magic

fn takes_str(s: &str) {}

let s = String::from("Hello");

takes_str(&s);

via std::string::String


Indexing

fn main() {
    let mut s = "合気道";
    println!("{}", s[1]);
}

Indexing

error[E0277]: the type `str` cannot be indexed by `{integer}`
 --> src/main.rs:3:20
  |
3 |     println!("{}", s[1]);
  |                    ^^^^
  | string indices are ranges of `usize`

Indexing

fn main() {
    let mut s = "合気道".chars();
    println!("{}", s.nth(1).unwrap()); // 気
}

Indexing

fn main() {
    for s in "देवनागरी".chars() {
        println!("{}", s);
    }
}
द
े
व
न
ा
ग
र
ी

Unicode

Interesting read: Stackoverflow “Why does modern Perl avoid UTF-8 by default?”

Epilogue


Quiz

ASCII is a __-bit encoding
ASCII is a 7-bit encoding
Maximum number of bytes of a UTF-8 code point?
4
Which string types does rust define?
&str, String
std::mem::size_of::<char>() gives?
4
How to iterate over characters of a string?
let mut s = "Hello world".chars();

Next session

Wed, 2019/01/29 19:00

Topic: traits


Thanks!