Skip to content
This repository has been archived by the owner on Nov 3, 2024. It is now read-only.

Latest commit

 

History

History
111 lines (84 loc) · 3.46 KB

README.md

File metadata and controls

111 lines (84 loc) · 3.46 KB

Pronouncer

A Rust-based text-to-speech synthesizer that uses the CMU phonetic dictionary and pre-recorded phonemes to generate funny-sounding speech using my voice (or your own samples which you can compile into the program by replacing the ones in pronouncer_lib/audio).

Features

  • Text-to-speech synthesis using CMU phonetic dictionary
  • High-quality pre-recorded phonemes for natural sound
  • Smooth audio transitions using advanced crossfading
  • Outputs standard WAV audio files (44.1kHz, 16-bit)
  • Static compilation of audio data for standalone binaries

Installation

  1. Ensure you have Rust installed (https://rustup.rs/)
  2. Clone this repository
  3. Build the project:
cargo build --release

Usage

Run the program with words as arguments:

cargo run --release -- "hello world"

Or run it interactively:

cargo run --release
Enter a string: hello world

The program will generate an output.wav file containing the synthesized speech.


Project Structure

The project is organized as a Rust workspace containing two main crates:

pronouncer_lib

Core library containing the text-to-speech engine:

  • src/lib.rs - Main library interface and audio processing
  • src/phoneme.rs - Phoneme enum and conversion functions
  • build.rs - Build script for processing dictionary and audio files
  • audio/ - Pre-recorded WAV files for each phoneme
  • build/ - Build-time resources including CMU dictionary

pronouncer_bin

Command-line interface executable:

  • src/main.rs - CLI implementation
  • Handles argument parsing and file I/O

Key Components

  1. Build System

    • Processes CMU dictionary at compile time
    • Serializes phoneme WAV files into binary data
    • Generates optimized lookup tables
  2. Phoneme System

    • 39 distinct phonemes based on CMU dictionary
    • Each phoneme has a corresponding WAV recording
    • Efficient enum-based representation
  3. Audio Processing

    • 44.1kHz 16-bit mono WAV output
    • Crossfading algorithm for smooth transitions
    • Fileless audio storage - phoneme WAV data is serialized and embedded directly into the binary
  4. Dictionary System

    • CMU dictionary-based word to phoneme conversion
    • Fallback to character-by-character pronunciation
    • Efficient hashmap-based lookups

Technical Details

Build Process

  1. The build script (build.rs) processes the CMU dictionary and WAV files
  2. Dictionary is converted to a binary lookup table using bincode serialization
  3. WAV files are serialized and embedded directly into the binary
  4. Static initialization provides immediate access to audio data at runtime

Audio Synthesis Process

  1. Input text is normalized and split into words
  2. Words are looked up in the CMU dictionary
  3. Unknown words fall back to character-by-character pronunciation
  4. Phoneme sequences are converted to audio samples
  5. Advanced crossfading is applied between phonemes
  6. Final audio is written to WAV file

Performance Considerations

  • Audio data is compiled directly into the binary, eliminating runtime file I/O
  • Efficient bincode serialization for compact data storage
  • High-performance hashmap-based dictionary lookups
  • Optimized crossfading algorithm for smooth transitions

Dependencies

Core dependencies:

  • bincode: Fast serialization
  • hashbrown: High-performance hashmaps
  • hound: WAV file handling
  • lazy_static: Efficient static initialization
  • serde: Serialization framework

License

MIT