Shin Rakuda

Description

Shin Rakuda is a powerful and flexible tool designed to benchmark the performance of different Language Models (LLMs) on given datasets. It provides researchers and developers with an easy-to-use interface to load datasets, select models, run benchmarking processes, and visualize results.

Key Features

Support for multiple inference libraries (Huggingface and VLLM)
Flexible configuration for models, datasets, and evaluation parameters
Easy-to-use command-line interface
Visualization of benchmarking results
Support for both API-based and local models

Prerequisites

Python 3.9 or higher
pip or Poetry for dependency management
Access to required model APIs (if using API-based models)
Sufficient computational resources for running local models (if applicable)

Configuration

Copy the .env.example file to .env and configure the models' API keys if necessary:
```
cp .env.example .env
```
Edit the config.yaml file to configure the project. The configuration file is divided into several sections:
- Models: Define the LLMs you want to benchmark
- Evaluation Datasets: Specify the datasets for evaluation
- Judge Model: Configure the model used for judging responses
- Evaluation Configurations: Set up directories and other evaluation parameters
For detailed explanations of each configuration option, please refer to the comments in the config_template.yaml file.

Models

# API model
models:
  - model_name: string
    api: boolean # whether the model inference via API, default True for API models
    provider: string # the provider of the model
# Local Model
  - model_name: string # can be any name you want
    api: boolean # whether the model inference via API, default False for local models
    provider: string # model hosting provider, default huggingface
    system_prompt: string 
    do_sample: boolean
    vllm_config: # vllm config section
      model: string # model full name or model id
      max_model_len: int # maximum model length
    vllm_sampling_params: # vllm sampling parameters section
      temperature: float # temperature
      top_p: float # top p
      max_tokens: int # maximum tokens
      repetition_penalty: float # repetition penalty
    hf_pipeline: # huggingface pipeline section
      task: string
      model: string # model full name or model id
      torch_dtype: string 
      max_new_tokens: int
      device_map: string
      trust_remote_code: boolean
      return_full_text: boolean
    hf_chat_template: # huggingface chat template section
      chat_template: string # this can be either the complete chat template or the format of the chat template such as `ChatML`
      tokenize: boolean # whether to tokenize the chat template
      add_generation_prompt: boolean # whether to add generation prompt

References:

Please add the HF or VLLM configuration parameters as you see fit, and Rakuda will process accordingly. Rakuda will NOT work if you have extra parameters that are not supported by either inference library.

Evaluation Datasets

eval_datasets:
  - dataset_name: string # dataset name
    judge_prompt_template: string # judge prompt template
    num_questions: int # Optional, number of questions to evaluate
    random_questions: boolean # Optional, whether to select questions randomly when num_questions is provided
    use_jinja: boolean # whether to use jinja templating for the judge prompt
    score_keyword: string # keyword to extract the score from the model output, please see config_template.yaml file for format

Judge Model

judge_models:
  - model_name: string # repo_id of model
    api: boolean # whether the model inference via API
    provider: string # the provider of the model

Evaluation Configurations

eval_datasets_dir: string # directory containing the evaluation datasets
log_dir: string # directory to save the logs
result_dir: string # directory to save the evaluation results
existing_eval_dir: string  # Optional, directory containing existing results to compare with, so it will not re-run the evaluation for some models
inference_library: string  # inference library to use, change to "hf" or "huggingface" for huggingface, "vllm" for vllm

Installation

# Create a virtual environment
python3 -m venv .venv
# Activate the virtual environment
source .venv/bin/activate
# Install the dependencies
pip install -r requirements.txt
# Update filelock to resolve bug
pip install --upgrade filelock

Alternative Dependency Management

This project uses pyproject.toml for dependency management. To install additional dependencies:

Add the dependency to the pyproject.toml file.

poetry add <dependency>

Run poetry install to update your environment.

Usage

Run the end-to-end evaluation script:

python3 scripts/evaluate_llm.py --config-name config_xxx

Replace config_xxx with the name of your configuration file (without .yaml) located in the configs directory.

Example output:

Start Shin Rakuda evaluation...
Processing datasets: 100%|██████████| 2/2 [00:00<00:00,  5.01it/s]
Evaluating japanese_mt_bench...
Processing models: 100%|██████████| 3/3 [00:00<00:00, 15.08it/s]
...

After the evaluation is complete, you can find the results and visualizations in the result_dir specified in your configuration file.

Todo

Add support for llama 3.1 models
Improve Huggingface pipeline support
Update VLLM (until vllm latest version that supports proper gpu memory release)

Contributing

We welcome contributions to Shin Rakuda! Here's how you can help:

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

Please make sure to update tests as appropriate and adhere to the project's coding standards.

Troubleshooting

If you encounter CUDA out of memory errors, try reducing the max_model_len or max_tokens parameters in your model configuration.
For issues with specific models or datasets, check the model provider's documentation or dataset source for any known limitations or requirements.
If you're having trouble with dependencies, make sure you're using the correct version of Python and have installed all required packages.

For more help, please open an issue on the GitHub repository.

License

MIT

Citation

If you use Shin Rakuda in your research, please cite it as follows:

@software{shin_rakuda,
  author = {YuzuAI},
  title = {Shin Rakuda: A Flexible LLM Benchmarking Tool},
  year = {2024},
  url = {https://github.com/yourusername/shin-rakuda}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
configs		configs
datasets		datasets
rakuda		rakuda
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.ja.md		README.ja.md
README.md		README.md
__init__.py		__init__.py
config.yaml		config.yaml
config_small.yaml		config_small.yaml
config_swallow.yaml		config_swallow.yaml
config_template.yaml		config_template.yaml
config_validator.py		config_validator.py
constants.py		constants.py
evaluate_llm.py		evaluate_llm.py
helper.py		helper.py
main.yml		main.yml
pyproject.toml		pyproject.toml
rakuda.py		rakuda.py
requirements.txt		requirements.txt
vllm_memory_test.py		vllm_memory_test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Shin Rakuda

Table of Contents

Description

Key Features

Prerequisites

Configuration

Models

Evaluation Datasets

Judge Model

Evaluation Configurations

Installation

Alternative Dependency Management

Usage

Todo

Contributing

Troubleshooting

License

Citation

References

About

Releases

Languages

License

yuzu-ai/ShinRakuda

Folders and files

Latest commit

History

Repository files navigation

Shin Rakuda

Table of Contents

Description

Key Features

Prerequisites

Configuration

Models

Evaluation Datasets

Judge Model

Evaluation Configurations

Installation

Alternative Dependency Management

Usage

Todo

Contributing

Troubleshooting

License

Citation

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Languages