This release includes a number of new features, improvements and bug fixes.
New Features
- Added a benchmarking tool for measuring data loading performance with gcsfuse. (#863)
- Added a Prometheus server to the Latency Profile Generator (LPG) running on port 9090, along with new metrics for prompt_length, response_length, and time_per_output_token. (#857)
- Added support for Google Cloud Monitoring and Managed Collection for the gke-batch-refarch. (#856)
- Added a tutorial on packaging models and low-rank adapters (LoRA) from Hugging Face as images, pushing them to Artifact Registry, and deploying them in GKE. (#855)
Improvements
- Updated outdated references to the Text Generation Inference (TGI) container to use the Hugging Face Deep Learning Containers (DLCs) hosted on Google Cloud's Artifact Registry. (#816)
- Added the ability to benchmark multiple models concurrently in the LPG. (#850)
- Added support for "inf" (infinity) request rate and number of prompts in the LPG. (#847)
- Fixed the latency_throughput_curve.sh script to correctly parse non-integer request rates and added "errors" to the benchmark results. (#850)
Bug Fixes
- Fixed an issue where the README was not rendering correctly. (#862)
New Contributors
- @alvarobartt made their first contribution in #816
- @liu-cong made their first contribution in #850
- @coolkp made their first contribution in #855
- @JamesDuncanNz made their first contribution in #856
Full Changelog: v1.6...v1.7