🎉 Enhancements
- Add support for adapter loading in mllama by @ajtejankar in #669
- Record number of skipped tokens in the response by @tgaddair in #681
- Record TTFT and TPOT in response headers by @tgaddair in #684
- Add cli arg --speculation-max-batch-size by @tgaddair in #686
- Use
--predibase-api-token
parameter when downloading by @joseph-predibase in #687 - Launcher args for compile max batch size and rank by @tgaddair in #690
🐛 Bugfixes
- Fix stella embeddings + Integration tests for lorax by @magdyksaleh in #668
- Fix lora loading and indexing bug in mllama by @ajtejankar in #682
- Set maximum grpc message receive size to 2GiB by @tgaddair in #667
- Fix
frequency_penalty
andpresence_penalty
by @tgaddair in #672 - Fix scores (remove debug code) by @tgaddair in #673
- Fix top_p to allow setting it to 1.0 by @magdyksaleh in #676
- Format fixes tool calling by @magdyksaleh in #680
- Use predibase API token when downloading pbase files by @joseph-predibase in #688
- Pbase adapter source resolution by @magdyksaleh in #689
- fix: Make logprob field optional for response Pydantic validation by @jeffreyftang in #692
🔧 Maintenance
- Only use sha tag for running int tests by @magdyksaleh in #674
- Fix int tests 2 by @magdyksaleh in #675
- Always build and push image before running IT by @arnavgarg1 in #678
- Only push main if int tests pass by @magdyksaleh in #677
- Remove bad check by @magdyksaleh in #683
Full Changelog: v0.12.0...v0.12.1