Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proof of concept implementation for OpenAI compatible API format #237

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ayancey
Copy link
Collaborator

@ayancey ayancey commented Aug 7, 2024

Quick and dirty implementation of what it would look like to support OpenAI's API format. This is an attempt to satisfy #227.

Example OpenAI output from /v1/audio/transcriptions:

{
    "task": "transcribe",
    "language": "english",
    "duration": 9.90999984741211,
    "text": "The dog jumped over the big fence and then it ran over to the farm.",
    "segments": [
        {
            "id": 0,
            "seek": 0,
            "start": 0.0,
            "end": 11.0,
            "text": " The dog jumped over the big fence and then it ran over to the farm.",
            "tokens": [
                50364,
                440,
                3000,
                13864,
                670,
                264,
                955,
                15422,
                293,
                550,
                309,
                5872,
                670,
                281,
                264,
                5421,
                13,
                50914
            ],
            "temperature": 0.0,
            "avg_logprob": -0.3397972285747528,
            "compression_ratio": 1.0634920597076416,
            "no_speech_prob": 0.02906951494514942
        }
    ]
}

Some notes:

  • OpenAI's implementation uses form data, not JSON input
  • Their formats offered are json, text, srt, verbose_json, or vtt. json only has one key with "text", whereas verbose_json includes other basic info. verbose_json is used along with timestamp_granularities[] array to provide segments or word level timestamps. Since we always get segments, we have to throw those away when json format is used.
  • Need a way to get duration of the file to match OpenAI output. I can divide the size of the numpy array by the sample rate to get the number of seconds, but I also have to divide it by 2? Not sure if it's stereo, that wouldn't make much sense.
  • The abstraction between the endpoint route method and the core.py methods for whisper/faster-whisper need to be changed. Mostly for modifying the JSON keys before they're turned into a StringIO stream.

@ahmetoner ahmetoner self-assigned this Aug 19, 2024
@ahmetoner ahmetoner added the enhancement New feature or request label Aug 19, 2024
@ahmetoner
Copy link
Owner

I am planning to merge this update for version 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants