Support importing/pasting/dropping audio files #722

humphd · 2024-11-10T18:06:29Z

We just added support for more file types when you attach/paste/drop them. We also have support for turning audio into text, see (src/lib/speech-recognition.ts). Let's add support for importing audio files, which converts the audio to text and includes it in a new message.

AryanK1511 · 2024-11-16T19:22:47Z

I can pick this one up. Are you able to assign it to me so I can start working on it?

humphd · 2024-11-16T19:24:23Z

@AryanK1511 it's all yours

AryanK1511 · 2024-11-18T14:40:10Z

@humphd, just to confirm my understanding of this issue:

Are we aiming to add the ability to attach, paste, or drop audio files into the chatbot, have their content transcribed into text, and then include that text in a new message? For example, if I attach a pre-recorded audio file saying, "Tell me about the universe," the chatbot would transcribe it and include the text as part of the conversation.

Is that correct?

humphd · 2024-11-18T14:45:56Z

Correct. It should work for the audio file types that we can natively send to the LLMs. Later we could add a step to transcode if necessary, but let's not start with that.

So if I have a podcast MP3, I can attach this file and chat with it (the transcript will be injected into the chat for me).

humphd · 2024-11-18T16:27:33Z

Adding the file will automatically process it into chat. Try adding a .js or .pdf. Same as those

…

On Mon, Nov 18, 2024 at 10:10 AM Aryan Khurana ***@***.***> wrote: @humphd <https://github.com/humphd> I see! So for the user, they will attach a file, chat with it and then when they hit send, they will see the transcript and their prompt in the chat right? — Reply to this email directly, view it on GitHub <#722 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADILBXFDKNXIJFFZURXQ6D2BH7O7AVCNFSM6AAAAABRQODY2OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBTGI4DSMJUGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

AryanK1511 · 2024-11-19T01:19:54Z

Hi @humphd,

After spending a few hours digging into this codebase, I managed to fix the issue. However, the solution I’ve come up with feels quite hacky. I’d like to explain my approach, why I believe the current structure makes a clean solution difficult, and seek your guidance on how to improve it.

Background

The issue involves handling audio files similarly to how PDFs are processed in the src/hooks/use-file-import.tsx file. Here's the current flow for PDFs:

The following snippet converts PDF files to text via an external API and returns the content:

if (file.type === "application/pdf") {
    const contents = await pdfToMarkdown(file);
    assertContents(contents);
    return contents;
}

Later in the same file, the content is added to the chat:

else if (file.type === "application/pdf") {
    const document = (contents as JinaAiReaderResponse).data;
    chat.addMessage(new ChatCraftHumanMessage({ text: `${document.content}\n` }));
}

My Approach for Audio Files

To implement audio file handling, the ideal approach would be to mimic the PDF workflow by creating an audioToText function:

if (file.type.startsWith("audio/")) {
    const contents = await audioToText(file);
    assertContents(contents);
    return contents;
}

However, since the project already includes functionality for transcribing audio in src/lib/speech-recognition.ts, I leveraged the transcribe method:

async transcribe(audio: File): Promise<string> {
    const transcriptions = new OpenAI.Audio.Transcriptions(this._openai);
    const transcription = await transcriptions.create({
        file: audio,
        model: this._sttModel,
    });
    return transcription.text;
}

The Problem

Using transcribe requires creating an instance of the SpeechRecognition class, which depends on a model and client:

constructor(sttModel: string, openai: OpenAI) {
    this._sttModel = sttModel;
    this._openai = openai;
}

This initialization is done in src/components/PromptForm/MicIcon.tsx via the useModels hook:

const { getSpeechToTextClient, isSpeechToTextSupported, allProvidersWithModels } = useModels();
const sttClient = await getSpeechToTextClient();
const sttProvider = allProvidersWithModels.find((p) => p.apiUrl === sttClient.baseURL);
const sttModel = sttProvider?.models.find((model) => isSpeechToTextModel(model.name))?.name;
speechRecognitionRef.current = new SpeechRecognition(sttModel, sttClient);

The issue is that hooks like useModels can only be used in React components, whereas the file-processing logic resides outside of React components.

Current Solution

Currently, I bypass the clean integration by doing the following:

Returning an empty string for audio files in use-file-import.tsx:
```
if (file.type.startsWith("audio/")) {
    return "";
}
```

Handling audio files downstream in the React component:

else if (file.type.startsWith("audio/")) {
    const sttClient = await getSpeechToTextClient();
    const sttProvider = allProvidersWithModels.find((p) => p.apiUrl === sttClient.baseURL);
    const sttModel = sttProvider?.models.find((model) => isSpeechToTextModel(model.name))?.name;
    const sr = new SpeechRecognition(sttModel, sttClient);
    const text = await sr.transcribe(file);
    chat.addMessage(new ChatCraftHumanMessage({ text: `${text}\n` }));
}

This approach works, as shown in the attached video.

Screen.Recording.2024-11-18.at.8.18.38.PM.mov

However, it introduces two significant problems:

The file processing component never gets triggered:

const progressId = progress({
    title: `Processing file${files.length > 1 ? "s" : ""}`,
    progressPercentage: 0,
});

The logic for audio file handling is fragmented and not aligned with the existing structure.

Request for Guidance

I believe the current structure of the code makes it difficult to implement audio file handling in a clean way. How would you suggest proceeding?

Would you like me to open a PR with my current changes so we can work collaboratively on refining the solution? Note that I’ve made a few other changes to ensure the input works as expected.

Thank you for your guidance!

humphd · 2024-11-19T01:56:25Z

Yeah, that's not ideal. We are going to need to refactor the logic for dealing with AI out of hooks, so background processes like this can do it as well.

For now, let's make this work with OpenAI and whisper, per https://platform.openai.com/docs/guides/speech-to-text. You can use settings.ts to call getSettings() and see if you have an OpenAI API Key stored. If you do, you can show the audio files as an option, otherwise, don't bother. Then pass that key and an OpenAI client instance to SpeechRecognition.

I'd make this work at all, then we can figure out how to make it work in general. To be honest, the TTS stuff is kind of separate from our other Chat model handling, so we should probably extract it out similar to what we're doing with Jina.ai.

cc'ing @Amnish04, who might have other thoughts.

AryanK1511 · 2024-11-19T02:02:46Z

Sounds good to me 🫡

Lemme make those changes and send in a PR

AryanK1511 · 2024-11-19T02:59:58Z

@humphd Everything works perfectly now and the code is very consistent too. You can have a look at the PR that I just sent in and lemme know if everything looks good

humphd assigned AryanK1511 Nov 16, 2024

AryanK1511 linked a pull request Nov 19, 2024 that will close this issue

Add Support for Audio File Transcription #745

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support importing/pasting/dropping audio files #722

Support importing/pasting/dropping audio files #722

humphd commented Nov 10, 2024

AryanK1511 commented Nov 16, 2024

humphd commented Nov 16, 2024

AryanK1511 commented Nov 18, 2024

humphd commented Nov 18, 2024

humphd commented Nov 18, 2024 via email

AryanK1511 commented Nov 19, 2024

humphd commented Nov 19, 2024

AryanK1511 commented Nov 19, 2024 •

edited

Loading

AryanK1511 commented Nov 19, 2024

Support importing/pasting/dropping audio files #722

Support importing/pasting/dropping audio files #722

Comments

humphd commented Nov 10, 2024

AryanK1511 commented Nov 16, 2024

humphd commented Nov 16, 2024

AryanK1511 commented Nov 18, 2024

humphd commented Nov 18, 2024

humphd commented Nov 18, 2024 via email

AryanK1511 commented Nov 19, 2024

Background

My Approach for Audio Files

The Problem

Current Solution

Request for Guidance

humphd commented Nov 19, 2024

AryanK1511 commented Nov 19, 2024 • edited Loading

AryanK1511 commented Nov 19, 2024

AryanK1511 commented Nov 19, 2024 •

edited

Loading