Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support importing/pasting/dropping audio files #722

Open
humphd opened this issue Nov 10, 2024 · 9 comments · May be fixed by #745
Open

Support importing/pasting/dropping audio files #722

humphd opened this issue Nov 10, 2024 · 9 comments · May be fixed by #745
Assignees

Comments

@humphd
Copy link
Collaborator

humphd commented Nov 10, 2024

We just added support for more file types when you attach/paste/drop them. We also have support for turning audio into text, see (src/lib/speech-recognition.ts). Let's add support for importing audio files, which converts the audio to text and includes it in a new message.

@AryanK1511
Copy link

I can pick this one up. Are you able to assign it to me so I can start working on it?

@humphd
Copy link
Collaborator Author

humphd commented Nov 16, 2024

@AryanK1511 it's all yours

@AryanK1511
Copy link

@humphd, just to confirm my understanding of this issue:

Are we aiming to add the ability to attach, paste, or drop audio files into the chatbot, have their content transcribed into text, and then include that text in a new message? For example, if I attach a pre-recorded audio file saying, "Tell me about the universe," the chatbot would transcribe it and include the text as part of the conversation.

Is that correct?

@humphd
Copy link
Collaborator Author

humphd commented Nov 18, 2024

Correct. It should work for the audio file types that we can natively send to the LLMs. Later we could add a step to transcode if necessary, but let's not start with that.

So if I have a podcast MP3, I can attach this file and chat with it (the transcript will be injected into the chat for me).

@humphd
Copy link
Collaborator Author

humphd commented Nov 18, 2024 via email

@AryanK1511
Copy link

Hi @humphd,

After spending a few hours digging into this codebase, I managed to fix the issue. However, the solution I’ve come up with feels quite hacky. I’d like to explain my approach, why I believe the current structure makes a clean solution difficult, and seek your guidance on how to improve it.

Background

The issue involves handling audio files similarly to how PDFs are processed in the src/hooks/use-file-import.tsx file. Here's the current flow for PDFs:

  1. The following snippet converts PDF files to text via an external API and returns the content:

    if (file.type === "application/pdf") {
        const contents = await pdfToMarkdown(file);
        assertContents(contents);
        return contents;
    }
  2. Later in the same file, the content is added to the chat:

    else if (file.type === "application/pdf") {
        const document = (contents as JinaAiReaderResponse).data;
        chat.addMessage(new ChatCraftHumanMessage({ text: `${document.content}\n` }));
    }

My Approach for Audio Files

To implement audio file handling, the ideal approach would be to mimic the PDF workflow by creating an audioToText function:

if (file.type.startsWith("audio/")) {
    const contents = await audioToText(file);
    assertContents(contents);
    return contents;
}

However, since the project already includes functionality for transcribing audio in src/lib/speech-recognition.ts, I leveraged the transcribe method:

async transcribe(audio: File): Promise<string> {
    const transcriptions = new OpenAI.Audio.Transcriptions(this._openai);
    const transcription = await transcriptions.create({
        file: audio,
        model: this._sttModel,
    });
    return transcription.text;
}

The Problem

Using transcribe requires creating an instance of the SpeechRecognition class, which depends on a model and client:

constructor(sttModel: string, openai: OpenAI) {
    this._sttModel = sttModel;
    this._openai = openai;
}

This initialization is done in src/components/PromptForm/MicIcon.tsx via the useModels hook:

const { getSpeechToTextClient, isSpeechToTextSupported, allProvidersWithModels } = useModels();
const sttClient = await getSpeechToTextClient();
const sttProvider = allProvidersWithModels.find((p) => p.apiUrl === sttClient.baseURL);
const sttModel = sttProvider?.models.find((model) => isSpeechToTextModel(model.name))?.name;
speechRecognitionRef.current = new SpeechRecognition(sttModel, sttClient);

The issue is that hooks like useModels can only be used in React components, whereas the file-processing logic resides outside of React components.

Current Solution

Currently, I bypass the clean integration by doing the following:

  1. Returning an empty string for audio files in use-file-import.tsx:

    if (file.type.startsWith("audio/")) {
        return "";
    }
  2. Handling audio files downstream in the React component:

    else if (file.type.startsWith("audio/")) {
        const sttClient = await getSpeechToTextClient();
        const sttProvider = allProvidersWithModels.find((p) => p.apiUrl === sttClient.baseURL);
        const sttModel = sttProvider?.models.find((model) => isSpeechToTextModel(model.name))?.name;
        const sr = new SpeechRecognition(sttModel, sttClient);
        const text = await sr.transcribe(file);
        chat.addMessage(new ChatCraftHumanMessage({ text: `${text}\n` }));
    }

This approach works, as shown in the attached video.

Screen.Recording.2024-11-18.at.8.18.38.PM.mov

However, it introduces two significant problems:

  1. The file processing component never gets triggered:
    const progressId = progress({
        title: `Processing file${files.length > 1 ? "s" : ""}`,
        progressPercentage: 0,
    });
  2. The logic for audio file handling is fragmented and not aligned with the existing structure.

Request for Guidance

I believe the current structure of the code makes it difficult to implement audio file handling in a clean way. How would you suggest proceeding?

Would you like me to open a PR with my current changes so we can work collaboratively on refining the solution? Note that I’ve made a few other changes to ensure the input works as expected.

Thank you for your guidance!

@humphd
Copy link
Collaborator Author

humphd commented Nov 19, 2024

Yeah, that's not ideal. We are going to need to refactor the logic for dealing with AI out of hooks, so background processes like this can do it as well.

For now, let's make this work with OpenAI and whisper, per https://platform.openai.com/docs/guides/speech-to-text. You can use settings.ts to call getSettings() and see if you have an OpenAI API Key stored. If you do, you can show the audio files as an option, otherwise, don't bother. Then pass that key and an OpenAI client instance to SpeechRecognition.

I'd make this work at all, then we can figure out how to make it work in general. To be honest, the TTS stuff is kind of separate from our other Chat model handling, so we should probably extract it out similar to what we're doing with Jina.ai.

cc'ing @Amnish04, who might have other thoughts.

@AryanK1511
Copy link

AryanK1511 commented Nov 19, 2024

Sounds good to me 🫡

Lemme make those changes and send in a PR

@AryanK1511 AryanK1511 linked a pull request Nov 19, 2024 that will close this issue
6 tasks
@AryanK1511
Copy link

@humphd Everything works perfectly now and the code is very consistent too. You can have a look at the PR that I just sent in and lemme know if everything looks good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants