Skip to content

preternatural-explore/photo-translator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Important

Created by Preternatural AI, an exhaustive client-side AI infrastructure for Swift.
This project and the frameworks used are presently in alpha stage of development.

PhotoTranslator: Generate Creative Sentences in a Foreign Language from a Photo

The PhotoTranslator app leverages OpenAI's Vision API to bring translations into the user's surroundings seamlessly. Users can simply take a photo, and the app, using an on-device YOLO model, identifies objects within the image. Then, creative sentences in the target language are generated about the picture in general and each object specifically along with the foreign language audio using ElevenLabs API, making learning a new language an engaging and immersive experience.

MIT License

Table of Contents

Usage

Supported Platforms

macos   ios   ipados  

To install and run the PhotoTranslator app:

  1. Download and open the project
  2. Add your OpenAI API Key in the LLMClientManager file:
// AIManagers/LLMClientManager
private static let client: any LLMRequestHandling = OpenAI.Client(
    apiKey: "YOUR_API_KEY"
)

You can get the OpenAI API key on the OpenAI developer website. Note that you have to set up billing and add a small amount of money for the API calls to work (this will cost you less than 1 dollar).

  1. Add your ElevenLabs API Key in the TTSClientManager file:
// AIManagers/TTSClientManager
static let client = ElevenLabs.Client(apiKey: "YOUR_API_KEY")

ElevenLabs is a “Text-to-Speech” service which is used in the PhotoTranslator app to generate the audio of the translated sentence in a foreign language. You can get your ElevenLabs API Key on the ElevenLabs website. The API key is located in your user profile:

Screenshot_2024-05-24_at_8_21_03 PM
  1. Select the target language for translation. The app is currently set to Hindi.
// AIManagers/LLMClientManager
private static let targetLanguage = "Hindi"
  1. Create the target language speaker in AIManagers/Speakers. The app is currently set to a HindiSpeaker
// AIManagers/Speakers
// change the speaker to your target language
// you can find the voice for your target language on the ElevenLabs website
struct HindiSpeaker: Speaker { 
    let speakerName: String = "Akshay"
    let elevenLabsVoiceID = "qO2mI1DuN2aagyvZHwwt"
}
  1. Run the app on device - either iPhone, iPad or Mac as the camera is required to take a photo.
  2. Take a photo and wait for the app to generate creative sentences about the photo in your target language, with English translation.

phototranslationdescription phototranslationdetails

Bug: Note that there is currently a bug where the photo is flipped 90 degrees on the phone and iPad.

Key Concepts

The PhotoTranslator app is developed to demonstrate the the following key concepts:

  • Using OpenAI's Vision API
  • Function calling to get structured data from LLMs
  • Integrating ElevenLabs Multilingual Audio generation

Preternatural Frameworks

The following Preternatural Frameworks were used in this project:

  • AI: The definitive, open-source Swift framework for interfacing with generative AI.
  • Media: Media makes it stupid simple to work with media capture & playback in Swift.

Technical Specifications

The PhotoTranslator uses several AI frameworks in the following steps:

  1. The user captures a photo
  2. The photo is analyzed by the YOLOv8 on-device model, which detects and identifies individual objects within the image. Each object is highlighted with uniquely colored, numbered boxes. See PhotoObjectDetectionManager for the implementation.
  3. The processed photo is sent to OpenAI using the completion API with function calling. This step involves generating creative sentences in the apps's target language about the picture as a whole and each individual object identified in the picture. Transliteration and english translation is also provided for each sentence. See LLMClientManager for implementation.
  4. Finally, the translated text is converted into spoken audio using ElevenLabs' voice synthesis technology, so the user can learn how to say the sentence in the app's target foreign language. See TTSClientManager for implementation.

As a result, the PhotoTranslator app exemplifies the effective integration of diverse AI technologies to create a comprehensive and interactive language learning tool.

License

This package is licensed under the MIT License.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages