A framework to enable multimodal models to operate a computer.
Using the same inputs and outputs as a human operator, the model views the screen and decides on a series of mouse and keyboard actions to reach an objective.
- Compatibility: Designed for various multimodal models.
- Integration: Currently integrated with GPT-4v, Gemini Pro Vision, Claude 3 and LLaVa.
- Future Plans: Support for additional models.
At HyperwriteAI, we are developing Agent-1-Vision a multimodal model with more accurate click location predictions.
We will soon be offering API access to our Agent-1-Vision model.
If you're interested in gaining access to this API, sign up here.
final-low.mp4
- Install the project
pip install self-operating-computer
- Run the project
operate
- Enter your OpenAI Key: If you don't have one, you can obtain an OpenAI key here
- Give Terminal app the required permissions: As a last step, the Terminal app will ask for permission for "Screen Recording" and "Accessibility" in the "Security & Privacy" page of Mac's "System Preferences".
An additional model is now compatible with the Self Operating Computer Framework. Try Google's gemini-pro-vision
by following the instructions below.
Start operate
with the Gemini model
operate -m gemini-pro-vision
Enter your Google AI Studio API key when terminal prompts you for it If you don't have one, you can obtain a key here after setting up your Google AI Studio account. You may also need authorize credentials for a desktop application. It took me a bit of time to get it working, if anyone knows a simpler way, please make a PR.
Use Claude 3 with Vision to see how it stacks up to GPT-4-Vision at operating a computer. Navigate to the Claude dashboard to get an API key and run the command below to try it.
operate -m claude-3
If you wish to experiment with the Self-Operating Computer Framework using LLaVA on your own machine, you can with Ollama!
Note: Ollama currently only supports MacOS and Linux
First, install Ollama on your machine from https://ollama.ai/download.
Once Ollama is installed, pull the LLaVA model:
ollama pull llava
This will download the model on your machine which takes approximately 5 GB of storage.
When Ollama has finished pulling LLaVA, start the server:
ollama serve
That's it! Now start operate
and select the LLaVA model:
operate -m llava
Important: Error rates when using LLaVA are very high. This is simply intended to be a base to build off of as local multimodal models improve over time.
Learn more about Ollama at its GitHub Repository
The framework supports voice inputs for the objective. Try voice by following the instructions below. Clone the repo to a directory on your computer:
git clone https://github.com/OthersideAI/self-operating-computer.git
Cd into directory:
cd self-operating-computer
Install the additional requirements-audio.txt
pip install -r requirements-audio.txt
Install device requirements For mac users:
brew install portaudio
For Linux users:
sudo apt install portaudio19-dev python3-pyaudio
Run with voice mode
operate --voice
The Self-Operating Computer Framework now integrates Optical Character Recognition (OCR) capabilities with the gpt-4-with-ocr
mode. This mode gives GPT-4 a hash map of clickable elements by coordinates. GPT-4 can decide to click
elements by text and then the code references the hash map to get the coordinates for that element GPT-4 wanted to click.
Based on recent tests, OCR performs better than som
and vanilla GPT-4 so we made it the default for the project. To use the OCR mode you can simply write:
operate
or operate -m gpt-4-with-ocr
will also work.
The Self-Operating Computer Framework now supports Set-of-Mark (SoM) Prompting with the gpt-4-with-som
command. This new visual prompting method enhances the visual grounding capabilities of large multimodal models.
Learn more about SoM Prompting in the detailed arXiv paper: here.
For this initial version, a simple YOLOv8 model is trained for button detection, and the best.pt
file is included under model/weights/
. Users are encouraged to swap in their best.pt
file to evaluate performance improvements. If your model outperforms the existing one, please contribute by creating a pull request (PR).
Start operate
with the SoM model
operate -m gpt-4-with-som
If you want to contribute yourself, see CONTRIBUTING.md.
For any input on improving this project, feel free to reach out to Josh on Twitter.
For real-time discussions and community support, join our Discord server.
- If you're already a member, join the discussion in #self-operating-computer.
- If you're new, first join our Discord Server and then navigate to the #self-operating-computer.
Stay updated with the latest developments:
- This project is compatible with Mac OS, Windows, and Linux (with X server installed).
The gpt-4-vision-preview
model is required. To unlock access to this model, your account needs to spend at least $5 in API credits. Pre-paying for these credits will unlock access if you haven't already spent the minimum $5.
Learn more here