Welcome to YO-FLO, a proof-of-concept implementation of YOLO-like object detection using the Florence-2-base-ft model. Inspired by the powerful YOLO (You Only Look Once) object detection framework, YO-FLO leverages the capabilities of the Florence foundational vision model to achieve real-time inference while maintaining a lightweight footprint.
- Introduction
- Features
- Installation
- Usage
- Error Handling
- Contributing
- License
YO-FLO explores whether the new Florence foundational vision model can be implemented in a YOLO-like format for object detection. Florence-2 is designed by Microsoft as a unified vision-language model capable of handling diverse tasks such as object detection, captioning, and segmentation. To achieve this, it uses a sequence-to-sequence framework where images and task-specific prompts are processed to generate the desired text outputs. The model's architecture combines a DaViT vision encoder with a transformer-based multi-modal encoder-decoder, making it versatile and efficient.
Florence-2 has been trained on the extensive FLD-5B dataset, containing 126 million images and over 5 billion annotations, ensuring high-quality performance across multiple tasks. Despite its relatively small size, Florence-2 demonstrates strong zero-shot and fine-tuning capabilities, making it an excellent choice for real-time applications.
- Real-Time Object Detection: Achieve YOLO-like performance using the Florence-2-base-ft model.
- Class-Specific Detection: Specify the class of objects you want to detect (e.g., 'cat', 'dog').
- Expression Comprehension: Detect objects or states via questions for mundane, cool, and exotic results!
- Beep and Screenshot on Detection: Toggle options to beep and take screenshots when the target class or phrase is detected.
- Tkinter GUI: A user-friendly graphical interface for easy interaction.
- Cross-Platform Compatibility: Works on Windows, macOS, and Linux.
- Toggle Headless Mode: Enable or disable headless mode for running without GUI.
- Update Inference Rate: Display the rate of inferences per second during real-time detection.
- Screenshot on Yes/No Inference: Automatically save screenshots based on yes/no answers from expression comprehension.
- Visual Grounding: Identify and highlight specific regions in an image based on descriptive phrases.
- Evaluate Inference Tree: Use a tree of inference phrases to evaluate multiple conditions in a single run.
- Plot Bounding Boxes: Visualize detection results by plotting bounding boxes on the image.
- Save Screenshots: Save screenshots of detected objects or regions of interest.
- Robust Error Handling: Comprehensive error management for smooth operation.
- Webcam Detection Control: Start and stop webcam-based detection with ease.
- Debug Mode: Toggle detailed logging for development and troubleshooting purposes.
- Python 3.7 or higher
- pip
pip install torch transformers pillow opencv-python colorama simpleaudio huggingface-hub
To start YO-FLO, run the following command:
python yo-flo.py
- Select Model Path: Choose a local directory containing the Florence model.
- Download Model from HuggingFace: Download and initialize the Florence-2-base-ft model from HuggingFace.
- Set Class Name: Specify the class name you want to detect (leave blank to show all detections).
- Set Phrase: Enter the phrase for comprehension detection (e.g., 'Is the person smiling?', 'Is the cat laying down?').
- Set Visual Grounding Phrase: Enter the phrase for visual grounding.
- Set Inference Tree: Enter multiple inference phrases to evaluate several conditions.
- Toggle Beep on Detection: Enable or disable the beep sound on detection.
- Toggle Screenshot on Detection: Enable or disable taking screenshots on detection.
- Toggle Screenshot on Yes/No Inference: Enable or disable taking screenshots based on yes/no inference results.
- Start Webcam Detection: Begin real-time object detection using your webcam.
- Stop Webcam Detection: Stop the webcam detection and return to the menu.
- Toggle Debug Mode: Enable or disable debug mode for detailed logging.
- Toggle Headless Mode: Enable or disable headless mode for running without GUI.
- Exit: Exit the application.
- Select Model Path or Download Model from HuggingFace.
- Set Class Name to specify what you want to detect (e.g., 'cat', 'dog').
- Set Phrase for specific phrase-based inference.
- Set Visual Grounding Phrase to bound specific regions to detect.
- Set Inference Tree for evaluating multiple conditions.
- Toggle Beep on Detection if you want an audible alert.
- Toggle Screenshot on Detection if you want to save screenshots of detections.
- Toggle Screenshot on Yes/No Inference to save screenshots based on comprehension results.
- Start Webcam Detection to begin detecting objects in real-time.
YO-FLO includes robust error handling to ensure smooth operation:
- Model Initialization Errors: Handles cases where the model path is incorrect or the model fails to load.
- Webcam Access Errors: Notifies if the webcam cannot be accessed.
- Image Processing Errors: Catches errors during frame processing and provides detailed messages.
- File Not Found Errors: Alerts if required files (e.g., beep sound file) are missing.
- General Exception Handling: Catches and logs any unexpected errors to prevent crashes.
- Error loading model: Model path not found or model failed to load.
- Error running object detection: Issues during object detection process.
- Error plotting bounding boxes: Problems with visualizing detection results.
- Error toggling beep: Issues enabling or disabling the beep sound.
- Error saving screenshot: Problems saving detection screenshots.
- OpenCV error: Errors related to OpenCV operations.
We welcome contributions to improve YO-FLO. Please follow these steps:
- Fork the repository.
- Create a new branch (git checkout -b feature-branch).
- Commit your changes (git commit -am 'Add new feature').
- Push to the branch (git push origin feature-branch).
- Create a new Pull Request.
YO-FLO is licensed under the MIT License.
Thank you for using YO-FLO! We are excited to see what amazing applications you will build with this tool. Happy detecting!