Yoloe [50+ FPS] face-tracking project - a duet between Laptop & ESP32

This project combines a high-speed YOLOE vision engine running on a laptop (50+ FPS) with an ESP32 that wirelessly controls servos, sensors, and peripherals in real time. The result is a powerful dual-device architecture that delivers desktop-class AI performance together with microcontroller precision—without a single control wire. From intelligent face tracking to robotic arms, smart surveillance, and AI-powered automation, this project opens the door to a new generation of maker innovations.

Yoloe 50+ FPS face-tracking project - a duet between Laptop & ESP32

Introduction: It’s a duet between YOLOE detection model is running on a Laptop and Servos + ToF sensors running on an ESP32. Together they create a 50+ FPS {Frames Per Second} face-tracking system, where detection part of the system runs on the laptop while the remaining control and sensing tasks run on the ESP32.

Earlier we have seen face-tracking projects based on the Raspberry Pi 5 + Hailo-8L combination, which produce an output of around 30 FPS. On the other hand, when YOLOE converted ONNX / PT / NCNN models are run directly on a Raspberry Pi 5, the output typically reaches only 3 ~ 5 FPS [frames per second]. On top of that, these models generally need to run within a GUI environment on the Raspberry Pi, which further increases the operational load on the system.

In contrast, when the same models are executed on a moderate computer or laptop, they can run at substantially higher speeds—typically 10 to 55+ FPS depending on the hardware & model type. However, laptops do not provide a straightforward way to directly control hardware accessories such as relays, motors, or servos, which limits their use in robotics or physical automation projects.

Thus, the advantages of both worlds—high-speed AI processing available on laptops and efficient hardware control through microcontrollers—often remain underutilized when these systems operate independently. To address this limitation, we designed a hybrid architecture that extends the capabilities of the laptop through peripheral hardware control.

In the proposed system, the laptop performs the computationally intensive AI tasks such as face detection, while an ESP32 microcontroller manages the physical hardware components, including servos and sensors. The ESP32 is interfaced with a 128×64 OLED display for real-time status visualization. Communication between the laptop and the ESP32 is established through the ESPConnect protocol, enabling the laptop to transmit commands while the ESP32 executes hardware control operations in real time. [if you have a GPU computer, FPS will further enhance]

A VL53L0X Time-of-Flight (ToF) distance sensor is connected to the ESP32, and the measured distance is transmitted to the laptop and displayed on the live video feed. This sensor is likely to be useful for capturing objects after locking the camera on to the object like for robotic arm. Simultaneously, the laptop sends servo control commands and the detected face count, which are again displayed on the ESP32’s OLED screen for local monitoring. To ensure uninterrupted execution of the computer vision pipeline, the Python program on the MacBook Air is launched using the following command.

$> caffeinate -i python your_python_code.py #preventing the system from entering sleep mode during operation.

Changes in scan methodology: Unlike the Raspberry Pi + Hailo-8L implementation where a zig-zag scanning method was used, this system uses a raster scan methodology. The reason for this change is that the pan–tilt control is handled by the ESP32, which has slower command execution compared to the direct control available on the Raspberry Pi + Hailo-8L setup.

In the raster scan approach, the camera performs a fast horizontal sweep (0–180° pan) and then steps vertically (tilt) before sweeping again in the opposite direction. This pattern systematically covers the entire field of view and allows the system to acquire detected face(s) much faster and more reliably than the earlier zig-zag scan. Also the tracking methodology is never allowed to lock onto one face instead it continuously explore frame slowly & discover new faces entering the frame and settle on to a newer group centre! This behaviour is similar to the meeting room auto tracking cameras [shown below]

Camera placement: The pan–tilt servos operate based on a closed-loop control mechanism, where positional adjustments are driven by feedback derived from the camera’s field of view. Therefore, it is essential to mount the camera directly onto the pan–tilt assembly so that any movement in the visual frame can be immediately translated into corresponding servo actions.

To achieve this, a separate USB camera (webcam) has been used instead of the laptop’s built-in camera, allowing the camera to move physically along with the pan–tilt mechanism. The Python code automatically detects the connected camera, making the setup straightforward. Alternatively, the entire laptop can be mounted onto the pan–tilt system, although this is generally less practical due to size and weight constraints.

YOLOE-Based Detection and System Architecture : Having understood the working principles of YOLO, we now extend the discussion to YOLOE (You Only Look Once – Everything), which represents a significant evolution in object detection. While traditional YOLO models are restricted to detecting a fixed set of predefined object classes, YOLOE introduces a prompt-driven mechanism that allows users to dynamically define objects of interest without requiring model retraining.

In conventional workflows, incorporating a new object into the detection pipeline involves collecting large datasets and performing computationally intensive training or fine-tuning, often requiring high-end GPU resources. YOLOE eliminates this constraint by shifting from a training-centric approach to a prompt-based approach. Users can define objects either through text prompts (for objects already present in the base model) or image prompts (for new or custom objects). The model encodes these prompts and associates them with internal representations, enabling it to recognise visually similar objects during inference without further training.

This capability enables highly specific and context-aware detection. For example, the system can be configured to detect not only general categories such as person, but also contextual conditions like a person wearing glasses or presence of a smartphone . It can also identify unique objects—such as a specific personal item—using only a few reference images. By narrowing the detection scope to user-defined prompts, the computational load is reduced, resulting in faster and more efficient inference.

To further enhance performance, the YOLOE model is converted into the NCNN format, which is optimised for CPU-based inference. This conversion significantly reduces runtime overhead and enables efficient deployment on systems without dedicated GPUs. In the implemented setup, a laptop performs all YOLOE inference, delivering performance that is substantially superior to embedded platforms such as Raspberry Pi, while an ESP32 microcontroller handles pan–tilt servo control and sensor interfacing. Communication between the two is achieved using the lightweight ESPConnect protocol, ensuring low-latency and reliable command transmission. This hybrid architecture enables smooth and responsive real-time tracking. Two practical approaches are used for customising detection.

1. Objects Available in the Base Model (Text Prompting): When the desired objects already exist in the pre-trained YOLOE model (e.g., yoloe-11s-seg.pt), they can be selected using text prompts. These prompts are embedded directly into the exported model, resulting in a lighter and more focused NCNN model optimised for deployment. This approach enables efficient multi-object detection with real-time performance. In practice, the system achieves approximately 15–30 FPS for multiple objects on a laptop, while specialisation to a single object (such as person tracking) can increase performance to around 50 FPS.

… Text-Prompt-NCNN-Conversion.py….
from ultralytics import YOLOE
model = YOLOE("yoloe-11s-seg.pt") # '...11l' for large, '...11s' for small model
names = ["person","remote","glasses","magnifying glass","smartphone","clock", "banana","book"]
model.set_classes(names, model.get_text_pe(names))
model.export(format="ncnn", imgsz=640)

This code generates an NCNN-optimized model, significantly reducing runtime overhead and improving inference speed. In practice, the system achieves approximately 15–30 FPS for multi-object detection on a laptop, while single-object specialization (e.g., person tracking) can reach ~50 FPS, enabling smooth real-time tracking.

2. Objects Not Present in the Model (Image Prompting): For objects not included in the base model, YOLOE supports image prompting, where a small number of reference images (captured from different viewpoints) are used to guide the model. The model learns the visual patterns from these examples and associates them with a unique internal identifier, enabling detection of similar objects in subsequent frames. This method allows rapid and lightweight customisation without retraining, making it highly suitable for real-world applications.

Overall, the combination of prompt-based object definition, NCNN optimisation, and distributed processing between a laptop and ESP32 results in an efficient and scalable system. The laptop provides high-speed AI inference, while the ESP32 ensures responsive hardware control, together enabling a practical solution for real-time intelligent tracking applications.

Current Limitation: During experimentation, it was observed that text prompting and image prompting cannot yet be combined into a single model pipeline. At present, these two approaches must be implemented as separate models. It is possible that future versions of YOLOE may support hybrid prompting, enabling both methods to coexist within the same detection model.

Necessary additional software: For any Python-based development work, it is always recommended to work inside a virtual environment. This ensures that new installations do not interfere with an already existing and stable system-wide Python setup. Once your work is complete, you can simply deactivate the virtual environment and return to your original configuration without any side effects.

python virtual environment setup:
$> python3 -m venv venv
$> source venv/bin/activate
$> do all your works……..
$> deactivate

The following Python modules represent the minimum required set for working with YOLO / YOLOE:

pip install ultralytics opencv-python-headless numpy onnx onnxruntime torch torchvision torchaudio

NCNN model for face: Since the standard YOLOE models used in this experimentation do not include a predefined “face” object category, a separate pretrained YOLOv8 face detection model was obtained and converted for use within the system. A suitable PyTorch model was downloaded from the following GitHub repository:

The GitHub page for download is here - https://github.com/lindevs/yolov8-face/releases

In these model names, the suffix indicates the model size: ‘n’ for nano, ‘s’ for small, ‘m’ for medium, and ‘l’ for large. Smaller models are faster but slightly less accurate, while larger models generally provide better detection accuracy at the cost of higher computational requirements.

After downloading the PyTorch model, it was converted into the NCNN inference format, which is optimized for CPU-based edge devices and improves runtime efficiency. The conversion can be performed using the following command:

$> yolo export model=yolov8s-face-lindevs.pt format=ncnn

$> yolo export model=yolov8s-face-lindevs.pt format=ncnn imgsz=320 #or 640

Once converted, the resulting NCNN model can be deployed directly for real-time face detection within the application pipeline.

ESP32 Process flow & Schematic: ESP32 on it’s I2C bus is connected with - OLED SH110, V53LOX & the Servo driver PCA9685. The PCA9685 can connect a total of 16 Servo motors involving several robotic are movements like - holding, picking ,shifting etc. however, for each movements you need to modify the esp32 code and subsequently upload the code into the esp32.

Few typical use cases of this dual device project:

1. Smart Surveillance & Active Monitoring: The system can be used for intelligent surveillance, where it automatically tracks people or specific objects in real time. Unlike fixed CCTV cameras, the pan–tilt mechanism ensures full-area coverage, while YOLOE allows detection of context-specific events such as a person carrying a particular object or entering restricted zones.

2. Meeting Room / Conference Auto-Framing: In office or home environments, the system can function as an auto-tracking camera that dynamically frames participants during meetings. By tracking multiple faces (using centroid tracking), it ensures that all participants remain within view, similar to professional conferencing systems.

3. Assistive Monitoring (Elderly / Patient Care): The system can monitor elderly individuals or patients by tracking their movement and detecting specific conditions (e.g., presence, inactivity, or unusual posture). Combined with earlier fall detection project [published earlier in EFY] it can become a fully automated tracking & safety monitoring system.

4. Human–Robot Interaction Interface: With YOLOE’s prompt-based detection, the system can recognise gestures or objects (e.g., hand signals, remote, tools) and trigger actions via ESP32-controlled devices. This enables a low-cost human–machine interaction system without complex training pipelines.

5. Smart Retail / Customer Analytics: In retail environments, the system can track customer movement, count visitors, or identify interactions with specific products (e.g., person picking up an item). The pan–tilt mechanism allows dynamic focus on areas of interest, improving coverage compared to static cameras.

6. Personal Object Tracking: Using image prompting, the system can track specific personal items (e.g., glasses, phone, tools). This is useful in workshops or homes where locating frequently used objects quickly is important.

7. Security for Restricted Objects: The system can be configured to detect specific sensitive objects (e.g., equipment, devices) and trigger alerts if they are moved or accessed. Since YOLOE allows custom object definition without retraining, it is highly adaptable.

8. Educational and Research Platform: This setup is an excellent experimental platform for computer vision and robotics. It demonstrates integration of:

• Real-time AI inference (YOLOE + NCNN)
• Embedded control (ESP32 + servos + sensors)
• Network communication (ESPConnect)
making it ideal for prototyping advanced AI-based systems.

9. Event / Crowd Monitoring: In events or public spaces, the system can track groups of people and dynamically adjust the camera to follow crowd movement. Multi-face centroid tracking fits perfectly here for smooth group framing.

10. Smart Home Automation Extension: The system can act as a vision-based controller—for example:

• Turn lights ON when a person is detected near a display gallery etc.
• Track user movement across rooms
• Trigger appliances based on object detection (ornaments remote, phone, etc.)

R-1. Vision-Guided Robotic Arm (Pick-and-Place System): By adding multiple servos (base rotation, shoulder, elbow, wrist, gripper), the system can be extended into a robotic arm capable of picking and placing objects. YOLOE enables identification of target objects through text or image prompts (e.g., bottle, tool, component), while the laptop computes object position and sends coordinated commands to the ESP32. The arm can then approach, grasp, and relocate objects autonomously.
Applications: small-scale automation, lab sorting, assistive robotics, automatic shopkeeper

R-2. Intelligent Object Retrieval System: This extension focuses on locating and fetching specific items. Using image prompting, the system can recognise a unique object (e.g., a particular charger, remote, tool or add a weight sensor and pick up say 1kg apples from apple basket). The pan–tilt unit first scans and locks onto the object, the TOF tool measures distance after which the robotic arm aligns itself and performs a targeted pick-up operation.
Applications: assistive systems for elderly users, smart workshops, inventory handling, automatic shopkeeper.

R-3. Autonomous Surveillance Turret with Actuation: With additional servos for directional control and a mechanical actuator (e.g., pointer, laser, or alarm trigger), the system can act as an active surveillance turret. It not only tracks a subject but can also respond physically—for example, pointing at intrusions, triggering alarms, or activating deterrent mechanisms or firing arms. YOLOE enables detection of specific conditions (e.g., person in restricted area, object removal).
Applications: security systems, restricted zone monitoring, industrial safety.

R-4. Gesture-Controlled Robotic Manipulator: Combining your gesture recognition work with YOLOE, the system can be extended into a gesture-driven robotic arm. The camera detects hand gestures or poses, and corresponding commands are sent to the ESP32 to control multiple servos in real time. This allows intuitive human–robot interaction, where users can guide arm movement, gripping, or object placement using simple gestures.
Applications: touchless control systems, assistive robotics, interactive demos.

Aftermath: This project demonstrates how prompt-driven vision using YOLOE can eliminate traditional training bottlenecks while enabling real-time intelligent tracking. The integration of high-performance laptop inference with typical ESP32-based actuation creates a balanced and efficient dual-device architecture. Its flexibility allows seamless extension from passive observation to active robotic interaction. Overall, the system establishes a scalable foundation for future AI-driven automation and assistive technologies.

Project operation video: https://youtu.be/Y20zCTBk9f8

Bye bye

Somnath Bera and Biswajit Purkait.