Hand-Raise Detection System
A computer vision system that detects hand-raising gestures in videos using YOLOv8 object detection and pose estimation models.
Overview
This system uses a two-stage approach:
- Object Detection: Detects people in the video frame using YOLOv8
- Pose Estimation: Analyzes upper body keypoints to determine if hands are raised
The system outputs an annotated video with bounding boxes and skeleton overlays, plus JSON event logs with timestamps of hand-raise events.
Requirements
pip install ultralytics opencv-python torch
Supports Python 3.13+ with optional GPU acceleration via CUDA.
Usage
python hand_raise_detector.py [input_video.mp4]
Outputs:
out.avi- Annotated video with visual overlaysevents.json- Structured event logevents.ndjson- Newline-delimited JSON events
Configuration
Key parameters in the script:
CONF_THRES: Detection confidence threshold (default: 0.3)UPPER_BODY_RATIO: Portion of person bbox to analyze (default: 0.5)SUSTAIN_FRAMES: Required consecutive frames for valid detection (default: 15)
Development Challenges
1. Color Space Encoding (BGR vs RGB)
Getting the correct color encoding for streaming with Ultralytics required careful handling. OpenCV uses BGR format by default, while ML models expect RGB. Required np.ascontiguousarray() to ensure proper memory layout for model input.
2. Upper Body Isolation
Not fully accomplished. The current approach crops the top 50% of each detected person's bounding box, but this is a crude geometric approximation that doesn't account for pose variations, sitting vs standing positions, or semantic understanding of anatomy.
3. Person ID Tracking
Major limitation. The system assigns IDs based on detection order within each frame, but these IDs are not consistent across frames. A person detected first in one frame might be detected third in the next, causing event logs to incorrectly attribute hand-raises to wrong individuals.
4. Handling Overlapping/Intersecting People
When people overlap or stand close together, bounding boxes merge or occlude each other, pose keypoints may be assigned to the wrong person, and cropped regions may contain multiple people, confusing the pose model.
5. Confidence Thresholds & Temporal Smoothing
Finding optimal values required significant trial and error. These values are scene-dependent and may need tuning per video:
- Detection confidence (0.3): Balance between detection rate and false positives
- Keypoint confidence (0.4): Ensure reliable joint detections
- Sustain frames (15): Balance between responsiveness and stability
- Vertical threshold (0.15): Prevents "hands near face" false positives
Known Issues
False Positives:
- Quick arm movements and stretching/yawning gestures
- Occlusions causing inconsistent keypoint detections
- Poor lighting reducing keypoint confidence
- Crowded scenes confusing detection and pose models
False Negatives:
- Partial hand-raises (elbow bent, hand not high enough)
- Profile views with low keypoint visibility
- Fast movements under 15 consecutive frames
- Occlusions from furniture or other people
Possible Improvements
Quick Wins:
- Add person tracking: Use ByteTrack or DeepSORT for consistent IDs across frames
- Improve upper body detection: Use pose-based crops (shoulder-to-hip distance) instead of fixed ratio
- Expose config as CLI args: Enable per-video parameter tuning
- Add temporal smoothing: Moving average of hand positions over 3-5 frames
- Implement hysteresis: Prevent rapid on/off flickering with cooldown logic
Longer-Term Enhancements:
- Multi-person pose estimation: Single model pass for all people simultaneously
- Action recognition models: Use temporal models (SlowFast, X3D) instead of per-frame pose
- Calibration mode: Auto-tune thresholds using labeled ground truth clips
- Intent classification: Distinguish deliberate signals from casual gestures
- End-to-end fine-tuning: Train on hand-raise specific dataset
Performance
- Processing speed: ~0.96x real-time on RTX 4090 (960ms to process 1 second of video)
- Memory usage: ~2-4GB depending on video resolution
- Optimization opportunities: Batch processing, frame skipping, async I/O, model quantization