marsultor/CoGuide-Miniproj

Fork 0

Go to file

Mars Ultor 3fcec25c26 readme done

2025-11-08 18:22:16 -06:00

.gitignore

first commit

2025-11-07 23:30:56 -06:00

.python-version

first commit

2025-11-07 23:30:56 -06:00

events.json

readme done

2025-11-08 18:22:16 -06:00

events.ndjson

readme done

2025-11-08 18:22:16 -06:00

main.py

readme done

2025-11-08 18:22:16 -06:00

README.md

readme done

2025-11-08 18:22:16 -06:00

README.md

Hand-Raise Detection System

A computer vision system that detects hand-raising gestures in videos using YOLOv8 object detection and pose estimation models.

Overview

This system uses a two-stage approach:

Object Detection: Detects people in the video frame using YOLOv8
Pose Estimation: Analyzes upper body keypoints to determine if hands are raised

The system outputs an annotated video with bounding boxes and skeleton overlays, plus JSON event logs with timestamps of hand-raise events.

Requirements

pip install ultralytics opencv-python torch

Supports Python 3.13+ with optional GPU acceleration via CUDA.

Usage

python hand_raise_detector.py [input_video.mp4]

Outputs:

out.avi - Annotated video with visual overlays
events.json - Structured event log
events.ndjson - Newline-delimited JSON events

Configuration

Key parameters in the script:

CONF_THRES: Detection confidence threshold (default: 0.3)
UPPER_BODY_RATIO: Portion of person bbox to analyze (default: 0.5)
SUSTAIN_FRAMES: Required consecutive frames for valid detection (default: 15)

Development Challenges

1. Color Space Encoding (BGR vs RGB)

Getting the correct color encoding for streaming with Ultralytics required careful handling. OpenCV uses BGR format by default, while ML models expect RGB. Required np.ascontiguousarray() to ensure proper memory layout for model input.

2. Upper Body Isolation

Not fully accomplished. The current approach crops the top 50% of each detected person's bounding box, but this is a crude geometric approximation that doesn't account for pose variations, sitting vs standing positions, or semantic understanding of anatomy.

3. Person ID Tracking

Major limitation. The system assigns IDs based on detection order within each frame, but these IDs are not consistent across frames. A person detected first in one frame might be detected third in the next, causing event logs to incorrectly attribute hand-raises to wrong individuals.

4. Handling Overlapping/Intersecting People

When people overlap or stand close together, bounding boxes merge or occlude each other, pose keypoints may be assigned to the wrong person, and cropped regions may contain multiple people, confusing the pose model.

5. Confidence Thresholds & Temporal Smoothing

Finding optimal values required significant trial and error. These values are scene-dependent and may need tuning per video:

Detection confidence (0.3): Balance between detection rate and false positives
Keypoint confidence (0.4): Ensure reliable joint detections
Sustain frames (15): Balance between responsiveness and stability
Vertical threshold (0.15): Prevents "hands near face" false positives

Known Issues

False Positives:

Quick arm movements and stretching/yawning gestures
Occlusions causing inconsistent keypoint detections
Poor lighting reducing keypoint confidence
Crowded scenes confusing detection and pose models

False Negatives:

Partial hand-raises (elbow bent, hand not high enough)
Profile views with low keypoint visibility
Fast movements under 15 consecutive frames
Occlusions from furniture or other people

Possible Improvements

Quick Wins:

Add person tracking: Use ByteTrack or DeepSORT for consistent IDs across frames
Improve upper body detection: Use pose-based crops (shoulder-to-hip distance) instead of fixed ratio
Expose config as CLI args: Enable per-video parameter tuning
Add temporal smoothing: Moving average of hand positions over 3-5 frames
Implement hysteresis: Prevent rapid on/off flickering with cooldown logic

Longer-Term Enhancements:

Multi-person pose estimation: Single model pass for all people simultaneously
Action recognition models: Use temporal models (SlowFast, X3D) instead of per-frame pose
Calibration mode: Auto-tune thresholds using labeled ground truth clips
Intent classification: Distinguish deliberate signals from casual gestures
End-to-end fine-tuning: Train on hand-raise specific dataset

Performance

Processing speed: ~0.96x real-time on RTX 4090 (960ms to process 1 second of video)
Memory usage: ~2-4GB depending on video resolution
Optimization opportunities: Batch processing, frame skipping, async I/O, model quantization