dev_to 2026年3月15日

ストリートビデオから歩行者の経路を JSON 形式で抽出する

Extracting Pedestrian Trajectories from Street Video as JSON

Translated: 2026/3/15 2:10:16

pythonopencv-pythononnxruntimeyolox-tinycomputer-vision

Japanese Translation

Motivation スマートフォン動画の映像から歩行者の経路（トラジェクトリー）を抽出する意義は、私の都市的な社会運動に関する研究において多様な目的を果たすものです。 GIS 準備データ：JSON 出力は地理情報システム（GIS）およびマッピングツールとシームレスに統合されます費用効果の高いデータ収集：高価な GPS トレーカーまたは監視インフラの導入を不要とします歩行者の行動理解：都市環境において人々が移動・互いにどのように相互作用するかを明らかにしますデモの反応測定：立ち上がりのデモが周囲の歩行者の流動に及ぼす影響を定量的に評価します本プロジェクトはデモ監視の迅速な展開に emphasis（重点）を置いています。セットアップはスマートフォンと三脚だけで完了するため、出現する事象に対する迅速な対応が可能になります。都市計画および輸送研究において、歩行者の移動パターンを理解することは、より安全で効率的な公共空間の設計に不可欠です。従来の方法（手動観察や GPS 追跡）には、被験者数の限界と高コストという制約があります。コンピュータビジョンは、ビデオ分析を通じてスケーラブルな代替手段を提供します。本記事は、オープンソースツールを使用してストリートビデオの映像から歩行者の経路を抽出する方法を示します。YOLOX-Tiny をリアルタイムの人物検出に使用し、構造化された JSON 経路データを作成するためにカスタンのセントロイドベースのトラッカーを実装します。本プロジェクトで使用されたサンプル動画はスマートフォンで撮影され、セットアップは軽量で簡単にデプロイ可能とされました。 YOLOX-Tiny はリアルタイムの推論を最適化した軽量オブジェクト検出モデルです。本稿では、クロスプラットフォーム互換性のために ONNX エクスポートを使用しています。検出パイプライン：前処理：アスペクト比を維持するため Letterbox リーシング推論：YOLOX モデルがフレームを処理する後処理：検出を境界ボックスに変換するフィルタリング：確実性閾値設定と非最大抑圧（NMS）検出された人物をフレーム間で追跡するために、私はシンプルで効果的なセントロイドトラッカーを実装しました：各検出の境界ボックス中心はセントロイドとなり、各フレーム間のセントロイドの一致によってトラッカーは維持されます。一致しなかった検出には新しいトラッカーが登録され、最大消失閾値を超えた場合はトラッカーは登録から削除されます。視覚化のために、映像のトラッキング検出も実行されます。各完全な経路に対して、以下のデータは抽出されます：期間：人がトラッキングされた総時間距離：移動した総ピクセル数方向：度の単位で移動角開始/終了位置：入口と出口の位置画面上の脱出検出：人がフレームから画面外に出たか否か # 必要なパッケージ pip install opencv-python numpy onnxruntime # YOLOX-Tiny ONNX モデルのダウンロード # 元：https://github.com/Megvii-BaseDetection/YOLOX def detect_persons(frame, session): # フレームの前処理 blob, ratio = preprocess_yolox(frame, 416, 416) # 推論の実行 output = session.run(None, {session.get_inputs()[0].name: blob})[0] # 検出後の処理 # ... (確信度のフィルタリング、NMS 適用) return boxes, confidences フルセントロイドトラッカー実装 from collections import defaultdict import numpy as np class CentroidTracker: """ 検出された境界ボックスをフレーム間で関連付けるためのセントロイドベースの追跡アルゴリズム。セントロイドの追跡に加え、境界ボックスの足（歩行者が地面に触れる点）に基づいた経路も維持しており、移動分析においてより安定します。 """ def __init__(self, max_disappeared=50): self.next_object_id = 0 self.objects = {} # ID: (centroid_x, centroid_y) self.disappeared = {} # ID: disappeared_frame_count self.trajectories = defaultdict(list) # ID: [(x, y, frame), ...] self.first_seen = {} # ID: first frame detected self.last_seen = {} # ID: last frame detected self.max_disappeared = max_disappeared def register(self, centroid, foot_point, frame_num): """ユニークな ID を付与して新しいオブジェクトを登録する""" self.objects[self.next_object_id] = centroid self.disappeared

Original Content

Motivation Why extract pedestrian trajectories from smartphone video footage? This approach serves multiple purposes in my research on urban social movements: GIS-ready data: JSON output integrates seamlessly with geographic information systems and mapping tools Cost-effective data collection: Eliminates the need for expensive GPS trackers or surveillance infrastructure Understanding pedestrian behavior: Reveals how people move and interact in urban environments Measuring protest reactions: Quantifies how standing demonstrations affect surrounding pedestrian flow This project emphasizes rapid deployment for protest monitoring. The entire setup requires only a smartphone and tripod, enabling quick response to emerging events. In urban planning and transportation studies, understanding pedestrian movement patterns is crucial for designing safer and more efficient public spaces. Previous methods like manual observation or GPS tracking have limitations in coverage and cost. Computer vision offers a scalable alternative through video analysis. This article demonstrates how to extract pedestrian trajectories from street video footage using open-source tools. I'll use YOLOX-Tiny for real-time person detection and implement a custom centroid-based tracker to generate structured JSON trajectory data. The sample videos used in this project were captured on a smartphone, which keeps the setup lightweight and easy to deploy. YOLOX-Tiny is a lightweight object detection model optimized for real-time inference. I use the ONNX export for cross-platform compatibility with OpenCV and ONNX Runtime. The detection pipeline: Preprocessing: Letterbox resizing to maintain aspect ratio Inference: YOLOX model processes the frame Postprocessing: Convert detections to bounding boxes Filtering: Confidence thresholding and non-maximum suppression For tracking detected persons across frames, I implement a simple but effective centroid tracker: Each detection's bounding box center becomes a centroid Tracks are maintained by matching centroids between frames New tracks are registered for unmatched detections Lost tracks are deregistered after a maximum disappearance threshold For visualize, footage tracks are also detected For each complete trajectory, I extract: Duration: Total time the person was tracked Distance: Total pixels traveled Direction: Movement angle in degrees Start/End positions: Entry and exit points Screen exit detection: Whether the person left the frame # Required packages pip install opencv-python numpy onnxruntime # Download YOLOX-Tiny ONNX model # From: https://github.com/Megvii-BaseDetection/YOLOX def detect_persons(frame, session): # Preprocess frame blob, ratio = preprocess_yolox(frame, 416, 416) # Run inference output = session.run(None, {session.get_inputs()[0].name: blob})[0] # Postprocess detections # ... (filter by confidence, apply NMS) return boxes, confidences Full CentroidTracker Implementation from collections import defaultdict import numpy as np class CentroidTracker: """ centroid-based tracking algorithm for associating detected bounding boxes across frames. In addition to tracking centroids, it also maintains trajectories based on the foot point of the bounding box (the point where the person touches the ground), which is more stable for movement analysis. """ def __init__(self, max_disappeared=50): self.next_object_id = 0 self.objects = {} # ID: (centroid_x, centroid_y) self.disappeared = {} # ID: disappeared_frame_count self.trajectories = defaultdict(list) # ID: [(x, y, frame), ...] self.first_seen = {} # ID: first frame detected self.last_seen = {} # ID: last frame detected self.max_disappeared = max_disappeared def register(self, centroid, foot_point, frame_num): """register a new object with a unique ID""" self.objects[self.next_object_id] = centroid self.disappeared[self.next_object_id] = 0 self.trajectories[self.next_object_id].append( (foot_point[0], foot_point[1], frame_num) ) self.first_seen[self.next_object_id] = frame_num self.last_seen[self.next_object_id] = frame_num self.next_object_id += 1 def deregister(self, object_id): """deregister an object and remove it from tracking""" del self.objects[object_id] del self.disappeared[object_id] def update(self, rects, frame_num): """ update the tracker with new bounding box detections Args: rects: the list of detected bounding boxes [(x1, y1, x2, y2), ...] frame_num: the current frame number Returns: objects: a dictionary mapping object IDs to their current centroids {(cx, cy)} """ # when no detections are present, mark existing objects as disappeared if len(rects) == 0: for object_id in list(self.disappeared.keys()): self.disappeared[object_id] += 1 if self.disappeared[object_id] > self.max_disappeared: self.deregister(object_id) return self.objects # conpute centroids and foot points for the current detections input_centroids = np.zeros((len(rects), 2), dtype="int") input_feet = np.zeros((len(rects), 2), dtype="int") for i, (x1, y1, x2, y2) in enumerate(rects): cx = int((x1 + x2) / 2.0) input_centroids[i] = (cx, int((y1 + y2) / 2.0)) input_feet[i] = (cx, y2) # foot point is the bottom center of the bounding box # if no existing objects, register all input centroids if len(self.objects) == 0: for i in range(len(input_centroids)): self.register(input_centroids[i], input_feet[i], frame_num) # existing objects are present, match input centroids to existing object centroids else: object_ids = list(self.objects.keys()) object_centroids = list(self.objects.values()) # conpute distance matrix between existing object centroids and input centroids D = np.zeros((len(object_centroids), len(input_centroids))) for i, oc in enumerate(object_centroids): for j, ic in enumerate(input_centroids): D[i, j] = np.linalg.norm(oc - ic) # find the smallest distance pairs (existing object to input centroid) rows = D.min(axis=1).argsort() cols = D.argmin(axis=1)[rows] used_rows = set() used_cols = set() for (row, col) in zip(rows, cols): if row in used_rows or col in used_cols: continue # when distance is lower than a threshold, consider it a match if D[row, col] > 100: # if the distance is too large, ignore the match (this threshold can be tuned) continue object_id = object_ids[row] self.objects[object_id] = input_centroids[col] # using the centroid for tracking self.disappeared[object_id] = 0 self.trajectories[object_id].append( (input_feet[col][0], input_feet[col][1], frame_num) # using the foot point for trajectory analysis ) self.last_seen[object_id] = frame_num used_rows.add(row) used_cols.add(col) # not matched existing objects unused_rows = set(range(D.shape[0])) - used_rows for row in unused_rows: object_id = object_ids[row] self.disappeared[object_id] += 1 if self.disappeared[object_id] > self.max_disappeared: self.deregister(object_id) # not matched input centroids unused_cols = set(range(D.shape[1])) - used_cols for col in unused_cols: self.register(input_centroids[col], input_feet[col], frame_num) return self.objects The trajectory data is saved as structured JSON: { "video_name": "street_footage.mp4", "fps": 30, "resolution": "1920x1080", "tracks": [ { "id": 1, "duration": 12.5, "total_distance": 320.4, "trajectory": [ {"x": 100, "y": 200, "frame": 10, "time_sec": 0.333}, {"x": 105, "y": 202, "frame": 11, "time_sec": 0.367} ], "geometry": { "type": "LineString", "coordinates": [[100, 200], [105, 202]] } } ] } Processing a 5min video at 30 FPS typically yields: The JSON output provides rich data for further analysis: Spatial patterns of movement Temporal distribution of pedestrian activity Flow direction analysis Cost-effective: Uses commodity hardware and free software Scalable: Can process hours of footage automatically Structured output: JSON format integrates with GIS and analysis tools Real-time capable: YOLOX-Tiny enables live processing The sample videos in this project were captured on a smartphone, but the same pipeline can be applied to fixed surveillance cameras for longer-term monitoring. Processing demonstration videos from Shinbashi station revealed insights about centroid tracking performance and pedestrian behavior during protests: Commuter indifference: In Japan, individual protests are uncommon, so commuters typically ignore demonstrators. Additionally, most people are busy office workers who tend to focus on their commute rather than noticing activities around them. Camera height issues: Using a smartphone camera with a low tripod created unreliable detections. People near the camera appeared with unnatural up-and-down trajectories due to the low-angle perspective. ID swapping during interactions: When pedestrians crossed paths or interacted closely, their tracking IDs would swap, creating fragmented trajectories for the same individuals. Overall, the system successfully captured general movement patterns. Future improvements could include filtering trajectories with sudden angle changes after intersections or removing outliers based on historical movement differences. Occlusion handling: Simple centroid tracking fails in crowds Camera motion: Assumes static camera position Identity persistence: No re-identification across camera cuts Stopping behavior: People who stop moving in videos sometimes lose their tracking ID due to centroid distance thresholds, leading to fragmented trajectories (e.g., ID 5 → 110 → 430 as the same person gets re-detected with new IDs) For crowded scenes, more sophisticated trackers like DeepSORT or ByteTrack would improve performance. Camera motion compensation using optical flow could extend applicability to moving platforms. In this project, I prioritized spending time on analysis and visualization rather than implementing the most advanced tracking pipeline; that tradeoff made it easier to iterate quickly with real data. This trajectory data serves as input for: Urban planning: Identifying pedestrian flow bottlenecks Safety analysis: Detecting high-risk crossing patterns Traffic engineering: Optimizing signal timing Accessibility studies: Understanding mobility patterns The structured JSON format makes it easy to integrate with mapping libraries like MapLibre GL JS for visualization, as I'll explore in the next article. By combining YOLOX-Tiny detection with centroid tracking, I can extract meaningful pedestrian trajectory data from video footage. The resulting JSON structure provides a foundation for spatial analysis of urban movement patterns. While the current implementation works well for moderate-density scenarios, future enhancements could address occlusion and camera motion challenges. In the next article, I'll visualize these trajectories on an interactive map using MapLibre GL JS. Download YOLOX-Tiny ONNX model From: https://github.com/Megvii-BaseDetection/YOLOX Centroid tracker https://pyimagesearch.com/2025/07/14/people-tracker-with-yolov12-and-centroid-tracker/ my github project https://github.com/TOKIHISA/people_trajectory_analysis/blob/main/src/detects_people.py