arxiv_cs_cv 2026年2月10日

Neural Sentinel：Human-in-the-Loop Continual Learning を用いたナンバープレート認識のための Unified Vision Language Model (VLM)

Neural Sentinel: Unified Vision Language Model (VLM) for License Plate Recognition with Human-in-the-Loop Continual Learning

Open original article

Translated: 2026/2/11 13:58:24

Japanese Translation

arXiv:2602.07051v1 発表タイプ: new 概要: 従来の Automatic License Plate Recognition (ALPR) システムは、物体検出ネットワークの後に別個の Optical Character Recognition (OCR) モジュールを配置するマルチステージパイプラインを採用しており、これにより誤差の累積、レイテンシの増加、アーキテクチャの複雑化が生じる。本研究では Neural Sentinel を提案する。これは Vision Language Models (VLMs) を活用し、単一のフォワードパスでナンバープレート認識、state（州）分類、車両属性抽出を実行する新しい統合アプローチである。我々の主な貢献は、Low-Rank Adaptation (LoRA) により適応させファインチューニングした PaliGemma 3B モデルが、車両画像に関する複数の視覚的質問に同時に回答できることを示した点にある。本手法はナンバープレート認識精度92.3%を達成し、EasyOCR ベースラインに対して14.1%の改善、PaddleOCR ベースラインに対して9.9%の改善を示した。我々は Human-in-the-Loop (HITL) 継続学習フレームワークを導入し、ユーザによる訂正を取り込みつつ experience replay により catastrophic forgetting を防ぎ、元の訓練データと訂正サンプルの比率を70:30に維持する。システムは平均推論レイテンシ152ms、Expected Calibration Error (ECE) 0.048を達成し、信頼度推定が良く校正されていることを示す。さらに、VLM-first アーキテクチャにより、task-specific training なしで車両色検出 (89%)、シートベルト検出 (82%)、乗員数カウント (78%) といった補助タスクへゼロショットで一般化できることを確認した。実世界の toll plaza 画像に対する大規模な実験を通じて、統合された Vision Language アプローチが ALPR システムにおけるパラダイムシフトをもたらし、従来のパイプライン方式では達成し得ない高精度、アーキテクチャの簡素化、出現的なマルチタスク能力を提供することを示した。

Original Content

arXiv:2602.07051v1 Announce Type: new Abstract: Traditional Automatic License Plate Recognition (ALPR) systems employ multi-stage pipelines consisting of object detection networks followed by separate Optical Character Recognition (OCR) modules, introducing compounding errors, increased latency, and architectural complexity. This research presents Neural Sentinel, a novel unified approach that leverages Vision Language Models (VLMs) to perform license plate recognition, state classification, and vehicle attribute extraction through a single forward pass. Our primary contribution lies in demonstrating that a fine-tuned PaliGemma 3B model, adapted via Low-Rank Adaptation (LoRA), can simultaneously answer multiple visual questions about vehicle images, achieving 92.3% plate recognition accuracy, which is a 14.1% improvement over EasyOCR and 9.9% improvement over PaddleOCR baselines. We introduce a Human-in-the-Loop (HITL) continual learning framework that incorporates user corrections while preventing catastrophic forgetting through experience replay, maintaining a 70:30 ratio of original training data to correction samples. The system achieves a mean inference latency of 152ms with an Expected Calibration Error (ECE) of 0.048, indicating well calibrated confidence estimates. Additionally, the VLM first architecture enables zero-shot generalization to auxiliary tasks including vehicle color detection (89%), seatbelt detection (82%), and occupancy counting (78%) without task specific training. Through extensive experimentation on real world toll plaza imagery, we demonstrate that unified vision language approaches represent a paradigm shift in ALPR systems, offering superior accuracy, reduced architectural complexity, and emergent multi-task capabilities that traditional pipeline approaches cannot achieve.