dev_to 2026年3月7日

私は1篇の論文を読んだだけで、視覚系のAIモデルに3回交換しました

I Read One Paper and Ended Up Swapping Visual AI Models 3 Times

Translated: 2026/3/7 11:06:04

androidmachine-learningaccessibility-appsvision-language-modelonnx-format

Japanese Translation

一日、ショウUIというタイトルの論文に出会いました。これはスクリーンショットを見るために視覚的な形を作ります。そして、その要素を理解します。 "この概念は楽しかったです」と感じた私は、好奇心からモデルの3回の交換、アクセシビリティアプリのアイデアとプロジェクトが始まりました。ある瞬間、OpenBMB社が展開したショウUI-2Bに出会いました。スクリーンショットを差し入れれば、全画面のインターフェース要素を判定します。これはインターフェース理解に専用された視覚モデルでした。私はこのアイデアからアプリを作ることができると思うようになりました。それ以来始まったすべてが生まれました。ただし、実際により結果は論文と異なるものでした。韓国語のUI - 特別に大量のCSSを使用するデザイン重いウェブサイトなどでは性能が悪かったです。特定のフォーム要素を正しく位置付けすることができず、ユーザー名とパスワード入力フォームタグを見つけることができませんでした. いくつかの成功は10回の一からのものです。そのモデルには4.7GBものサイズがありました。これほど大きなものはとても小さくありません。この分、テスト環境に対する痛手もありました。私は正しいグラフィックス環境をセットアップすることができないので、CPUを使用してモデルを強制格納サイズダウンを行いました。小さなテスト – スクリーンショットを差し入れ、UI要素の座標が返るようにプログラムを実行した場合も通常の5分から6分間の応答時間を持ちました。グラフィックス環境では数秒でした。「視覚モデルであるAIを見て」というアイデアは魅力的でありましたが、ショウUIという具体的なモデルには不足がありました。「UI要素をスクリーン上に理解」するためのアイデア自体が魅力だったので他の良いモデルを探すとそのアイデアが広がりました。 AIがあらゆる物を探し、「人々が見ること」を見つけられるのかを考え、視覚障害の人々が日常動作を行う際に役立つことはどうなるのか考えました。。そうして私はAndroid向けのアクセシビリティ助手のアプリケーションを計画しました—カメラが世界を見てそしてAIと声でユーザーや何が起こっているか説明する。必要な機能：交通信号の認識（赤，青）、バスのナンバーリーダー（OCR）、アラートボタンやフィールドを判定→自動化します。ショウUIの精度問題に対してUI-TARS-2Bというモデルを見つけました。これはByteDance社（ツイッターの所有会社である）はTikTokの会社です。この特定のモデルよりはるかに良い結果を見せつけました。それが90パーセントも高いですが、そのモデルはまたUI要素を理解できずでした。私の技術セットではQwen-7Bという一般的な視覚言語モデルがあります。これがUIの要素を見極めるのがどうなのか試し始めました。デスクトップでのテスト結果は素晴らしいものと見られましたが端末で動作するかどうかが問題となりました。。

Original Content

One day I stumbled across a paper called ShowUI. A vision model that looks at screenshots and understands UI elements. "That sounds fun" — I thought. That curiosity led to 3 model swaps, an accessibility app concept, and a project I never shipped. I came across ShowUI-2B by OpenBMB. Feed it a screenshot, and it detects buttons, text fields, icons — all the UI elements on screen. A Vision model purpose-built for understanding interfaces. "I could build something with this." That thought started everything. When I actually ran it, the results didn't match the paper. On Korean-language UIs — especially heavily styled sites with custom CSS — it was bad. It couldn't even locate the username and password input fields. Not "low accuracy." It couldn't find them at all. Maybe 1 success out of 10 attempts. The model was also 4.7GB — not small. The testing environment was painful too. I couldn't set up a proper GPU environment, so I force-quantized the model and ran it on CPU. A simple test — feed a screenshot, get back UI element coordinates — took up to 5 minutes to return results. On a GPU, this would take seconds. Instead, I could make coffee and come back to find it still running. The concept of "AI that understands UIs" was compelling. This particular model just wasn't good enough. ShowUI wasn't perfect, but the idea of "AI that sees and understands screens" stuck with me. As I searched for better models, the concept expanded. If AI can understand UI elements on screen, could it also read traffic light colors? Bus numbers? Could it help visually impaired people navigate daily life? That's how I started planning an accessibility assistant app for Android — the camera sees the world, AI processes it, and voice tells the user what's happening. Features I needed: Traffic light recognition (red/green) Bus number reading (OCR) App UI automation (detect buttons and fields → automate interactions) To fix ShowUI's accuracy issues, I found UI-TARS-2B by ByteDance (the company behind TikTok). It was definitely better than ShowUI. More accurate at distinguishing specific UI elements, and about 2GB with INT8 quantization — less than half the size. But this model could only understand UIs. My tech stack at this point looked like: Traffic lights / bus numbers → Qwen-7B (general vision model) UI detection → UI-TARS-2B (UI specialist) Two models to manage simultaneously. Memory allocation, model switching logic, error handling — everything doubled in complexity. Then it clicked. Qwen-7B is a general-purpose Vision Language Model. Can't it understand UIs too? I tested on desktop and the results were promising: Task Qwen-7B Accuracy (Desktop) Traffic light recognition 88% Bus number OCR 80% UI element detection 75% 75% is lower than UI-TARS's 90%, but being able to do everything with a single model meant cutting complexity in half. UI-TARS was no longer necessary. The question was: could this run on a phone? "Just run the AI on a server" — obviously that would be easier. But for this app, it wasn't an option. When a visually impaired person points their camera at the world, the footage captures their home, their routes, the people around them. Continuously streaming this sensitive video data to a server is a serious privacy problem. Especially since assistive devices tend to stay on all day — you'd essentially be enabling real-time location tracking. So all AI inference had to happen on the device itself. Camera data never leaves the phone. This decision became the constraint that shaped every model choice that followed. With that in mind, I set out to move Qwen-7B from desktop to mobile. Here's where reality hit hard. First, Android can't run PyTorch or HuggingFace models directly. You must convert to ONNX format. Finding a good model isn't enough — you also need to confirm it can be converted to ONNX and that performance holds after conversion. I tried converting Qwen-7B to ONNX myself, but converting a 7B-parameter VLM turned out to be far more complex than expected. I gave up. And even before conversion was a problem, the model was simply too large — most devices ran out of memory and couldn't even load it. The direction — "one model for everything" — was right. But 7B was more than mobile could handle. The final answer was Qwen-2B VL. A smaller version of Qwen-7B that retains Vision Language capabilities. Where 7B couldn't even load on mobile, 2B VL actually ran. Spec Qwen-2B VL Size 1.2–1.5GB (INT4) Inference speed 7–9 seconds Battery life 3–6 hours Heat 42–44°C Traffic lights 88% Bus numbers 80% UI elements 75% The accuracy wasn't stellar, but I figured that could be improved later with fine-tuning. What mattered was that it actually ran on a phone. And an ONNX-converted version was already available on HuggingFace — no manual conversion needed. Technically, I'd finally found the answer. The model problem was solved. But when I stepped back and looked at the full project, the scope was impossible for one person. Traffic light recognition, bus OCR, UI automation, voice guidance, GPS navigation, accessibility testing — each of these is a project on its own. The biggest burden was accessibility testing. Building an app that blind users can actually use requires TalkBack (screen reader) compatibility, voice feedback timing, haptic pattern design — specialized domains that you can't just learn solo. It requires iterative testing with actual visually impaired users. And for a service targeting blind users, "mostly works" isn't acceptable. 88% accuracy is fine for a regular app, but misreading a traffic light is a matter of life and safety. Even if fine-tuning could improve accuracy, collecting and validating that fine-tuning data would be yet another project in itself. Researching existing apps like BlindSquare confirmed it. This space has dedicated teams who've been refining their products for years. Trying to build an MVP solo in 4 weeks wasn't a technology problem — it was a scope problem. Stopping the project wasn't giving up. It was redirecting my resources where they could actually make an impact. Read ShowUI paper → "This looks fun" → Tested it, 1 in 10 success rate Idea expands → "Could this help blind people?"→ Start planning the app Switch to UI-TARS → "A better model" → Complexity doubles Merge into Qwen-7B → "One model for everything" → Can't run on mobile Find Qwen-2B VL → "This is it!" → Actually works on phone Reality check → "Too big for one person" → Project ends 1. Paper benchmarks ≠ real-world performance ShowUI's paper looked impressive, but on Korean UIs with heavy CSS styling, it couldn't even find input fields — 1 in 10 attempts. Papers report results under optimal conditions. Your environment will be different. 2. One generalist model > multiple specialists Rather than ShowUI for UIs, a separate model for traffic lights, another for OCR — a single Vision Language Model like Qwen VL did "everything well enough." One model at 75% across all tasks beats four models at 90% each, in practice. 3. Mobile is a different world A 7B model that runs beautifully on desktop couldn't even load on a phone. If you're planning on-device AI, start with mobile constraints — memory, battery, heat — not desktop performance. 4. The detour was worth it This app never shipped, but I don't regret it. Quantizing models by hand, waiting 5 minutes for CPU inference on a single coordinate test, measuring phone temperatures — you can't learn this stuff from tutorials. Most importantly, it was fun. That experience is why I can confidently run local AI (Ollama + Qwen3) in TalkWith.chat today.