Back to list
なぜ私のリアルタイム Google Meet 翻訳はあなたのラップトップで実行され、私のサーバーでは実行されないのか
Why my real-time Google Meet translator runs on your laptop, not my server
Translated: 2026/4/20 12:00:23
Japanese Translation
$0/月のインフラコスト。音源がユーザーのデバイスに残らずに済む。Google Meet でのリアルタイム双方向音声翻訳。ここで、これら全てを同時に実現するためのアーキテクチャのトリックを解説する。
私は Chrome エクステンションを作成し、Google Meet でリアルタイム的双方向音声翻訳を実現しました。あなたはロシア語を話しかれ、同僚は英語を聞きます。彼らはドイツ語で返信し、あなたはロシア語を聞きます。字幕、音声合成(TTS)、すべて包含。
次に、それをどのようにデプロイすべきか考えなければなりませんでした。
大半の「会議における AI ツール」は同じマニュアルに従います:クライアントはマイク音声のストリームをバックエンドに送信し、バックエンドは STT + LLM + TTS の料金を負担し、ユーザーはサブスクリプションを支払う(それが請求をカバーし、余分に利益をもたらすことを願って)。このモデルには、私にとって望ましくない 2 つの問題があります:
会話の每一分が私の AWS 請求書のレコードになるものであり、重たいユーザーには私は Upside を得ません。
会話の每一分が私のサーバーを介して誰かのマイクを流れるものとなるのです。それは私が維持したくないプライバシーの物語です。
したがって、MeetVoice は異なる方法で機能します。
このアーキテクチャは通常組み合わせない 2 つのものです:
Bring-your-own-key (BYOK): ユーザーは自分自身の Deepgram + Groq + (オプションの)OpenAI キーを挿入します。デフォルトの無料ティアの Edge TTS —— その分は Microsoft が支払う(公式エンドポイントではないが、数年来安定している)。
「サーバー」はユーザーのラップトップで実行されます。私は Windows および macOS 用の小さい Electron tray アプリを配布し、127.0.0.1:18900 にローカルの WebSocket サーバーを起動させます。Chrome エクステンションはそれに取り付けます。
私が得るもの:
インフラコストゼロ。EC2 や Cloud Run、サーバーレスのコールドスタートなし。私の定期インフラ請求書はマーケティングサイトのための 1 つの Cloudflare Worker だけです。
音はデバイスに残ります(ユーザーの選択した STT プロバイダーを除く、それはユーザーのキー上にある——そして彼らが選びました)。
拡張性是無料。新しいユーザー = 新しいラップトップ = 新しいサーバー。
私が代償といるもの:
オンボーディングは困難。「アプリをダウンロード」は「エクステンションをインストールしサインイン」よりも多くの摩擦である。
私は Electron-updater 循環なしにサーバー側のバグ修正を自動更新できない(R2 + electron-updater はこれを適切に処理しますが、それはもう動く部分である)。
ライセンシングはデスクトップ側で存続する必要がある(LemonSqueezy + 小規模な Cloudflare Worker を使用して権限チェックを行う)。
インディーズ SaaS にとって、このトレードオフは明らかなものである。
さて、技術的に興味のある部分を解説しよう。
Mic / Tab audio
│
▼
Deepgram Nova-3 (streaming WebSocket, diarization)
│
▼
TranscriptBuffer (sentence boundary + speaker change + 4s safety timeout)
│
▼
Groq Llama 3.3 70B (streaming, sentence-chunked translation)
│
▼
Edge TTS (free, Microsoft Neural voices)
│
▼
Audio injection back into Meet
これら 2 つは、1 カール当たり並列で実行される:
インコミングパイプライン(peerLang → userLang):タブ音声 → 翻訳された声をスピーカーを介して再生し、その上字幕。
アウトゴイングパイプライン(userLang → peerLang):あなたのマイク → 翻訳された声を会議に話そうとして再生し、上、他の側のための字幕。
両パイプラインは 1 つの WebSocket を共有する。方向をプレフィックスバイト(0x00 インコミング、0x01 アウトゴイング)でマルチプレックスする。安価で、スキーマなし、機能する。
エンドエンドのレイテンシーは定常状態では約 1.5–2 秒である。大部分は Deepgram が自信を持ってチャンクが_fina_であるかを示すのを待っているためにある。
次に、最も長期間にわたって正しい部分の 2 つ。
これはクールなものだ。
Meet があなたのマイクを望む時、それは navigator.mediaDevices.getUserMedia({ audio: true }) を呼ぶ。それによって戻ってくるものとは MediaStream であり、それが他の参加者に流れる。
よって私は単純に……異なるストリームを返す。
// content script, world: "MAIN", runAt: "document_start"
const origGetUserMedia = navigator.mediaDevices.getUserMedia
.bind(navigator.mediaDevices);
navigator.mediaDevices.getUserMedia = async (constraints) => {
if (!constraints?.audio) return origGetUserMedia(constraints);
// Get the real mic, but don't hand it to Meet directly
const realStream = await origGetUserMedia(constraints);
// Build a controllable stream Meet will hold a reference to
const controlStream = new MediaStream();
for (const t of realStream.getAudioTracks()) controlStream.addTrack(t);
for (const
Original Content
$0/month in infra costs. Audio that never leaves the user's device. Real-time two-way voice translation in Google Meet. Here's the architecture trick that made all three possible at the same time.
I built a Chrome extension that does real-time, two-way voice translation in Google Meet. You speak Russian, your colleague hears English. They reply in German, you hear Russian. Subtitles, TTS, the whole thing.
Then I had to figure out how to ship it.
Most "AI in your meeting" tools follow the same playbook: the client streams mic audio to a backend, the backend pays for STT + LLM + TTS, and the user pays a subscription that hopefully covers the bill plus some. That model has two problems I didn't want:
Every minute of conversation is a row in my AWS bill, and I have no upside on heavy users.
Every minute of conversation is also someone else's microphone going through my server. That's a privacy story I didn't want to maintain.
So MeetVoice ships a different way.
The architecture is two things you don't usually combine:
Bring-your-own-key (BYOK): users plug in their own Deepgram + Groq + (optional) OpenAI keys. Free-tier Edge TTS as default — Microsoft pays for that one (unofficial endpoint, but it's been stable for years).
The "server" runs on the user's laptop. I ship a small Electron tray app for Windows and macOS that boots a local WebSocket server on 127.0.0.1:18900. The Chrome extension connects to it.
What I get:
Zero infra cost. No EC2, no Cloud Run, no serverless cold starts. My recurring infra bill is one Cloudflare Worker for the marketing site.
Audio never leaves the device (modulo the user's chosen STT provider, which is on their key — and they picked it).
Scaling is free. New user = new laptop = new server.
What I trade away:
Onboarding is harder. "Download an app" is more friction than "install extension and sign in."
I can't auto-update server-side bug fixes without an electron-updater roundtrip (R2 + electron-updater handles this fine, but it's another moving part).
Licensing has to live on the desktop side (LemonSqueezy + a tiny Cloudflare Worker for entitlement checks).
For an indie SaaS, that tradeoff is a no-brainer. Now let me show you the technically interesting part.
Mic / Tab audio
│
▼
Deepgram Nova-3 (streaming WebSocket, diarization)
│
▼
TranscriptBuffer (sentence boundary + speaker change + 4s safety timeout)
│
▼
Groq Llama 3.3 70B (streaming, sentence-chunked translation)
│
▼
Edge TTS (free, Microsoft Neural voices)
│
▼
Audio injection back into Meet
Two of these run in parallel per call:
Incoming pipeline (peerLang → userLang): tab audio → translated voice played through your speakers, plus subtitles.
Outgoing pipeline (userLang → peerLang): your mic → translated voice spoken into the meeting as if you said it, plus subtitles for the other side.
Both pipelines share one WebSocket. I multiplex direction with a prefix byte (0x00 incoming, 0x01 outgoing). Cheap, schemaless, works.
End-to-end latency is around 1.5–2 seconds in the steady state. Most of it is Deepgram waiting to confidently mark a chunk is_final.
Now the two parts that took the longest to get right.
This is the cool one.
When Meet wants your microphone, it calls navigator.mediaDevices.getUserMedia({ audio: true }). It gets back a MediaStream, and that's what flows to the other participants.
So I just... return a different stream.
// content script, world: "MAIN", runAt: "document_start"
const origGetUserMedia = navigator.mediaDevices.getUserMedia
.bind(navigator.mediaDevices);
navigator.mediaDevices.getUserMedia = async (constraints) => {
if (!constraints?.audio) return origGetUserMedia(constraints);
// Get the real mic, but don't hand it to Meet directly
const realStream = await origGetUserMedia(constraints);
// Build a controllable stream Meet will hold a reference to
const controlStream = new MediaStream();
for (const t of realStream.getAudioTracks()) controlStream.addTrack(t);
for (const t of realStream.getVideoTracks()) controlStream.addTrack(t);
// After the next user gesture, swap the audio tracks for our mixed stream
document.addEventListener("click", trySetupGraph, true);
return controlStream;
};
The mixed stream is built with Web Audio:
audioCtx = new AudioContext({ sampleRate: 48000 });
destination = audioCtx.createMediaStreamDestination();
micSource = audioCtx.createMediaStreamSource(realStream);
micGainNode = audioCtx.createGain(); // mic, with ducking
ttsGainNode = audioCtx.createGain(); // injected TTS, with boost
micSource.connect(micGainNode).connect(destination);
ttsGainNode.connect(destination);
// Swap tracks on the stream Meet is already holding a reference to
for (const t of controlStream.getAudioTracks()) controlStream.removeTrack(t);
for (const t of destination.stream.getAudioTracks()) controlStream.addTrack(t);
When the server sends translated TTS audio back:
Decode the chunks into an AudioBuffer.
Duck micGainNode to 20%, so you don't talk over yourself.
Play the buffer through ttsGainNode → destination.
On source.onended, restore the mic gain.
From the other participants' point of view, they hear you speaking their language. Their Meet client doesn't know there's a synthesised voice in the pipe — it's just bytes on the same MediaStream Meet asked for.
A few things that bit me:
AudioContext needs a user gesture to start in the running state. So getUserMedia returns the real stream first, and the swap happens on the next click/keydown. Skip this and Chrome creates the context in suspended state — silent failure mode where nothing throws but no audio flows.
The override script runs in the MAIN world, which means no chrome.* APIs. All extension communication goes through window.postMessage with targetOrigin: "https://meet.google.com" (never "*" — defense-in-depth).
A sequential TTS queue is mandatory. Two segments arriving back-to-back and decoded in parallel will overlap and sound like two drunk synths arguing. A single isPlaying flag plus playNext() in source.onended is enough.
A monotonic activePlaybackId counter, bumped on every new playback. Stale onended callbacks from a previous segment check it and bail out. Without this, a fast-arriving newer segment got its mic gain restored by an older callback and the next one started full-volume.
Deepgram emits two kinds of finalized transcripts: is_final (this chunk is locked in) and speech_final (the speaker just took a breath). If you translate every is_final chunk you get garbage — three-word fragments, no context, awful cache behavior. If you wait for speech_final you get clean translations but the user waits 2+ seconds before hearing anything.
The compromise is a TranscriptBuffer that flushes on whichever happens first:
push(text, speaker, endTime) {
// Speaker switched — flush the previous speaker first
if (speaker !== this.speaker && this.segments.length) this.flush();
this.segments.push(text);
const accumulated = this.segments.join(" ");
if (SENTENCE_BOUNDARY_RE.test(accumulated) && accumulated.length > 20) {
this.flush(); // sentence done
} else if (wordCount(accumulated) >= 30) {
this.flush(); // long monologue
} else if (!this.timer) {
this.timer = setTimeout(() => this.flush(), 4000); // silence safety
}
}
On the translation side: instead of waiting for the LLM to finish the whole sentence, the Groq response is streamed and re-chunked by sentence (regex [.!?] after 20+ chars). Each sentence is sent to TTS as soon as it lands, not at end-of-stream. This pipelines TTS synthesis on top of LLM generation — first audible word arrives noticeably faster than the naive "translate, then synthesize" loop.
Subtitles update on the interim transcripts (so the user sees them live), but TTS only plays on stable sentences. Best of both.
Deepgram Nova-3 — only streaming STT I tried that handles speaker diarization well in noisy meetings.
Groq + Llama 3.3 70B — fastest LLM I can afford for a BYOK product. Cheaper per token than GPT-4o-mini and a few times higher throughput. OpenAI is the fallback.
Edge TTS (msedge-tts, MIT-licensed) — Microsoft's Neural voices, free, sound great. OpenAI tts-1 is an optional upgrade.
WXT — best WebExtension framework I've used. Manifest V3, Vite, TypeScript, content-script worlds, all just work.
Electron 41 with an ESM tray app — surprisingly clean. utilityProcess runs the WS server in a child process so it can crash without taking the tray with it.
Astro 6 for the marketing site — static, fast, file-based i18n.
What I rejected:
OpenAI Whisper API — the standard /v1/audio/transcriptions endpoint takes a finished file, not a stream. (The newer Realtime API with gpt-4o-transcribe exists, but it's a different beast and came too late for this design.)
ElevenLabs — beautiful voices, but the per-minute price would make BYOK unaffordable for daily users.
A traditional VPS backend — the entire point of this design.
BYOK + local server is a real pattern. Cost-of-revenue collapses to $0. Privacy goes from a marketing line to an architecture property. The price you pay is onboarding friction — and most pro users will gladly trade that for control.
Manifest V3 is harder than the docs admit. You can't keep state in the service worker. You need an offscreen document for anything stateful (audio, persistent WebSocket). chrome.storage is not available in the offscreen doc, so you message-pass with retry. Plan for it.
Electron is not as bad as Twitter says. A tray-only app is ~200 MB on disk and ~80 MB RAM idle. electron-builder handles signing on Mac/Windows. GitHub Actions builds the macOS DMG on macos-latest for free.
If you want to try the thing: download MeetVoice for Windows or macOS at meetvoice.app and install the Chrome extension. You'll need a Deepgram key (free tier is enough to test); the rest is optional.
Happy to answer questions in the comments — especially about the audio graph or the MV3 offscreen-doc dance. Those took the most pain to figure out.
Russian version: На русском