dev_to 2026年3月7日

生産の不具合が現れたときから、確実な修理までの時間を減らす

Reducing the time between a production crash and a fix

Translated: 2026/3/7 7:37:06

Japanese Translation

コードをリリースすると何となく動作しますが、忽然として生産でクラッシュが発生しました。

Original Content

You ship code, everything works — and then suddenly a crash appears in production. Even in well-instrumented systems, the investigation process often looks like this: check the monitoring alert dig through logs search the codebase try to reproduce the issue write a fix open a pull request In many teams, this process can easily take hours. After several years working on complex applications and critical data workflows, I started wondering if part of this investigation process could be automated. Could we shorten the loop between crash detection and a validated fix? This is what led me to start building Crashloom. Crashloom is an experiment around using AI agents to investigate crashes, identify potential root causes, and propose fixes that can be validated before creating a pull request. The idea is to reduce the time between a production crash and a safe fix by assisting developers in the investigation workflow. crash → investigation → sandbox validation → pull request The project is still early stage, and I'm curious how other teams handle production incidents today. How long does it usually take in your case to go from crash detection → merged fix?