dev_to 2026年4月17日

Part 2: データセット - ラベル、直感、合成データ、そして AI がモデル構築の前に始まる理由

Part 2: The Dataset - Labels, Heuristics, Synthetic Data, and Why AI Starts Before the Model

Translated: 2026/4/17 11:10:05

aimachine-learningdata-pipelinedeep-learningbackend-engineering

Japanese Translation

本シリーズ（6 回にわたり）に直接来ていただいている方へ、基本的な概要とシリーズ全体の期待値を説明している Part 1 はこちらです：Part 1: 構築したものは何か – チラメール分類用のマイクロ AI システム Prince Raj Apr 16 #ai #go #backend #machinelearning 多くの開発者が AI に取り組む際、最初にモデルに行きつきます。どのニューラルネットワークを使うべきか？トランスフォーマーを使うべきか？何層加えるべきか？公平な疑問ですが、最初の問題ではありません。このプロジェクトにおいて最初の真の任務は：モデルが何を意味すべきか定義することです。それは自明に聞こえますが、それがすべてのかつてです。ラベルが曖昧、一貫性のない、あるいはテキストから推測不可能である場合、アーキテクチャがどれほど複雑でもモデルは苦戦します。この分類器は 1 つのラベルのみを出力しません。5 つのラベルを出力します：部門（department）、感情（sentiment）、リード意図（lead_intent）、離脱リスク（churn_risk）、意図（intent）。つまり、すべてのトレーニング例はこの形状を必要とします：{ "text": "refund nahi mila yet", "department": "billing", "sentiment": "negative", "lead_intent": "low", "churn_risk": "high", "intent": "refund" } これはトレーニングセットの正規化されたスキーマです。英語で言うと：すべてのチケットを 1 つの一貫した回答用紙に翻訳する必要があります。データを 3 つの場所で持っていることを想像してください：銀行サポートデータセット、感情データセット、一般的な意図データセット。どれもあなたの製品に自然に一致しません。データセット 1 にはラベル=payment_issue があるかもしれません。データセット 2 はポジティブかネガティブかのみを知るかもしれません。データセット 3 には全く離脱リスクについて何も言及していないかもしれません。そのため、作業は単に「データをロードすること」ではありません。作業は：異なるソースを 1 つの共有言語に変換することです。このプロジェクトのデータセットパイプラインがそれを行います。後端エンジニアのようにそれぞれの出力を通過させてみましょう。これはルーティングの問題です。問題点は：どのチームがこれを担当する可能性が高いか？例：refund -> billing、password reset -> technical、tracking issue -> logistics、pricing request -> sales。このラベルは運用上のものです。これは感情的なトーンを測定します：positive、neutral、negative。これは意図とは異なります。価格問い合わせは中立である可能性があります。払い戻し依頼はネガティブである可能性があります。お礼のメッセージはポジティブである可能性があります。このラベルは下流の優先順位付けとメッセージングに役立ちます。これがビジネス文脈が重要になる場所です。質問は：このメッセージは購入機会のように見えますか？例：demo request -> high、pricing inquiry -> high、feature request -> medium、complaint -> low。このラベルは単に言語理解ではないです。これはビジネス解釈です。これが後で重要になるのは、それは狭いタスクにおいて小規模なカスタムモデルが汎用 LLM を凌ぐ理由の 1 つだからです。これは顧客が去るかどうかを推定します。例：cancellation request -> high、repeated refund frustration -> high、neutral tracking question -> low。これも部分的に半義論であり、部分的にビジネス論理です。これは最も具体的なタスクです。例：refund、cancellation、delivery_issue、pricing_inquiry、technical_issue。トレーニングパイプラインは複数のソースからデータを取得します：Hugging Face データセット（例：banking77）、感情データ（例：tweet_eval/sentiment）、意図データセット（例：clinc_oos）、ローカル JSONL ファイル、合成の例、手動修正データ。しかし、元のソースラベルは私たちの 5 タスクスキーマとうまく一致しません。そのため、正規化します。技術用語：スキーマ正規化。英語で言うと：多数の異なるスプレッドシートを 1 つのハウスフォーマットに変換します。重要な初心者レッスンはここです：すべてのトレーニングラベルが人間の手作業で各フィールドを記述することには由来しなくて良いのです。時々データセットは 1 つの既知のラベルのみを与えます。あなたは他のラベルをドメイン規則を用いて推測できます。例えば：意図が refund であれば、部門はおそらく billing です。意図が pricing_inquiry であれば、リード意図はおそらく high です。意図が complaint であれば、感情は...

Original Content

Before we begin, if you have come directly to this post (Part 2 of 6), here is Part 1 where I explain the basics and set the expectations from this series. Part 1: What We Built - A Tiny AI System for Support Ticket Classification Prince Raj Apr 16 #ai #go #backend #machinelearning 6 reactions comment When many developers first approach AI, they jump straight to the model. They ask: Which neural network should I use? Should I use transformers? How many layers should I add? Those are fair questions, but not the first questions. For this project, the first real job was: define what the model is supposed to mean That sounds obvious, but it is the foundation of everything else. If your labels are vague, inconsistent, or impossible to infer from text, the model will struggle no matter how fancy the architecture is. This classifier does not output one label. It outputs five: department sentiment lead_intent churn_risk intent That means every training example needs a shape like this: { "text": "refund nahi mila yet", "department": "billing", "sentiment": "negative", "lead_intent": "low", "churn_risk": "high", "intent": "refund" } This is the canonical schema of the training set. Plain-English version: Every ticket must be translated into one consistent answer sheet. Imagine you have data from three places: a banking support dataset a sentiment dataset a general intent dataset None of them naturally match your product. One dataset may have label=payment_issue. Another may only know positive vs negative sentiment. Yet another may say nothing about churn risk at all. So the job is not only "load data." The job is: convert different sources into one shared language That is what the dataset pipeline in this project does. Let’s go through each output the way a backend engineer would. This is a routing problem. The question is: Which team should probably handle this? Examples: refund -> billing password reset -> technical tracking issue -> logistics pricing request -> sales This label is operational. This measures emotional tone: positive neutral negative This is not the same as intent. A pricing question can be neutral. A refund request can be negative. A thank-you note can be positive. This label helps downstream prioritization and messaging. This is where business context starts to matter. The question is: Does this message look like a buying opportunity? Examples: demo request -> high pricing inquiry -> high feature request -> medium complaint -> low This label is not just language understanding. It is business interpretation. That matters later, because it is one reason small custom models can beat general-purpose LLMs on narrow tasks. This estimates whether the customer may leave. Examples: cancellation request -> high repeated refund frustration -> high neutral tracking question -> low Again, this is partly semantic and partly business logic. This is the most specific task. Examples: refund cancellation delivery_issue pricing_inquiry technical_issue The training pipeline pulls data from multiple sources: Hugging Face datasets like banking77 Sentiment data like tweet_eval/sentiment Intent datasets like clinc_oos Local JSONL files Synthetic examples Manual correction data But raw source labels do not line up nicely with our five-task schema. So we normalize them. Technical term: This is schema normalization. Plain-English version: We take many different spreadsheets and convert them into one house format. Here is an important beginner lesson: Not every training label has to come from a human manually writing every field. Sometimes a dataset gives you only one known label. You can infer the others using domain rules. For example: if intent is refund, department is probably billing if intent is pricing_inquiry, lead intent is probably high if intent is complaint, sentiment is probably negative if intent is cancellation, churn risk is probably high That is exactly what this project does. In plain language: When we know one strong clue, we can responsibly fill in related labels. This is not perfect. This is one of my favorite parts of the project, because it is very relatable for backend engineers. Real support data is usually messy in two ways: it is incomplete it is uneven Maybe you have lots of billing messages but not many sales leads. So the pipeline generates synthetic tickets using templates. Examples of synthetic patterns: "I want a refund for my subscription" "Refund nahi mila for my order" "Can I get a demo for my team?" "Payment failed but money got deducted" Then it adds style noise: typos shorthand uppercase casual phrasing Hinglish variants Plain-English version: We manufacture extra training examples for situations we care about but do not have enough of. Technical term: This is synthetic data generation or data augmentation. A lot of AI tutorials quietly assume clean English input. Real production systems do not get that luxury. Users write things like: refund chahiye paisa mila nahi app kharab hai jaldi fix karo If you ignore that kind of variation, your model will feel fragile in production. So this project includes simple but valuable normalization rules that map common Hinglish words to normalized English equivalents: nahi -> not paisa -> money kharab -> broken chahiye -> want This is not "full multilingual AI." targeted robustness for the language patterns your users actually type This project also supports a corrections.jsonl file. That means once the model is live, you can capture corrected labels and feed them back into training. The workflow looks like this: Model makes a prediction in production Human or system corrects bad labels Corrected example gets appended to corrections.jsonl Next training run boosts those corrections I love this because it feels very familiar to backend teams. It is not mystical. It is a feedback loop. You ship. You observe. You correct. You retrain. That is how production systems grow up. After collecting all examples, the pipeline splits them into: Training data Validation data Why do we need validation? Because if we only measure performance on the same examples the model learned from, the scores can be misleading. Plain-English version: Training data is the study material. Validation data is the exam. The project also tries to stratify by intent when splitting. That means it attempts to preserve label balance, so the validation set does not accidentally miss important classes. At this point, we still have not talked about embeddings, dense layers, or PyTorch math. And that is the point. The AI project already contains a lot of engineering value before the neural network starts training: Schema design Label definitions Heuristics Dataset normalization Synthetic example generation Production corrections Validation setup This is why I keep telling backend engineers: You already have a lot of the mindset needed for AI systems. Good AI pipelines reward the same habits as good backend systems: Consistent contracts Thoughtful data modeling Clear assumptions Measurable feedback loops Remember this: Training data is not "whatever text you found." You are deciding: What the model should notice What tradeoffs it should care about What your labels really mean in the business That is the real beginning of AI work. In Part 3, we will finally answer the question that makes many people feel like AI is magic: How does text become numbers? I will explain: Bag-of-words Keyword flags Token IDs Embeddings Why this project combines all of them And I’ll do it in plain language first, then connect each idea to the proper technical terms. Disclosure: AI was used to frame the article.