arxiv_cs_cv 2026年2月10日

画像データセットにおけるプライバシー：妊婦超音波画像の事例研究

Privacy in Image Datasets: A Case Study on Pregnancy Ultrasounds

Translated: 2026/3/15 17:05:14

privacyimage-datasetspregnancy-ultrasoundslaionclip

Japanese Translation

arXiv:2602.07149v1 発表タイプ：新規要約：生成モデルの台頭に伴い、大規模なデータセットの使用がインターネット上の収集により増え、それらはしばしば最小限、あるいは全くのデータキュレーションなしで行われています。これは、敏感または個人情報が含まれているという懸念を招きます。本稿では、個人情報が含まれやすくインターネット上にしばしば共有される妊婦超音波画像の存在を探求します。LAION-400M データセットに CLIP エンベディング類似性を用いて体系的に考察を行うことで、妊婦超音波画像を含む画像を取得し、名前や所在地などの数千の個人情報のエンティティを検出します。私たちの発見は、再特定や不正の可能を高めるリスクの高い情報が複数の画像に含まれていることを示しています。私たちは、データセットのキュレーション、データプライバシー、公衆用画像データの倫理的利用のために推奨される慣行で終わります。

Original Content

arXiv:2602.07149v1 Announce Type: new Abstract: The rise of generative models has led to increased use of large-scale datasets collected from the internet, often with minimal or no data curation. This raises concerns about the inclusion of sensitive or private information. In this work, we explore the presence of pregnancy ultrasound images, which contain sensitive personal information and are often shared online. Through a systematic examination of LAION-400M dataset using CLIP embedding similarity, we retrieve images containing pregnancy ultrasound and detect thousands of entities of private information such as names and locations. Our findings reveal that multiple images have high-risk information that could enable re-identification or impersonation. We conclude with recommended practices for dataset curation, data privacy, and ethical use of public image datasets.