Back to list
あなたの Struct がメモリを無駄にしており、あなたは気づいていない
Your Struct is Wasting Memory and You Don't Know It
Translated: 2026/4/25 1:00:26
Japanese Translation
私たちは読みやすいと感じる順序でフィールドを列挙して Struct を記述します。名前、そして年齢、そしてスコア。コンパイルも走ります。コンパイラは静かに膨大にする、誤 alignments を行う、あるいは両方のことを行い、あなたは決して確認することなくそれをリリースします。正確に同じ 6 つのフィールドを持つ 3 つの Struct について here です:\n\n#include <stdio.h>\n#include <stdint.h>\n#include <stddef.h>\n\nstruct Good {\n double balance;\n uint64_t transaction_id;\n int32_t account_type;\n int16_t region_code;\n char status;\n char currency[4];\n};\n\nstruct Bad {\n char status;\n double balance;\n int16_t region_code;\n uint64_t transaction_id;\n char currency[4];\n int32_t account_type;\n};\n\nstruct __attribute__((packed)) PackedBad {\n char status;\n double balance;\n int16_t region_code;\n uint64_t transaction_id;\n char currency[4];\n int32_t account_type;\n};\n\nint main() {\n printf("Good: %zu bytes\\n", sizeof(struct Good));\n printf("Bad: %zu bytes\\n", sizeof(struct Bad));\n printf("PackedBad: %zu bytes\\n", sizeof(struct PackedBad));\n return 0;\n}\n\nGood: 32 bytes\nBad: 40 bytes\nPackedBad: 27 bytes\n\n同じフィールド。27、32、そして 40 バイト。差はデータではありません。それは順序と、コンパイラが作業をするかどうかです。Struct に触れる前に、CPU が実際に RAM とどのように話しているのかを理解する必要があります。3 つのバスがそれらを接続しています。アドレスバスは CPU が読みたいメモリアドレスを運ぶもので、現代のシステムでは 48 から 52 つの物理のワイヤです。CPU はこれらのワイヤに数字を設定し、RAM はそれを聞きます。データバスは実際のバイトを運びます。これは 64 ビット幅であり、1 つの転送で 8 バイトが並列に旅行します。しかし、CPU は 8 バイトに留まりません。フルキャッシュラインまでデータバスを横断する転送を継続します。キャッシュラインは 64 バイトです。それは RAM とあなたの L1/L2 キャッシュとの間の通信の唯一の単位です。CPU は決して 1 バイトを読みません。決して 8 バイトを読みません。常に 64 バイトを読み取ります。単一の変数を読む際、CPU はその変数のアドレスをアドレスバスに置き、データバス全体を 64 バイトのブロックを渡らせる、そしてそれをキャッシュに保存し、そしてそこからあなたに 1 バイトを提供します。全てのキャッシュラインは 64 の倍数のアドレスを開始します。キャッシュライン 0 は 0x0000 から 0x003F(0 から 63)をカバーします。キャッシュライン 1 は 0x0040 から 0x007F(64 から 127)をカバーします。キャッシュライン 2 は 0x0080 から 0x00BF(128 から 191)をカバーします。境界は固定されており、常に 64 の倍数にあります。これはこの記事の全ての他のルールに従う基礎となります。全てのデータ型は自身のサイズに等しい align 要件を持っています。double(8 バイト)は 8 の倍数のアドレスで始める必要があります。uint32_t(4 バイト)は 4 の倍数のアドレスで始める必要があります。char(1 バイト)はどこに行っても良いです。自然 align されたアドレスにあるフィールドが存在すると、CPU は 1 つのバス取引でそれを読み取り、1 つのキャッシュラインフェッチに綺麗に入ります。フィールドが誤って align されている場合、それはキャッシュラインの境界をまたぐことができます。例えば、double が 0x003C(60)で始めたとしましょう。それは 8 バイトなので、0x003C から 0x0043(60 から 67)を占有します。キャッシュライン 0 は 0x003F(63)で終了します。キャッシュライン 1 は 0x0040(64)で始めます。あなたの double は両方のラインに分割されています。CPU はキャッシュライン 0 へのアドレスリクエストを出し、データバスがデータを渡すのを待ち、その後、キャッシュライン 1 への第 2 のアドレスリクエストを出し、再度待ち、そして両方の半分をハードウェアで貼り合わせます。1 つの変数読み取りのためにメモリのフルラウンドトリップが 2 つです。\n\nこれで、address mod data_size == 0 がこれを防ぐ理由を考え出してください。キャッシュラインの境界は 64 の倍数にあります。自然に align された double は 8 の倍数の位置にあります。最悪の場合、double が 0x0038(56)にある場合、バイト 56 から 63 を占有します。それはキャッシュラインの境界に正確に一致し、それをまたがないので、これが機能します。なぜなら 64 は自身で 8 の倍数だからです。自身のサイズに align されたフィールドは、同じサイズの倍数である境界をまたぐことは数学的に不可能です。よって、address mod data_size == 0 はスタイルの慣習ではありません。それは、フィールドが正確に 1 つのキャッシュライン内に住み、正確に 1 つのバス取引で fetch され、分割される可能性がまったくないことを保証する条件です。コンパイラはこの保証を維持するためにフィールドの間にパディングを挿入します。Bad なフィールドの順序付けは、それが大量の...を挿入させることを強制します
Original Content
We write structs by listing fields in whatever order feels readable. Name, then age, then score. It compiles. It runs. The compiler silently bloats it, misaligns it, or both, and you ship it without ever checking. Here are three structs holding the exact same six fields: #include #include #include struct Good { double balance; uint64_t transaction_id; int32_t account_type; int16_t region_code; char status; char currency[4]; }; struct Bad { char status; double balance; int16_t region_code; uint64_t transaction_id; char currency[4]; int32_t account_type; }; struct __attribute__((packed)) PackedBad { char status; double balance; int16_t region_code; uint64_t transaction_id; char currency[4]; int32_t account_type; }; int main() { printf("Good: %zu bytes\n", sizeof(struct Good)); printf("Bad: %zu bytes\n", sizeof(struct Bad)); printf("PackedBad: %zu bytes\n", sizeof(struct PackedBad)); return 0; } Good: 32 bytes Bad: 40 bytes PackedBad: 27 bytes Same fields. 27, 32, and 40 bytes. The difference is not the data. It is the order and whether you let the compiler do its job. Before touching any struct, you need to understand how the CPU actually talks to RAM. There are three buses connecting them. The address bus carries the memory address the CPU wants to read. It is 48 to 52 physical wires on a modern system. The CPU puts a number on these wires and RAM listens. The data bus carries the actual bytes back. It is 64 bits wide, so 8 bytes travel in parallel per transfer. But the CPU does not stop at 8 bytes. It keeps bursting transfers across the data bus until it has filled a full cache line. A cache line is 64 bytes. That is the only unit of communication between RAM and your L1/L2 cache. The CPU never fetches 1 byte. It never fetches 8 bytes. It always fetches 64 bytes. When you read a single char, the CPU puts that char's address on the address bus, pulls the entire 64-byte block containing it across the data bus, stores it in cache, and then gives you your one byte out of it. Every cache line starts at an address that is a multiple of 64. Cache line 0 covers 0x0000 to 0x003F (0 to 63). Cache line 1 covers 0x0040 to 0x007F (64 to 127). Cache line 2 covers 0x0080 to 0x00BF (128 to 191). The boundaries are fixed and always at multiples of 64. This is the rule everything else in this post follows. Every data type has an alignment requirement equal to its own size. A double (8 bytes) must start at an address divisible by 8. A uint32_t (4 bytes) must start at an address divisible by 4. A char (1 byte) can go anywhere. When a field sits at a naturally aligned address the CPU reads it in one bus transaction. It fits cleanly inside one cache line fetch. When a field is misaligned it can straddle a cache line boundary. Say a double starts at 0x003C (60). It is 8 bytes, so it occupies 0x003C to 0x0043 (60 to 67). Cache line 0 ends at 0x003F (63). Cache line 1 starts at 0x0040 (64). Your double is split across both. The CPU issues an address request for cache line 0, waits for the data bus to deliver, then issues a second address request for cache line 1, waits again, and stitches both halves together in hardware. Two full round trips to memory for one field read. Now think about why address mod data_size == 0 prevents this. Cache line boundaries sit at multiples of 64. A naturally aligned double sits at a multiple of 8. The worst case is a double at 0x0038 (56), occupying bytes 56 to 63. It ends exactly at the cache line boundary, never crossing it. This works because 64 is itself a multiple of 8. A field aligned to its own size mathematically cannot straddle a boundary that is also a multiple of that same size. So address mod data_size == 0 is not a style convention. It is the condition that guarantees your field lives inside exactly one cache line, fetched in exactly one bus transaction, with no possibility of being split. The compiler inserts padding between fields to maintain this guarantee. Bad field ordering forces it to insert a lot of padding. And packed removes all of it. 0x0000 0x0008 0x0010 0x0014 0x0016 0x0017 0x001B 0x001F (0) (8) (16) (20) (22) (23) (27) (31) |-----------|-----------|------|----|--|------|--------| balance trans_id acct reg st currency pad balance at 0x0000 (0). 0 mod 8 = 0. Aligned. transaction_id at 0x0008 (8). 8 mod 8 = 0. Aligned. account_type at 0x0010 (16). 16 mod 4 = 0. Aligned. region_code at 0x0014 (20). 20 mod 2 = 0. Aligned. status at 0x0016 (22). Char, goes anywhere. currency at 0x0017 (23). Char array, goes anywhere. Every field starts exactly where the previous one ended. Zero internal padding. The 5 bytes at the end are tail padding. In an array the second element must start at an address divisible by 8, the largest field alignment. Without tail padding the second element begins at 0x001B (27) and its balance field lands there too. 27 mod 8 = 3. Misaligned. So the compiler rounds 27 up to 32. The second element starts at 0x0020 (32). 32 mod 8 = 0. Clean. Zero bytes wasted internally. The tail padding is structural and unavoidable. 0x0000 0x0001 0x0008 0x0010 0x0012 0x0018 0x0020 0x0024 0x0028 (0) (1) (8) (16) (18) (24) (32) (36) (40) |------|---------|-----------|------|---------|-----------|------|----| st 7B pad balance reg 6B pad trans_id curr acct status at 0x0000 (0), one byte. The next field is balance, a double that needs a multiple of 8. After byte 1, the nearest multiple of 8 is 0x0008 (8). The compiler inserts 7 bytes of padding between them that store nothing and do nothing. Then region_code lands at 0x0010 (16), two bytes, ending at 0x0011 (17). The field after it is transaction_id, which needs a multiple of 8. The nearest is 0x0018 (24). Six more bytes gone. 13 bytes wasted purely from putting char status first. In an array of a million of these structs that is 13MB of RAM holding nothing. The struct is 25% larger than it needs to be, meaning fewer elements fit per cache line and more trips to RAM on every access pattern. 0x0000 0x0001 0x0009 0x000B 0x0013 0x0017 0x001B (0) (1) (9) (11) (19) (23) (27) |------|---------|------|-----------|--------|------| st balance reg trans_id curr acct __attribute__((packed)) removes all padding. Fields sit back to back. 27 bytes. But look at where each field actually lands: balance at 0x0001 (1). 1 mod 8 = 1. Not 0. Misaligned. region_code at 0x0009 (9). 9 mod 2 = 1. Not 0. Misaligned. transaction_id at 0x000B (11). 11 mod 8 = 3. Not 0. Misaligned. account_type at 0x0017 (23). 23 mod 4 = 3. Not 0. Misaligned. Four fields, zero aligned. In an array, whether a given element straddles a cache line boundary depends on its index. You can check with: (index x struct_size) mod 64 + struct_size > 64 For element 2 of PackedBad: (2 x 27) mod 64 + 27 = 54 + 27 = 81. Since 81 > 64, element 2 straddles. Its bytes run from 0x0036 (54) to 0x0054 (84), crossing the cache line boundary at 0x0040 (64). The CPU issues an address request for cache line 0 (0x0000 to 0x003F), waits for the data bus, issues a request for cache line 1 (0x0040 to 0x007F), waits again, and stitches both halves. Two full round trips for one struct read. You saved 13 bytes on paper and doubled your memory traffic in practice. The straddle is slow. In single-threaded code it is just slower. In multithreaded code it is also wrong. The CPU guarantees a memory access is atomic, meaning indivisible and instantaneous from every other thread's perspective, only when: address mod data_size == 0 That condition guarantees the field sits inside one cache line and the CPU fetches it in one bus transaction. One transaction means no window for another thread to slip in. When balance sits at 0x0001 (1) in PackedBad, 1 mod 8 = 1. The condition fails. The CPU fetches the first portion of balance in one bus transaction, then the second portion in a separate bus transaction. There is a real time gap between them. If another thread writes to that same balance field inside that gap, the reading thread gets the first half from before the write and the second half from after it. A value assembled from two different points in time. A number that was never logically written anywhere in your program. No segfault. No assertion. No log line. The field silently reads as garbage. In a monitoring system this corrupts your metrics. In a financial system this is a balance that never existed reaching your business logic. Good and Bad are both padded by the compiler so every field satisfies address mod data_size == 0. Tearing cannot happen. PackedBad has four fields that fail this condition in every element. Struct Size Internal padding Misaligned fields Good 32 bytes 0 bytes 0 Bad 40 bytes 13 bytes 0 PackedBad 27 bytes 0 bytes 4 Bad pays in memory. PackedBad pays in correctness. Good pays nothing. Order fields from largest alignment requirement to smallest: struct Good { double balance; // 8 bytes uint64_t transaction_id; // 8 bytes int32_t account_type; // 4 bytes int16_t region_code; // 2 bytes char status; // 1 byte char currency[4]; // 1 byte alignment }; The compiler has nothing to pad because each field naturally follows the previous one without any gap. No attributes, no pragmas. Just ordering. Verify with sizeof. Inspect individual field positions with __builtin_offsetof(struct Foo, field) when something looks off. __attribute__((packed)) has one valid use: serializing data onto a network socket or disk, where you control both ends and the CPU never does arithmetic directly on the packed bytes. You pack the struct, write the raw bytes to the wire, and on the receiving end you copy into a properly aligned struct before reading any field. The packed struct is a transport container, not a data structure your code operates on. The moment you read fields out of a packed struct in a running program you pay the straddle penalty on every access and you are one concurrent write away from tearing. You fix your field order. You remove packed. Everything is aligned. You go multithreaded and all cores pin at 100% while throughput collapses. struct Good is 32 bytes. Two of them fit inside one 64-byte cache line. Say your array starts at 0x1000 (4096). arr[0] lives at 0x1000 to 0x101F (4096 to 4127). arr[1] lives at 0x1020 to 0x103F (4128 to 4159). Both sit inside the single cache line spanning 0x1000 to 0x103F (4096 to 4159). Thread 1 writes to arr[0]. Thread 2 writes to arr[1]. Different structs. No shared fields. No mutex involved. But both live in the same 64-byte cache line. Every time Thread 1 writes to arr[0], the CPU's MESI cache coherency protocol broadcasts an invalidation across the ring bus to every other core: the cache line at 0x1000 was modified, your copies are stale, drop them. Thread 2 has its L1 cache entry for arr[1] ripped away even though nobody touched arr[1]. It takes an L1 miss, goes out to L3, fetches the 64-byte line again, modifies arr[1], and now Thread 1 gets invalidated. Back and forth. The cores spend the vast majority of their time passing one cache line across the ring bus and almost no time doing actual work. The fix is to give each struct its own cache line: struct __attribute__((aligned(64))) NodeMetrics { double balance; uint64_t transaction_id; int32_t account_type; int16_t region_code; char status; char currency[4]; }; Now arr[0] owns 0x1000 to 0x103F (4096 to 4159) entirely. arr[1] owns 0x1040 to 0x107F (4160 to 4223) entirely. Thread 1 and Thread 2 never touch the same cache line and the coherency protocol never fires between them. You waste 32 bytes per struct. You get linear scaling across every core. Order fields largest to smallest. Verify with sizeof. Check offsets with __builtin_offsetof when something feels off. Use packed only for wire or disk formats where you control both ends. Pad to 64 bytes with aligned(64) only when multiple threads write to adjacent elements of an array.