dev_to 2026年4月25日

NumPy アレイ：なぜ Python のリストを使うのか？

NumPy Arrays: Why Not Just Use a Python List?

Translated: 2026/4/25 5:49:09 翻訳信頼度: 97.1%

Japanese Translation

あなたは、17 版以降から NumPy のアレイを使っています。 np.array([1, 2, 3])、np.zeros((3, 4))、np.random.randn(100)。これらのコードを何十回も型付けしていますが、なぜなのかと考えることはありませんでした。なぜ Python のリストだけで済ませないのか。リストは数字を保持し、ループをかけられ、インデックス化をサポートします。NumPy が実際に加える価値とはなにか。答えは、あなたが思っている以上に重要です。なぜ NumPy アレイが異なるのかを理解し、ライブラリに戦いを仕掛けず、本来設計された方法で使用することを始めるのです。 ```python import numpy as np import time size = 5_000_000 python_list = list(range(size)) numpy_array = np.arange(size, dtype=np.float64) start = time.time() result_list = [x * 2.5 for x in python_list] list_time = time.time() - start start = time.time() result_numpy = numpy_array * 2.5 numpy_time = time.time() - start print(f"Python list: {list_time:.4f} seconds") print(f"NumPy array: {numpy_time:.4f} seconds") print(f"NumPy is {list_time / numpy_time:.0f}x faster") ``` 出力： Python list: 0.8341 seconds NumPy array: 0.0089 seconds NumPy is 94x faster 94 倍速いです。500 万の数字に対して。これは縮尺される差です。百万の画像を処理する画像データセット、または百万のレコードでトレーニングする場合、その差は 2 時間の待ち時間と 3 時間の待ち時間の差です。理由はメモリーの仕組みにあります。 Python リストは散らばったメモリのオブジェクトへの参照を保持します。各数字はフルの Python オブジェクトであり、オーバーヘッド、型情報、参照カウントを伴います。リストを 2.5 倍にするには、Python は各オブジェクトを個別に 1 つずつ訪れます。 NumPy アレイは、すべての同じ型に raw number をパッキングした 1 つの連続したメモリーブロックに記憶されています。オーバーヘッドはありません。NumPy は、このブロックを最適化された C コードに渡すことで、CPU のベクトライズ了指令を正しく使用し、すべてを並列処理します。一つは散らばる。一つはパッキングする。パッキングが勝ちます。 NumPy アレイの各要素は同じ型でなければなりません。それが取引です。柔軟性は低くなるが、速度は巨大になる。 ```python int_array = np.array([1, 2, 3, 4]) float_array = np.array([1.0, 2.0, 3.0]) bool_array = np.array([True, False, True]) print(int_array.dtype) # int64 print(float_array.dtype) # float64 print(bool_array.dtype) # bool ``` 出力： int64 float64 bool int64 は 64 ビットの整数です。約 -9 クインティリオンスターから +9 クインティリオンスターまでの数字を保存できます。float64 は 64 ビットの浮動小数点です。標準的な十進数精度です。あなたは型を明示的に指定できます。 ```python small_ints = np.array([1, 2, 3], dtype=np.int8) half_float = np.array([1.0, 2.0, 3.0], dtype=np.float32) print(f"int8 range: {np.iinfo(np.int8).min} to {np.iinfo(np.int8).max}") print(f"Memory: {small_ints.nbytes} bytes vs {np.array([1,2,3], dtype=np.int64).nbytes} bytes") ``` 出力： int8 range: -128 to 127 Memory: 3 bytes vs 24 bytes int8 は 3 個の数字用 3 バイトです。int64 は 24 バイトです。メモリー使用量が 8 倍少ないです。100,000 画像のデータセットにおいて、uint8 を float64 に替えることはメモリー使用量を 8 倍カットできます。それは RAM に適合するかどうか、不适合するかどうかの差です。これはディープラーニングにおいて重要になります。GPU メモリーは制限され、高価です。float32 を float64 に替えることで、最小限の精度損失を伴ってメモリー使用量が半分になります。ほとんどのニューラルネットワークトレーニングは、この理由のため float32 を使用します。 ```python print(np.zeros((3, 4))) # all zeros print(np.ones((2, 3))) # all ones print(np.full((2, 3), 7)) # filled with a value print(np.eye(4)) # identity matrix print(np.arange(0, 10, 2)) # like range(), returns array print(np.linspace(0, 1, 5)) # 5 evenly spaced from 0 to 1 ``` 出力： [[0. 0. 0. 0.] [0. 0. 0. 0.] [0. 0. 0. 0.]] [[1. 1. 1.] [1. 1. 1.]] [[7 7 7] [7 7 7]] [[1. 0. 0. 0.] [0. 1. 0. 0.] [0. 0. 1. 0.] [0. 0. 0. 1.]] [0 2 4 6 8] [0. 0.25 0.5 0.75 1. ] np.linspace が初心者が最も見落としているものです。n を始めと終値間の均等間隔の数値を与えます。関数をプロットしたり、学習率の範囲を作成してテストしたりする際に、これが最も使用されるものです。

Original Content

You have been using NumPy arrays since post 17. np.array([1, 2, 3]). np.zeros((3, 4)). np.random.randn(100). You have typed these dozens of times without stopping to ask why. Why not just use a Python list? Lists hold numbers. Lists can be looped over. Lists support indexing. What does NumPy actually add? The answer matters more than you might think. When you understand why NumPy arrays are different, you stop fighting the library and start using it the way it was designed to be used. import numpy as np import time size = 5_000_000 python_list = list(range(size)) numpy_array = np.arange(size, dtype=np.float64) start = time.time() result_list = [x * 2.5 for x in python_list] list_time = time.time() - start start = time.time() result_numpy = numpy_array * 2.5 numpy_time = time.time() - start print(f"Python list: {list_time:.4f} seconds") print(f"NumPy array: {numpy_time:.4f} seconds") print(f"NumPy is {list_time / numpy_time:.0f}x faster") Output: Python list: 0.8341 seconds NumPy array: 0.0089 seconds NumPy is 94x faster 94 times faster. On 5 million numbers. That difference scales. When you are processing image datasets of millions of images, or training on millions of records, that gap is the difference between waiting 2 minutes and waiting 3 hours. The reason is how memory works. A Python list stores references to objects scattered across memory. Each number is a full Python object with overhead, type information, reference counts. To multiply a list by 2.5, Python visits each object individually, one at a time. A NumPy array stores raw numbers packed tightly into one continuous block of memory. All the same type. No overhead. NumPy passes that block to optimized C code that processes everything in parallel, using CPU vectorization instructions designed exactly for this. One scatters. One packs. Packing wins. Every element in a NumPy array must be the same type. That is the trade. Less flexibility, enormous speed. int_array = np.array([1, 2, 3, 4]) float_array = np.array([1.0, 2.0, 3.0]) bool_array = np.array([True, False, True]) print(int_array.dtype) # int64 print(float_array.dtype) # float64 print(bool_array.dtype) # bool Output: int64 float64 bool int64 means 64-bit integer. Can store numbers from roughly -9 quintillion to +9 quintillion. float64 means 64-bit floating point. Standard decimal precision. You can specify the dtype explicitly. small_ints = np.array([1, 2, 3], dtype=np.int8) half_float = np.array([1.0, 2.0, 3.0], dtype=np.float32) print(f"int8 range: {np.iinfo(np.int8).min} to {np.iinfo(np.int8).max}") print(f"Memory: {small_ints.nbytes} bytes vs {np.array([1,2,3], dtype=np.int64).nbytes} bytes") Output: int8 range: -128 to 127 Memory: 3 bytes vs 24 bytes int8 uses 3 bytes for three numbers. int64 uses 24. Eight times less memory. When you are loading image pixel data (values 0-255), using uint8 instead of float64 can cut your memory usage by 8x. On a dataset of 100,000 images, that is the difference between fitting in RAM and not. This matters in deep learning. GPU memory is limited and expensive. Using float32 instead of float64 halves your memory usage with minimal precision loss. Most neural network training uses float32 for exactly this reason. print(np.zeros((3, 4))) # all zeros print(np.ones((2, 3))) # all ones print(np.full((2, 3), 7)) # filled with a value print(np.eye(4)) # identity matrix print(np.arange(0, 10, 2)) # like range(), returns array print(np.linspace(0, 1, 5)) # 5 evenly spaced from 0 to 1 Output: [[0. 0. 0. 0.] [0. 0. 0. 0.] [0. 0. 0. 0.]] [[1. 1. 1.] [1. 1. 1.]] [[7 7 7] [7 7 7]] [[1. 0. 0. 0.] [0. 1. 0. 0.] [0. 0. 1. 0.] [0. 0. 0. 1.]] [0 2 4 6 8] [0. 0.25 0.5 0.75 1. ] np.linspace is the one most beginners miss. It gives you n evenly spaced numbers between a start and end value, inclusive. When you are plotting a function or creating a range of learning rates to test, this is what you reach for. Everything from Python lists, extended to multiple dimensions. data = np.array([ [10, 20, 30, 40], [50, 60, 70, 80], [90, 100, 110, 120] ]) print(data[1, 2]) # single element: 70 print(data[0, :]) # first row: [10 20 30 40] print(data[:, 1]) # second column: [20 60 100] print(data[1:, 2:]) # bottom right 2x2 block print(data[[0, 2], :]) # rows 0 and 2 Output: 70 [10 20 30 40] [ 20 60 100] [[ 70 80] [110 120]] [[ 10 20 30 40] [ 90 100 110 120]] That last one, data[[0, 2], :], is fancy indexing. Pass a list of indices and you get those specific rows back. This is how you select a subset of training samples without a loop. This is one of the most useful NumPy features and most beginners do not use it enough. scores = np.array([72, 88, 45, 91, 63, 54, 79, 96, 38, 82]) passing = scores[scores >= 60] print(f"All scores: {scores}") print(f"Passing only: {passing}") mask = scores >= 60 print(f"Mask: {mask}") Output: All scores: [72 88 45 91 63 54 79 96 38 82] Passing only: [72 88 91 63 79 96 82] Mask: [ True True False True True False True True False True] scores >= 60 creates a boolean array. Using that boolean array as an index filters the original array. No loop. No list comprehension. One line. This is how data filtering works in NumPy at scale. students = np.array([ [72, 23, 1], [88, 25, 0], [45, 19, 1], [91, 31, 1], [54, 22, 0] ]) high_scorers_over_21 = students[(students[:, 0] >= 70) & (students[:, 1] > 20)] print(high_scorers_over_21) Output: [[72 23 1] [88 25 0] [91 31 1]] Score above 70 AND age above 20. Two conditions, one line, no loop. Broadcasting is NumPy's way of doing operations between arrays of different shapes. matrix = np.array([ [1, 2, 3], [4, 5, 6], [7, 8, 9] ]) row = np.array([10, 20, 30]) result = matrix + row print(result) Output: [[11 22 33] [14 25 36] [17 28 39]] matrix is (3, 3). row is (3,). They are different shapes. NumPy automatically expanded row across all three rows of matrix before adding. Broadcasting rule: dimensions are compared from the right. Either they are equal, or one of them is 1. If one dimension is 1, it gets stretched to match. The most common broadcasting you will do: data = np.random.randn(1000, 8) col_means = data.mean(axis=0) # shape (8,) col_stds = data.std(axis=0) # shape (8,) normalized = (data - col_means) / col_stds # (1000,8) - (8,) works via broadcasting print(f"Normalized shape: {normalized.shape}") print(f"Column means after: {normalized.mean(axis=0).round(4)}") Output: Normalized shape: (1000, 8) Column means after: [-0. 0. 0. -0. 0. 0. 0. -0.] Subtract the mean vector from every row, divide every row by the std vector. Zero loops. One line. Broadcasting handles the shape mismatch automatically. arr = np.array([3, 1, 4, 1, 5, 9, 2, 6, 5, 3]) print(arr.sum()) # 39 print(arr.min()) # 1 print(arr.max()) # 9 print(arr.mean()) # 3.9 print(arr.std()) # 2.3 print(np.sort(arr)) # sorted copy print(np.argsort(arr)) # indices that would sort it print(np.unique(arr)) # unique values print(np.cumsum(arr)) # running total Output: 39 1 9 3.9 2.3 [1 1 2 3 3 4 5 5 6 9] [1 3 6 0 9 2 8 4 7 5] [1 2 3 4 5 6 9] [ 3 4 8 9 14 23 25 31 36 39] np.argsort returns the indices that would sort the array. So index 1 comes first (value 1), then index 3 (also value 1), and so on. Useful when you want to know the ranking of items, not just their sorted values. For example, ranking recommendations or finding top predictions. flat = np.arange(24) grid = flat.reshape(4, 6) print(f"Flat: {flat.shape} Grid: {grid.shape}") a = np.array([[1, 2], [3, 4]]) b = np.array([[5, 6], [7, 8]]) vertical = np.vstack([a, b]) horizontal = np.hstack([a, b]) print(f"\nvstack: {vertical.shape}") print(vertical) print(f"\nhstack: {horizontal.shape}") print(horizontal) Output: Flat: (24,) Grid: (4, 6) vstack: (4, 2) [[1 2] [3 4] [5 6] [7 8]] hstack: (2, 4) [[1 2 5 6] [3 4 7 8]] vstack stacks vertically (more rows). hstack stacks horizontally (more columns). You will use these when combining datasets, concatenating batches, or building feature matrices from multiple sources. Create numpy_practice.py. You have exam scores for 200 students across 5 subjects: np.random.seed(99) scores = np.random.randint(30, 101, size=(200, 5)) subject_names = ["Math", "Science", "English", "History", "CS"] Do all of the following without loops: Calculate each student's average score (mean across axis 1). Find the top 10 students by average score using np.argsort. Calculate each subject's mean and standard deviation (across axis 0). Which subject has the highest average? Which has the most variance? Find all students who failed (below 40) in at least one subject. How many are there? Hint: use boolean indexing and np.any. Normalize the entire scores matrix so each subject has mean 0 and std 1. Verify by printing the column means and stds after normalization. Build a new matrix containing only the rows of students who passed all five subjects (all scores above 40). What is its shape? You know NumPy deeply now. The next tool is Pandas. If NumPy is for raw numerical computation, Pandas is for data that has labels, column names, mixed types, and the messy structure of real-world datasets. It is where you spend most of your time before models even enter the picture.