๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐Ÿ Python

[Python] Pandas, Numpy ์„ฑ๋Šฅ ํ–ฅ์ƒ (feat.Pandas vs Numpy)

by dev.py 2024. 5. 6.

ํšŒ์‚ฌ์—์„œ Pandas์™€ Numpy ๋ฅผ ํ†ตํ•ด ๋Œ€์šฉ๋Ÿ‰ ๋กœ๊ทธ ํŒŒ์ผ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ, ๋น„์ด์ƒ์ ์œผ๋กœ ๊ธด ์‘๋‹ต ์‹œ๊ฐ„์„ ํ•ด๊ฒฐํ•œ ๋‚ด์šฉ์„ ๊ธฐ์ˆ ํ•œ๋‹ค.

์„  3์ค„ ์š”์•ฝ

  1. Pandas ์ธ๋ฑ์Šค ์ ‘๊ทผ ํ•จ์ˆ˜๋Š” at์ด ๊ฐ€์žฅ ๋น ๋ฅด๋‹ค
  2. Pandas์˜ DataFrame ๊ฐ cell๋ณ„ ์—…๋ฐ์ดํŠธ๊ฐ€ ์•„๋‹Œ Numpy์˜ array๋กœ ํ–‰์„ ๋งŒ๋“ค์–ด ๊ต์ฒด๊ฐ€ ๋” ๋น ๋ฅด๋‹ค.
  3. Pandas์˜ DataFrame โ†’ Numpy์˜ ndarray ๋Œ€์ฒด๊ฐ€ ๋” ๋น ๋ฅผ ์ˆ˜ ์žˆ๋‹ค.

 

 

Pandas vs Numpy 

๋ณดํ†ต ํŒŒ์ด์ฌ์—์„œ ํ–‰๋ ฌ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋ฉด Pandas๊ฐ€ ๊ฐ€์žฅ ๋จผ์ € ๊ฒ€์ƒ‰๋˜๊ณ  ์˜ˆ์‹œ๊ฐ€ ๋งŽ๋‹ค.

ํ•˜์ง€๋งŒ ์ฝ๊ธฐ/์“ฐ๊ธฐ ์ž‘์—…์ด ๋นˆ๋ฒˆํ•˜๋‹ค๋ฉด Pandas์˜ DataFrame ๋ณด๋‹จ Numpy์˜ ndarry๊ฐ€ ์„ฑ๋Šฅ๋ฉด์—์„œ ๋” ์ข‹๋‹ค.

https://www.geeksforgeeks.org/difference-between-pandas-vs-numpy/

 

Difference between Pandas VS NumPy - GeeksforGeeks

A Computer Science portal for geeks. It contains well written, well thought and well explained computer science and programming articles, quizzes and practice/competitive programming/company interview Questions.

www.geeksforgeeks.org

์œ„ ๊ธ€์˜ ํ‘œ๋ฅผ ๋ฒˆ์—ญํ•ด์„œ ์ •๋ฆฌํ–ˆ๋‹ค.

  Pandas Numpy
๋ฐ์ดํ„ฐ ์œ ํ˜• ํ…Œ์ด๋ธ” ๋ฐ์ดํ„ฐ ์ˆ˜์น˜ ๋ฐ์ดํ„ฐ
์ฃผ์š” ๋„๊ตฌ DataFrame, Series ndarry, array
๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ ๋” ๋งŽ์€ ๋ฉ”๋ชจ๋ฆฌ ์†Œ๋น„ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์ 
์„ฑ๋Šฅ ํ–‰์ด 500K ์ด์ƒ์ผ๋•Œ ์šฐ์ˆ˜ ํ–‰ 50K ์ดํ•˜์ผ ๋•Œ ์„ฑ๋Šฅ ์šฐ์ˆ˜
์ธ๋ฑ์‹ฑ ์†๋„ ๋งค์šฐ ๋Š๋ฆผ ๋งค์šฐ ๋น ๋ฆ„
๋ฐ์ดํ„ฐ ๊ตฌ์กฐ 2์ฐจ์› ํ…Œ์ด๋ธ” ๊ฐ์ฒด ๋‹ค์ฐจ์› ๋ฐฐ์—ด ์ œ๊ณต

 

ํŠนํžˆ ๋นˆ๋ฒˆํ•œ ์ธ๋ฑ์‹ฑ๊ณผ  ๋ฐ์ดํ„ฐ ์ฝ๊ธฐ/์“ฐ๊ธฐ ์ž‘์—…๋ฉด์—์„œ ์„ฑ๋Šฅ ์ฐจ์ด๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค.

์‹œ๋„ํ•œ ๋ฐฉ๋ฒ•

1. Pandas cell ์ ‘๊ทผ ํ•จ์ˆ˜์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ ์ฐจ์ด

import pandas as pd
import numpy as np
import time

# ๋ถ€๋™์†Œ์ˆ˜์  ์ˆ˜๋กœ ๊ตฌ์„ฑ๋œ 500,000 x 10,000 ํฌ๊ธฐ์˜ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์ƒ์„ฑ
data = np.random.uniform(0, 100, size=(500000, 10000))
df = pd.DataFrame(data)

# ๊ฐ ์ ‘๊ทผ ํ•จ์ˆ˜์— ์‚ฌ์šฉํ•  ๋ฌด์ž‘์œ„ ์ธ๋ฑ์Šค ์ƒ์„ฑ
num_accesses = 1000  # ์ ‘๊ทผ ํšŸ์ˆ˜
random_rows = np.random.randint(0, 500000, size=num_accesses)
random_cols = np.random.randint(0, 10000, size=num_accesses)

# ๊ฐ ๋ฐฉ๋ฒ•๋ณ„๋กœ ์‹œ๊ฐ„ ์ธก์ •
methods = ['loc', 'iloc', 'at', 'iat']
results = {}

for method in methods:
    start_time = time.perf_counter_ns()
    if method == 'loc':
        for i in range(num_accesses):
            value = df.loc[random_rows[i], random_cols[i]]
    elif method == 'iloc':
        for i in range(num_accesses):
            value = df.iloc[random_rows[i], random_cols[i]]
    elif method == 'at':
        for i in range(num_accesses):
            value = df.at[random_rows[i], random_cols[i]]
    elif method == 'iat':
        for i in range(num_accesses):
            value = df.iat[random_rows[i], random_cols[i]]
    end_time = time.perf_counter_ns()
    total_time_ns = end_time - start_time
    avg_time_ns = total_time_ns / num_accesses
    results[method] = {'total_time_ns': total_time_ns, 'avg_time_ns': avg_time_ns}

# 'at' ๋ฉ”์†Œ๋“œ ๊ธฐ์ค€์œผ๋กœ ์ƒ๋Œ€ ์„ฑ๋Šฅ ํผ์„ผํŠธ ๊ณ„์‚ฐ
at_total_time = results['at']['total_time_ns']
relative_performance = {method: (results[method]['total_time_ns'] / at_total_time) * 100 for method in methods}

# ๊ฒฐ๊ณผ ์ถœ๋ ฅ
for method, times in results.items():
    print(f"{method: } method total time: {times['total_time_ns']} ns, average time: {times['avg_time_ns']} ns")

for method, performance in relative_performance.items():
    print(f"{method} performance relative to 'at': {performance:.2f}%")

 

๊ฒฐ๊ณผ

  ํ‰๊ท  ์‹œ๊ฐ„ at ๋Œ€๋น„ ์„ฑ๋Šฅ ์ฐจ์ด
at 1593.416 ns 100%
iat 5735.292 ns 360%
iloc 7973.834 ns 500%
loc 195609.041 ns 12276%

at > iat > iloc >>> loc ์œผ๋กœ ์ธก์ •๋˜์—ˆ๋‹ค.

ํŠนํžˆ at๊ณผ loc๋Š” ๊ทธ ์ฐจ์ด๊ฐ€ ์‹ฌํ•˜๋‹ค.

๋ฐ์ดํ„ฐ ํ˜•ํƒœ์— ๋”ฐ๋ผ ๋‹ค๋ฅด๊ฒ ์ง€๋งŒ, ์„ฑ๋Šฅ ์ด์Šˆ๊ฐ€ ์žˆ๋‹ค๋ฉด iloc, loc ๋ณด๋‹ค at๊ณผ iat๋ฅผ ํ™œ์šฉํ•˜์ž.

 

2. DataFrame cell ๋ณ„ ์—…๋ฐ์ดํŠธ โ†’ Numpy array ๋กœ ํ–‰ ๋ฐ์ดํ„ฐ ๊ต์ฒด

import pandas as pd
import numpy as np
import time

# ๋ฐ์ดํ„ฐ ์‚ฌ์ด์ฆˆ ์„ค์ •
rows = 50000
cols = 1000

# ์ดˆ๊ธฐ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„๊ณผ NumPy ๋ฐฐ์—ด ์ƒ์„ฑ
df = pd.DataFrame(np.random.uniform(0, 100, size=(rows, cols)))
new_values = np.random.uniform(0, 100, size=(rows, cols))

# ๋žœ๋คํ•˜๊ฒŒ ์„ ํƒํ•  ํ–‰์˜ ์ˆ˜์™€ ํ–‰ ์ธ๋ฑ์Šค ์„ค์ •
num_rows_to_update = 5000
random_indices = np.random.choice(rows, num_rows_to_update, replace=False)

# DataFrame ์…€ ์—…๋ฐ์ดํŠธ (at ์‚ฌ์šฉ)
start_time_df_update_at = time.perf_counter_ns()
for i in random_indices:
    for j in range(cols):
        df.at[i, j] = new_values[i, j]
end_time_df_update_at = time.perf_counter_ns()

# NumPy ๋ฐฐ์—ด๋กœ ๊ต์ฒด (iloc ์‚ฌ์šฉ)
start_time_df_update_iloc = time.perf_counter_ns()
for i in random_indices:
    new_array = np.array(new_values[i])
    df.iloc[i] = new_array
end_time_df_update_iloc = time.perf_counter_ns()

# ๊ฒฐ๊ณผ ๊ณ„์‚ฐ ๋ฐ ์ถœ๋ ฅ
total_time_df_update_at = end_time_df_update_at - start_time_df_update_at
total_time_df_update_iloc = end_time_df_update_iloc - start_time_df_update_iloc

average_time_at = total_time_df_update_at / num_rows_to_update
average_time_iloc = total_time_df_update_iloc / num_rows_to_update
relative_time_percentage = (average_time_at / average_time_iloc) * 100

print("Average time per updating Pandas (cell) 'at'   method: {} ns".format(average_time_at))
print("Average time per updating Pandas (row)  'iloc' method: {} ns".format(average_time_iloc))
print("Relative time of 'at' method to 'iloc' method: {:.2f}%".format(relative_time_percentage))

๋ฐ์ดํ„ฐ ํ˜•ํƒœ์— ๋”ฐ๋ผ ๋‹ค๋ฅด๊ฒ ์ง€๋งŒ, ์„ฑ๋Šฅ ์ด์Šˆ๊ฐ€ ์žˆ๋‹ค๋ฉด cell ๋ณ„ ์—…๋ฐ์ดํŠธ ๋ณด๋‹ค numpy array๋กœ ๊ฐ์ฒด๋ฅผ ๋งŒ๋“ค๊ณ  ๊ต์ฒด ํ•ด๋ณด์ž.

 

๊ฒฐ๊ณผ

  ํ‰๊ท  ์‹œ๊ฐ„ ์ƒ๋Œ€ ํผ์„ผํŠธ
Row ์—…๋ฐ์ดํŠธ 10277 ns 100%
Cell ๋ณ„ ์—…๋ฐ์ดํŠธ 4333883 ns 42166.98%

 

๋ฐ์ดํ„ฐ ํ˜•ํƒœ์— ๋”ฐ๋ผ ๋‹ค๋ฅด๊ฒ ์ง€๋งŒ, ์„ฑ๋Šฅ ์ด์Šˆ๊ฐ€ ์žˆ๋‹ค๋ฉด cell ๋ณ„ ์—…๋ฐ์ดํŠธ ๋ณด๋‹ค numpy array๋กœ ๊ฐ์ฒด๋ฅผ ๋งŒ๋“ค๊ณ  ๊ต์ฒด ํ•ด๋ณด์ž.

3. Pands DataFrame vs Numpy ndarray ์„ฑ๋Šฅ ์ฐจ์ด

import pandas as pd
import numpy as np
import time

# ๋ฐ์ดํ„ฐ ์‚ฌ์ด์ฆˆ ์„ค์ •
rows = 500000
cols = 1000

# NumPy ndarray ๋ฐ Pandas DataFrame ์ดˆ๊ธฐํ™”
np_array = np.random.uniform(0, 100, size=(rows, cols))
df = pd.DataFrame(np_array)

# NumPy ndarray ์ฝ๊ธฐ ์„ฑ๋Šฅ ํ…Œ์ŠคํŠธ
start_time_np_read = time.perf_counter()
_ = np_array.sum()
end_time_np_read = time.perf_counter()

# Pandas DataFrame ์ฝ๊ธฐ ์„ฑ๋Šฅ ํ…Œ์ŠคํŠธ
start_time_df_read = time.perf_counter()
_ = df.sum()
end_time_df_read = time.perf_counter()

# NumPy ndarray ์“ฐ๊ธฐ ์„ฑ๋Šฅ ํ…Œ์ŠคํŠธ
new_values_np = np.random.uniform(0, 100, size=(rows, cols))
start_time_np_write = time.perf_counter()
np_array[:] = new_values_np
end_time_np_write = time.perf_counter()

# Pandas DataFrame ์“ฐ๊ธฐ ์„ฑ๋Šฅ ํ…Œ์ŠคํŠธ
new_values_df = np.random.uniform(0, 100, size=(rows, cols))
start_time_df_write = time.perf_counter()
df.iloc[:] = new_values_df
end_time_df_write = time.perf_counter()

# ๊ฒฐ๊ณผ ์ถœ๋ ฅ
np_read_time = (end_time_np_read - start_time_np_read) * 1e9  # ns ๋‹จ์œ„๋กœ ๋ณ€ํ™˜
df_read_time = (end_time_df_read - start_time_df_read) * 1e9
np_write_time = (end_time_np_write - start_time_np_write) * 1e9
df_write_time = (end_time_df_write - start_time_df_write) * 1e9

read_time_percentage = (df_read_time / np_read_time) * 100
write_time_percentage = (df_write_time / np_write_time) * 100

print("NumPy read time: {:.0f} ns".format(np_read_time))
print("DataFrame read time: {:.0f} ns ({:.2f}% of NumPy)".format(df_read_time, read_time_percentage))
print("NumPy write time: {:.0f} ns".format(np_write_time))
print("DataFrame write time: {:.0f} ns ({:.2f}% of NumPy)".format(df_write_time, write_time_percentage))

 

๊ฒฐ๋ก 

ํ•„์ž๋Š” ํ•ด๋‹น ๋ฐฉ๋ฒ•์—์„œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ๊ฐ€์žฅ ์ปธ๋‹ค.

Row : 500K, Col 1K ๊ธฐ์ค€

  ์ด ์‹œ๊ฐ„ Numpy ๋Œ€๋น„ ํผ์„ผํŠธ
Numpy ์ฝ๊ธฐ 0.181773 sec 100.00%
Pandas ์ฝ๊ธฐ 0.522033 sec 287.19%
Numpy ์“ฐ๊ธฐ 1.233984 sec 100.00%
Pandas ์“ฐ๊ธฐ 35.092410 sec 2843.83%

 

Row : 5K, Col 1K ๊ธฐ์ค€

  ์ด ์‹œ๊ฐ„ Numpy ๋Œ€๋น„ ํผ์„ผํŠธ
Numpy ์ฝ๊ธฐ 0.001558 100.00%
Pandas ์ฝ๊ธฐ 0.005912 379.38%
Numpy ์“ฐ๊ธฐ 0.001462 100.00%
Pandas ์“ฐ๊ธฐ 0.001526 104.37%

 

 

์‹ค์ œ ํ”„๋กœ์ ํŠธ ๊ธฐ์ค€

1. iloc, loc โ†’ at ๊ต์ฒด๋กœ

โ†’ 15% ์„ฑ๋Šฅ ํ–ฅ์ƒ

2. Cell ๋ณ„ ์—…๋ฐ์ดํŠธ โ†’ numpy.array๋กœ ํ–‰๋ณ„ ์—…๋ฐ์ดํŠธ 

โ†’ ์ถ”๊ฐ€๋กœ 10% ์„ฑ๋Šฅ ํ–ฅ์ƒ

2. Pandas DataFrame โ†’ Numpy  ndarray ๊ต์ฒด

โ†’ ์ „์ฒด 70% ์„ฑ๋Šฅ ํ–ฅ์ƒ