ํ์ฌ์์ Pandas์ Numpy ๋ฅผ ํตํด ๋์ฉ๋ ๋ก๊ทธ ํ์ผ์ ์ฒ๋ฆฌํ๋ ๋ฐ, ๋น์ด์์ ์ผ๋ก ๊ธด ์๋ต ์๊ฐ์ ํด๊ฒฐํ ๋ด์ฉ์ ๊ธฐ์ ํ๋ค.
์ 3์ค ์์ฝ
- Pandas ์ธ๋ฑ์ค ์ ๊ทผ ํจ์๋ at์ด ๊ฐ์ฅ ๋น ๋ฅด๋ค
- Pandas์ DataFrame ๊ฐ cell๋ณ ์ ๋ฐ์ดํธ๊ฐ ์๋ Numpy์ array๋ก ํ์ ๋ง๋ค์ด ๊ต์ฒด๊ฐ ๋ ๋น ๋ฅด๋ค.
- Pandas์ DataFrame โ Numpy์ ndarray ๋์ฒด๊ฐ ๋ ๋น ๋ฅผ ์ ์๋ค.
Pandas vs Numpy
๋ณดํต ํ์ด์ฌ์์ ํ๋ ฌ ๋ฐ์ดํฐ๋ฅผ ์ฒ๋ฆฌํ๋ฉด Pandas๊ฐ ๊ฐ์ฅ ๋จผ์ ๊ฒ์๋๊ณ ์์๊ฐ ๋ง๋ค.
ํ์ง๋ง ์ฝ๊ธฐ/์ฐ๊ธฐ ์์ ์ด ๋น๋ฒํ๋ค๋ฉด Pandas์ DataFrame ๋ณด๋จ Numpy์ ndarry๊ฐ ์ฑ๋ฅ๋ฉด์์ ๋ ์ข๋ค.
https://www.geeksforgeeks.org/difference-between-pandas-vs-numpy/
Difference between Pandas VS NumPy - GeeksforGeeks
A Computer Science portal for geeks. It contains well written, well thought and well explained computer science and programming articles, quizzes and practice/competitive programming/company interview Questions.
www.geeksforgeeks.org
์ ๊ธ์ ํ๋ฅผ ๋ฒ์ญํด์ ์ ๋ฆฌํ๋ค.
Pandas | Numpy | |
๋ฐ์ดํฐ ์ ํ | ํ ์ด๋ธ ๋ฐ์ดํฐ | ์์น ๋ฐ์ดํฐ |
์ฃผ์ ๋๊ตฌ | DataFrame, Series | ndarry, array |
๋ฉ๋ชจ๋ฆฌ ์ฌ์ฉ๋ | ๋ ๋ง์ ๋ฉ๋ชจ๋ฆฌ ์๋น | ๋ฉ๋ชจ๋ฆฌ ํจ์จ์ |
์ฑ๋ฅ | ํ์ด 500K ์ด์์ผ๋ ์ฐ์ | ํ 50K ์ดํ์ผ ๋ ์ฑ๋ฅ ์ฐ์ |
์ธ๋ฑ์ฑ ์๋ | ๋งค์ฐ ๋๋ฆผ | ๋งค์ฐ ๋น ๋ฆ |
๋ฐ์ดํฐ ๊ตฌ์กฐ | 2์ฐจ์ ํ ์ด๋ธ ๊ฐ์ฒด | ๋ค์ฐจ์ ๋ฐฐ์ด ์ ๊ณต |
ํนํ ๋น๋ฒํ ์ธ๋ฑ์ฑ๊ณผ ๋ฐ์ดํฐ ์ฝ๊ธฐ/์ฐ๊ธฐ ์์ ๋ฉด์์ ์ฑ๋ฅ ์ฐจ์ด๊ฐ ๋ฐ์ํ๋ค.
์๋ํ ๋ฐฉ๋ฒ
1. Pandas cell ์ ๊ทผ ํจ์์ ๋ฐ๋ฅธ ์ฑ๋ฅ ์ฐจ์ด
import pandas as pd
import numpy as np
import time
# ๋ถ๋์์์ ์๋ก ๊ตฌ์ฑ๋ 500,000 x 10,000 ํฌ๊ธฐ์ ๋ฐ์ดํฐํ๋ ์ ์์ฑ
data = np.random.uniform(0, 100, size=(500000, 10000))
df = pd.DataFrame(data)
# ๊ฐ ์ ๊ทผ ํจ์์ ์ฌ์ฉํ ๋ฌด์์ ์ธ๋ฑ์ค ์์ฑ
num_accesses = 1000 # ์ ๊ทผ ํ์
random_rows = np.random.randint(0, 500000, size=num_accesses)
random_cols = np.random.randint(0, 10000, size=num_accesses)
# ๊ฐ ๋ฐฉ๋ฒ๋ณ๋ก ์๊ฐ ์ธก์
methods = ['loc', 'iloc', 'at', 'iat']
results = {}
for method in methods:
start_time = time.perf_counter_ns()
if method == 'loc':
for i in range(num_accesses):
value = df.loc[random_rows[i], random_cols[i]]
elif method == 'iloc':
for i in range(num_accesses):
value = df.iloc[random_rows[i], random_cols[i]]
elif method == 'at':
for i in range(num_accesses):
value = df.at[random_rows[i], random_cols[i]]
elif method == 'iat':
for i in range(num_accesses):
value = df.iat[random_rows[i], random_cols[i]]
end_time = time.perf_counter_ns()
total_time_ns = end_time - start_time
avg_time_ns = total_time_ns / num_accesses
results[method] = {'total_time_ns': total_time_ns, 'avg_time_ns': avg_time_ns}
# 'at' ๋ฉ์๋ ๊ธฐ์ค์ผ๋ก ์๋ ์ฑ๋ฅ ํผ์ผํธ ๊ณ์ฐ
at_total_time = results['at']['total_time_ns']
relative_performance = {method: (results[method]['total_time_ns'] / at_total_time) * 100 for method in methods}
# ๊ฒฐ๊ณผ ์ถ๋ ฅ
for method, times in results.items():
print(f"{method: } method total time: {times['total_time_ns']} ns, average time: {times['avg_time_ns']} ns")
for method, performance in relative_performance.items():
print(f"{method} performance relative to 'at': {performance:.2f}%")
๊ฒฐ๊ณผ
ํ๊ท ์๊ฐ | at ๋๋น ์ฑ๋ฅ ์ฐจ์ด | |
at | 1593.416 ns | 100% |
iat | 5735.292 ns | 360% |
iloc | 7973.834 ns | 500% |
loc | 195609.041 ns | 12276% |
at > iat > iloc >>> loc ์ผ๋ก ์ธก์ ๋์๋ค.
ํนํ at๊ณผ loc๋ ๊ทธ ์ฐจ์ด๊ฐ ์ฌํ๋ค.
๋ฐ์ดํฐ ํํ์ ๋ฐ๋ผ ๋ค๋ฅด๊ฒ ์ง๋ง, ์ฑ๋ฅ ์ด์๊ฐ ์๋ค๋ฉด iloc, loc ๋ณด๋ค at๊ณผ iat๋ฅผ ํ์ฉํ์.
2. DataFrame cell ๋ณ ์ ๋ฐ์ดํธ โ Numpy array ๋ก ํ ๋ฐ์ดํฐ ๊ต์ฒด
import pandas as pd
import numpy as np
import time
# ๋ฐ์ดํฐ ์ฌ์ด์ฆ ์ค์
rows = 50000
cols = 1000
# ์ด๊ธฐ ๋ฐ์ดํฐ ํ๋ ์๊ณผ NumPy ๋ฐฐ์ด ์์ฑ
df = pd.DataFrame(np.random.uniform(0, 100, size=(rows, cols)))
new_values = np.random.uniform(0, 100, size=(rows, cols))
# ๋๋คํ๊ฒ ์ ํํ ํ์ ์์ ํ ์ธ๋ฑ์ค ์ค์
num_rows_to_update = 5000
random_indices = np.random.choice(rows, num_rows_to_update, replace=False)
# DataFrame ์
์
๋ฐ์ดํธ (at ์ฌ์ฉ)
start_time_df_update_at = time.perf_counter_ns()
for i in random_indices:
for j in range(cols):
df.at[i, j] = new_values[i, j]
end_time_df_update_at = time.perf_counter_ns()
# NumPy ๋ฐฐ์ด๋ก ๊ต์ฒด (iloc ์ฌ์ฉ)
start_time_df_update_iloc = time.perf_counter_ns()
for i in random_indices:
new_array = np.array(new_values[i])
df.iloc[i] = new_array
end_time_df_update_iloc = time.perf_counter_ns()
# ๊ฒฐ๊ณผ ๊ณ์ฐ ๋ฐ ์ถ๋ ฅ
total_time_df_update_at = end_time_df_update_at - start_time_df_update_at
total_time_df_update_iloc = end_time_df_update_iloc - start_time_df_update_iloc
average_time_at = total_time_df_update_at / num_rows_to_update
average_time_iloc = total_time_df_update_iloc / num_rows_to_update
relative_time_percentage = (average_time_at / average_time_iloc) * 100
print("Average time per updating Pandas (cell) 'at' method: {} ns".format(average_time_at))
print("Average time per updating Pandas (row) 'iloc' method: {} ns".format(average_time_iloc))
print("Relative time of 'at' method to 'iloc' method: {:.2f}%".format(relative_time_percentage))
๋ฐ์ดํฐ ํํ์ ๋ฐ๋ผ ๋ค๋ฅด๊ฒ ์ง๋ง, ์ฑ๋ฅ ์ด์๊ฐ ์๋ค๋ฉด cell ๋ณ ์ ๋ฐ์ดํธ ๋ณด๋ค numpy array๋ก ๊ฐ์ฒด๋ฅผ ๋ง๋ค๊ณ ๊ต์ฒด ํด๋ณด์.
๊ฒฐ๊ณผ
ํ๊ท ์๊ฐ | ์๋ ํผ์ผํธ | |
Row ์ ๋ฐ์ดํธ | 10277 ns | 100% |
Cell ๋ณ ์ ๋ฐ์ดํธ | 4333883 ns | 42166.98% |
๋ฐ์ดํฐ ํํ์ ๋ฐ๋ผ ๋ค๋ฅด๊ฒ ์ง๋ง, ์ฑ๋ฅ ์ด์๊ฐ ์๋ค๋ฉด cell ๋ณ ์ ๋ฐ์ดํธ ๋ณด๋ค numpy array๋ก ๊ฐ์ฒด๋ฅผ ๋ง๋ค๊ณ ๊ต์ฒด ํด๋ณด์.
3. Pands DataFrame vs Numpy ndarray ์ฑ๋ฅ ์ฐจ์ด
import pandas as pd
import numpy as np
import time
# ๋ฐ์ดํฐ ์ฌ์ด์ฆ ์ค์
rows = 500000
cols = 1000
# NumPy ndarray ๋ฐ Pandas DataFrame ์ด๊ธฐํ
np_array = np.random.uniform(0, 100, size=(rows, cols))
df = pd.DataFrame(np_array)
# NumPy ndarray ์ฝ๊ธฐ ์ฑ๋ฅ ํ
์คํธ
start_time_np_read = time.perf_counter()
_ = np_array.sum()
end_time_np_read = time.perf_counter()
# Pandas DataFrame ์ฝ๊ธฐ ์ฑ๋ฅ ํ
์คํธ
start_time_df_read = time.perf_counter()
_ = df.sum()
end_time_df_read = time.perf_counter()
# NumPy ndarray ์ฐ๊ธฐ ์ฑ๋ฅ ํ
์คํธ
new_values_np = np.random.uniform(0, 100, size=(rows, cols))
start_time_np_write = time.perf_counter()
np_array[:] = new_values_np
end_time_np_write = time.perf_counter()
# Pandas DataFrame ์ฐ๊ธฐ ์ฑ๋ฅ ํ
์คํธ
new_values_df = np.random.uniform(0, 100, size=(rows, cols))
start_time_df_write = time.perf_counter()
df.iloc[:] = new_values_df
end_time_df_write = time.perf_counter()
# ๊ฒฐ๊ณผ ์ถ๋ ฅ
np_read_time = (end_time_np_read - start_time_np_read) * 1e9 # ns ๋จ์๋ก ๋ณํ
df_read_time = (end_time_df_read - start_time_df_read) * 1e9
np_write_time = (end_time_np_write - start_time_np_write) * 1e9
df_write_time = (end_time_df_write - start_time_df_write) * 1e9
read_time_percentage = (df_read_time / np_read_time) * 100
write_time_percentage = (df_write_time / np_write_time) * 100
print("NumPy read time: {:.0f} ns".format(np_read_time))
print("DataFrame read time: {:.0f} ns ({:.2f}% of NumPy)".format(df_read_time, read_time_percentage))
print("NumPy write time: {:.0f} ns".format(np_write_time))
print("DataFrame write time: {:.0f} ns ({:.2f}% of NumPy)".format(df_write_time, write_time_percentage))
๊ฒฐ๋ก
ํ์๋ ํด๋น ๋ฐฉ๋ฒ์์ ์ฑ๋ฅ ํฅ์์ด ๊ฐ์ฅ ์ปธ๋ค.
Row : 500K, Col 1K ๊ธฐ์ค
์ด ์๊ฐ | Numpy ๋๋น ํผ์ผํธ | |
Numpy ์ฝ๊ธฐ | 0.181773 sec | 100.00% |
Pandas ์ฝ๊ธฐ | 0.522033 sec | 287.19% |
Numpy ์ฐ๊ธฐ | 1.233984 sec | 100.00% |
Pandas ์ฐ๊ธฐ | 35.092410 sec | 2843.83% |
Row : 5K, Col 1K ๊ธฐ์ค
์ด ์๊ฐ | Numpy ๋๋น ํผ์ผํธ | |
Numpy ์ฝ๊ธฐ | 0.001558 | 100.00% |
Pandas ์ฝ๊ธฐ | 0.005912 | 379.38% |
Numpy ์ฐ๊ธฐ | 0.001462 | 100.00% |
Pandas ์ฐ๊ธฐ | 0.001526 | 104.37% |
์ค์ ํ๋ก์ ํธ ๊ธฐ์ค
1. iloc, loc โ at ๊ต์ฒด๋ก
โ 15% ์ฑ๋ฅ ํฅ์
2. Cell ๋ณ ์ ๋ฐ์ดํธ โ numpy.array๋ก ํ๋ณ ์ ๋ฐ์ดํธ
โ ์ถ๊ฐ๋ก 10% ์ฑ๋ฅ ํฅ์
2. Pandas DataFrame โ Numpy ndarray ๊ต์ฒด
โ ์ ์ฒด 70% ์ฑ๋ฅ ํฅ์
'๐ Python' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
[Python] ํ์ด์ฌ ๊ฐ๋ฐ์ ๋ฉด์ ์ง๋ฌธ (0) | 2024.05.26 |
---|---|
[Python] GIL์ด๋? ํ์ด์ฌ์์์ ๋ฉํฐ์ฐ๋ ๋ (0) | 2024.05.06 |
Python 3.11 ๋ฌ๋ผ์ง ์ - ์ ๋ฐ์ดํธ (0) | 2024.01.15 |
์๋ฃ๊ตฌ์กฐ - List, Dict, Set, Tuple (0) | 2022.10.27 |
[Python] sort, sorted ( Tim sort, ํ ์ํธ) (0) | 2022.10.23 |