[Python] Pandas, Numpy 성능 향상 (feat.Pandas vs Numpy)

[Python] Pandas, Numpy 성능 향상 (feat.Pandas vs Numpy)Language/Python2024. 5. 6. 16:31@PyTong

Table of Contents

회사에서 Pandas와 Numpy 를 통해 대용량 로그 파일을 처리하는 데, 비이상적으로 긴 응답 시간을 해결한 내용을 기술한다.

선 3줄 요약

Pandas 인덱스 접근 함수는 at이 가장 빠르다
Pandas의 DataFrame 각 cell별 업데이트가 아닌 Numpy의 array로 행을 만들어 교체가 더 빠르다.
Pandas의 DataFrame → Numpy의 ndarray 대체가 더 빠를 수 있다.

Pandas vs Numpy

보통 파이썬에서 행렬 데이터를 처리하면 Pandas가 가장 먼저 검색되고 예시가 많다.

하지만 읽기/쓰기 작업이 빈번하다면 Pandas의 DataFrame 보단 Numpy의 ndarry가 성능면에서 더 좋다.

https://www.geeksforgeeks.org/difference-between-pandas-vs-numpy/

Difference between Pandas VS NumPy - GeeksforGeeks

A Computer Science portal for geeks. It contains well written, well thought and well explained computer science and programming articles, quizzes and practice/competitive programming/company interview Questions.

www.geeksforgeeks.org

위 글의 표를 번역해서 정리했다.

	Pandas	Numpy
데이터 유형	테이블 데이터	수치 데이터
주요 도구	DataFrame, Series	ndarry, array
메모리 사용량	더 많은 메모리 소비	메모리 효율적
성능	행이 500K 이상일때 우수	행 50K 이하일 때 성능 우수
인덱싱 속도	매우 느림	매우 빠름
데이터 구조	2차원 테이블 객체	다차원 배열 제공

특히 빈번한 인덱싱과 데이터 읽기/쓰기 작업면에서 성능 차이가 발생한다.

시도한 방법

1. Pandas cell 접근 함수에 따른 성능 차이

import pandas as pd
import numpy as np
import time

# 부동소수점 수로 구성된 500,000 x 10,000 크기의 데이터프레임 생성
data = np.random.uniform(0, 100, size=(500000, 10000))
df = pd.DataFrame(data)

# 각 접근 함수에 사용할 무작위 인덱스 생성
num_accesses = 1000  # 접근 횟수
random_rows = np.random.randint(0, 500000, size=num_accesses)
random_cols = np.random.randint(0, 10000, size=num_accesses)

# 각 방법별로 시간 측정
methods = ['loc', 'iloc', 'at', 'iat']
results = {}

for method in methods:
    start_time = time.perf_counter_ns()
    if method == 'loc':
        for i in range(num_accesses):
            value = df.loc[random_rows[i], random_cols[i]]
    elif method == 'iloc':
        for i in range(num_accesses):
            value = df.iloc[random_rows[i], random_cols[i]]
    elif method == 'at':
        for i in range(num_accesses):
            value = df.at[random_rows[i], random_cols[i]]
    elif method == 'iat':
        for i in range(num_accesses):
            value = df.iat[random_rows[i], random_cols[i]]
    end_time = time.perf_counter_ns()
    total_time_ns = end_time - start_time
    avg_time_ns = total_time_ns / num_accesses
    results[method] = {'total_time_ns': total_time_ns, 'avg_time_ns': avg_time_ns}

# 'at' 메소드 기준으로 상대 성능 퍼센트 계산
at_total_time = results['at']['total_time_ns']
relative_performance = {method: (results[method]['total_time_ns'] / at_total_time) * 100 for method in methods}

# 결과 출력
for method, times in results.items():
    print(f"{method: } method total time: {times['total_time_ns']} ns, average time: {times['avg_time_ns']} ns")

for method, performance in relative_performance.items():
    print(f"{method} performance relative to 'at': {performance:.2f}%")

결과

	평균 시간	at 대비 성능 차이
at	1593.416 ns	100%
iat	5735.292 ns	360%
iloc	7973.834 ns	500%
loc	195609.041 ns	12276%

at > iat > iloc >>> loc 으로 측정되었다.

특히 at과 loc는 그 차이가 심하다.

데이터 형태에 따라 다르겠지만, 성능 이슈가 있다면 iloc, loc 보다 at과 iat를 활용하자.

2. DataFrame cell 별 업데이트 → Numpy array 로 행 데이터 교체

import pandas as pd
import numpy as np
import time

# 데이터 사이즈 설정
rows = 50000
cols = 1000

# 초기 데이터 프레임과 NumPy 배열 생성
df = pd.DataFrame(np.random.uniform(0, 100, size=(rows, cols)))
new_values = np.random.uniform(0, 100, size=(rows, cols))

# 랜덤하게 선택할 행의 수와 행 인덱스 설정
num_rows_to_update = 5000
random_indices = np.random.choice(rows, num_rows_to_update, replace=False)

# DataFrame 셀 업데이트 (at 사용)
start_time_df_update_at = time.perf_counter_ns()
for i in random_indices:
    for j in range(cols):
        df.at[i, j] = new_values[i, j]
end_time_df_update_at = time.perf_counter_ns()

# NumPy 배열로 교체 (iloc 사용)
start_time_df_update_iloc = time.perf_counter_ns()
for i in random_indices:
    new_array = np.array(new_values[i])
    df.iloc[i] = new_array
end_time_df_update_iloc = time.perf_counter_ns()

# 결과 계산 및 출력
total_time_df_update_at = end_time_df_update_at - start_time_df_update_at
total_time_df_update_iloc = end_time_df_update_iloc - start_time_df_update_iloc

average_time_at = total_time_df_update_at / num_rows_to_update
average_time_iloc = total_time_df_update_iloc / num_rows_to_update
relative_time_percentage = (average_time_at / average_time_iloc) * 100

print("Average time per updating Pandas (cell) 'at'   method: {} ns".format(average_time_at))
print("Average time per updating Pandas (row)  'iloc' method: {} ns".format(average_time_iloc))
print("Relative time of 'at' method to 'iloc' method: {:.2f}%".format(relative_time_percentage))

데이터 형태에 따라 다르겠지만, 성능 이슈가 있다면 cell 별 업데이트 보다 numpy array로 객체를 만들고 교체 해보자.

결과

	평균 시간	상대 퍼센트
Row 업데이트	10277 ns	100%
Cell 별 업데이트	4333883 ns	42166.98%

데이터 형태에 따라 다르겠지만, 성능 이슈가 있다면 cell 별 업데이트 보다 numpy array로 객체를 만들고 교체 해보자.

3. Pands DataFrame vs Numpy ndarray 성능 차이

import pandas as pd
import numpy as np
import time

# 데이터 사이즈 설정
rows = 500000
cols = 1000

# NumPy ndarray 및 Pandas DataFrame 초기화
np_array = np.random.uniform(0, 100, size=(rows, cols))
df = pd.DataFrame(np_array)

# NumPy ndarray 읽기 성능 테스트
start_time_np_read = time.perf_counter()
_ = np_array.sum()
end_time_np_read = time.perf_counter()

# Pandas DataFrame 읽기 성능 테스트
start_time_df_read = time.perf_counter()
_ = df.sum()
end_time_df_read = time.perf_counter()

# NumPy ndarray 쓰기 성능 테스트
new_values_np = np.random.uniform(0, 100, size=(rows, cols))
start_time_np_write = time.perf_counter()
np_array[:] = new_values_np
end_time_np_write = time.perf_counter()

# Pandas DataFrame 쓰기 성능 테스트
new_values_df = np.random.uniform(0, 100, size=(rows, cols))
start_time_df_write = time.perf_counter()
df.iloc[:] = new_values_df
end_time_df_write = time.perf_counter()

# 결과 출력
np_read_time = (end_time_np_read - start_time_np_read) * 1e9  # ns 단위로 변환
df_read_time = (end_time_df_read - start_time_df_read) * 1e9
np_write_time = (end_time_np_write - start_time_np_write) * 1e9
df_write_time = (end_time_df_write - start_time_df_write) * 1e9

read_time_percentage = (df_read_time / np_read_time) * 100
write_time_percentage = (df_write_time / np_write_time) * 100

print("NumPy read time: {:.0f} ns".format(np_read_time))
print("DataFrame read time: {:.0f} ns ({:.2f}% of NumPy)".format(df_read_time, read_time_percentage))
print("NumPy write time: {:.0f} ns".format(np_write_time))
print("DataFrame write time: {:.0f} ns ({:.2f}% of NumPy)".format(df_write_time, write_time_percentage))

결론

필자는 해당 방법에서 성능 향상이 가장 컸다.

Row : 500K, Col 1K 기준

	총 시간	Numpy 대비 퍼센트
Numpy 읽기	0.181773 sec	100.00%
Pandas 읽기	0.522033 sec	287.19%
Numpy 쓰기	1.233984 sec	100.00%
Pandas 쓰기	35.092410 sec	2843.83%

Row : 5K, Col 1K 기준

	총 시간	Numpy 대비 퍼센트
Numpy 읽기	0.001558	100.00%
Pandas 읽기	0.005912	379.38%
Numpy 쓰기	0.001462	100.00%
Pandas 쓰기	0.001526	104.37%

실제 프로젝트 기준

1. iloc, loc → at 교체로

→ 15% 성능 향상

2. Cell 별 업데이트 → numpy.array로 행별 업데이트

→ 추가로 10% 성능 향상

2. Pandas DataFrame → Numpy ndarray 교체

→ 전체 70% 성능 향상

저작자표시 비영리 변경금지

'Language > Python' 카테고리의 다른 글

[Python] 파이썬 개발자 면접 질문 (0)	2024.05.26
[Python] GIL이란? 파이썬에서의 멀티쓰레드 (0)	2024.05.06
Python 3.11 달라진 점 - 업데이트 (0)	2024.01.15
자료구조 - List, Dict, Set, Tuple (0)	2022.10.27
sort, sorted ( Tim sort, 팀 소트) (0)	2022.10.23