Post

[KT Aivle 3기 AI] 5일차. (2) DataFrame 결합

0. 개요

KT Aivle School 3기 AI 5일차

  • 주제 : DataFrame 변경
  • 강사 : 한기영 강사님
  • 내용 : 4일차에 이어 Pandas의 DataFrame을 다루는 방법에 대해서 알려주셨다. 그 중, DataFrame 결합하는 부분에 대한 정리이다.

1. 데이터 프레임 결합

1) 데이터 프레임을 결합하는 두 가지 방법

(1) pd.concat()

  • 매핑 기준 : 인덱스(행), 칼럼이름(열)
  • 구조가 같아야 합칠 수 있다.
  • 방향 선택
    • axis=0 : 세로(행)로 합치기 (Default)
    • axis=1 : 가로(열)로 합치기
  • 방법 선택
    • join=’outer’ : 모든 행과 열 합치가 (Default)
    • join=’inner’ : 매핑되는 행과 열만 합치기

(2) pd.merge() (concat보다 많이 쓰임. 중요!)

  • 매핑 기준 : 특정 칼럼(key)의 값 기준으로 결합
  • 데이터베이스 테이블 조인과 같음
  • merge는 무조건 옆으로 붙인다
  • inner/outer/left/right join

2) pivot

  • Pivot 함수를 이용하여 집계된 데이터를 재구성
  • `df.pivot(index, column, values)
  • Pivot은 DataFrame을 결합시키는 것은 아니지만, 구조를 변형시키는것.
    • 먼저 groupby로 집계하고, pivot(index, column, value)

2. 실습

1
2
3
4
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
1
2
3
4
5
6
7
8
9
sales = pd.read_csv("https://raw.githubusercontent.com/DA4BAM/dataset/master/ts_sales_simple.csv")

sales1 = sales.loc[sales['Store_ID']==1]
sales2 = sales.loc[sales['Store_ID']==2]
sales5 = sales.loc[sales['Store_ID']==5]
sales11 = sales.loc[sales['Store_ID']==11]

products = pd.read_csv("https://raw.githubusercontent.com/DA4BAM/dataset/master/ts_product_master.csv")
stores = pd.read_csv("https://raw.githubusercontent.com/DA4BAM/dataset/master/ts_store_master.csv")
1
sales1.head()
Order_IDDateStore_IDQtyProduct_ID
032013-01-0110.03
152013-01-0110.05
272013-01-0110.07
382013-01-0110.08
4102013-01-0110.010
1
products.head()
Product_IDProduct_CodeSubCategoryCategoryLeadTimePrice
020HG001Gardening ToolsHousehold Goods250
127HH001Home AppliancesHousehold Goods2150
216HK001KitchenHousehold Goods223
315HK002KitchenHousehold Goods241
432GS001SeafoodGrocery334
1
stores.head()
Store_IDCityStateStore_Type
01Saint PaulMinnesota4
12Saint PaulMinnesota4
23Saint PaulMinnesota4
34Saint PaulMinnesota4
45Oklahoma CityOklahoma4

1) 결합1 : 붙이기 concat()

  • 데이터를 생성해보자.
1
2
3
4
5
6
df1 = pd.DataFrame({'A':[10,25], 'B':[15,30]})
df2 = pd.DataFrame({'A':[20,30, 50], 'C':[35,30, 40]})
df2.drop([1], inplace = True)

display(df1)
display(df2)
AB
01015
12530
AC
02035
25040

(1) 세로로 합치기 : axis=0

  • join = ‘inner’
1
pd.concat([df1, df2], axis = 0, join = 'inner')
A
010
125
020
250
  • join = ‘outer’
1
pd.concat([df1, df2], axis = 0, join = 'outer')
ABC
01015.0NaN
12530.0NaN
020NaN35.0
250NaN40.0

(2) 가로로 합치기 : axis=1

  • join = ‘inner’
1
pd.concat([df1, df2], axis = 1, join = 'inner')
ABAC
010152035
  • join = ‘outer’
1
pd.concat([df1, df2], axis = 1, join = 'outer')
ABAC
010.015.020.035.0
125.030.0NaNNaN
2NaNNaN50.040.0

(3) 예제

[문1] sales1,2,5,11 데이터프레임을 하나로 합칩시다. 어떻게 합치는게 좋을까요?

1
2
sales0 = pd.concat([sales1, sales2, sales5, sales11], axis=0, join='outer')
sales0
Order_IDDateStore_IDQtyProduct_ID
032013-01-0110.0003
152013-01-0110.0005
272013-01-0110.0007
382013-01-0110.0008
4102013-01-0110.00010
..................
24340535482013-01-311110.00022
24341535502013-01-3111641.50724
24342535532013-01-31110.00027
24343535552013-01-311127.00029
24344535582013-01-311113.00032

1860 rows × 5 columns

2) 결합 2 : 조인 join(merge)

  • default로 inner join
1
2
3
4
5
df1 = pd.DataFrame({'A':[1,2], 'B':[15,30], 'C':[20, 25]})
df2 = pd.DataFrame({'A':[2,3], 'D':[20, 35]})

display(df1)
display(df2)
ABC
011520
123025
AD
0220
1335

(1) inner merge

1
pd.merge(df1, df2, how = 'inner', on = 'A')
ABCD
02302520
1
pd.merge(df1, df2, how = 'inner')
ABCD
02302520
1
pd.merge(df1, df2)
ABCD
02302520

(2) left merge

1
pd.merge(df1, df2, how = 'left')
ABCD
011520NaN
12302520.0

(3) right merge

1
pd.merge(df1, df2, how = 'right')
ABCD
0230.025.020
13NaNNaN35

3) Pivot

  • pivot : 결합은 아니지만, 집계 후 DataFrame 구조를 변형해서 조회하는데 종종 사용됨
  • 단계
      1. groupby
      1. pivot
1
2
3
4
5
# 1) 매장1의 일별 카테고리별 판매량을 집계

temp = pd.merge(sales1, products)
temp2 = temp.groupby(['Date', 'Category'], as_index = False)['Qty'].sum()
temp2
DateCategoryQty
02013-01-01Drink0.000
12013-01-01Food0.000
22013-01-01Grocery0.000
32013-01-01Household Goods0.000
42013-01-02Drink1158.000
............
1192013-01-30Household Goods932.000
1202013-01-31Drink971.000
1212013-01-31Food751.766
1222013-01-31Grocery1907.797
1232013-01-31Household Goods711.000

124 rows × 3 columns

1
2
3
# 2) pivot
temp3 = temp2.pivot( 'Category', 'Date' ,'Qty')
temp3
Date2013-01-012013-01-022013-01-032013-01-042013-01-052013-01-062013-01-072013-01-082013-01-092013-01-10...2013-01-222013-01-232013-01-242013-01-252013-01-262013-01-272013-01-282013-01-292013-01-302013-01-31
Category
Drink0.01158.000985.0000001055.0000001319.000407.0001267.0001115.0001290.00000914.000...1114.0001152.000924.0001213.0001132.000417.000830.000999.0001140.000971.000
Food0.01227.652913.699000790.366000901.057416.912852.676829.851967.58200775.515...780.2011025.047791.388836.856880.019416.783821.064668.154900.092751.766
Grocery0.03305.1302613.6850012711.0790012746.782926.2822689.7202356.2773023.572981933.235...2242.2162824.2962221.8052393.2082257.9071162.2072208.3642001.0472662.4851907.797
Household Goods0.01070.000836.000000834.000000821.000257.000830.000830.000917.00000687.000...786.000769.000622.000701.000551.000247.000625.000617.000932.000711.000

4 rows × 31 columns

1
2
3
4
5
# sns.heatmap으로 시각화 할 수도 있습니다!

plt.figure(figsize = (20, 6))
sns.heatmap(temp3)
plt.show()

png

4) 예제

1
2
3
sales = pd.read_csv("https://raw.githubusercontent.com/DA4BAM/dataset/master/ts_sales_simple.csv")
products = pd.read_csv("https://raw.githubusercontent.com/DA4BAM/dataset/master/ts_product_master.csv")
stores = pd.read_csv("https://raw.githubusercontent.com/DA4BAM/dataset/master/ts_store_master.csv")
1
sales.head(3)
Order_IDDateStore_IDQtyProduct_ID
032013-01-0110.03
152013-01-0110.05
272013-01-0110.07
1
products.head(3)
Product_IDProduct_CodeSubCategoryCategoryLeadTimePrice
020HG001Gardening ToolsHousehold Goods250
127HH001Home AppliancesHousehold Goods2150
216HK001KitchenHousehold Goods223
1
stores.head(3)
Store_IDCityStateStore_Type
01Saint PaulMinnesota4
12Saint PaulMinnesota4
23Saint PaulMinnesota4

(1) sales에 금액(Amt) 변수를 추가하시오.

  • Amt = Qty * Price
1
2
3
# Product_ID 와 Price 열만 가져와서 merge
temp = pd.merge(sales, products.loc[:, ['Product_ID', 'Price']], how='inner', on='Product_ID')
temp.head(3)
Order_IDDateStore_IDQtyProduct_IDPrice
032013-01-0110.038
1362013-01-01100.038
2692013-01-01110.038
1
2
3
# 전체 merge
sales2 = pd.merge(sales, products, how='inner', on='Product_ID')
sales2.head(3)
Order_IDDateStore_IDQtyProduct_IDProduct_CodeSubCategoryCategoryLeadTimePrice
032013-01-0110.03DB001BeverageDrink28
1362013-01-01100.03DB001BeverageDrink28
2692013-01-01110.03DB001BeverageDrink28
1
2
sales2['Amt'] = sales2['Qty'] * sales2['Price']
sales2.head(3)
Order_IDDateStore_IDQtyProduct_IDProduct_CodeSubCategoryCategoryLeadTimePriceAmt
032013-01-0110.03DB001BeverageDrink280.0
1362013-01-01100.03DB001BeverageDrink280.0
2692013-01-01110.03DB001BeverageDrink280.0

(2) 상품(Product_Code) 별 매출액

1
sales2.groupby(['Product_Code'], as_index=False)[['Amt']].sum()
Product_CodeAmt
0DA0011.062816e+06
1DB0011.336522e+07
2FB0013.298673e+06
3FF0013.423657e+06
4FM0012.094948e+06
5FP0012.456785e+06
6GA0012.419344e+07
7GE0011.017210e+06
8GM0019.063631e+06
9GS0011.188430e+06
10HC0012.173276e+07
11HG0011.904000e+05
12HH0010.000000e+00
13HK0010.000000e+00
14HK0020.000000e+00

(3) City별 카테고리별 매출액

1
2
3
4
5
# 1. sales2와 stores를 merge
sales3 = pd.merge(sales2, stores, how='inner', on='Store_ID')

# 2. groupby
sales3.groupby(['City', 'Category'], as_index=False)[['Amt']].sum()
CityCategoryAmt
0AlbanyDrink2.367080e+05
1AlbanyFood6.998561e+04
2AlbanyGrocery3.676481e+05
3AlbanyHousehold Goods2.875500e+05
4AnnapolisDrink3.507200e+05
............
79Saint PaulHousehold Goods1.081224e+07
80WilmingtonDrink0.000000e+00
81WilmingtonFood0.000000e+00
82WilmingtonGrocery0.000000e+00
83WilmingtonHousehold Goods0.000000e+00

84 rows × 3 columns

(4) 매출액이 가장 높은 주 (state) top3

1
sales3.groupby(['State'], as_index=False)[['Amt']].sum().sort_values('Amt', ascending=False).head(3)
StateAmt
6Minnesota4.733903e+07
8North Carolina1.133845e+07
0Arizona4.192750e+06
This post is licensed under CC BY 4.0 by the author.

[KT Aivle 3기 AI] 5일차. (1) DataFrame 변경

[KT Aivle 3기 AI] 6일차. 시계열 데이터, 데이터 분석 방법론, 시각화, 단변량 분석