[KT Aivle 3기 AI] 5일차. (2) DataFrame 결합

Posted Feb 7, 2023

By JIHWAN PARK 9 min read

0. 개요

KT Aivle School 3기 AI 5일차

주제 : DataFrame 변경
강사 : 한기영 강사님
내용 : 4일차에 이어 Pandas의 DataFrame을 다루는 방법에 대해서 알려주셨다. 그 중, DataFrame 결합하는 부분에 대한 정리이다.

1. 데이터 프레임 결합

1) 데이터 프레임을 결합하는 두 가지 방법

(1) `pd.concat()`

매핑 기준 : 인덱스(행), 칼럼이름(열)
구조가 같아야 합칠 수 있다.
방향 선택
- axis=0 : 세로(행)로 합치기 (Default)
- axis=1 : 가로(열)로 합치기
방법 선택
- join=’outer’ : 모든 행과 열 합치가 (Default)
- join=’inner’ : 매핑되는 행과 열만 합치기

(2) `pd.merge()` (concat보다 많이 쓰임. 중요!)

매핑 기준 : 특정 칼럼(key)의 값 기준으로 결합
데이터베이스 테이블 조인과 같음
merge는 무조건 옆으로 붙인다
inner/outer/left/right join

2) pivot

Pivot 함수를 이용하여 집계된 데이터를 재구성
`df.pivot(index, column, values)
Pivot은 DataFrame을 결합시키는 것은 아니지만, 구조를 변형시키는것.
- 먼저 groupby로 집계하고, pivot(index, column, value)

2. 실습

  
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

  
sales = pd.read_csv("https://raw.githubusercontent.com/DA4BAM/dataset/master/ts_sales_simple.csv")

sales1 = sales.loc[sales['Store_ID']==1]
sales2 = sales.loc[sales['Store_ID']==2]
sales5 = sales.loc[sales['Store_ID']==5]
sales11 = sales.loc[sales['Store_ID']==11]

products = pd.read_csv("https://raw.githubusercontent.com/DA4BAM/dataset/master/ts_product_master.csv")
stores = pd.read_csv("https://raw.githubusercontent.com/DA4BAM/dataset/master/ts_store_master.csv")

  
sales1.head()

	Order_ID	Date	Store_ID	Product_ID
0	3	2013-01-01	1	3
1	5	2013-01-01	1	5
2	7	2013-01-01	1	7
3	8	2013-01-01	1	8
4	10	2013-01-01	1	10

  
products.head()

	Product_ID	Product_Code	SubCategory	Category	LeadTime	Price
0	20	HG001	Gardening Tools	Household Goods	2	50
1	27	HH001	Home Appliances	Household Goods	2	150
2	16	HK001	Kitchen	Household Goods	2	23
3	15	HK002	Kitchen	Household Goods	2	41
4	32	GS001	Seafood	Grocery	3	34

  
stores.head()

	Store_ID	City	State	Store_Type
0	1	Saint Paul	Minnesota	4
1	2	Saint Paul	Minnesota	4
2	3	Saint Paul	Minnesota	4
3	4	Saint Paul	Minnesota	4
4	5	Oklahoma City	Oklahoma	4

1) 결합1 : 붙이기 `concat()`

데이터를 생성해보자.

  
df1 = pd.DataFrame({'A':[10,25], 'B':[15,30]})
df2 = pd.DataFrame({'A':[20,30, 50], 'C':[35,30, 40]})
df2.drop([1], inplace = True)

display(df1)
display(df2)

	A	B
0	10	15
1	25	30

	A	C
0	20	35
2	50	40

(1) 세로로 합치기 : axis=0

join = ‘inner’

  
pd.concat([df1, df2], axis = 0, join = 'inner')

	A
0	10
1	25
0	20
2	50

join = ‘outer’

  
pd.concat([df1, df2], axis = 0, join = 'outer')

	A	B	C
0	10	15.0	NaN
1	25	30.0	NaN
0	20	NaN	35.0
2	50	NaN	40.0

(2) 가로로 합치기 : axis=1

join = ‘inner’

  
pd.concat([df1, df2], axis = 1, join = 'inner')

	A	B	A	C
0	10	15	20	35

join = ‘outer’

  
pd.concat([df1, df2], axis = 1, join = 'outer')

	A	B	A	C
0	10.0	15.0	20.0	35.0
1	25.0	30.0	NaN	NaN
2	NaN	NaN	50.0	40.0

(3) 예제

[문1] sales1,2,5,11 데이터프레임을 하나로 합칩시다. 어떻게 합치는게 좋을까요?

  
sales0 = pd.concat([sales1, sales2, sales5, sales11], axis=0, join='outer')
sales0

	Order_ID	Date	Store_ID	Qty	Product_ID
0	3	2013-01-01	1	0.000	3
1	5	2013-01-01	1	0.000	5
2	7	2013-01-01	1	0.000	7
3	8	2013-01-01	1	0.000	8
4	10	2013-01-01	1	0.000	10
...	...	...	...	...	...
24340	53548	2013-01-31	11	10.000	22
24341	53550	2013-01-31	11	641.507	24
24342	53553	2013-01-31	11	0.000	27
24343	53555	2013-01-31	11	27.000	29
24344	53558	2013-01-31	11	13.000	32

1860 rows × 5 columns

2) 결합 2 : 조인 join(merge)

default로 inner join

  
df1 = pd.DataFrame({'A':[1,2], 'B':[15,30], 'C':[20, 25]})
df2 = pd.DataFrame({'A':[2,3], 'D':[20, 35]})

display(df1)
display(df2)

	A	B	C
0	1	15	20
1	2	30	25

	A	D
0	2	20
1	3	35

(1) inner merge

  
pd.merge(df1, df2, how = 'inner', on = 'A')

	A	B	C	D
0	2	30	25	20

  
pd.merge(df1, df2, how = 'inner')

	A	B	C	D
0	2	30	25	20

  
pd.merge(df1, df2)

	A	B	C	D
0	2	30	25	20

(2) left merge

  
pd.merge(df1, df2, how = 'left')

	A	B	C	D
0	1	15	20	NaN
1	2	30	25	20.0

(3) right merge

  
pd.merge(df1, df2, how = 'right')

	A	B	C	D
0	2	30.0	25.0	20
1	3	NaN	NaN	35

3) Pivot

pivot : 결합은 아니지만, 집계 후 DataFrame 구조를 변형해서 조회하는데 종종 사용됨
단계
- 1. groupby
- 1. pivot

  
# 1) 매장1의 일별 카테고리별 판매량을 집계

temp = pd.merge(sales1, products)
temp2 = temp.groupby(['Date', 'Category'], as_index = False)['Qty'].sum()
temp2

	Date	Category	Qty
0	2013-01-01	Drink	0.000
1	2013-01-01	Food	0.000
2	2013-01-01	Grocery	0.000
3	2013-01-01	Household Goods	0.000
4	2013-01-02	Drink	1158.000
...	...	...	...
119	2013-01-30	Household Goods	932.000
120	2013-01-31	Drink	971.000
121	2013-01-31	Food	751.766
122	2013-01-31	Grocery	1907.797
123	2013-01-31	Household Goods	711.000

124 rows × 3 columns

  
# 2) pivot
temp3 = temp2.pivot( 'Category', 'Date' ,'Qty')
temp3

Date	2013-01-01	2013-01-02	2013-01-03	2013-01-04	2013-01-05	2013-01-06	2013-01-07	2013-01-08	2013-01-09	2013-01-10	...	2013-01-22	2013-01-23	2013-01-24	2013-01-25	2013-01-26	2013-01-27	2013-01-28	2013-01-29	2013-01-30	2013-01-31
Category
Drink	0.0	1158.000	985.000000	1055.000000	1319.000	407.000	1267.000	1115.000	1290.00000	914.000	...	1114.000	1152.000	924.000	1213.000	1132.000	417.000	830.000	999.000	1140.000	971.000
Food	0.0	1227.652	913.699000	790.366000	901.057	416.912	852.676	829.851	967.58200	775.515	...	780.201	1025.047	791.388	836.856	880.019	416.783	821.064	668.154	900.092	751.766
Grocery	0.0	3305.130	2613.685001	2711.079001	2746.782	926.282	2689.720	2356.277	3023.57298	1933.235	...	2242.216	2824.296	2221.805	2393.208	2257.907	1162.207	2208.364	2001.047	2662.485	1907.797
Household Goods	0.0	1070.000	836.000000	834.000000	821.000	257.000	830.000	830.000	917.00000	687.000	...	786.000	769.000	622.000	701.000	551.000	247.000	625.000	617.000	932.000	711.000

4 rows × 31 columns

  
# sns.heatmap으로 시각화 할 수도 있습니다!

plt.figure(figsize = (20, 6))
sns.heatmap(temp3)
plt.show()

4) 예제

  
sales = pd.read_csv("https://raw.githubusercontent.com/DA4BAM/dataset/master/ts_sales_simple.csv")
products = pd.read_csv("https://raw.githubusercontent.com/DA4BAM/dataset/master/ts_product_master.csv")
stores = pd.read_csv("https://raw.githubusercontent.com/DA4BAM/dataset/master/ts_store_master.csv")

  
sales.head(3)

	Order_ID	Date	Store_ID	Product_ID
0	3	2013-01-01	1	3
1	5	2013-01-01	1	5
2	7	2013-01-01	1	7

  
products.head(3)

	Product_ID	Product_Code	SubCategory	Category	LeadTime	Price
0	20	HG001	Gardening Tools	Household Goods	2	50
1	27	HH001	Home Appliances	Household Goods	2	150
2	16	HK001	Kitchen	Household Goods	2	23

  
stores.head(3)

	Store_ID	City	State	Store_Type
0	1	Saint Paul	Minnesota	4
1	2	Saint Paul	Minnesota	4
2	3	Saint Paul	Minnesota	4

(1) sales에 금액(Amt) 변수를 추가하시오.

Amt = Qty * Price

  
# Product_ID 와 Price 열만 가져와서 merge
temp = pd.merge(sales, products.loc[:, ['Product_ID', 'Price']], how='inner', on='Product_ID')
temp.head(3)

	Order_ID	Date	Store_ID	Product_ID	Price
0	3	2013-01-01	1	3	8
1	36	2013-01-01	10	3	8
2	69	2013-01-01	11	3	8

  
# 전체 merge
sales2 = pd.merge(sales, products, how='inner', on='Product_ID')
sales2.head(3)

	Order_ID	Date	Store_ID	Product_ID	Product_Code	SubCategory	Category	LeadTime	Price
0	3	2013-01-01	1	3	DB001	Beverage	Drink	2	8
1	36	2013-01-01	10	3	DB001	Beverage	Drink	2	8
2	69	2013-01-01	11	3	DB001	Beverage	Drink	2	8

  
sales2['Amt'] = sales2['Qty'] * sales2['Price']
sales2.head(3)

	Order_ID	Date	Store_ID	Product_ID	Product_Code	SubCategory	Category	LeadTime	Price
0	3	2013-01-01	1	3	DB001	Beverage	Drink	2	8
1	36	2013-01-01	10	3	DB001	Beverage	Drink	2	8
2	69	2013-01-01	11	3	DB001	Beverage	Drink	2	8

(2) 상품(Product_Code) 별 매출액

  
sales2.groupby(['Product_Code'], as_index=False)[['Amt']].sum()

	Product_Code	Amt
0	DA001	1.062816e+06
1	DB001	1.336522e+07
2	FB001	3.298673e+06
3	FF001	3.423657e+06
4	FM001	2.094948e+06
5	FP001	2.456785e+06
6	GA001	2.419344e+07
7	GE001	1.017210e+06
8	GM001	9.063631e+06
9	GS001	1.188430e+06
10	HC001	2.173276e+07
11	HG001	1.904000e+05
12	HH001	0.000000e+00
13	HK001	0.000000e+00
14	HK002	0.000000e+00

(3) City별 카테고리별 매출액

  
# 1. sales2와 stores를 merge
sales3 = pd.merge(sales2, stores, how='inner', on='Store_ID')

# 2. groupby
sales3.groupby(['City', 'Category'], as_index=False)[['Amt']].sum()

	City	Category	Amt
0	Albany	Drink	2.367080e+05
1	Albany	Food	6.998561e+04
2	Albany	Grocery	3.676481e+05
3	Albany	Household Goods	2.875500e+05
4	Annapolis	Drink	3.507200e+05
...	...	...	...
79	Saint Paul	Household Goods	1.081224e+07
80	Wilmington	Drink	0.000000e+00
81	Wilmington	Food	0.000000e+00
82	Wilmington	Grocery	0.000000e+00
83	Wilmington	Household Goods	0.000000e+00

84 rows × 3 columns

(4) 매출액이 가장 높은 주 (state) top3

  
sales3.groupby(['State'], as_index=False)[['Amt']].sum().sort_values('Amt', ascending=False).head(3)

	State	Amt
6	Minnesota	4.733903e+07
8	North Carolina	1.133845e+07
0	Arizona	4.192750e+06

AI & 데이터분석, AIVLE SCHOOL

This post is licensed under CC BY 4.0 by the author.