[사전학습] 3.1 결측 데이터 처리

Posted Jan 27, 2023

By JIHWAN PARK 17 min read

3. 데이터 전처리 이해와 실무

3.1 데이터 정제 : 결측 데이터 처리

결측치(Missing Value)

데이터가 수집되지 않거나 누락되어 정보(값)가 필요하지 않음을 의미
=> 모델 훈련을 위해 결측치 처리 필요
결측치 발생 원인
대부분 수집 및 관리 과정에서 결측치 발생
미수집 : 미 입력된 데이터를 수집 및 저장
시스템 오류 : 오류에 의해 누락되어 수집 및 저장
신규 항목 : 새롭게 수집되는 항목
결측치 처리 방안
제거하기 : 가장 쉬운 방안, 엄청난 데이터 손실 발생
대체하기 : 최대한 많은 데이터 활용, 편향(Bias) 발생 가능

결측치 다루기

결측치 제거하기
- Listwise : 결측치가 존재하는 행 삭제
- Pairwise : 모든 변수가 결측치로만 존재하는 행 삭제
결측치 대체하기
- 정보의 손실을 방지하거나 변수 특성(평균, 상관관계 등)에 영향 발생
- 일정 값 대체 : 결측치를 각 변수의 평균값으로 대체
- 선형 값 대체 : 선형 함수 기반 앞 뒤 관측치 활용 대체

실습

데이터 불러오기

  
import numpy as np
import pandas as pd

  
cancer = pd.read_csv("./data/wdbc.data", header=None)
cancer

	0	1	2	3	4	5	6	7	8	9	...	22	23	24	25	26	27	28	29	30	31
0	842302	M	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.30010	0.14710	...	25.380	17.33	184.60	2019.0	0.16220	0.66560	0.7119	0.2654	0.4601	0.11890
1	842517	M	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.08690	0.07017	...	24.990	23.41	158.80	1956.0	0.12380	0.18660	0.2416	0.1860	0.2750	0.08902
2	84300903	M	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.19740	0.12790	...	23.570	25.53	152.50	1709.0	0.14440	0.42450	0.4504	0.2430	0.3613	0.08758
3	84348301	M	11.42	20.38	77.58	386.1	0.14250	0.28390	0.24140	0.10520	...	14.910	26.50	98.87	567.7	0.20980	0.86630	0.6869	0.2575	0.6638	0.17300
4	84358402	M	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.19800	0.10430	...	22.540	16.67	152.20	1575.0	0.13740	0.20500	0.4000	0.1625	0.2364	0.07678
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
564	926424	M	21.56	22.39	142.00	1479.0	0.11100	0.11590	0.24390	0.13890	...	25.450	26.40	166.10	2027.0	0.14100	0.21130	0.4107	0.2216	0.2060	0.07115
565	926682	M	20.13	28.25	131.20	1261.0	0.09780	0.10340	0.14400	0.09791	...	23.690	38.25	155.00	1731.0	0.11660	0.19220	0.3215	0.1628	0.2572	0.06637
566	926954	M	16.60	28.08	108.30	858.1	0.08455	0.10230	0.09251	0.05302	...	18.980	34.12	126.70	1124.0	0.11390	0.30940	0.3403	0.1418	0.2218	0.07820
567	927241	M	20.60	29.33	140.10	1265.0	0.11780	0.27700	0.35140	0.15200	...	25.740	39.42	184.60	1821.0	0.16500	0.86810	0.9387	0.2650	0.4087	0.12400
568	92751	B	7.76	24.54	47.92	181.0	0.05263	0.04362	0.00000	0.00000	...	9.456	30.37	59.16	268.6	0.08996	0.06444	0.0000	0.0000	0.2871	0.07039

569 rows × 32 columns

링크 에서 다운로드 가능

  
# 데이터 컬럼명 지정
cancer.columns = [
    "id", "diagnosis", "radius_mean", "texture_mean", "perimeter_mean", "area_mean", "smoothness_mean", "compactness_mean", 
    "concavity_mean", "concave_points_mean", "symmetry_mean", "fractal_dimension_mean", "radius_se", "texture_se",
    "perimeter_se", "texture_worst", "smoothness_se", "compactness_se", "concavity_se", "concave_points_se", "symmetry_se",
    "fractal_dimension_se", "radius_worst", "texture_worst", "perimeter_worst", "area_worst", "smoothness_worst", "compactness_worst", 
    "concavity_worst", "concave_points_worst", "symmetry_worst", "fractal_dimension_worst"
]

# ID를 Index화
cancer = cancer.set_index('id')
cancer

	diagnosis	radius_mean	texture_mean	perimeter_mean	area_mean	smoothness_mean	compactness_mean	concavity_mean	concave_points_mean	symmetry_mean	...	radius_worst	texture_worst	perimeter_worst	area_worst	smoothness_worst	compactness_worst	concavity_worst	concave_points_worst	symmetry_worst	fractal_dimension_worst
id
842302	M	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.30010	0.14710	0.2419	...	25.380	17.33	184.60	2019.0	0.16220	0.66560	0.7119	0.2654	0.4601	0.11890
842517	M	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.08690	0.07017	0.1812	...	24.990	23.41	158.80	1956.0	0.12380	0.18660	0.2416	0.1860	0.2750	0.08902
84300903	M	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.19740	0.12790	0.2069	...	23.570	25.53	152.50	1709.0	0.14440	0.42450	0.4504	0.2430	0.3613	0.08758
84348301	M	11.42	20.38	77.58	386.1	0.14250	0.28390	0.24140	0.10520	0.2597	...	14.910	26.50	98.87	567.7	0.20980	0.86630	0.6869	0.2575	0.6638	0.17300
84358402	M	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.19800	0.10430	0.1809	...	22.540	16.67	152.20	1575.0	0.13740	0.20500	0.4000	0.1625	0.2364	0.07678
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
926424	M	21.56	22.39	142.00	1479.0	0.11100	0.11590	0.24390	0.13890	0.1726	...	25.450	26.40	166.10	2027.0	0.14100	0.21130	0.4107	0.2216	0.2060	0.07115
926682	M	20.13	28.25	131.20	1261.0	0.09780	0.10340	0.14400	0.09791	0.1752	...	23.690	38.25	155.00	1731.0	0.11660	0.19220	0.3215	0.1628	0.2572	0.06637
926954	M	16.60	28.08	108.30	858.1	0.08455	0.10230	0.09251	0.05302	0.1590	...	18.980	34.12	126.70	1124.0	0.11390	0.30940	0.3403	0.1418	0.2218	0.07820
927241	M	20.60	29.33	140.10	1265.0	0.11780	0.27700	0.35140	0.15200	0.2397	...	25.740	39.42	184.60	1821.0	0.16500	0.86810	0.9387	0.2650	0.4087	0.12400
92751	B	7.76	24.54	47.92	181.0	0.05263	0.04362	0.00000	0.00000	0.1587	...	9.456	30.37	59.16	268.6	0.08996	0.06444	0.0000	0.0000	0.2871	0.07039

569 rows × 31 columns

결측치 생성

  
# 데이터 복사
cancer_data = cancer.copy()

# 데이터 내 결측치 생성
# 실습을 위한 일부 데이터 선택
cancer_data = cancer_data[0:30]
cancer_data = cancer_data[['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean']]

# 결측치 생성
# 6개 record 내 결측치 생성
cancer_data.iloc[2, :] = np.nan     # 3행 내 모든 데이터 결측치 생성
cancer_data.iloc[5, 0] = np.nan     # 6행 내 1열 데이터 결측치 생성
cancer_data.iloc[10, [3, 4]] = np.nan   # 11행 내 4, 5열 데이터 결측치 생성
cancer_data.iloc[12, 2:4] = np.nan   # 13행 내 3, 4열 데이터 결측치 생성
cancer_data.iloc[15, [0, 3]] = np.nan   # 16행 내 1, 4열 데이터 결측치 생성
cancer_data.iloc[24, 4] = np.nan   # 25행 내 5열 데이터 결측치 생성

cancer_data

	diagnosis	radius_mean	texture_mean	perimeter_mean	area_mean
id
842302	M	17.990	10.38	122.80	1001.0
842517	M	20.570	17.77	132.90	1326.0
84300903	NaN	NaN	NaN	NaN	NaN
84348301	M	11.420	20.38	77.58	386.1
84358402	M	20.290	14.34	135.10	1297.0
843786	NaN	12.450	15.70	82.57	477.1
844359	M	18.250	19.98	119.60	1040.0
84458202	M	13.710	20.83	90.20	577.9
844981	M	13.000	21.82	87.50	519.8
84501001	M	12.460	24.04	83.97	475.9
845636	M	16.020	23.24	NaN	NaN
84610002	M	15.780	17.89	103.60	781.0
846226	M	19.170	NaN	NaN	1123.0
846381	M	15.850	23.95	103.70	782.7
84667401	M	13.730	22.61	93.60	578.3
84799002	NaN	14.540	27.54	NaN	658.8
848406	M	14.680	20.13	94.74	684.5
84862001	M	16.130	20.68	108.10	798.8
849014	M	19.810	22.15	130.00	1260.0
8510426	B	13.540	14.36	87.46	566.3
8510653	B	13.080	15.71	85.63	520.0
8510824	B	9.504	12.44	60.34	273.9
8511133	M	15.340	14.26	102.50	704.4
851509	M	21.160	23.04	137.20	1404.0
852552	M	16.650	21.38	110.00	NaN
852631	M	17.140	16.40	116.00	912.7
852763	M	14.580	21.53	97.41	644.8
852781	M	18.610	20.25	122.10	1094.0
852973	M	15.300	25.27	102.40	732.4
853201	M	17.570	15.05	115.00	955.1

결측치 제거하기

결측치 제거 방안
1. listwise deletion : 데이터 내 1개 변수 값에서 N/A(결측)이 존재하는 경우, 해당 행 제거
2. pairwise deletion : 모든 변수가 N/A(결측)이 존재하는 경우, 해당 행 제거
결측치 제거 시, 온전한 데이터를 사용한다는 관점은 적용 가능하나 데이터 손실이 발생함
listwise

  
# 데이터 개요
cancer_data.info()
# 총 6개 record에서 결측치 존재

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30 entries, 842302 to 853201
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   diagnosis       27 non-null     object 
 1   radius_mean     29 non-null     float64
 2   texture_mean    28 non-null     float64
 3   perimeter_mean  26 non-null     float64
 4   area_mean       27 non-null     float64
dtypes: float64(4), object(1)
memory usage: 1.4+ KB

  
# listwise deletion 수행
# 30개 record 중, 6개 record에서 결측치 존재함
cancer_copy = cancer_data.copy()
cancer_copy = cancer_copy.dropna()

# 데이터 요약 : 총 30개 record 중, 하나의 결축치라도 보유한 6개 record 삭제
print(cancer_copy.info())

# 데이터 차원 확인
print("데이터 차원 :", np.shape(cancer_copy))

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24 entries, 842302 to 853201
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   diagnosis       24 non-null     object 
 1   radius_mean     24 non-null     float64
 2   texture_mean    24 non-null     float64
 3   perimeter_mean  24 non-null     float64
 4   area_mean       24 non-null     float64
dtypes: float64(4), object(1)
memory usage: 1.1+ KB
None
데이터 차원 : (24, 5)

cancer_copy

	diagnosis	radius_mean	texture_mean	perimeter_mean	area_mean
id
842302	M	17.990	10.38	122.80	1001.0
842517	M	20.570	17.77	132.90	1326.0
84348301	M	11.420	20.38	77.58	386.1
84358402	M	20.290	14.34	135.10	1297.0
844359	M	18.250	19.98	119.60	1040.0
84458202	M	13.710	20.83	90.20	577.9
844981	M	13.000	21.82	87.50	519.8
84501001	M	12.460	24.04	83.97	475.9
84610002	M	15.780	17.89	103.60	781.0
846381	M	15.850	23.95	103.70	782.7
84667401	M	13.730	22.61	93.60	578.3
848406	M	14.680	20.13	94.74	684.5
84862001	M	16.130	20.68	108.10	798.8
849014	M	19.810	22.15	130.00	1260.0
8510426	B	13.540	14.36	87.46	566.3
8510653	B	13.080	15.71	85.63	520.0
8510824	B	9.504	12.44	60.34	273.9
8511133	M	15.340	14.26	102.50	704.4
851509	M	21.160	23.04	137.20	1404.0
852631	M	17.140	16.40	116.00	912.7
852763	M	14.580	21.53	97.41	644.8
852781	M	18.610	20.25	122.10	1094.0
852973	M	15.300	25.27	102.40	732.4
853201	M	17.570	15.05	115.00	955.1

pairwise

  
# pairwise deletion 수행
# 30개 record 중, 1개 record에서 모든 변수 내 결측치 존재
# 모든 결측치 존재 record만 삭제
cancer_copy = cancer_data.copy()
cancer_copy = cancer_copy.dropna(how='all')

# 데이터 요약 : 총 30개 record 중, 1개 record 삭제
print(cancer_copy.info())

# 데이터 차원 확인
print("데이터 차원 :", np.shape(cancer_copy))

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29 entries, 842302 to 853201
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   diagnosis       27 non-null     object 
 1   radius_mean     29 non-null     float64
 2   texture_mean    28 non-null     float64
 3   perimeter_mean  26 non-null     float64
 4   area_mean       27 non-null     float64
dtypes: float64(4), object(1)
memory usage: 1.4+ KB
None
데이터 차원 : (29, 5)

cancer_copy

	diagnosis	radius_mean	texture_mean	perimeter_mean	area_mean
id
842302	M	17.990	10.38	122.80	1001.0
842517	M	20.570	17.77	132.90	1326.0
84348301	M	11.420	20.38	77.58	386.1
84358402	M	20.290	14.34	135.10	1297.0
843786	NaN	12.450	15.70	82.57	477.1
844359	M	18.250	19.98	119.60	1040.0
84458202	M	13.710	20.83	90.20	577.9
844981	M	13.000	21.82	87.50	519.8
84501001	M	12.460	24.04	83.97	475.9
845636	M	16.020	23.24	NaN	NaN
84610002	M	15.780	17.89	103.60	781.0
846226	M	19.170	NaN	NaN	1123.0
846381	M	15.850	23.95	103.70	782.7
84667401	M	13.730	22.61	93.60	578.3
84799002	NaN	14.540	27.54	NaN	658.8
848406	M	14.680	20.13	94.74	684.5
84862001	M	16.130	20.68	108.10	798.8
849014	M	19.810	22.15	130.00	1260.0
8510426	B	13.540	14.36	87.46	566.3
8510653	B	13.080	15.71	85.63	520.0
8510824	B	9.504	12.44	60.34	273.9
8511133	M	15.340	14.26	102.50	704.4
851509	M	21.160	23.04	137.20	1404.0
852552	M	16.650	21.38	110.00	NaN
852631	M	17.140	16.40	116.00	912.7
852763	M	14.580	21.53	97.41	644.8
852781	M	18.610	20.25	122.10	1094.0
852973	M	15.300	25.27	102.40	732.4
853201	M	17.570	15.05	115.00	955.1

결측치 대체하기

결측치 대체 방안
1. 일정 값 대체 : 결측치를 사전 지정 값으로 대체
2. 선형 값 대체 : 선형 함수 기반, 앞 뒤 record 값을 활용하여 대체
결측치 대체 시, 가능한 많은 데이터를 사용할 수 있다는 관점에서 유용하나, 실 데이터와의 차이가 존재할 수 있음

  
# 결측치 데이터 확인
cancer_copy = cancer_data.copy()
cancer_copy

	diagnosis	radius_mean	texture_mean	perimeter_mean	area_mean
id
842302	M	17.990	10.38	122.80	1001.0
842517	M	20.570	17.77	132.90	1326.0
84300903	NaN	NaN	NaN	NaN	NaN
84348301	M	11.420	20.38	77.58	386.1
84358402	M	20.290	14.34	135.10	1297.0
843786	NaN	12.450	15.70	82.57	477.1
844359	M	18.250	19.98	119.60	1040.0
84458202	M	13.710	20.83	90.20	577.9
844981	M	13.000	21.82	87.50	519.8
84501001	M	12.460	24.04	83.97	475.9
845636	M	16.020	23.24	NaN	NaN
84610002	M	15.780	17.89	103.60	781.0
846226	M	19.170	NaN	NaN	1123.0
846381	M	15.850	23.95	103.70	782.7
84667401	M	13.730	22.61	93.60	578.3
84799002	NaN	14.540	27.54	NaN	658.8
848406	M	14.680	20.13	94.74	684.5
84862001	M	16.130	20.68	108.10	798.8
849014	M	19.810	22.15	130.00	1260.0
8510426	B	13.540	14.36	87.46	566.3
8510653	B	13.080	15.71	85.63	520.0
8510824	B	9.504	12.44	60.34	273.9
8511133	M	15.340	14.26	102.50	704.4
851509	M	21.160	23.04	137.20	1404.0
852552	M	16.650	21.38	110.00	NaN
852631	M	17.140	16.40	116.00	912.7
852763	M	14.580	21.53	97.41	644.8
852781	M	18.610	20.25	122.10	1094.0
852973	M	15.300	25.27	102.40	732.4
853201	M	17.570	15.05	115.00	955.1

일정 값 대체

범주형 데이터

  
# diagnosis 컬럼 내 결측치는 C라는 범주형 값 일괄 대체
cancer_copy['diagnosis'] = cancer_copy['diagnosis'].fillna('C')
cancer_copy.head(10)

	diagnosis	radius_mean	texture_mean	perimeter_mean	area_mean
id
842302	M	17.99	10.38	122.80	1001.0
842517	M	20.57	17.77	132.90	1326.0
84300903	C	NaN	NaN	NaN	NaN
84348301	M	11.42	20.38	77.58	386.1
84358402	M	20.29	14.34	135.10	1297.0
843786	C	12.45	15.70	82.57	477.1
844359	M	18.25	19.98	119.60	1040.0
84458202	M	13.71	20.83	90.20	577.9
844981	M	13.00	21.82	87.50	519.8
84501001	M	12.46	24.04	83.97	475.9

수치형 데이터

  
# 수치형 컬럼인 radius_mean 컬럼 내 결측치는 65라는 수치의 일정 값으로 대체
cancer_copy['radius_mean'] = cancer_copy['radius_mean'].fillna(65)
cancer_copy.head(10)

	diagnosis	radius_mean	texture_mean	perimeter_mean	area_mean
id
842302	M	17.99	10.38	122.80	1001.0
842517	M	20.57	17.77	132.90	1326.0
84300903	C	65.00	NaN	NaN	NaN
84348301	M	11.42	20.38	77.58	386.1
84358402	M	20.29	14.34	135.10	1297.0
843786	C	12.45	15.70	82.57	477.1
844359	M	18.25	19.98	119.60	1040.0
84458202	M	13.71	20.83	90.20	577.9
844981	M	13.00	21.82	87.50	519.8
84501001	M	12.46	24.04	83.97	475.9

  
# 데이터 개요 확인
cancer_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30 entries, 842302 to 853201
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   diagnosis       30 non-null     object 
 1   radius_mean     30 non-null     float64
 2   texture_mean    28 non-null     float64
 3   perimeter_mean  26 non-null     float64
 4   area_mean       27 non-null     float64
dtypes: float64(4), object(1)
memory usage: 1.4+ KB

평균 값 대체

  
# 일정 값이 아닌, 컬럼의 평균으로 대체(평균, 중앙, 최소, 최대값 등으로 대체 가능)
# texture_mean 컬럼 내 결측치를 texture_mean 평균 값으로 대체
cancer_copy['texture_mean'] = cancer_copy['texture_mean'].replace(np.nan, cancer_copy['texture_mean'].mean())

## 동일 결과
## cancer_copy['texture_mean'] = cancer_copy['texture_mean'].fillna(cancer_copy['texture_mean'].mean())

# 대체된 값과 texture_mean 컬럼의 평균값 비교
# 3번째 record, id 84300903 확인
print(cancer_copy['texture_mean'].mean())
cancer_copy.head(10)

19.397142857142853

	diagnosis	radius_mean	texture_mean	perimeter_mean	area_mean
id
842302	M	17.99	10.380000	122.80	1001.0
842517	M	20.57	17.770000	132.90	1326.0
84300903	C	65.00	19.397143	NaN	NaN
84348301	M	11.42	20.380000	77.58	386.1
84358402	M	20.29	14.340000	135.10	1297.0
843786	C	12.45	15.700000	82.57	477.1
844359	M	18.25	19.980000	119.60	1040.0
84458202	M	13.71	20.830000	90.20	577.9
844981	M	13.00	21.820000	87.50	519.8
84501001	M	12.46	24.040000	83.97	475.9

선형 값 대체

데이터 앞 뒤 record 값을 기반으로 결측치 대체 (선형보간법)
선형 값 대체의 경우 데이터의 연속성을 기반으로 연산되므로 신중히 사용

  
# 데이터의 선형관계를 기반 대체
cancer_copy = cancer_data.copy()
cancer_copy.head()

	diagnosis	radius_mean	texture_mean	perimeter_mean	area_mean
id
842302	M	17.99	10.38	122.80	1001.0
842517	M	20.57	17.77	132.90	1326.0
84300903	NaN	NaN	NaN	NaN	NaN
84348301	M	11.42	20.38	77.58	386.1
84358402	M	20.29	14.34	135.10	1297.0

  
# 선형보간법
cancer_copy = cancer_data.interpolate()
cancer_copy.head()

	diagnosis	radius_mean	texture_mean	perimeter_mean	area_mean
id
842302	M	17.990	10.380	122.80	1001.00
842517	M	20.570	17.770	132.90	1326.00
84300903	NaN	15.995	19.075	105.24	856.05
84348301	M	11.420	20.380	77.58	386.10
84358402	M	20.290	14.340	135.10	1297.00

  
print((cancer_data.iloc[1, 1] + cancer_data.iloc[3, 1]) / 2)

15.995000000000001

AI & 데이터분석, AIVLE SCHOOL

This post is licensed under CC BY 4.0 by the author.

3. 데이터 전처리 이해와 실무

3.1 데이터 정제 : 결측 데이터 처리

결측치(Missing Value)

결측치 발생 원인

결측치 처리 방안

결측치 다루기

실습

데이터 불러오기

결측치 생성

결측치 제거하기

listwise

pairwise

결측치 대체하기

일정 값 대체

범주형 데이터

수치형 데이터

평균 값 대체

선형 값 대체

Trending Tags