[사전학습] 4.1 일변량 비시각화

Posted Jan 27, 2023 Updated Jan 29, 2023

By JIHWAN PARK 14 min read

4.1 일변량 비시각화

탐색적 데이터 분석

EDA(Exploratory Data Analysis)는 데이터를 다양한 측면에서 바라보고 이해하는 과정
=> 통계적 요약, 분포 파악 및 시각화 등의 기법을 통해 직관적으로 데이터 특성 파악
EDA 기본 개요
데이터가 표햔하는 현상을 이해하고 다양한 패턴 파악
속성 파악 : 분석 목적 및 개별 변수 속성 파악
관계 파악 : 변수 간의 관계 파악 및 가설 검증
사전 데이터 탐색
데이터 정의 확인 : 정의서 기반 데이터 확인 > 테이블 별 변수 목록, 개수, 설명, 타입 등
실 데이터 확인 : 실제 데이터 개요, 결측치, 형상 등 확인 > head, tail, info 기반 확인, 변수별 정의된 범위 및 분포 등 확인 > 관측치 범위/분포 등
요인별 EDA 유형 구분
1. 데이터 변수 개수가 몇 개 인가?
2. 결과를 어떻게 파악할 것인가?
3. 데이터의 유형은 무엇인가?

	일변량(Univariable)	다변량(Multivariable)
비시각화	빈도표 기술 통계량	교차표 상관계수
시각화	파이차트 막대그래프 히스토그램 박스플롯	모자이크플롯 박스플롯 평행좌표 산점도

일변량 비시각화

분석 대상 데이터가 하나의 변수로 구성되고 요약 통계량, 빈도 등으로 표현하는 탐색 유형
=> 단일 변수이므로 원인 및 결과를 다루지는 않으나 데이터 설명 및 구성을 파악

범주형 비시각화

빈도표(범주형 데이터의 구성 및 비율 등을 확인)

특정 범주 별 빈도 파악이 목적
범주 별 빈도 수 기반의 구성 파악 및 결측치 빈도 파악
데이터 전체 수 대비 각 범주 별 분포 파악
연속형 비시각화
주요 통계 지표(연속형 데이터의 기술 통계량 및 주요 지표 등을 확인)
연속형 데이터의 대표 특징을 확인
1. 평균, 분산 등의 기술 통계량
2. 중앙값 등의 사분위수
3. 왜도, 첨도 등의 분포 관련 지표

실습

  
import numpy as np
import pandas as pd

  
import warnings
from sklearn.datasets import load_boston
with warnings.catch_warnings():
    warnings.filterwarnings("ignore")
    data = load_boston()

  
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.DataFrame(data.target, columns=['MEDV'])

housing = pd.merge(X, y, left_index=True, right_index=True, how='inner')
housing

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
0	0.00632	18.0	2.31	0.0	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.0	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.0	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.0	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.0	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	36.2
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
501	0.06263	0.0	11.93	0.0	0.573	6.593	69.1	2.4786	1.0	273.0	21.0	391.99	9.67	22.4
502	0.04527	0.0	11.93	0.0	0.573	6.120	76.7	2.2875	1.0	273.0	21.0	396.90	9.08	20.6
503	0.06076	0.0	11.93	0.0	0.573	6.976	91.0	2.1675	1.0	273.0	21.0	396.90	5.64	23.9
504	0.10959	0.0	11.93	0.0	0.573	6.794	89.3	2.3889	1.0	273.0	21.0	393.45	6.48	22.0
505	0.04741	0.0	11.93	0.0	0.573	6.030	80.8	2.5050	1.0	273.0	21.0	396.90	7.88	11.9

506 rows × 14 columns

  
print(data.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of black people by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

  
housing_data = housing.copy()

housing_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    float64
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(14)
memory usage: 55.5 KB

  
housing_data.head()

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	36.2

  
housing_data.tail()

	CRIM	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
501	0.06263	11.93	0.573	6.593	69.1	2.4786	1.0	273.0	21.0	391.99	9.67	22.4
502	0.04527	11.93	0.573	6.120	76.7	2.2875	1.0	273.0	21.0	396.90	9.08	20.6
503	0.06076	11.93	0.573	6.976	91.0	2.1675	1.0	273.0	21.0	396.90	5.64	23.9
504	0.10959	11.93	0.573	6.794	89.3	2.3889	1.0	273.0	21.0	393.45	6.48	22.0
505	0.04741	11.93	0.573	6.030	80.8	2.5050	1.0	273.0	21.0	396.90	7.88	11.9

  
print('CRIM_min :', min(housing_data.CRIM))
print('AGE_min :', min(housing_data.AGE))
print('MEDV_min :', min(housing_data.MEDV))

CRIM_min : 0.00632
AGE_min : 2.9
MEDV_min : 5.0

데이터 설명 기반의 관측 범위 확인
범주형 일변량 비시각화

  
housing_data = housing_data.astype({'CHAS':'object'})
housing_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    object 
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    float64
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(13), object(1)
memory usage: 55.5+ KB

  
pd.crosstab(housing_data.CHAS, columns='count')

col_0	count
CHAS
0.0	471
1.0	35

  
# 비율 환산
pd.crosstab(housing_data.CHAS, columns='count', normalize=True)

col_0	count
CHAS
0.0	0.93083
1.0	0.06917

  
# 합 추가하기
pd.crosstab(housing_data.CHAS, columns='count', margins=True)

col_0	count	All
CHAS
0.0	471	471
1.0	35	35
All	506	506

  
pd.crosstab(housing_data.CHAS, columns='count',normalize=True ,margins=True)

col_0	count	All
CHAS
0.0	0.93083	0.93083
1.0	0.06917	0.06917
All	1.00000	1.00000

연속형 일변량 비시각화

  
housing_data = housing.copy()

  
housing_data = housing_data.astype({'CHAS':'object'})
housing_data.describe()

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
count	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000
mean	3.613524	11.363636	11.136779	0.554695	6.284634	68.574901	3.795043	9.549407	408.237154	18.455534	356.674032	12.653063	22.532806
std	8.601545	23.322453	6.860353	0.115878	0.702617	28.148861	2.105710	8.707259	168.537116	2.164946	91.294864	7.141062	9.197104
min	0.006320	0.000000	0.460000	0.385000	3.561000	2.900000	1.129600	1.000000	187.000000	12.600000	0.320000	1.730000	5.000000
25%	0.082045	0.000000	5.190000	0.449000	5.885500	45.025000	2.100175	4.000000	279.000000	17.400000	375.377500	6.950000	17.025000
50%	0.256510	0.000000	9.690000	0.538000	6.208500	77.500000	3.207450	5.000000	330.000000	19.050000	391.440000	11.360000	21.200000
75%	3.677083	12.500000	18.100000	0.624000	6.623500	94.075000	5.188425	24.000000	666.000000	20.200000	396.225000	16.955000	25.000000
max	88.976200	100.000000	27.740000	0.871000	8.780000	100.000000	12.126500	24.000000	711.000000	22.000000	396.900000	37.970000	50.000000

  
q1 = housing_data['CRIM'].quantile(0.25)
q3 = housing_data['CRIM'].quantile(0.75)
iqr = q3 - q1
print(f'q1 : {q1}')
print(f'q3 : {q3}')
print(f'iqr : {iqr}')

q1 : 0.08204499999999999
q3 : 3.6770825
iqr : 3.5950375

왜도 첨도 확인

왜도(Skewness) : 분포의 비대칭성을 나타내는 척도로, 얼마나 비대칭인지를 확인
- 왜도의 경우 값이 0보다 크면 왼쪽으로 치우치고, 오른쪽꼬리가 긴 형태의 분포를 보임
첨도(Kurtosis) : 분포의 뾰족한 정도를 나타내는 척도로, 평균에 관측치가 얼마나 모여있는지를 확인
- 첨도의 경우 값이 0보다 크면 뾰족한 모양을 지님

  
print('Skewness :', round(housing_data['CRIM'].skew(), 4))
print('Kurtosis :', round(housing_data['CRIM'].kurt(), 4))

Skewness : 5.2231
Kurtosis : 37.1305

CRIM 컬럼의 왜도 첨도 확인 결과, 왼쪽으로 치우쳐 있고 정규분포보다 훨씬 뾰족한 모양의 분포를 지닌 변수임을 확인

  
print('Skewness :', round(housing_data['MEDV'].skew(), 4))
print('Kurtosis :', round(housing_data['MEDV'].kurt(), 4))

Skewness : 1.1081
Kurtosis : 1.4952

  
print('Skewness :', round(np.log(housing_data['CRIM']).skew(), 4))
print('Kurtosis :', round(np.log(housing_data['CRIM']).kurt(), 4))

Skewness : 0.4059
Kurtosis : -1.0097

로그 변환 결과, 왜도 및 첨도의 정도가 크게 줄어들었음을 확인
극단적 분포를 지닌 원 데이터를 알고리즘에 적용하는 것 보다, 로그 변환 등의 변환 후 적용하는 것이 정확한 분석 결과를 얻을 수 있음
Pandas-Profiling 패키지 소개
Pandas-Profiling은 기본 EDA를 자동화하여 리포트를 생성하는 패키지
데이터 head, tail 및 결측치
범주형 및 연속형 변수의 주요 통계량 도출
시각화까지 제공

  
import pandas_profiling
from pandas_profiling import ProfileReport

Overview : 데이터 개요 확인
- 컬럼 수, 데이터 관측치 수, 결측치 수, 중복 데이터 행, 컬럼 타입 별 개수 등의 정보 확인
Variable : 각 컬럼 별 타입 및 정보 확인
- 고유값 개수 (Unique한 데이터)
- 연속형의 경우 평균, 최대, 최소 및 Quantile statistics 및 왜도 첨도 등 주요 통계량 확인
- 범주형의 경우 범주 별 빈도 및 비율 확인
Interactions, Coreelations : 두 개의 변수 간 상관관계 확인
Missing Values : 결측치 확인
Sample : data의 head 및 tail 확인

  
housing_data.profile_report()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

  
housing_data.profile_report().to_file('./housing_data_pr_report.html')

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Pandas Profiling의 특징 요약

간단한 코드 구현으로, 데이터의 대부분 정보 확인 가능
의사결정을 위한, 리포트를 자동으로 생성하여 업무 활용 가능
다만, 큰 데이터를 대상으로 실행 시 데이터 요약 및 리포트 생성에 오랜 시간이 소요됨
큰 데이터를 대상으로 사용을 고려할 시, 적절하게 추출한 샘플데이터를 대상으로 전체적인 데이터 흐름만 살펴보는 방안을 고려 가능

AI & 데이터분석, AIVLE SCHOOL

This post is licensed under CC BY 4.0 by the author.

4.1 일변량 비시각화

탐색적 데이터 분석

EDA 기본 개요

사전 데이터 탐색

요인별 EDA 유형 구분

일변량 비시각화

범주형 비시각화

연속형 비시각화

실습

범주형 일변량 비시각화

연속형 일변량 비시각화

왜도 첨도 확인

Pandas-Profiling 패키지 소개

Pandas Profiling의 특징 요약

Trending Tags