Post

[사전학습] 4.3 다변량 비시각화

4.3 다변량 비시각화

다변량 비시각화

두 개 이상의 변수로 구성된 데이터 관계를 교차표 및 상관계수 등으로 파악하는 데이터 탐색 유형
=> 주어진 변수 간의 관계를 수치 및 통계적 지표 기반으로 파악하는 것이 목적

데이터 조합비시각화 방안목적
범주형-범주형교차표두 개 범주형 변수의 범주 별 연관성 및 구성파악
범주형-연속형범주 별 통계량범주 별 대표 통계량 비교 파악
연속형-연속형상관계수두 개 연속형 변수의 관계성 정도 파악

교차표(Cross tabulation)

범주형 - 범주형 변수 조합 간 연관 관계 파악

범주 별 요약 통계량

범주형 - 연속형 변수 조합 간 범주 별 대표 수치 비교

  • 평균, 중앙값 등

    상관계수(Corr. coefficient)

    연속형 - 연속형 변수 조합 간 관계성 강도 파악

    높은 상관계수

    비슷한 정보를 제공하는 밀접한 관계의 변수

  • 1 ~ 0.7 : 강한 상관관계
  • 0.7 ~ 0.3 : 상관관계 존재
  • 0.3 ~ 0.1 : 약한 상관관계
  • 0.1 ~ 0 : 미미한 관계

  • 회귀분석에서 독립변수 간에 강한 상관관계 발생 -> 다중공선성발생
  • 독립변수 간의 관계는 독립적이라는 회귀분석 가정에 위배
  • 회귀 계수가 불안정하여 종속변수에 미치는 영향력을 올바르게 설명치 못하므로 모델의 안정성 저해

    데이터 탐색 중 상관분석 결과를 통해 모델링 사전 단계 내 고려 필요

실습

1
2
import numpy as np
import pandas as pd
1
2
3
4
5
6
import warnings
from sklearn.datasets import load_boston
with warnings.catch_warnings():
    warnings.filterwarnings("ignore")
    data = load_boston()

1
2
3
4
5
6
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.DataFrame(data.target, columns=['MEDV'])

housing = pd.merge(X, y, left_index=True, right_index=True, how='inner')

housing_data = housing.copy()

범주형 - 범주형 다변량 비시각화

교차표 (Cross Tabulation)

1
housing_data.describe()
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMEDV
count506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000
mean3.61352411.36363611.1367790.0691700.5546956.28463468.5749013.7950439.549407408.23715418.455534356.67403212.65306322.532806
std8.60154523.3224536.8603530.2539940.1158780.70261728.1488612.1057108.707259168.5371162.16494691.2948647.1410629.197104
min0.0063200.0000000.4600000.0000000.3850003.5610002.9000001.1296001.000000187.00000012.6000000.3200001.7300005.000000
25%0.0820450.0000005.1900000.0000000.4490005.88550045.0250002.1001754.000000279.00000017.400000375.3775006.95000017.025000
50%0.2565100.0000009.6900000.0000000.5380006.20850077.5000003.2074505.000000330.00000019.050000391.44000011.36000021.200000
75%3.67708312.50000018.1000000.0000000.6240006.62350094.0750005.18842524.000000666.00000020.200000396.22500016.95500025.000000
max88.976200100.00000027.7400001.0000000.8710008.780000100.00000012.12650024.000000711.00000022.000000396.90000037.97000050.000000
1
housing_data = housing_data.astype({'CHAS' : 'object'})
1
2
3
4
medv_bins = [0, np.mean(housing_data['MEDV']), np.max(housing_data['MEDV'])]
medv_names = ['cheap', 'expensive']
housing_data['MEDV_G'] = pd.cut(housing_data['MEDV'], medv_bins, labels=medv_names)
housing_data
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMEDVMEDV_G
00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.9824.0expensive
10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.1421.6cheap
20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.0334.7expensive
30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.9433.4expensive
40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.3336.2expensive
................................................
5010.062630.011.930.00.5736.59369.12.47861.0273.021.0391.999.6722.4cheap
5020.045270.011.930.00.5736.12076.72.28751.0273.021.0396.909.0820.6cheap
5030.060760.011.930.00.5736.97691.02.16751.0273.021.0396.905.6423.9expensive
5040.109590.011.930.00.5736.79489.32.38891.0273.021.0393.456.4822.0cheap
5050.047410.011.930.00.5736.03080.82.50501.0273.021.0396.907.8811.9cheap

506 rows × 15 columns

1
2
rst_CHAS = pd.crosstab(housing_data['CHAS'], housing_data['MEDV_G'], margins=True)
rst_CHAS
MEDV_GcheapexpensiveAll
CHAS
0.0282189471
1.0152035
All297209506
1
2
3
4
indus_bins = [0, np.mean(housing_data['INDUS']), np.max(housing_data['INDUS'])]
indus_names = ['INUDS_LOW', 'INDUS_HIGH']
housing_data['INDUS_G'] = pd.cut(housing_data['INDUS'], indus_bins, labels=indus_names)
housing_data
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMEDVMEDV_GINDUS_G
00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.9824.0expensiveINUDS_LOW
10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.1421.6cheapINUDS_LOW
20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.0334.7expensiveINUDS_LOW
30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.9433.4expensiveINUDS_LOW
40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.3336.2expensiveINUDS_LOW
...................................................
5010.062630.011.930.00.5736.59369.12.47861.0273.021.0391.999.6722.4cheapINDUS_HIGH
5020.045270.011.930.00.5736.12076.72.28751.0273.021.0396.909.0820.6cheapINDUS_HIGH
5030.060760.011.930.00.5736.97691.02.16751.0273.021.0396.905.6423.9expensiveINDUS_HIGH
5040.109590.011.930.00.5736.79489.32.38891.0273.021.0393.456.4822.0cheapINDUS_HIGH
5050.047410.011.930.00.5736.03080.82.50501.0273.021.0396.907.8811.9cheapINDUS_HIGH

506 rows × 16 columns

1
2
3
4
5
rst_INDUS = pd.crosstab(housing_data['INDUS_G'], housing_data['MEDV_G'], margins=True)

rst_INDUS['ratio_cheap'] = np.round((rst_INDUS['cheap'] / rst_INDUS['All'])*100, 2)
rst_INDUS['ratio_expensive'] = np.round((rst_INDUS['expensive'] / rst_INDUS['All'])*100, 2)
rst_INDUS
MEDV_GcheapexpensiveAllratio_cheapratio_expensive
INDUS_G
INUDS_LOW12616829442.8657.14
INDUS_HIGH1714121280.6619.34
All29720950658.7041.30
1
2
3
4
5
6
7
8
9
RAD_bins = [0, np.mean(housing_data['RAD']), np.max(housing_data['RAD'])]
RAD_names = ['INUDS_LOW', 'RAD_HIGH']
housing_data['RAD_G'] = pd.cut(housing_data['RAD'], RAD_bins, labels=RAD_names)

rst_RAD = pd.crosstab(housing_data['RAD_G'], housing_data['MEDV_G'], margins=True)

rst_RAD['ratio_cheap'] = np.round((rst_RAD['cheap'] / rst_RAD['All'])*100, 2)
rst_RAD['ratio_expensive'] = np.round((rst_RAD['expensive'] / rst_RAD['All'])*100, 2)
rst_RAD
MEDV_GcheapexpensiveAllratio_cheapratio_expensive
RAD_G
INUDS_LOW18319137448.9351.07
RAD_HIGH1141813286.3613.64
All29720950658.7041.30
1
2
3
4
rst_df = pd.crosstab([housing_data['RAD_G'], housing_data['INDUS_G']], housing_data['MEDV_G'], margins=True)
rst_df['ratio_cheap'] = np.round((rst_df['cheap'] / rst_df['All'])*100, 2)
rst_df['ratio_expensive'] = np.round((rst_df['expensive'] / rst_df['All'])*100, 2)
rst_df
MEDV_GcheapexpensiveAllratio_cheapratio_expensive
RAD_GINDUS_G
INUDS_LOWINUDS_LOW12616829442.8657.14
INDUS_HIGH57238071.2528.75
RAD_HIGHINDUS_HIGH1141813286.3613.64
All29720950658.7041.30
1
2
3
4
5
6
7
8
9
10
11
12
13
re_indus_bins = [0, 
            np.max(housing_data['INDUS'])/4*1,
            np.max(housing_data['INDUS'])/4*2,
            np.max(housing_data['INDUS'])/4*3, 
            np.max(housing_data['INDUS'])]
re_indus_names = ['INUDS_G1', 'INDUS_G2', 'INDUS_G3', 'INDUS_G4']
housing_data['RE_INDUS_G'] = pd.cut(housing_data['INDUS'], re_indus_bins, labels=re_indus_names)

rst_RE_INDUS = pd.crosstab(housing_data['RE_INDUS_G'], housing_data['MEDV_G'], margins=True)

rst_RE_INDUS['ratio_cheap'] = np.round((rst_RE_INDUS['cheap'] / rst_RE_INDUS['All'])*100, 2)
rst_RE_INDUS['ratio_expensive'] = np.round((rst_RE_INDUS['expensive'] / rst_RE_INDUS['All'])*100, 2)
rst_RE_INDUS
MEDV_GcheapexpensiveAllratio_cheapratio_expensive
RE_INDUS_G
INUDS_G15613919528.7271.28
INDUS_G2793111071.8228.18
INDUS_G31363817478.1621.84
INDUS_G42612796.303.70
All29720950658.7041.30
1
2
3
4
5
rst_df = pd.crosstab([housing_data['RAD_G'], housing_data['RE_INDUS_G']], housing_data['MEDV_G'], margins=True)

rst_df['ratio_cheap'] = np.round((rst_df['cheap'] / rst_df['All'])*100, 2)
rst_df['ratio_expensive'] = np.round((rst_df['expensive'] / rst_df['All'])*100, 2)
rst_df
MEDV_GcheapexpensiveAllratio_cheapratio_expensive
RAD_GRE_INDUS_G
INUDS_LOWINUDS_G15613919528.7271.28
INDUS_G2793111071.8228.18
INDUS_G322204252.3847.62
INDUS_G42612796.303.70
RAD_HIGHINDUS_G31141813286.3613.64
All29720950658.7041.30

범주형 - 연속형 다변량 비시각화

1
pd.DataFrame(housing_data.groupby(['MEDV_G'])['INDUS'].mean())
INDUS
MEDV_G
cheap13.813266
expensive7.333349
1
pd.DataFrame(housing_data.groupby(['MEDV_G'])['INDUS'].median())
INDUS
MEDV_G
cheap18.10
expensive5.86
1
pd.DataFrame(housing_data.groupby(['MEDV_G'])['AGE'].mean())
AGE
MEDV_G
cheap79.009764
expensive53.746411
1
pd.DataFrame(housing_data.groupby(['MEDV_G'])['AGE'].median())
AGE
MEDV_G
cheap88.6
expensive52.6

연속형 - 연속형 다변량 비시각화

1
housing_data = housing.copy()
1
housing_data = housing_data.astype({'CHAS' : 'object'})
1
np.round(housing_data.corrwith(housing_data['MEDV']), 2).sort_values()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
LSTAT     -0.74
PTRATIO   -0.51
INDUS     -0.48
TAX       -0.47
NOX       -0.43
CRIM      -0.39
AGE       -0.38
RAD       -0.38
DIS        0.25
B          0.33
ZN         0.36
RM         0.70
MEDV       1.00
dtype: float64
1
np.round(housing_data.corr(), 2)
CRIMZNINDUSNOXRMAGEDISRADTAXPTRATIOBLSTATMEDV
CRIM1.00-0.200.410.42-0.220.35-0.380.630.580.29-0.390.46-0.39
ZN-0.201.00-0.53-0.520.31-0.570.66-0.31-0.31-0.390.18-0.410.36
INDUS0.41-0.531.000.76-0.390.64-0.710.600.720.38-0.360.60-0.48
NOX0.42-0.520.761.00-0.300.73-0.770.610.670.19-0.380.59-0.43
RM-0.220.31-0.39-0.301.00-0.240.21-0.21-0.29-0.360.13-0.610.70
AGE0.35-0.570.640.73-0.241.00-0.750.460.510.26-0.270.60-0.38
DIS-0.380.66-0.71-0.770.21-0.751.00-0.49-0.53-0.230.29-0.500.25
RAD0.63-0.310.600.61-0.210.46-0.491.000.910.46-0.440.49-0.38
TAX0.58-0.310.720.67-0.290.51-0.530.911.000.46-0.440.54-0.47
PTRATIO0.29-0.390.380.19-0.360.26-0.230.460.461.00-0.180.37-0.51
B-0.390.18-0.36-0.380.13-0.270.29-0.44-0.44-0.181.00-0.370.33
LSTAT0.46-0.410.600.59-0.610.60-0.500.490.540.37-0.371.00-0.74
MEDV-0.390.36-0.48-0.430.70-0.380.25-0.38-0.47-0.510.33-0.741.00
1
2
import scipy.stats as stats
stats.pearsonr(housing_data.TAX, housing_data.RAD)
1
PearsonRResult(statistic=0.9102281885331875, pvalue=4.129920119396259e-195)
This post is licensed under CC BY 4.0 by the author.

[사전학습] 4.2 일변량 비시각화

[사전학습] 4.4 다변량 시각화