[ADP] 4장 데이터 분석

Notice

Recent Posts

Recent Comments

Link

09-29 07:02

« 2024/09 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Tags more

Archives

Today

Total

관리 메뉴

Byeol Lo

[ADP] 4장 데이터 분석 - 분류 분석 본문

AI/ADP

[ADP] 4장 데이터 분석 - 분류 분석

알 수 없는 사용자 2024. 8. 3. 14:36

https://seonghun120614.tistory.com/323

Logistic Regression for Single-class

틀린 것이 있으면 지적 꼭 부탁드립니다. 회귀 분석을 진행하다 보면 모든 것들이 실수 값을 가지는 것이 아닌 인지 아닌지의 확률로 결과가 나왔으면 좋겠던 순간이 있을 것이다. 이때 Logisitic

seonghun120614.tistory.com

이론적인 내용은 위를 보길 바란다.

로지스틱 회귀가 분류목적으로 사용될 때, 시그모이드가 threshold(기준값)이 정해져서 해당 threshold보다 크면 Y=1인 집단으로, 아니면 Y=0으로 분류하게 된다. 보통 threshold의 결정은 손실함수를 적용한다거나 accuracy, sensitivity, specificity 를 동시에 고려하여서 적용된다.

R을 이용한 Logistic Regression

data(iris)
a = subset(iris, Species == "setosa" | Species == "versicolor")
a$Species <- factor(a$Species)
str(a)

> str(a)
'data.frame':	100 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 2 levels "setosa","versicolor": 1 1 1 1 1 1 1 1 1 1 ...

모델을 불러와 훈련시키면

b <- glm(Species~Sepal.Length, data=a, family = binomial)
summary(b)

Call:
glm(formula = Species ~ Sepal.Length, family = binomial, data = a)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -27.831      5.434  -5.122 3.02e-07 ***
Sepal.Length    5.140      1.007   5.107 3.28e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 138.629  on 99  degrees of freedom
Residual deviance:  64.211  on 98  degrees of freedom
AIC: 68.211

Number of Fisher Scoring iterations: 6

위 결과를 해석하면, sepal.length에 대한 species의 구별에서 sepal.length의 변수가 분류에 끼치는 영향이 유의하다(***). 또한, deviance(이탈도) 에서 Null Deviance는 절편만 포함하는 모형이며, Residual Deviance는 예측변수 Sepal.Length가 추가된 적합 모형의 이탈도를 나타낸다. 139.629 정도면 chi sq. 분포에서 p-value가 0.005 정도이므로 적합 결여를 나타내며 이는 intercept만 사용했을 때 모델이 관측된 자료를 잘 적합시키고 있지 않다는 의미다. 64.211도 p-value 가 0.997이므로 귀무가설이 기각되지 않아서 적합값이 관측된 자료를 잘 적합하고 있음을 알 수 있다.

> coef(b)
 (Intercept) Sepal.Length 
  -27.831451     5.140336 
> exp(coef(b)["Sepal.Length"])
Sepal.Length 
    170.7732 
> confint(b, parm = "Sepal.Length")
Waiting for profiling to be done...
   2.5 %   97.5 % 
3.421613 7.415508 
> exp(confint(b, parm = "Sepal.Length"))
Waiting for profiling to be done...
     2.5 %     97.5 % 
  30.61878 1661.55385

또한 화귀계수 β 와 오즈의 증가량 exp(β)에 대한 신뢰구간은 위와 같다.

R을 이용한 Multivariate Logistic Regression

glm.vs <- glm(vs~mpg+am, data=mtcars, family=binomial)
summary(glm.vs)

Call:
glm(formula = vs ~ mpg + am, family = binomial, data = mtcars)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept) -12.7051     4.6252  -2.747  0.00602 **
mpg           0.6809     0.2524   2.698  0.00697 **
am           -3.0073     1.5995  -1.880  0.06009 . 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 43.860  on 31  degrees of freedom
Residual deviance: 20.646  on 29  degrees of freedom
AIC: 26.646

Number of Fisher Scoring iterations: 6

두 독립변수를 넣어서 추정된 회귀 계수들에 대해 종속변수에 영향을 미치는 비율(Odds Ratio)를 알아보자. mpg의 값이 한 단위 증가함에 따라 vs가 1일 오즈가 exp(0.681) ≈ 1.98 배(98 %) 증가하며, mpg가 주어질 때, 오즈에 대한 am의 효과는 exp(-3.0073) ≈ 0.05배가 된다. 즉, 감소한다는 소리.

Logistic Regression 의 최적회귀방정식 선택

여기서도 비슷하게 step을 통해 변수를 선택할 수 있다(direction의 default는 backward임).

step.vs <- step(glm.vs, direction="backward")

Start:  AIC=26.65
vs ~ mpg + am

       Df Deviance    AIC
<none>      20.646 26.646
- am    1   25.533 29.533
- mpg   1   42.953 46.953

Logistic Regression 에서의 ANOVA

step.vs <- step(glm.vs, direction="backward")

Analysis of Deviance Table

Model: binomial, link: logit

Response: vs

Terms added sequentially (first to last)


     Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                    31     43.860              
mpg   1   18.327        30     25.533 1.861e-05 ***
am    1    4.887        29     20.646   0.02706 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

저작자표시

'AI > ADP' 카테고리의 다른 글

[ADP] 4장 데이터 분석 - Decision Tree (0)	2024.08.03
[ADP] 4장 데이터 분석 - 신경망 모형 (1)	2024.08.03
[ADP] 4장 데이터 분석 - 시계열 분석 (0)	2024.08.02
[ADP] 4장 데이터 분석 - 다변량 분석 (0)	2024.08.01
[ADP] 4장 데이터 분석 - 회귀 분석(Regression Analysis) (0)	2024.07.30