R/실습(project)

[R/실습] 랜덤포레스트를 이용한 위스콘신 유방암 데이터 분류분석

연정양 2022. 12. 20.

데이터 정의

[R/실습] xgboost 모델을 이용한 위스콘신 유방암 데이터 분류분석

데이터 정의 - 사용 데이터 : wisc_bc_data.csv 컬럼명 의미 id 환자 식별 번호 diagnosis 양성 여부 (M = 악성, B = 양성) 각 세포에 대한 정보 radius 반경 (중심에서 외벽까지 거리들의 평균값) texture 질감 (Gr

robinlovesyeon.tistory.com

데이터 정의 및 파일 다운로드는 위 링크 참고

랜덤포레스트를 이용한 분류분석

1) library 호출

library(randomForest)

2) 데이터 전처리

set.seed(1234)
idx <- sample(nrow(raw_wis), 0.7 * nrow(raw_wis))
train <- raw_wis[idx, -1]
test <- raw_wis[-idx, -1]

train_mat <- as.matrix(train[-c(1, 32)])
train_lab <- train$label

id는 랜덤포레스트 모델에 불필요하다 판단하여 제거하였다. Matrix 객체로 변환할 때 종속변수를 제거하고 matrix를 생성했고, 종속변수(diagnosis)는 train_lab에 따로 담았다.

3) 랜덤포레스트 모델 생성

RFmodel <- randomForest(train_lab~., data=train_mat, ntree=100, proximity=T)
RFmodel

ntree는 사용할 트리 수로, 100개로 지정하여 랜덤포레스트를 진행했다.

4) 시각화

test_mat <- as.matrix(test[-c(1, 32)])
dim(test_mat)
test_lab <- test$label

plot(RFmodel)

트리가 증가하면서 Error가 감소하는 것을 확인할 수 있다.

5) 중요 변수 보기

importance = as.data.frame(importance(RFmodel))
importance= arrange(importance,desc(IncNodePurity))
importance$IncNodePurity

컬럼명	IncNodePurity
perimeter_worst	19.8958356
points_worst	15.7742412
points_mean	13.5944334
area_worst	11.9207533
radius_worst	8.9063491
concavity_mean	5.0389156
perimeter_mean	2.9995902
radius_mean	2.2291575

6) 중요 변수 시각화

varImpPlot(RFmodel)

perimeter_worst가 가장 큰영향을 미치고, points_worst, points_mean, area_worst, radius_worst, concavity_mean 등의 순으로 중요변수가 시각화 되었다.

소스 코드

library(randomForest) 

set.seed(1234)
idx <- sample(nrow(raw_wis), 0.7 * nrow(raw_wis))
train <- raw_wis[idx, -1]
test <- raw_wis[-idx, -1]

train_mat <- as.matrix(train[-c(1, 32)])
train_lab <- train$label

RFmodel <- randomForest(train_lab~., data=train_mat, ntree=100, proximity=T)
RFmodel

test_mat <- as.matrix(test[-c(1, 32)])
dim(test_mat)
test_lab <- test$label

plot(RFmodel)

importance = as.data.frame(importance(RFmodel))
importance= arrange(importance,desc(IncNodePurity))
importance$IncNodePurity

varImpPlot(RFmodel)

저작자표시 변경금지 (새창열림)

'R > 실습(project)' 카테고리의 다른 글

[R/실습] Zelensky 대통령 연설문 모음 텍스트 분석 - 단어 구름 생성 (1)	2022.12.20
[R/실습] 선형회귀분석을 이용한 BostonHousing 예측분석 (0)	2022.12.20
[R/실습] 의사결정나무를 이용한 BostonHousing 예측분석 (0)	2022.12.20
[R/실습] xgboost 모델을 이용한 위스콘신 유방암 데이터 분류분석 (0)	2022.12.20
[R/실습] 시계열 분석을 통한 2022년 12월 환율 예측 (0)	2022.12.05

[R/실습] 랜덤포레스트를 이용한 위스콘신 유방암 데이터 분류분석

데이터 정의

랜덤포레스트를 이용한 분류분석

소스 코드

'R > 실습(project)' 카테고리의 다른 글

댓글

티스토리툴바