결측값(missing value) handling


Boston data

[missing vlaue generation]
1. deleting the observations
2. deleting the variable


[missing vlaue handling]
3. imputation with mean/median/mode
4. prediction
 4.1 KNN imputation
 4.2 rpart : decision tree
5. evaluation verification
 5.1 mean imputation
 5.2 KNN imputation
 5.3 rpart


R Code 

### Missing Values  #######
# initialize the data  # load the data
install.packages("MASS")
library(MASS)
data(Boston)
dim(Boston)
original <- Boston  # backup original data
# Introduce missing values
set.seed(100)
Boston[sample(1:nrow(Boston), 40), "rad"] <- NA
Boston[sample(1:nrow(Boston), 40), "ptratio"] <- NA
head(Boston)
# 1. Deleting the observations, if na exist na.omit action, when use lm, omit is default
# lm(medv ~ ptratio + rad, data=Boston, na.action=na.omit)
lm(medv ~ ptratio + rad, data=Boston)
## 2. Deleting the variable
## when variable missing is many, the variable elimination is better than use it
# It is a matter of deciding between the importance of the variable  and losing out on a number of observations.

##  3. Imputation with mean / median / mode(in category variable)
# Replacing the missing values with the mean / median / mode  is a crude way of treating missing values.
Boston$ptratio[is.na(Boston$ptratio)] <- round(mean(Boston$ptratio, na.rm = T), digits=1)
Boston$ptratio
Boston$rad[is.na(Boston$ptratio)] <- mode(Boston$rad)

## 4. Prediction
## 4.1. kNN Imputation
install.packages("DMwR")
library(DMwR)
# perform knn imputation. k=10(default).   일반적으로 k 는 홀수가 좋음.
# eliminate medv variable
knnOutput <- knnImputation(Boston[, !names(Boston) %in% "medv"]) 
knnOutput$rad=round(knnOutput$rad, digits=0)
anyNA(knnOutput)
## 4.2 rpart: decision trees ##
install.packages("rpart")
library(rpart)
# since rad is a factor, method is "class"
class_mod <- rpart(rad ~ . - medv, data=Boston[!is.na(Boston$rad), ],
                   method="class", na.action=na.omit) 
# since ptratio is numeric, method is "anova"
anova_mod <- rpart(ptratio ~ . - medv, data=Boston[!is.na(Boston$ptratio), ],
                   method="anova", na.action=na.omit) 
rad_pred <- predict(class_mod, Boston[is.na(Boston$rad), ])
ptratio_pred <- predict(anova_mod, Boston[is.na(Boston$ptratio), ])

## 5. Evaluation   평가  Accuarcy
library(DMwR)
## mean imputation
actuals <- original$ptratio[is.na(Boston$ptratio)]
predicteds <- rep(mean(Boston$ptratio, na.rm=T), length(actuals))
regr.eval(actuals, predicteds)
## KNN imputation
actuals <- original$ptratio[is.na(Boston$ptratio)]
predicteds <- knnOutput[is.na(Boston$ptratio), "ptratio"]
regr.eval(actuals, predicteds)
# The mean absolute percentage error (mape) has improved  by ~ 39% compared to the imputation by mean.
#  (meanimpute's mape)-(knn's mape)/ (meanimpute's mape)
## rpart
# ptraio
actuals <- original$ptratio[is.na(Boston$ptratio)]
predicteds <- ptratio_pred
regr.eval(actuals, predicteds)
# The mean absolute percentage error (mape) has improved additionally by another ~ 30% compared
# to the knnImputation      ## (knn's mape)-(rpart's mape)/ (rpart's mape)
#rad
actuals <- original$rad[is.na(Boston$rad)]
predicteds <- as.numeric(colnames(rad_pred)[apply(rad_pred, 1, which.max)])
mean(actuals != predicteds)  # compute misclass error.

이 블로그의 인기 게시물

USArrests(1973년 미국 50개주 십만명당 강력범죄수)

SRTP(Secure Real-Time Transport Protocol)

KDD 분석 방법론