Machine Learning project submission by MK

Goal

The purpose of this project is to prepare a machine learning algorithm that will be able to predict what kind of exercise is done based on data gathered from accelerometers on the belt, forearm, arm, and dumbell. The data comes from the Human Activity Recognition study: Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. "Wearable Computing: Accelerometers' Data Classification of Body Postures and Movements". More information along with the source can be found here: http://groupware.les.inf.puc-rio.br/har

Model prepared

-Model was calculated in R using mainly Caret package. -The data was divided into 60% for the training partition, and 40% for the testing one The following steps were taken:

The data was filtered to leave only variables describing accelerators (except the outcome variable).
The summarized data was analyzed.

>summary(training)

M<-abs(cor(training_sd[,-17]))
diag(M)<-0
which(M>0.8,arr.ind=TRUE)

Three things were noted:

For variables ‘var_total_accel_belt’, ‘var_accel_arm’, ‘var_accel_dumbbell’, ‘var_accel_forearm’ very high number of values are missing.
For variables "accel_belt_x","accel_belt_z", "accel_arm_x", "accel_arm_y","accel_arm_z", "accel_dumbbell_x","accel_dumbbell_z", "accel_forearm_x", "accel_forearm_z" standard deviation relatively higher than mean.
Variables "total_accel_belt", "accel_belt_y" and "accel_belt_z" are correlated with each other. Based on the above:
1. Variables with high number of NA were excluded from data set. This way the bias was reduced.
2. All the variables in data set were standarized. This way variance of the model was reduced.
3. Two principal components were calculated based on three correlated variables, and replaced them in the data set. This would simplify calculations and improve accuracy.
4. Several models relatively easy to compute were tried - i.e. linear discriminant analysis, CART and bagged CART. Bagged CART turned out to have the best accuracy while cross validated on the testing set - it was 0.9003.

> modFit_treebag
Bagged CART 

11776 samples
   15 predictor
    5 classes: 'A', 'B', 'C', 'D', 'E' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 11776, 11776, 11776, 11776, 11776, 11776, ... 
Resampling results

  Accuracy   Kappa      Accuracy SD  Kappa SD   
  0.8762355  0.8433116  0.005550699  0.007022629

The out-of-sample error was measured by percentage of predictions were incorrect on the testing sample. This value was ~10%.

> print(sum(pred_treebag!=testPC$classe)/length(testPC$classe))
[1] 0.09966862