This document demonstrates some basic uses of recipes. First, some definitions are required:
Y ~ A + B + A:B, the variables are A, B, and Y.predictor (independent variables), response, and case weight. This is meant to be open-ended and extensible.A, B, and A:B. These can be other derived entities that are grouped such a a set of principal components or a set of columns that define a basis function for a variable. These are synonymous with features in machine learning. Variables that have predictor roles would automatically be main effect termsThe cell segmentation data will be used. It has 58 predictor columns, a factor variable Class (the outcome), and two extra labelling columns. Each of the predictors has a suffix for the optical channel ("Ch1"-"Ch4"). We will first separate the data into a training and test set then remove unimportant variables:
library(recipes)
library(caret)
data(segmentationData)
seg_train <- segmentationData %>% 
  filter(Case == "Train") %>% 
  select(-Case, -Cell)
seg_test  <- segmentationData %>% 
  filter(Case == "Test")  %>% 
  select(-Case, -Cell)The idea is that the preprocessing operations will all be created using the training set and then these steps will be applied to both the training and test set.
For a first recipe, let’s plan on centering and scaling the predictors. First, we will create a recipe from the original data and then specify the processing steps.
Recipes can be created manually by sequentially adding roles to variables in a data set.
If the analysis only required outcomes and predictors, the easiest way to create the initial recipe is to use the standard formula method:
rec_obj <- recipe(Class ~ ., data = seg_train)
rec_obj
#> Data Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>    outcome          1
#>  predictor         58The data contained in the data argument need not be the training set; this data is only used to catalog the names of the variables and their types (e.g. numeric, etc.).
(Note that the formula method here is used to declare the variables and their roles and nothing else. If you use inline functions (e.g. log) it will complain. These types of operations can be added later.)
From here, preprocessing steps can be added sequentially in one of two ways:
rec_obj <- step_name(rec_obj, arguments)    ## or
rec_obj <- rec_obj %>% step_name(arguments)step_center and the other functions will always return updated recipes.
One other important facet of the code is the method for specifying which variables should be used in different steps. The manual page ?selections has more details but dplyr-like selector functions can be used:
x1, x2),dplyr functions for selecting variables: contains, ends_with, everything, matches, num_range, and starts_with,all_outcomes, all_predictors, has_role, orall_nominal, all_numeric, and has_type.Note that the functions listed above are the only ones that can be used to selecto variables inside the steps. Also, minus signs can be used to deselect variables.
For our data, we can add the two operations for all of the predictors:
standardized <- rec_obj %>%
  step_center(all_predictors()) %>%
  step_scale(all_predictors()) 
standardized
#> Data Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>    outcome          1
#>  predictor         58
#> 
#> Steps:
#> 
#> Centering for all_predictors()
#> Scaling for all_predictors()It is important to realize that the specific variables have not been declared yet (in this example). In some preprocessing steps, variables will be added or removed from the current list of possible variables.
If this is the only preprocessing steps for the predictors, we can now estimate the means and standard deviations from the training set. The prep function is used with a recipe and a data set:
trained_rec <- prep(standardized, training = seg_train)
#> step 1 center training 
#> step 2 scale trainingNow that the statistics have been estimated, the preprocessing can be applied to the training and test set:
train_data <- bake(trained_rec, newdata = seg_train)
test_data  <- bake(trained_rec, newdata = seg_test)bake returns a tibble:
class(test_data)
#> [1] "tbl_df"     "tbl"        "data.frame"
test_data
#> # A tibble: 1,010 x 58
#>    AngleCh1 AreaCh1 AvgIntenCh1 AvgIntenCh2 AvgIntenCh3 AvgIntenCh4
#>       <dbl>   <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
#>  1   1.0656  -0.647      -0.684      -1.177      -0.926     -0.9238
#>  2  -1.8040  -0.185      -0.632      -0.479      -0.809     -0.6666
#>  3  -1.0300  -0.707       1.207       3.035       0.348      1.3864
#>  4   1.6935  -0.684       0.806       2.664       0.296      0.8934
#>  5   1.8129  -0.342      -0.668      -1.172      -0.843     -0.9282
#>  6  -1.4759   0.784      -0.682      -0.628      -0.881     -0.5939
#>  7   1.2702   0.272      -0.672      -0.625      -0.809     -0.5156
#>  8  -1.5837   0.457       0.283       1.320      -0.613     -0.0891
#>  9  -0.7957  -0.412      -0.669      -1.168      -0.845     -0.9258
#> 10   0.0363  -0.638      -0.535       0.182      -0.555     -0.0253
#> # ... with 1,000 more rows, and 52 more variables:
#> #   ConvexHullAreaRatioCh1 <dbl>, ConvexHullPerimRatioCh1 <dbl>,
#> #   DiffIntenDensityCh1 <dbl>, DiffIntenDensityCh3 <dbl>,
#> #   DiffIntenDensityCh4 <dbl>, EntropyIntenCh1 <dbl>,
#> #   EntropyIntenCh3 <dbl>, EntropyIntenCh4 <dbl>, EqCircDiamCh1 <dbl>,
#> #   EqEllipseLWRCh1 <dbl>, EqEllipseOblateVolCh1 <dbl>,
#> #   EqEllipseProlateVolCh1 <dbl>, EqSphereAreaCh1 <dbl>,
#> #   EqSphereVolCh1 <dbl>, FiberAlign2Ch3 <dbl>, FiberAlign2Ch4 <dbl>,
#> #   FiberLengthCh1 <dbl>, FiberWidthCh1 <dbl>, IntenCoocASMCh3 <dbl>,
#> #   IntenCoocASMCh4 <dbl>, IntenCoocContrastCh3 <dbl>,
#> #   IntenCoocContrastCh4 <dbl>, IntenCoocEntropyCh3 <dbl>,
#> #   IntenCoocEntropyCh4 <dbl>, IntenCoocMaxCh3 <dbl>,
#> #   IntenCoocMaxCh4 <dbl>, KurtIntenCh1 <dbl>, KurtIntenCh3 <dbl>,
#> #   KurtIntenCh4 <dbl>, LengthCh1 <dbl>, NeighborAvgDistCh1 <dbl>,
#> #   NeighborMinDistCh1 <dbl>, NeighborVarDistCh1 <dbl>, PerimCh1 <dbl>,
#> #   ShapeBFRCh1 <dbl>, ShapeLWRCh1 <dbl>, ShapeP2ACh1 <dbl>,
#> #   SkewIntenCh1 <dbl>, SkewIntenCh3 <dbl>, SkewIntenCh4 <dbl>,
#> #   SpotFiberCountCh3 <dbl>, SpotFiberCountCh4 <dbl>, TotalIntenCh1 <dbl>,
#> #   TotalIntenCh2 <dbl>, TotalIntenCh3 <dbl>, TotalIntenCh4 <dbl>,
#> #   VarIntenCh1 <dbl>, VarIntenCh3 <dbl>, VarIntenCh4 <dbl>,
#> #   WidthCh1 <dbl>, XCentroid <dbl>, YCentroid <dbl>After exploring the data, more preprocessing might be required. Steps can be added to the trained recipe. Suppose that we need to create PCA components but only from the predictors from channel 1 and any predictors that are areas:
trained_rec <- trained_rec %>%
  step_pca(ends_with("Ch1"), contains("area"), num = 5)
trained_rec
#> Data Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>    outcome          1
#>  predictor         58
#> 
#> Training data contained 1009 data points and no missing data.
#> 
#> Steps:
#> 
#> Centering for AngleCh1, AreaCh1, ... [trained]
#> Scaling for AngleCh1, AreaCh1, ... [trained]
#> PCA extraction with ends_with("Ch1"), contains("area")Note that only the last step has been estimated; the first two were previously trained and these activities are not duplicated. We can add the PCA estimates using prep again:
trained_rec <- prep(trained_rec, training = seg_train)
#> step 1 center [pre-trained]
#> step 2 scale [pre-trained]
#> step 3 pca trainingbake can be reapplied to get the principal components in addition to the other variables:
test_data  <- bake(trained_rec, newdata = seg_test)
names(test_data)
#>  [1] "AvgIntenCh2"          "AvgIntenCh3"          "AvgIntenCh4"         
#>  [4] "DiffIntenDensityCh3"  "DiffIntenDensityCh4"  "EntropyIntenCh3"     
#>  [7] "EntropyIntenCh4"      "FiberAlign2Ch3"       "FiberAlign2Ch4"      
#> [10] "IntenCoocASMCh3"      "IntenCoocASMCh4"      "IntenCoocContrastCh3"
#> [13] "IntenCoocContrastCh4" "IntenCoocEntropyCh3"  "IntenCoocEntropyCh4" 
#> [16] "IntenCoocMaxCh3"      "IntenCoocMaxCh4"      "KurtIntenCh3"        
#> [19] "KurtIntenCh4"         "SkewIntenCh3"         "SkewIntenCh4"        
#> [22] "SpotFiberCountCh3"    "SpotFiberCountCh4"    "TotalIntenCh2"       
#> [25] "TotalIntenCh3"        "TotalIntenCh4"        "VarIntenCh3"         
#> [28] "VarIntenCh4"          "XCentroid"            "YCentroid"           
#> [31] "PC1"                  "PC2"                  "PC3"                 
#> [34] "PC4"                  "PC5"Note that the PCA components have replaced the original variables that were from channel 1 or measured an area aspect of the cells.
There are a number of different steps included in the package:
steps <- apropos("^step_")
steps[!grepl("new$", steps)]
#>  [1] "step_BoxCox"       "step_YeoJohnson"   "step_bagimpute"   
#>  [4] "step_bin2factor"   "step_center"       "step_classdist"   
#>  [7] "step_corr"         "step_date"         "step_depth"       
#> [10] "step_discretize"   "step_dummy"        "step_holiday"     
#> [13] "step_hyperbolic"   "step_ica"          "step_interact"    
#> [16] "step_intercept"    "step_invlogit"     "step_isomap"      
#> [19] "step_knnimpute"    "step_kpca"         "step_lincomb"     
#> [22] "step_log"          "step_logit"        "step_meanimpute"  
#> [25] "step_modeimpute"   "step_ns"           "step_nzv"         
#> [28] "step_ordinalscore" "step_other"        "step_pca"         
#> [31] "step_percentile"   "step_poly"         "step_range"       
#> [34] "step_ratio"        "step_regex"        "step_rm"          
#> [37] "step_scale"        "step_shuffle"      "step_spatialsign" 
#> [40] "step_sqrt"         "step_window"