Learning Note: Regression with Keras in the Same Dataset

Updated: Sep 5, 2021

Last note mentioned I will show the keras using the same datasets and reach lower RMSLE. Let's do it!

I directly used the data that was processed and feature-engineered last time to conduct experiments, and I was able to make a comparison with LGBM.

First.

I need to split the training set and make a validation set.

x0 <- xtrain[visit_date <= '2017-01-09' & visit_date > '2016-04-01']
x1 <- xtrain[visit_date <= '2017-03-09' & visit_date > '2017-01-09']
x2 <- xtrain[visit_date > '2017-03-09']
y0 <- log1p(x0$visitors)
y1 <- log1p(x1$visitors)
y2 <- log1p(x2$visitors)

# 0-train 1-validation 2-test

train_data <- x0
train_labels <- y0
test_data <- x1
test_labels <- y1

Normalization.

It’s recommended to normalize features that use different scales and ranges. Although the model might converge without feature normalization, it makes training more difficult, and it makes the resulting model more dependent on the choice of units used in the input.

I am going to use the feature_spec interface implemented in the tfdatasets package for normalization. The feature_columns interface allows for other common pre-processing operations on tabular data.

spec <- tfdatasets::feature_spec(train_df, label ~ . ) %>% 
  tfdatasets::step_numeric_column(all_numeric(), normalizer_fn = scaler_standard()) %>% 
  fit()

The spec created with tfdatasets can be used together with layer_dense_features to perform pre-processing directly in the TensorFlow graph.

Modeling

Deep learning models tend to be good at fitting to the training data, but the real challenge is generalization, not fitting.

Unfortunately, there is no magical formula to determine the right size or architecture of your model (in terms of the number of layers, or what the right size for each layer). You will have to experiment using a series of different architectures.

To find an appropriate model size, it’s best to start with relatively few layers and parameters, then begin increasing the size of the layers or adding new layers until see diminishing returns on the validation loss.

I’ll create a simple model using only dense layers, then well a smaller version, and compare them.

baseline_model <-keras_model_sequential() %>%layer_dense(units =16, activation ="relu", input_shape = num_words) %>%layer_dense(units =16, activation ="relu") %>%layer_dense(units =1, activation ="sigmoid") baseline_model %>%compile( optimizer ="adam",loss ="binary_crossentropy",metrics =list("accuracy") )summary(baseline_model)

Let’s create a model with less hidden units to compare against the baseline model that we just created and next, let’s add to this benchmark a network that has much more capacity, far more than the problem would warrant:

layer <- layer_dense_features(
  feature_columns = dense_features(spec), 
  dtype = tf$float32
)
layer(train_df)
input <- layer_input_from_dataset(train_df %>% select(-label))
output <- input %>% 
  layer_dense_features(dense_features(spec)) %>% 
  layer_dense(units = 512, activation = "relu") %>%
  layer_dense(units = 512, activation = "relu") %>%
  layer_dense(units = 1) 
model <- keras_model(input, output)
summary(model)
model %>% 
  compile(
    loss = "mse",
    optimizer = optimizer_rmsprop(),
    metrics = list('mean_squared_error')
  )
history <- model %>% fit(
  x = train_df %>% select(-label),
  y = train_df$label,
  epochs = 100,
  validation_split = 0.2,
  verbose = 2,
  callbacks = list(print_dot_callback)
)

Oh my goodness, what's the situation! Why does Validation's MAE rise first and then fall?

ADD WEIGHT REGULARIZATION & Callback

A “simple model” is a model where the distribution of parameter values has less entropy, or a model with fewer parameters altogether. Thus, a common way to mitigate overfitting is to put constraints on the complexity of a network by forcing its weights to only take on small values, which makes the distribution of weight values more “regular”. This is called “weight regularization”, and it is done by adding to the loss function of the network a cost associated with having large weights. This cost comes in two flavors:

L1 regularization, where the cost added is proportional to the absolute value of the weights coefficients (i.e. to what is called the “L1 norm” of the weights).

L2 regularization, where the cost added is proportional to the square of the value of the weights coefficients (i.e. to what is called the “L2 norm” of the weights). L2 regularization is also called weight decay in the context of neural networks. Don’t let the different name confuse you: weight decay is mathematically the exact same as L2 regularization.

In Keras, weight regularization is added by passing weight regularizer instances to layers. Let’s add L2 weight regularization to the baseline model now.

This graph shows little improvement in the model after about 100 epochs. Let’s update the fit method to automatically stop training when the validation score doesn’t improve. We’ll use a callback that tests a training condition for every epoch. If a set amount of epochs elapses without showing improvement, it automatically stops the training.

So, set L2 regularization.

output <- input %>% 
  layer_dense_features(dense_features(spec)) %>% 
  layer_dense(units = 512, activation = "relu",kernel_regularizer = regularizer_l2(l = 0.001)) %>%
  layer_dense(units = 512, activation = "relu",kernel_regularizer = regularizer_l2(l = 0.001)) %>%
  layer_dense(units = 1) 

model <- keras_model(input, output)
model %>% 
  compile(
    loss = "mse",
    optimizer = optimizer_rmsprop(),
    metrics = list('mean_squared_error')
  )
# The patience parameter is the amount of epochs to check for improvement.
early_stop <- callback_early_stopping(monitor = "val_loss", patience = 20)

history <- model %>% fit(
  x = train_df %>% select(-label),
  y = train_df$label,
  epochs = 200,
  validation_split = 0.2,
  verbose = 2,
  callbacks = list(early_stop))

Now it looks better and RMSLE reduced from 0.5863 to 0.399353.

Advantages

Competition of keras & LGBM!

Faster training speed and higher efficiency: Light GBM win! It uses histogram based algorithm, i.e. it buckets continuous feature values into discrete bins which fasten the training procedure. Keras need to try sometimes for closing to the optimal parameters.

Lower memory usage: LGBM win! Replaces continuous values to discrete bins, which result in lower memory usage.

Better accuracy than any other boosting algorithm: From the data set of this experiment, Keras win!

Compatibility with Large Datasets: Keras is capable of performing equally good with large datasets with a significant reduction in training time as compared to LGBM.

Two both supported to parallel learning supported.