library(tidyverse)
library(magrittr)

Introduction to Keras

  • There are quite a bunch of deep learning frameworks around, from the older Caffee and Theano to Google’s Tensorflow and the newer Pytorch (which is increasingly trending in research).
  • However, during the rest of this course, 95% of our deep learning exercises will be done using Keras
  • Keras is a deep-learning framework that provides a convenient way to define and train almost any kind of deep-learning model. Keras was initially developed for researchers, with the aim of enabling fast experimentation.

It has the following advantages:

  • User-friendly API which makes it easy to quickly prototype deep learning models.
  • Built-in support for convolutional networks (for computer vision), recurrent networks (for sequence processing), and any combination of both.
  • Supports arbitrary network architectures: multi-input or multi-output models, layer sharing, model sharing, etc., is therefore appropriate for building essentially any deep learning model, from a memory network to a neural Turing machine.
  • Is capable of running on top of multiple back-ends including TensorFlow, CNTK, or Theano.
  • Allows the same code to run on CPU or on GPU, and has strong multi-GPU, distributed storage, and training support (Google cloud, Spark, HDF5…)
  • Can easily be integrated in AI products (Apple CoreML, TensorFlow Android runtime, R or Python webapp backend such as a Shiny or Flask app)

It is widely adapted in academia and industry (Google, Netflix, Uber, CERN, Yelp, Square etc.), and is also a popular framework on Kaggle, the machine-learning competition website, where almost every recent deep-learning competition has been won using Keras models. While Google’s TensorFlow is even more popular, keep in mind that Keras can use Tensorflow (and other popular DL frameworks) as backend, and allows less cumbersome and more high-level

  • So, after all, Keras represents a wonderful high-level starter, fast and easy implementable, and in most cases flexible enough to do whatever you feel like.

Sidenote: The weird name (Keras) means horn in Greek, and is a reference to ancient Greek literature. Eg., in Odyssey, supernatural dream spirits are divided between those who deceive men with false visions (arriving to Earth through a gate of ivory), and those who announce a future that will come to pass (arriving through a gate of horn). So, enough history lessons, let’s run our first deep learning model!

# Load our main tool
library(keras)

Our first deep learning model

Introduction

  • Well, its about time to get serious. We will dive straight in, and use a simple deep learning model on the classical Mnist dataset.
  • This is the original data used by Jan LeCun and his team to fit an ANN that identifies handwritten digits for the US postal service.
  • It consists of quite a bunch of samples of handwritten dicites together with their correct label. The wandwritten dicits here conveniently come as a 28x28 greyscale matrix, making them a good starter to warm up. Lets do that.

Load our data and get ready

# Load our data
mnist <- dataset_mnist()
mnist %>%
  glimpse()
List of 2
 $ train:List of 2
  ..$ x: int [1:60000, 1:28, 1:28] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ y: int [1:60000(1d)] 5 0 4 1 9 2 1 3 1 4 ...
 $ test :List of 2
  ..$ x: int [1:10000, 1:28, 1:28] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ y: int [1:10000(1d)] 7 2 1 0 4 1 4 9 5 9 ...
# sepperate in train and test
train_images <- mnist$train$x
train_labels <- mnist$train$y
test_images <- mnist$test$x
test_labels <- mnist$test$y
  • Lets take a look at the structure.
glimpse(train_images)
 int [1:60000, 1:28, 1:28] 0 0 0 0 0 0 0 0 0 0 ...
glimpse(train_labels)
 int [1:60000(1d)] 5 0 4 1 9 2 1 3 1 4 ...
digit <- train_images[5,,]
digit[,8:20] # I crop it a bit, otherwise the columns dont fit on one page
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
 [1,]    0    0    0    0    0    0    0    0    0     0     0     0     0
 [2,]    0    0    0    0    0    0    0    0    0     0     0     0     0
 [3,]    0    0    0    0    0    0    0    0    0     0     0     0     0
 [4,]    0    0    0    0    0    0    0    0    0     0     0     0     0
 [5,]    0    0    0    0    0    0    0    0    0     0     0     0     0
 [6,]    0    0    0    0    0    0    0    0    0     0     0     0     0
 [7,]    0    0    0    0    0    0    0    0    0     0     0     0     0
 [8,]    0    0    0    0    0   55  148  210  253   253   113    87   148
 [9,]    0    0    0    0   87  232  252  253  189   210   252   252   253
[10,]    0    0    4   57  242  252  190   65    5    12   182   252   253
[11,]    0    0   96  252  252  183   14    0    0    92   252   252   225
[12,]    0  132  253  252  146   14    0    0    0   215   252   252    79
[13,]  126  253  247  176    9    0    0    8   78   245   253   129     0
[14,]  232  252  176    0    0    0   36  201  252   252   169    11     0
[15,]  252  252   30   22  119  197  241  253  252   251    77     0     0
[16,]  231  252  253  252  252  252  226  227  252   231     0     0     0
[17,]   55  235  253  217  138   42   24  192  252   143     0     0     0
[18,]    0    0    0    0    0    0   62  255  253   109     0     0     0
[19,]    0    0    0    0    0    0   71  253  252    21     0     0     0
[20,]    0    0    0    0    0    0    0  253  252    21     0     0     0
[21,]    0    0    0    0    0    0   71  253  252    21     0     0     0
[22,]    0    0    0    0    0    0  106  253  252    21     0     0     0
[23,]    0    0    0    0    0    0   45  255  253    21     0     0     0
[24,]    0    0    0    0    0    0    0  218  252    56     0     0     0
[25,]    0    0    0    0    0    0    0   96  252   189    42     0     0
[26,]    0    0    0    0    0    0    0   14  184   252   170    11     0
[27,]    0    0    0    0    0    0    0    0   14   147   252    42     0
[28,]    0    0    0    0    0    0    0    0    0     0     0     0     0

To make it more tangible, lets plot one:

digit %>% as.raster(max = 255) %>% plot()

rm(digits)

Define the Keras model

The workflow will be as follows:

  1. First, we’ll feed the neural network the training data, train_images and train_labels.
  2. The network will then learn to associate images and labels.
  3. Finally, we’ll ask the network to produce predictions for test_images, and we’ll verify whether these predictions match the labels from test_labels.

Let’s build the network - again, remember that you aren’t expected to understand everything about this example yet.

Building a model in Keras that can be fitted on your data involves two steps:

  1. Defining the networks architecture interms of layers and their shape.
  2. Compiling the model, and defining the loss function, evaluation metric, and optimizer.
network <- keras_model_sequential() %>% 
  layer_dense(units = 512, activation = "relu", input_shape = c(28 * 28)) %>%
  layer_dense(units = 10, activation = "softmax")

Notice that the layer stacking in R is done via the well-known %>%, in Pyhton with .. That’s about the main difference between both implementations.

  • The core building block of neural networks is the layer, a data-processing module that you can think of as a filter for data. Some data goes in, and it comes out in a more useful form.

  • Specifically, layers extract representations out of the data fed into them - hopefully, representations that are more meaningful for the problem at hand.

  • Most of deep learning consists of chaining together simple layers that will implement a form of progressive data distillation.

  • Here, our network consists of a sequence of two layers, which are densely connected (layer_dense) neural layers.

  • The second (and last) layer is a 10-way softmax layer, which means it will return an array of 10 probability scores (summing to 1).

  • Each score will be the probability that the current digit image belongs to one of our 10 digit classes. So, we defined a network with overall 634 cells, consisting of:

    1. input layer: 28x28 = 512 cells
    2. intermediate layer : 28x28 = 512 cells
    3. Output layer: 10 cells
  • To make the network ready for training, we need to pick three more things, as part of the compilation step:

    1. Loss function: How the network will be able to measure its performance on the training data, and thus how it will be able to steer itself in the right direction.
    2. Optimizer: The mechanism through which the network will update itself based on the data it sees and its loss function.
    3. Metrics Here, we’ll only care about accuracy (the fraction of the images that were correctly classified).
  • While we are already familiar with defining metrics to optimize, defining an optimizer and loss function is new. We will dig into that later.

  • Notice that the compile() function modifies the network in place. We will talk about all of them later in a bit more detail.

network %>% compile(
  optimizer = "rmsprop",
  loss = "categorical_crossentropy",
  metrics = c("accuracy")
)

Lets inspect our final setup:

summary(network)
Model: "sequential_1"
___________________________________________________________________________________________________
Layer (type)                                Output Shape                            Param #        
===================================================================================================
dense_2 (Dense)                             (None, 512)                             401920         
___________________________________________________________________________________________________
dense_3 (Dense)                             (None, 10)                              5130           
===================================================================================================
Total params: 407,050
Trainable params: 407,050
Non-trainable params: 0
___________________________________________________________________________________________________

Well’ we see that a network of this size has quite a large number of trainable parameters (all edge-weights, meaning 512x512 + 512x10).

Preprocess the data

  • Before training the model, preprocess the data by reshaping it into the shape the network expects and scaling it so that all values are in the [0, 1] interval.
  • Previously, our training images were stored in an 3d array of shape (60000, 28, 28) of type integer with values in the [0, 255] interval.
  • We transform it into a double array of shape (60000, 28 * 28) with values between 0 and 1.
train_images <- array_reshape(train_images, c(60000, 28 * 28))
train_images <- train_images / 255 # To scale between 0 and 1

test_images <- array_reshape(test_images, c(10000, 28 * 28))
test_images <- test_images / 255 # To scale between 0 and 1
  • Note that we use the array_reshape() rather than the dim() function to reshape the array. I explain why later, when we talk about tensor reshaping.
  • Lastly, we also need to categorically encode the labels.
train_labels <- to_categorical(train_labels)
test_labels <- to_categorical(test_labels)

Run the network

We’re now ready to train the network via Keras fit() function. We save the output in an object we call history.net.

set.seed(1337)
history.net <- network %>% fit(x = train_images, 
                               y = train_labels, 
                               epochs = 10, # How often shall we re-run the model on the whole sample
                               batch_size = 128, # How many observations should be included in every batch
                               validation_split = 0.25 # If we want to do a  cross-validation in the training
                               )
Epoch 1/10

  1/352 [..............................] - ETA: 0s - loss: 2.3864 - accuracy: 0.1328
 11/352 [..............................] - ETA: 1s - loss: 1.2659 - accuracy: 0.6236
 22/352 [>.............................] - ETA: 1s - loss: 0.9197 - accuracy: 0.7365
 31/352 [=>............................] - ETA: 1s - loss: 0.7930 - accuracy: 0.7712
 44/352 [==>...........................] - ETA: 1s - loss: 0.6916 - accuracy: 0.7972
 57/352 [===>..........................] - ETA: 1s - loss: 0.6179 - accuracy: 0.8189
 68/352 [====>.........................] - ETA: 1s - loss: 0.5747 - accuracy: 0.8316
 79/352 [=====>........................] - ETA: 1s - loss: 0.5402 - accuracy: 0.8424
 91/352 [======>.......................] - ETA: 1s - loss: 0.5104 - accuracy: 0.8515
102/352 [=======>......................] - ETA: 1s - loss: 0.4856 - accuracy: 0.8584
113/352 [========>.....................] - ETA: 1s - loss: 0.4659 - accuracy: 0.8642
125/352 [=========>....................] - ETA: 1s - loss: 0.4507 - accuracy: 0.8687
137/352 [==========>...................] - ETA: 0s - loss: 0.4332 - accuracy: 0.8739
150/352 [===========>..................] - ETA: 0s - loss: 0.4188 - accuracy: 0.8781
163/352 [============>.................] - ETA: 0s - loss: 0.4035 - accuracy: 0.8828
176/352 [==============>...............] - ETA: 0s - loss: 0.3913 - accuracy: 0.8857
187/352 [==============>...............] - ETA: 0s - loss: 0.3822 - accuracy: 0.8885
197/352 [===============>..............] - ETA: 0s - loss: 0.3739 - accuracy: 0.8909
208/352 [================>.............] - ETA: 0s - loss: 0.3653 - accuracy: 0.8930
219/352 [=================>............] - ETA: 0s - loss: 0.3585 - accuracy: 0.8952
231/352 [==================>...........] - ETA: 0s - loss: 0.3503 - accuracy: 0.8976
241/352 [===================>..........] - ETA: 0s - loss: 0.3437 - accuracy: 0.8994
252/352 [====================>.........] - ETA: 0s - loss: 0.3374 - accuracy: 0.9013
262/352 [=====================>........] - ETA: 0s - loss: 0.3331 - accuracy: 0.9024
271/352 [======================>.......] - ETA: 0s - loss: 0.3281 - accuracy: 0.9039
281/352 [======================>.......] - ETA: 0s - loss: 0.3236 - accuracy: 0.9052
291/352 [=======================>......] - ETA: 0s - loss: 0.3189 - accuracy: 0.9065
302/352 [========================>.....] - ETA: 0s - loss: 0.3139 - accuracy: 0.9080
312/352 [=========================>....] - ETA: 0s - loss: 0.3087 - accuracy: 0.9094
322/352 [==========================>...] - ETA: 0s - loss: 0.3050 - accuracy: 0.9106
333/352 [===========================>..] - ETA: 0s - loss: 0.3001 - accuracy: 0.9119
345/352 [============================>.] - ETA: 0s - loss: 0.2952 - accuracy: 0.9134
352/352 [==============================] - 2s 5ms/step - loss: 0.2928 - accuracy: 0.9142

352/352 [==============================] - 3s 8ms/step - loss: 0.2928 - accuracy: 0.9142 - val_loss: 0.1599 - val_accuracy: 0.9533
Epoch 2/10

  1/352 [..............................] - ETA: 0s - loss: 0.1682 - accuracy: 0.9609
 13/352 [>.............................] - ETA: 1s - loss: 0.1348 - accuracy: 0.9627
 25/352 [=>............................] - ETA: 1s - loss: 0.1295 - accuracy: 0.9638
 38/352 [==>...........................] - ETA: 1s - loss: 0.1383 - accuracy: 0.9624
 49/352 [===>..........................] - ETA: 1s - loss: 0.1387 - accuracy: 0.9617
 61/352 [====>.........................] - ETA: 1s - loss: 0.1312 - accuracy: 0.9625
 73/352 [=====>........................] - ETA: 1s - loss: 0.1293 - accuracy: 0.9625
 83/352 [======>.......................] - ETA: 1s - loss: 0.1304 - accuracy: 0.9618
 96/352 [=======>......................] - ETA: 1s - loss: 0.1286 - accuracy: 0.9627
108/352 [========>.....................] - ETA: 1s - loss: 0.1296 - accuracy: 0.9622
119/352 [=========>....................] - ETA: 1s - loss: 0.1265 - accuracy: 0.9632
130/352 [==========>...................] - ETA: 0s - loss: 0.1268 - accuracy: 0.9635
141/352 [===========>..................] - ETA: 0s - loss: 0.1276 - accuracy: 0.9632
151/352 [===========>..................] - ETA: 0s - loss: 0.1276 - accuracy: 0.9635
162/352 [============>.................] - ETA: 0s - loss: 0.1288 - accuracy: 0.9631
173/352 [=============>................] - ETA: 0s - loss: 0.1286 - accuracy: 0.9637
181/352 [==============>...............] - ETA: 0s - loss: 0.1279 - accuracy: 0.9639
187/352 [==============>...............] - ETA: 0s - loss: 0.1282 - accuracy: 0.9639
195/352 [===============>..............] - ETA: 0s - loss: 0.1275 - accuracy: 0.9638
203/352 [================>.............] - ETA: 0s - loss: 0.1271 - accuracy: 0.9639
206/352 [================>.............] - ETA: 0s - loss: 0.1277 - accuracy: 0.9636
213/352 [=================>............] - ETA: 0s - loss: 0.1278 - accuracy: 0.9635
220/352 [=================>............] - ETA: 0s - loss: 0.1272 - accuracy: 0.9637
229/352 [==================>...........] - ETA: 0s - loss: 0.1268 - accuracy: 0.9636
236/352 [===================>..........] - ETA: 0s - loss: 0.1266 - accuracy: 0.9636
245/352 [===================>..........] - ETA: 0s - loss: 0.1265 - accuracy: 0.9636
254/352 [====================>.........] - ETA: 0s - loss: 0.1265 - accuracy: 0.9634
263/352 [=====================>........] - ETA: 0s - loss: 0.1244 - accuracy: 0.9641
272/352 [======================>.......] - ETA: 0s - loss: 0.1238 - accuracy: 0.9642
282/352 [=======================>......] - ETA: 0s - loss: 0.1233 - accuracy: 0.9643
291/352 [=======================>......] - ETA: 0s - loss: 0.1233 - accuracy: 0.9642
299/352 [========================>.....] - ETA: 0s - loss: 0.1228 - accuracy: 0.9644
310/352 [=========================>....] - ETA: 0s - loss: 0.1228 - accuracy: 0.9643
321/352 [==========================>...] - ETA: 0s - loss: 0.1223 - accuracy: 0.9644
332/352 [===========================>..] - ETA: 0s - loss: 0.1218 - accuracy: 0.9644
343/352 [============================>.] - ETA: 0s - loss: 0.1214 - accuracy: 0.9647
352/352 [==============================] - 2s 5ms/step - loss: 0.1205 - accuracy: 0.9648

352/352 [==============================] - 2s 7ms/step - loss: 0.1205 - accuracy: 0.9648 - val_loss: 0.1126 - val_accuracy: 0.9659
Epoch 3/10

  1/352 [..............................] - ETA: 0s - loss: 0.0267 - accuracy: 1.0000
 12/352 [>.............................] - ETA: 1s - loss: 0.0690 - accuracy: 0.9805
 24/352 [=>............................] - ETA: 1s - loss: 0.0708 - accuracy: 0.9798
 36/352 [==>...........................] - ETA: 1s - loss: 0.0740 - accuracy: 0.9783
 50/352 [===>..........................] - ETA: 1s - loss: 0.0780 - accuracy: 0.9759
 63/352 [====>.........................] - ETA: 1s - loss: 0.0783 - accuracy: 0.9763
 77/352 [=====>........................] - ETA: 1s - loss: 0.0803 - accuracy: 0.9761
 91/352 [======>.......................] - ETA: 1s - loss: 0.0777 - accuracy: 0.9769
104/352 [=======>......................] - ETA: 0s - loss: 0.0774 - accuracy: 0.9770
117/352 [========>.....................] - ETA: 0s - loss: 0.0769 - accuracy: 0.9769
130/352 [==========>...................] - ETA: 0s - loss: 0.0788 - accuracy: 0.9764
143/352 [===========>..................] - ETA: 0s - loss: 0.0781 - accuracy: 0.9766
155/352 [============>.................] - ETA: 0s - loss: 0.0793 - accuracy: 0.9762
168/352 [=============>................] - ETA: 0s - loss: 0.0778 - accuracy: 0.9767
180/352 [==============>...............] - ETA: 0s - loss: 0.0765 - accuracy: 0.9771
193/352 [===============>..............] - ETA: 0s - loss: 0.0782 - accuracy: 0.9767
206/352 [================>.............] - ETA: 0s - loss: 0.0787 - accuracy: 0.9767
217/352 [=================>............] - ETA: 0s - loss: 0.0778 - accuracy: 0.9771
228/352 [==================>...........] - ETA: 0s - loss: 0.0790 - accuracy: 0.9769
239/352 [===================>..........] - ETA: 0s - loss: 0.0785 - accuracy: 0.9770
251/352 [====================>.........] - ETA: 0s - loss: 0.0779 - accuracy: 0.9770
261/352 [=====================>........] - ETA: 0s - loss: 0.0784 - accuracy: 0.9768
272/352 [======================>.......] - ETA: 0s - loss: 0.0777 - accuracy: 0.9771
282/352 [=======================>......] - ETA: 0s - loss: 0.0773 - accuracy: 0.9772
292/352 [=======================>......] - ETA: 0s - loss: 0.0774 - accuracy: 0.9772
303/352 [========================>.....] - ETA: 0s - loss: 0.0775 - accuracy: 0.9771
314/352 [=========================>....] - ETA: 0s - loss: 0.0773 - accuracy: 0.9772
325/352 [==========================>...] - ETA: 0s - loss: 0.0771 - accuracy: 0.9773
336/352 [===========================>..] - ETA: 0s - loss: 0.0775 - accuracy: 0.9771
347/352 [============================>.] - ETA: 0s - loss: 0.0774 - accuracy: 0.9771
352/352 [==============================] - 2s 4ms/step - loss: 0.0775 - accuracy: 0.9772

352/352 [==============================] - 2s 6ms/step - loss: 0.0775 - accuracy: 0.9772 - val_loss: 0.1008 - val_accuracy: 0.9710
Epoch 4/10

  1/352 [..............................] - ETA: 0s - loss: 0.0433 - accuracy: 0.9922
 13/352 [>.............................] - ETA: 1s - loss: 0.0487 - accuracy: 0.9862
 26/352 [=>............................] - ETA: 1s - loss: 0.0523 - accuracy: 0.9838
 39/352 [==>...........................] - ETA: 1s - loss: 0.0490 - accuracy: 0.9848
 52/352 [===>..........................] - ETA: 1s - loss: 0.0492 - accuracy: 0.9847
 66/352 [====>.........................] - ETA: 1s - loss: 0.0516 - accuracy: 0.9846
 80/352 [=====>........................] - ETA: 1s - loss: 0.0526 - accuracy: 0.9841
 94/352 [=======>......................] - ETA: 0s - loss: 0.0532 - accuracy: 0.9841
105/352 [=======>......................] - ETA: 0s - loss: 0.0533 - accuracy: 0.9841
117/352 [========>.....................] - ETA: 0s - loss: 0.0527 - accuracy: 0.9844
128/352 [=========>....................] - ETA: 0s - loss: 0.0522 - accuracy: 0.9846
137/352 [==========>...................] - ETA: 0s - loss: 0.0522 - accuracy: 0.9847
148/352 [===========>..................] - ETA: 0s - loss: 0.0527 - accuracy: 0.9847
160/352 [============>.................] - ETA: 0s - loss: 0.0529 - accuracy: 0.9846
170/352 [=============>................] - ETA: 0s - loss: 0.0528 - accuracy: 0.9847
180/352 [==============>...............] - ETA: 0s - loss: 0.0529 - accuracy: 0.9849
191/352 [===============>..............] - ETA: 0s - loss: 0.0537 - accuracy: 0.9845
201/352 [================>.............] - ETA: 0s - loss: 0.0551 - accuracy: 0.9842
212/352 [=================>............] - ETA: 0s - loss: 0.0544 - accuracy: 0.9844
223/352 [==================>...........] - ETA: 0s - loss: 0.0553 - accuracy: 0.9841
234/352 [==================>...........] - ETA: 0s - loss: 0.0561 - accuracy: 0.9840
244/352 [===================>..........] - ETA: 0s - loss: 0.0563 - accuracy: 0.9839
254/352 [====================>.........] - ETA: 0s - loss: 0.0561 - accuracy: 0.9840
265/352 [=====================>........] - ETA: 0s - loss: 0.0557 - accuracy: 0.9842
276/352 [======================>.......] - ETA: 0s - loss: 0.0554 - accuracy: 0.9841
287/352 [=======================>......] - ETA: 0s - loss: 0.0551 - accuracy: 0.9841
300/352 [========================>.....] - ETA: 0s - loss: 0.0552 - accuracy: 0.9840
313/352 [=========================>....] - ETA: 0s - loss: 0.0556 - accuracy: 0.9839
325/352 [==========================>...] - ETA: 0s - loss: 0.0561 - accuracy: 0.9836
338/352 [===========================>..] - ETA: 0s - loss: 0.0556 - accuracy: 0.9838
350/352 [============================>.] - ETA: 0s - loss: 0.0554 - accuracy: 0.9838
352/352 [==============================] - 2s 4ms/step - loss: 0.0555 - accuracy: 0.9838

352/352 [==============================] - 2s 6ms/step - loss: 0.0555 - accuracy: 0.9838 - val_loss: 0.0989 - val_accuracy: 0.9717
Epoch 5/10

  1/352 [..............................] - ETA: 0s - loss: 0.0369 - accuracy: 0.9922
 15/352 [>.............................] - ETA: 1s - loss: 0.0299 - accuracy: 0.9922
 29/352 [=>............................] - ETA: 1s - loss: 0.0310 - accuracy: 0.9906
 43/352 [==>...........................] - ETA: 1s - loss: 0.0334 - accuracy: 0.9896
 56/352 [===>..........................] - ETA: 1s - loss: 0.0356 - accuracy: 0.9891
 69/352 [====>.........................] - ETA: 1s - loss: 0.0369 - accuracy: 0.9887
 82/352 [=====>........................] - ETA: 1s - loss: 0.0368 - accuracy: 0.9889
 93/352 [======>.......................] - ETA: 1s - loss: 0.0358 - accuracy: 0.9894
103/352 [=======>......................] - ETA: 1s - loss: 0.0356 - accuracy: 0.9894
116/352 [========>.....................] - ETA: 0s - loss: 0.0358 - accuracy: 0.9894
129/352 [=========>....................] - ETA: 0s - loss: 0.0375 - accuracy: 0.9891
137/352 [==========>...................] - ETA: 0s - loss: 0.0372 - accuracy: 0.9890
148/352 [===========>..................] - ETA: 0s - loss: 0.0365 - accuracy: 0.9894
158/352 [============>.................] - ETA: 0s - loss: 0.0371 - accuracy: 0.9893
170/352 [=============>................] - ETA: 0s - loss: 0.0374 - accuracy: 0.9892
181/352 [==============>...............] - ETA: 0s - loss: 0.0380 - accuracy: 0.9892
193/352 [===============>..............] - ETA: 0s - loss: 0.0380 - accuracy: 0.9891
204/352 [================>.............] - ETA: 0s - loss: 0.0386 - accuracy: 0.9889
214/352 [=================>............] - ETA: 0s - loss: 0.0384 - accuracy: 0.9888
225/352 [==================>...........] - ETA: 0s - loss: 0.0388 - accuracy: 0.9888
237/352 [===================>..........] - ETA: 0s - loss: 0.0385 - accuracy: 0.9888
247/352 [====================>.........] - ETA: 0s - loss: 0.0387 - accuracy: 0.9887
259/352 [=====================>........] - ETA: 0s - loss: 0.0396 - accuracy: 0.9884
271/352 [======================>.......] - ETA: 0s - loss: 0.0403 - accuracy: 0.9882
281/352 [======================>.......] - ETA: 0s - loss: 0.0401 - accuracy: 0.9883
292/352 [=======================>......] - ETA: 0s - loss: 0.0404 - accuracy: 0.9881
304/352 [========================>.....] - ETA: 0s - loss: 0.0405 - accuracy: 0.9881
316/352 [=========================>....] - ETA: 0s - loss: 0.0411 - accuracy: 0.9879
327/352 [==========================>...] - ETA: 0s - loss: 0.0408 - accuracy: 0.9880
337/352 [===========================>..] - ETA: 0s - loss: 0.0406 - accuracy: 0.9880
349/352 [============================>.] - ETA: 0s - loss: 0.0411 - accuracy: 0.9878
352/352 [==============================] - 2s 4ms/step - loss: 0.0411 - accuracy: 0.9878

352/352 [==============================] - 2s 6ms/step - loss: 0.0411 - accuracy: 0.9878 - val_loss: 0.0870 - val_accuracy: 0.9747
Epoch 6/10

  1/352 [..............................] - ETA: 0s - loss: 0.0442 - accuracy: 0.9922
 13/352 [>.............................] - ETA: 1s - loss: 0.0279 - accuracy: 0.9916
 26/352 [=>............................] - ETA: 1s - loss: 0.0247 - accuracy: 0.9928
 38/352 [==>...........................] - ETA: 1s - loss: 0.0273 - accuracy: 0.9920
 50/352 [===>..........................] - ETA: 1s - loss: 0.0250 - accuracy: 0.9928
 62/352 [====>.........................] - ETA: 1s - loss: 0.0250 - accuracy: 0.9931
 73/352 [=====>........................] - ETA: 1s - loss: 0.0257 - accuracy: 0.9924
 86/352 [======>.......................] - ETA: 1s - loss: 0.0279 - accuracy: 0.9914
 98/352 [=======>......................] - ETA: 1s - loss: 0.0289 - accuracy: 0.9913
110/352 [========>.....................] - ETA: 1s - loss: 0.0285 - accuracy: 0.9915
121/352 [=========>....................] - ETA: 0s - loss: 0.0280 - accuracy: 0.9916
133/352 [==========>...................] - ETA: 0s - loss: 0.0293 - accuracy: 0.9909
145/352 [===========>..................] - ETA: 0s - loss: 0.0294 - accuracy: 0.9907
157/352 [============>.................] - ETA: 0s - loss: 0.0291 - accuracy: 0.9907
168/352 [=============>................] - ETA: 0s - loss: 0.0298 - accuracy: 0.9905
180/352 [==============>...............] - ETA: 0s - loss: 0.0303 - accuracy: 0.9905
190/352 [===============>..............] - ETA: 0s - loss: 0.0308 - accuracy: 0.9905
203/352 [================>.............] - ETA: 0s - loss: 0.0315 - accuracy: 0.9901
215/352 [=================>............] - ETA: 0s - loss: 0.0316 - accuracy: 0.9902
228/352 [==================>...........] - ETA: 0s - loss: 0.0313 - accuracy: 0.9904
241/352 [===================>..........] - ETA: 0s - loss: 0.0312 - accuracy: 0.9905
253/352 [====================>.........] - ETA: 0s - loss: 0.0312 - accuracy: 0.9905
265/352 [=====================>........] - ETA: 0s - loss: 0.0308 - accuracy: 0.9906
276/352 [======================>.......] - ETA: 0s - loss: 0.0304 - accuracy: 0.9907
288/352 [=======================>......] - ETA: 0s - loss: 0.0302 - accuracy: 0.9907
300/352 [========================>.....] - ETA: 0s - loss: 0.0302 - accuracy: 0.9908
312/352 [=========================>....] - ETA: 0s - loss: 0.0301 - accuracy: 0.9908
324/352 [==========================>...] - ETA: 0s - loss: 0.0301 - accuracy: 0.9908
336/352 [===========================>..] - ETA: 0s - loss: 0.0308 - accuracy: 0.9908
349/352 [============================>.] - ETA: 0s - loss: 0.0305 - accuracy: 0.9910
352/352 [==============================] - 2s 4ms/step - loss: 0.0303 - accuracy: 0.9910

352/352 [==============================] - 2s 5ms/step - loss: 0.0303 - accuracy: 0.9910 - val_loss: 0.1008 - val_accuracy: 0.9727
Epoch 7/10

  1/352 [..............................] - ETA: 0s - loss: 0.0514 - accuracy: 0.9766
 15/352 [>.............................] - ETA: 1s - loss: 0.0237 - accuracy: 0.9927
 29/352 [=>............................] - ETA: 1s - loss: 0.0246 - accuracy: 0.9925
 43/352 [==>...........................] - ETA: 1s - loss: 0.0216 - accuracy: 0.9933
 55/352 [===>..........................] - ETA: 1s - loss: 0.0217 - accuracy: 0.9935
 68/352 [====>.........................] - ETA: 1s - loss: 0.0212 - accuracy: 0.9938
 83/352 [======>.......................] - ETA: 1s - loss: 0.0233 - accuracy: 0.9933
 97/352 [=======>......................] - ETA: 0s - loss: 0.0232 - accuracy: 0.9936
111/352 [========>.....................] - ETA: 0s - loss: 0.0231 - accuracy: 0.9935
125/352 [=========>....................] - ETA: 0s - loss: 0.0231 - accuracy: 0.9936
139/352 [==========>...................] - ETA: 0s - loss: 0.0231 - accuracy: 0.9934
152/352 [===========>..................] - ETA: 0s - loss: 0.0229 - accuracy: 0.9935
165/352 [=============>................] - ETA: 0s - loss: 0.0236 - accuracy: 0.9936
178/352 [==============>...............] - ETA: 0s - loss: 0.0237 - accuracy: 0.9935
190/352 [===============>..............] - ETA: 0s - loss: 0.0232 - accuracy: 0.9935
203/352 [================>.............] - ETA: 0s - loss: 0.0229 - accuracy: 0.9935
215/352 [=================>............] - ETA: 0s - loss: 0.0231 - accuracy: 0.9935
226/352 [==================>...........] - ETA: 0s - loss: 0.0232 - accuracy: 0.9934
239/352 [===================>..........] - ETA: 0s - loss: 0.0230 - accuracy: 0.9934
253/352 [====================>.........] - ETA: 0s - loss: 0.0228 - accuracy: 0.9934
264/352 [=====================>........] - ETA: 0s - loss: 0.0227 - accuracy: 0.9934
277/352 [======================>.......] - ETA: 0s - loss: 0.0225 - accuracy: 0.9935
288/352 [=======================>......] - ETA: 0s - loss: 0.0227 - accuracy: 0.9934
299/352 [========================>.....] - ETA: 0s - loss: 0.0225 - accuracy: 0.9933
312/352 [=========================>....] - ETA: 0s - loss: 0.0226 - accuracy: 0.9934
323/352 [==========================>...] - ETA: 0s - loss: 0.0226 - accuracy: 0.9933
334/352 [===========================>..] - ETA: 0s - loss: 0.0225 - accuracy: 0.9934
344/352 [============================>.] - ETA: 0s - loss: 0.0228 - accuracy: 0.9933
352/352 [==============================] - 1s 4ms/step - loss: 0.0229 - accuracy: 0.9933

352/352 [==============================] - 2s 5ms/step - loss: 0.0229 - accuracy: 0.9933 - val_loss: 0.0875 - val_accuracy: 0.9768
Epoch 8/10

  1/352 [..............................] - ETA: 0s - loss: 0.0032 - accuracy: 1.0000
 14/352 [>.............................] - ETA: 1s - loss: 0.0201 - accuracy: 0.9950
 28/352 [=>............................] - ETA: 1s - loss: 0.0150 - accuracy: 0.9961
 41/352 [==>...........................] - ETA: 1s - loss: 0.0148 - accuracy: 0.9962
 53/352 [===>..........................] - ETA: 1s - loss: 0.0145 - accuracy: 0.9960
 66/352 [====>.........................] - ETA: 1s - loss: 0.0152 - accuracy: 0.9959
 80/352 [=====>........................] - ETA: 1s - loss: 0.0145 - accuracy: 0.9961
 92/352 [======>.......................] - ETA: 1s - loss: 0.0146 - accuracy: 0.9962
104/352 [=======>......................] - ETA: 0s - loss: 0.0163 - accuracy: 0.9958
115/352 [========>.....................] - ETA: 0s - loss: 0.0162 - accuracy: 0.9957
126/352 [=========>....................] - ETA: 0s - loss: 0.0163 - accuracy: 0.9957
138/352 [==========>...................] - ETA: 0s - loss: 0.0161 - accuracy: 0.9959
150/352 [===========>..................] - ETA: 0s - loss: 0.0157 - accuracy: 0.9959
162/352 [============>.................] - ETA: 0s - loss: 0.0155 - accuracy: 0.9960
174/352 [=============>................] - ETA: 0s - loss: 0.0157 - accuracy: 0.9960
186/352 [==============>...............] - ETA: 0s - loss: 0.0155 - accuracy: 0.9960
197/352 [===============>..............] - ETA: 0s - loss: 0.0159 - accuracy: 0.9958
209/352 [================>.............] - ETA: 0s - loss: 0.0158 - accuracy: 0.9958
220/352 [=================>............] - ETA: 0s - loss: 0.0166 - accuracy: 0.9956
231/352 [==================>...........] - ETA: 0s - loss: 0.0169 - accuracy: 0.9955
244/352 [===================>..........] - ETA: 0s - loss: 0.0167 - accuracy: 0.9955
256/352 [====================>.........] - ETA: 0s - loss: 0.0169 - accuracy: 0.9954
268/352 [=====================>........] - ETA: 0s - loss: 0.0168 - accuracy: 0.9954
282/352 [=======================>......] - ETA: 0s - loss: 0.0171 - accuracy: 0.9954
296/352 [========================>.....] - ETA: 0s - loss: 0.0173 - accuracy: 0.9954
306/352 [=========================>....] - ETA: 0s - loss: 0.0172 - accuracy: 0.9954
317/352 [==========================>...] - ETA: 0s - loss: 0.0173 - accuracy: 0.9952
328/352 [==========================>...] - ETA: 0s - loss: 0.0175 - accuracy: 0.9952
340/352 [===========================>..] - ETA: 0s - loss: 0.0174 - accuracy: 0.9952
352/352 [==============================] - 2s 5ms/step - loss: 0.0176 - accuracy: 0.9952

352/352 [==============================] - 2s 6ms/step - loss: 0.0176 - accuracy: 0.9952 - val_loss: 0.0888 - val_accuracy: 0.9761
Epoch 9/10

  1/352 [..............................] - ETA: 0s - loss: 0.0402 - accuracy: 0.9844
 13/352 [>.............................] - ETA: 1s - loss: 0.0091 - accuracy: 0.9976
 25/352 [=>............................] - ETA: 1s - loss: 0.0107 - accuracy: 0.9975
 30/352 [=>............................] - ETA: 1s - loss: 0.0114 - accuracy: 0.9971
 36/352 [==>...........................] - ETA: 1s - loss: 0.0121 - accuracy: 0.9970
 48/352 [===>..........................] - ETA: 1s - loss: 0.0131 - accuracy: 0.9972
 64/352 [====>.........................] - ETA: 1s - loss: 0.0117 - accuracy: 0.9976
 75/352 [=====>........................] - ETA: 1s - loss: 0.0117 - accuracy: 0.9974
 85/352 [======>.......................] - ETA: 1s - loss: 0.0114 - accuracy: 0.9974
 97/352 [=======>......................] - ETA: 1s - loss: 0.0122 - accuracy: 0.9972
109/352 [========>.....................] - ETA: 1s - loss: 0.0117 - accuracy: 0.9973
121/352 [=========>....................] - ETA: 1s - loss: 0.0125 - accuracy: 0.9971
133/352 [==========>...................] - ETA: 1s - loss: 0.0124 - accuracy: 0.9971
145/352 [===========>..................] - ETA: 0s - loss: 0.0120 - accuracy: 0.9972
158/352 [============>.................] - ETA: 0s - loss: 0.0133 - accuracy: 0.9969
170/352 [=============>................] - ETA: 0s - loss: 0.0130 - accuracy: 0.9969
182/352 [==============>...............] - ETA: 0s - loss: 0.0131 - accuracy: 0.9968
194/352 [===============>..............] - ETA: 0s - loss: 0.0134 - accuracy: 0.9967
206/352 [================>.............] - ETA: 0s - loss: 0.0131 - accuracy: 0.9968
217/352 [=================>............] - ETA: 0s - loss: 0.0130 - accuracy: 0.9968
229/352 [==================>...........] - ETA: 0s - loss: 0.0135 - accuracy: 0.9966
241/352 [===================>..........] - ETA: 0s - loss: 0.0134 - accuracy: 0.9965
254/352 [====================>.........] - ETA: 0s - loss: 0.0132 - accuracy: 0.9966
267/352 [=====================>........] - ETA: 0s - loss: 0.0132 - accuracy: 0.9965
279/352 [======================>.......] - ETA: 0s - loss: 0.0129 - accuracy: 0.9966
291/352 [=======================>......] - ETA: 0s - loss: 0.0128 - accuracy: 0.9967
302/352 [========================>.....] - ETA: 0s - loss: 0.0129 - accuracy: 0.9966
310/352 [=========================>....] - ETA: 0s - loss: 0.0128 - accuracy: 0.9966
321/352 [==========================>...] - ETA: 0s - loss: 0.0129 - accuracy: 0.9967
332/352 [===========================>..] - ETA: 0s - loss: 0.0128 - accuracy: 0.9967
343/352 [============================>.] - ETA: 0s - loss: 0.0127 - accuracy: 0.9968
352/352 [==============================] - 2s 5ms/step - loss: 0.0128 - accuracy: 0.9967

352/352 [==============================] - 2s 6ms/step - loss: 0.0128 - accuracy: 0.9967 - val_loss: 0.0889 - val_accuracy: 0.9783
Epoch 10/10

  1/352 [..............................] - ETA: 0s - loss: 0.0685 - accuracy: 0.9922
 15/352 [>.............................] - ETA: 1s - loss: 0.0120 - accuracy: 0.9969
 29/352 [=>............................] - ETA: 1s - loss: 0.0086 - accuracy: 0.9978
 42/352 [==>...........................] - ETA: 1s - loss: 0.0080 - accuracy: 0.9980
 56/352 [===>..........................] - ETA: 1s - loss: 0.0079 - accuracy: 0.9980
 67/352 [====>.........................] - ETA: 1s - loss: 0.0081 - accuracy: 0.9980
 78/352 [=====>........................] - ETA: 1s - loss: 0.0074 - accuracy: 0.9983
 91/352 [======>.......................] - ETA: 1s - loss: 0.0075 - accuracy: 0.9983
104/352 [=======>......................] - ETA: 0s - loss: 0.0081 - accuracy: 0.9979
115/352 [========>.....................] - ETA: 0s - loss: 0.0083 - accuracy: 0.9978
127/352 [=========>....................] - ETA: 0s - loss: 0.0088 - accuracy: 0.9977
139/352 [==========>...................] - ETA: 0s - loss: 0.0085 - accuracy: 0.9978
151/352 [===========>..................] - ETA: 0s - loss: 0.0089 - accuracy: 0.9977
161/352 [============>.................] - ETA: 0s - loss: 0.0091 - accuracy: 0.9974
171/352 [=============>................] - ETA: 0s - loss: 0.0089 - accuracy: 0.9975
177/352 [==============>...............] - ETA: 0s - loss: 0.0091 - accuracy: 0.9975
188/352 [===============>..............] - ETA: 0s - loss: 0.0091 - accuracy: 0.9975
199/352 [===============>..............] - ETA: 0s - loss: 0.0089 - accuracy: 0.9975
209/352 [================>.............] - ETA: 0s - loss: 0.0089 - accuracy: 0.9975
220/352 [=================>............] - ETA: 0s - loss: 0.0087 - accuracy: 0.9976
231/352 [==================>...........] - ETA: 0s - loss: 0.0087 - accuracy: 0.9976
241/352 [===================>..........] - ETA: 0s - loss: 0.0086 - accuracy: 0.9976
252/352 [====================>.........] - ETA: 0s - loss: 0.0089 - accuracy: 0.9975
262/352 [=====================>........] - ETA: 0s - loss: 0.0089 - accuracy: 0.9975
272/352 [======================>.......] - ETA: 0s - loss: 0.0088 - accuracy: 0.9975
283/352 [=======================>......] - ETA: 0s - loss: 0.0089 - accuracy: 0.9975
293/352 [=======================>......] - ETA: 0s - loss: 0.0090 - accuracy: 0.9974
304/352 [========================>.....] - ETA: 0s - loss: 0.0089 - accuracy: 0.9975
313/352 [=========================>....] - ETA: 0s - loss: 0.0096 - accuracy: 0.9973
323/352 [==========================>...] - ETA: 0s - loss: 0.0097 - accuracy: 0.9972
332/352 [===========================>..] - ETA: 0s - loss: 0.0099 - accuracy: 0.9972
342/352 [============================>.] - ETA: 0s - loss: 0.0100 - accuracy: 0.9972
352/352 [==============================] - 2s 5ms/step - loss: 0.0101 - accuracy: 0.9972

352/352 [==============================] - 2s 6ms/step - loss: 0.0101 - accuracy: 0.9972 - val_loss: 0.0924 - val_accuracy: 0.9765
  • Two quantities are displayed in the log during training: the loss and accuracy of the network over the training data during the subsequent epochs (new training runs after re-adjusting the weights).
  • Notice that the measures improve in every epoch. We quickly reach an accuracy (98.9% on the training data.
  • Notice that fit() adjusts the weights of the network without explicly assigning it into a new object. history.net therefore only contains the history of the models prediction metrics through the different epochs, in case we would like to inspect it. Let’s take a look:
history.net

Final epoch (plot to see history):
        loss: 0.01011
    accuracy: 0.9972
    val_loss: 0.09245
val_accuracy: 0.9765 
history.net %>% glimpse()
List of 2
 $ params :List of 3
  ..$ verbose: int 1
  ..$ epochs : int 10
  ..$ steps  : int 352
 $ metrics:List of 4
  ..$ loss        : num [1:10] 0.2928 0.1205 0.0775 0.0555 0.0411 ...
  ..$ accuracy    : num [1:10] 0.914 0.965 0.977 0.984 0.988 ...
  ..$ val_loss    : num [1:10] 0.1599 0.1126 0.1008 0.0989 0.087 ...
  ..$ val_accuracy: num [1:10] 0.953 0.966 0.971 0.972 0.975 ...
 - attr(*, "class")= chr "keras_training_history"

We can also visualize these metrics through the epocs.

history.net %>% plot(smooth = TRUE)

  • Interestingly, we already see that our model overfits. Meaning, while accuracy in our training set tends to further increase through the epocs, it starts over time to decrease in our validation set. T
  • here are different ways to fight that, such as defining a layer_dropout, or to tell the model to pick stop running further epocs as soon as the validation accuracy drops. However, we will for now just move on.
  • For now, let’s check if the model performs well out-of-sample on the test-set:
metrics <- network %>% evaluate(test_images, test_labels)

  1/313 [..............................] - ETA: 0s - loss: 0.0036 - accuracy: 1.0000
 38/313 [==>...........................] - ETA: 0s - loss: 0.0640 - accuracy: 0.9794
 70/313 [=====>........................] - ETA: 0s - loss: 0.0922 - accuracy: 0.9737
102/313 [========>.....................] - ETA: 0s - loss: 0.0913 - accuracy: 0.9767
133/313 [===========>..................] - ETA: 0s - loss: 0.0965 - accuracy: 0.9751
162/313 [==============>...............] - ETA: 0s - loss: 0.0906 - accuracy: 0.9767
197/313 [=================>............] - ETA: 0s - loss: 0.0852 - accuracy: 0.9778
227/313 [====================>.........] - ETA: 0s - loss: 0.0796 - accuracy: 0.9794
256/313 [=======================>......] - ETA: 0s - loss: 0.0723 - accuracy: 0.9810
288/313 [==========================>...] - ETA: 0s - loss: 0.0693 - accuracy: 0.9821
313/313 [==============================] - 0s 2ms/step - loss: 0.0692 - accuracy: 0.9821

313/313 [==============================] - 0s 2ms/step - loss: 0.0692 - accuracy: 0.9821
metrics
      loss   accuracy 
0.06924154 0.98210001 
  • Ok, so far so good. I think that’s a decent accuracy for such an ad-hoc model. Whith a bit of tinkering, we surely could get it to 99%. But thats a task for another time…
  • Lets go back to basic and revise a bit what we have done so far.

Data representations for neural networks

  • In the previous example, we started from data stored in multidimensional arrays, also called tensors.
  • In general, most current ML systems use tensors as their basic data structure. Tensors are fundamental to the field-so fundamental that Google’s TensorFlow was named after them. So what’s a tensor?
  • Tensors are a generalization of vectors and matrices to an arbitrary number of dimensions (note that in the context of tensors, a dimension is often called an axis).
  • In R, vectors are used to create and manipulate 1D tensors, and matrices are used for 2D tensors. For higher-level dimensions, array objects (which support any number of dimensions) are used.

Key tensor-attributes

A tensor is defined by three key attributes:

  1. Number of axes (rank): For instance, a 3D tensor has three axes, and a matrix has two axes.
  2. Shape: This is an integer vector that describes how many dimensions the tensor has along each axis.
  3. Data type: This is the type of the data contained in the tensor; for instance, a tensor’s type could be integer or double. On rare occasions, you may see a character tensor.

Tensor reshaping

Remember that we before did not use the dim() but the array_reshape() function to manipulate our input tensors.

train_images <- array_reshape(train_images, c(60000, 28 * 28))
str(train_images)
 num [1:60000, 1:784] 0 0 0 0 0 0 0 0 0 0 ...
dim(train_images)
[1] 60000   784
  • This is an R specific thingy, so that the data is reinterpreted using row-major semantics (as opposed to Rs default column-major semantics), which is in turn compatible with the way the numerical libraries called by Keras (NumPy, TensorFlow, and so on) interpret array dimensions.
  • You should always use the array_reshape() function when reshaping R arrays that will be passed to Keras.
  • Reshaping a tensor means rearranging its rows and columns to match a target shape.
  • Naturally, the reshaped tensor has the same total number of coefficients as the initial tensor. Lets do a simple examples:
x <- matrix(c(0:5),
            nrow = 3, ncol = 2, byrow = TRUE)
x
     [,1] [,2]
[1,]    0    1
[2,]    2    3
[3,]    4    5
x <- array_reshape(x, dim = c(3, 2))
x
     [,1] [,2]
[1,]    0    1
[2,]    2    3
[3,]    4    5
x <- array_reshape(x, dim = c(2, 3))
x
     [,1] [,2] [,3]
[1,]    0    1    2
[2,]    3    4    5
  • A special case of reshaping that’s commonly encountered is transposition.
  • Transposing a matrix means exchanging its rows and its columns, so that x[i,] becomes x[, i]. The t() function can be used to transpose a matrix:
x <- t(x)
x
     [,1] [,2]
[1,]    0    3
[2,]    1    4
[3,]    2    5
rm(x)

Geometric interpretation of tensor operations

layer <- layer_dense(units = 32, input_shape = c(784))
  • We’re creating a layer that will only accept as input 2D tensors where the first dimension is 784 (the first dimension, the batch dimension, is unspecified, and thus any value would be accepted). This layer will return a tensor where the first dimension has been transformed to be 32.
  • Thus this layer can only be connected to a downstream layer that expects 32-dimensional vectors as its input. When using Keras, you don’t have to worry about compatibility, because the layers you add to your models are dynamically built to match the shape of the incoming layer.
  • For instance, suppose you write the following:
model <- keras_model_sequential() %>%
  layer_dense(units = 32, input_shape = c(784)) %>%
  layer_dense(units = 32)
model
Model
Model: "sequential_2"
___________________________________________________________________________________________________
Layer (type)                                Output Shape                            Param #        
===================================================================================================
dense_4 (Dense)                             (None, 32)                              25120          
___________________________________________________________________________________________________
dense_5 (Dense)                             (None, 32)                              1056           
===================================================================================================
Total params: 26,176
Trainable params: 26,176
Non-trainable params: 0
___________________________________________________________________________________________________
# devtools::install_github("andrie/deepviz")
library(deepviz)
plot_model(model)
  • The second layer didn’t receive an input shape argument-instead, it automatically inferred its input shape as being the output shape of the layer that came before.

  • Picking the right network architecture is more an art than a science; and although there are some best practices and principles you can rely on, only practice can help you become a proper neural-network architect.

  • Here, we will simit ourself to a simple feed-forward network, where every layer is only connected to the following one. For now, there are two key architecture decisions to be made about such a stack of dense layers:

    1. How many layers to use?
    2. How many hidden units to choose for each layer?
    3. Which activation function to use?
rm(layer, model)

Activation functions

  • As we already saw, we can define activation functiopns for the different activation functions for every layer. While we within the intermediate layers seldomely switch between different activation functions, the one we define for the output layer critically depends on the shape of our desired output data.
  • A brief reminder: Activation functions transform the input weights of a cell to its output. Without them, the dense layer would consist of two linear operations-a dot product and an addition:
  • In order to get access to a much richer hypothesis space that would benefit from deep representations, you need a non-linearity, or activation function.
    • relu is the most popular activation function in deep learning, but there are many other candidates, which all come with similarly strange names: prelu, elu, and so on. A relu (rectified linear unit) is a function meant to zero out negative values, and commonly used for intermediate layers (formerly, almost all layers where modelled with sigmoid, but nowadays its proven that for intermediate layers relu mostly works better).

Our output layer, however, should model a binary choice (yes/no classification). For such a model, we would in a 2-class problem commonly use a a sigmoid function, which we already know from logistic regression models. It “squashes” arbitrary values into the [0, 1] interval, outputting something that can be interpreted as a probability.

However, since we have a multi-class prediction problem, we choose softmax, which squashes the outputs of each unit to be between 0 and 1, just like a sigmoid, but it also divides each output such that the total sum of the outputs is equal to 1. The output is equivalent to a categorical probability distribution, it tells you the probability that any of the classes are true.

If you are interested regarding the different types of layers in Keras, check the reference site with all layers implemented. Furthermore, types of activation functions are discussed HERE

Loss functions and optimizers: keys to configuring the learning process

Once the network architecture is defined, you still have to choose two more things:

  • Loss function (objective function): The quantity that will be minimized during training. It represents a measure of success for the task at hand.

  • Optimizer: Determines how the network will be updated based on the loss function. It implements a specific variant of stochastic gradient descent (SGD).

  • Choosing the right objective and function for the right problem is extremely important: your network will take any shortcut it can, to minimize the loss; so if the objective doesn’t fully correlate with success for the task at hand, your network will end up doing things you may not have wanted.

  • HERE you find a brief overview on different loss functions. Fortunately, when it comes to common problems such as classification, regression, and sequence prediction, there are simple guidelines you can follow to choose the correct loss.

  • Take this rule-of-thumb table as a good starter:

  • With respect to the optimizer: We will cover that later. There are a bunch of different around, most variants of the Stochastic Gradient Descent (SGD), Batch (vanilla) Gradient Descent, and Mini-Batch Gradient Descent.
  • HERE and HERE you find a nice summary for the interested reader that wants to now more. Currently (that might change soon, since everything in DL movers fast), its common knowledge that if you have nmo strong reasons to believe so, RMSprop (an unpublished, adaptive learning rate method proposed by Geoff Hinton) with standard learning rates works just well.

Reviewing our initial example

Let’s go back to the first example and review each piece of it in the light of what we have learned up to now: This was the input data:

mnist <- dataset_mnist()

train_images <- mnist$train$x
train_images <- array_reshape(train_images, c(60000, 28 * 28))
train_images <- train_images / 255

test_images <- mnist$test$x
test_images <- array_reshape(test_images, c(10000, 28 * 28))
test_images <- test_images / 255
  • Now you understand that the input images are stored in tensors of shape (60000, 784) (training data) and (10000, 784) (test data), respectively.
  • This was our network:
network <- keras_model_sequential() %>%
  layer_dense(units = 512, activation = "relu", input_shape = c(28*28)) %>%
  layer_dense(units = 10, activation = "softmax")
  • Now you understand that this network consists of a chain of two dense layers, that each layer applies a few simple tensor operations to the input data, and that these operations involve weight tensors.
  • We know that layer_dense() creates fully connected layers, so there exists a weight between every element of one with every element of the following layer.
  • Weight tensors, which are attributes of the layers, are where the knowledge of the network persists. We know the 2nd layer has 512 cells, the final output layer 10 (equal to the number of classes to predict). Finally, we know that every cell also contains a non-linear activation function, such as relu, sigmoid, or softmax.

This was the network-compilation step:

network network %>% compile(
  optimizer = "rmsprop",
  loss = "categorical_crossentropy",
  metrics = "accuracy"
  )
  • Now you understand that categorical_crossentropy (a measure how pure the predicted classes are) is a type of a `loss`` function that’s used as a feedback signal for learning the weight tensors, and which the training phase will attempt to minimize.
  • You also know that this reduction of the loss happens via mini-batch stochastic gradient descent. The exact rules governing a specific use of gradient descent are defined by the rmsprop optimizer passed as the first argument.

Finally, this was the training loop:

network %>% fit(x = train_images, 
                y = train_labels, 
                epochs = 10, 
                batch_size = 128)
  • Now you understand what happens when you call fit(): the network will start to iterate on the training data in mini-batches of 128 samples, 10 times over (each iteration over all the training data is called an ?epoch?).
  • At each iteration, the network will compute the gradients of the weights with regard to the loss on the batch, and update the weights accordingly.
  • After these ?10? epochs, the network will have performed ?2,345? gradient updates (?469? per ?epoch?), and the loss of the network will be sufficiently low that the network will be capable of classifying handwritten digits with high accuracy.

At this point, you already know most of what there is to know about the basics of neural networks.

However, there is still some stuff to come, namely:

  1. How to use architectures other that the simple feed-forward one.
  2. How to fight overfitting
  3. How to specify training routines and parameter grid-search
  4. And some more…

But for that, there will be other sessions top come…

Example on tabular data

  • Just to make it a bit lss abstract and work with tabular data, lets give it a shot with classifying penguins :)

Load data

library(tidymodels)
data <- read_csv("https://github.com/allisonhorst/palmerpenguins/raw/5b5891f01b52ae26ad8cb9755ec93672f49328a8/data/penguins_size.csv")
data %>% glimpse()
Rows: 344
Columns: 7
$ species_short     <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", …
$ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torg…
$ culmen_length_mm  <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, 37.8, 37.8, …
$ culmen_depth_mm   <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, 17.1, 17.3, …
$ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180, 182, 191, 1…
$ body_mass_g       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, 3300, 3700, …
$ sex               <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE", "FEMALE", "MALE", NA…

TRain TEst split

data %<>%
  rename(y = species_short) %>%
  relocate(y) %>%
  drop_na()
data_split <- initial_split(data, prop = 0.75, strata = y)

data_train <- data_split  %>%  training()
data_test <- data_split %>% testing()

Preprocessing recipe

data_recipe <- data_train %>%
  recipe(y ~.) %>%
  step_center(all_numeric(), -all_outcomes()) %>% # Centers all numeric variables to mean = 0
  step_scale(all_numeric(), -all_outcomes()) %>% # scales all numeric variables to sd = 1
  step_dummy(all_nominal(), one_hot = TRUE) %>%
  prep()
x_train <- juice(data_recipe) %>% select(-starts_with('y')) %>% as.matrix()
x_test <- bake(data_recipe, new_data = data_test) %>% select(-starts_with('y')) %>% as.matrix()
y_train <- juice(data_recipe)  %>% select(starts_with('y')) %>% as.matrix()
y_test <- bake(data_recipe, new_data = data_test) %>% select(starts_with('y')) %>% as.matrix()

Define the network

model_keras <- keras_model_sequential()
model_keras %>% 
  # First hidden layer
  layer_dense(
    units              = 12, 
    activation         = "relu", 
    input_shape        = ncol(x_train)) %>% 
  # Dropout to prevent overfitting
  layer_dropout(rate = 0.1) %>%
  # Second hidden layer
  layer_dense(
    units              = 12, 
    activation         = "relu") %>% 
  # Dropout to prevent overfitting
  layer_dropout(rate = 0.1) %>%
  # Output layer
  layer_dense(
    units              = ncol(y_train), 
    activation         = "softmax") 
model_keras %>% 
  compile(
    optimizer = "adam",
    loss = "categorical_crossentropy",
    metrics = "accuracy"
  )
model_keras_hist <- model_keras  %>% fit(x = x_train, 
                                         y = y_train, 
                                         epochs = 10, # How often shall we re-run the model on the whole sample
                                         batch_size = 12, # How many observations should be included in every batch
                                         validation_split = 0.25 # If we want to do a  cross-validation in the training
                                         )
Epoch 1/10

 1/16 [>.............................] - ETA: 0s - loss: 1.1291 - accuracy: 0.5000
16/16 [==============================] - 0s 2ms/step - loss: 1.2027 - accuracy: 0.3085

16/16 [==============================] - 1s 47ms/step - loss: 1.2027 - accuracy: 0.3085 - val_loss: 1.3124 - val_accuracy: 0.0000e+00
Epoch 2/10

 1/16 [>.............................] - ETA: 0s - loss: 1.3300 - accuracy: 0.1667
16/16 [==============================] - 0s 1ms/step - loss: 1.1194 - accuracy: 0.3830

16/16 [==============================] - 0s 12ms/step - loss: 1.1194 - accuracy: 0.3830 - val_loss: 1.2318 - val_accuracy: 0.0000e+00
Epoch 3/10

 1/16 [>.............................] - ETA: 0s - loss: 1.0494 - accuracy: 0.5000
16/16 [==============================] - 0s 1ms/step - loss: 1.0029 - accuracy: 0.5532

16/16 [==============================] - 0s 12ms/step - loss: 1.0029 - accuracy: 0.5532 - val_loss: 1.1818 - val_accuracy: 0.0159
Epoch 4/10

 1/16 [>.............................] - ETA: 0s - loss: 0.9049 - accuracy: 0.7500
16/16 [==============================] - 0s 1ms/step - loss: 0.9466 - accuracy: 0.6170

16/16 [==============================] - 0s 13ms/step - loss: 0.9466 - accuracy: 0.6170 - val_loss: 1.1483 - val_accuracy: 0.0794
Epoch 5/10

 1/16 [>.............................] - ETA: 0s - loss: 0.8350 - accuracy: 0.7500
16/16 [==============================] - 0s 2ms/step - loss: 0.8769 - accuracy: 0.6755

16/16 [==============================] - 0s 13ms/step - loss: 0.8769 - accuracy: 0.6755 - val_loss: 1.1026 - val_accuracy: 0.2063
Epoch 6/10

 1/16 [>.............................] - ETA: 0s - loss: 0.8247 - accuracy: 0.6667
16/16 [==============================] - 0s 1ms/step - loss: 0.8050 - accuracy: 0.7660

16/16 [==============================] - 0s 12ms/step - loss: 0.8050 - accuracy: 0.7660 - val_loss: 1.0505 - val_accuracy: 0.5873
Epoch 7/10

 1/16 [>.............................] - ETA: 0s - loss: 0.7771 - accuracy: 0.7500
16/16 [==============================] - 0s 2ms/step - loss: 0.7296 - accuracy: 0.8298

16/16 [==============================] - 0s 13ms/step - loss: 0.7296 - accuracy: 0.8298 - val_loss: 0.9910 - val_accuracy: 0.6825
Epoch 8/10

 1/16 [>.............................] - ETA: 0s - loss: 0.6809 - accuracy: 0.9167
16/16 [==============================] - 0s 2ms/step - loss: 0.6871 - accuracy: 0.8404

16/16 [==============================] - 0s 12ms/step - loss: 0.6871 - accuracy: 0.8404 - val_loss: 0.9175 - val_accuracy: 0.7460
Epoch 9/10

 1/16 [>.............................] - ETA: 0s - loss: 0.6663 - accuracy: 0.7500
16/16 [==============================] - 0s 3ms/step - loss: 0.6088 - accuracy: 0.8617

16/16 [==============================] - 0s 15ms/step - loss: 0.6088 - accuracy: 0.8617 - val_loss: 0.8506 - val_accuracy: 0.7937
Epoch 10/10

 1/16 [>.............................] - ETA: 0s - loss: 0.7198 - accuracy: 0.8333
16/16 [==============================] - 0s 1ms/step - loss: 0.5699 - accuracy: 0.8777

16/16 [==============================] - 0s 12ms/step - loss: 0.5699 - accuracy: 0.8777 - val_loss: 0.7716 - val_accuracy: 0.8571

Your Turn

Using the Keras API and reference (Python or R) manual and in particular construct additional simple models (with appropriate metrics):

  • Regression example
  • Multi-label example

Also consider further resources mentioned below

Endnotes

References

More info

You can find more info about:

  • keras here: Excellent documentation, tutorials, and resources regarding keras, maintained by Rstudio

Online Resources

Datacamp * Introduction to TensorFlow in R: A bit low-level, but a good intro for starters * Also follow the Python intros, they might still be helpful for you.

Others

Books

  • François Chollet & J. J. Allaire (2018). Deep Learning with R, Manning Publications: Good book, but not for free. Find it here. though, in case of interest.

Session info

sessionInfo()
---
title: 'Neural Networks Application: Ecosystem & simple ANNs (R)'
author: "Daniel S. Hain (dsh@business.aau.dk)"
date: "Updated `r format(Sys.time(), '%B %d, %Y')`"
output:
  html_notebook:
    code_folding: show
    df_print: paged
    toc: true
    toc_depth: 2
    toc_float:
      collapsed: false
    theme: flatly
---

```{r setup, include=FALSE}
### Generic preamble
rm(list=ls())
Sys.setenv(LANG = "en") # For english language
options(scipen = 5) # To deactivate annoying scientific number notation

### Knitr options
library(knitr) # For display of the markdown
knitr::opts_chunk$set(warning=FALSE,
                     message=FALSE,
                     comment=FALSE, 
                     fig.align="center"
                     )
```


```{r}
library(tidyverse)
library(magrittr)
```


# Introduction to `Keras`

* There are quite a bunch of deep learning frameworks around, from the older `Caffee` and `Theano` to Google's `Tensorflow` and the newer `Pytorch` (which is increasingly trending in research). 
* However, during the rest of this course, 95% of our deep learning exercises will be done using `Keras`
* Keras is a deep-learning framework that provides a convenient way to define and train almost any kind of deep-learning model. Keras was initially developed for researchers, with the aim of enabling fast experimentation. 

It has the following advantages:

* User-friendly API which makes it easy to quickly prototype deep learning models.
* Built-in support for convolutional networks (for computer vision), recurrent networks (for sequence processing), and any combination of both.
* Supports arbitrary network architectures: multi-input or multi-output models, layer sharing, model sharing, etc., is therefore appropriate for building essentially any deep learning model, from a memory network to a neural Turing machine.
* Is capable of running on top of multiple back-ends including `TensorFlow`, `CNTK`, or `Theano`.
* Allows the same code to run on CPU or on GPU, and has strong multi-GPU, distributed storage, and training support (`Google cloud`, `Spark`, `HDF5`...)
* Can easily be integrated in AI products (Apple `CoreML`, `TensorFlow` Android runtime,  `R` or `Python` webapp backend such as a `Shiny` or `Flask` app)

It is widely adapted in academia and industry (Google, Netflix, Uber, CERN, Yelp, Square etc.), and is also a popular framework on Kaggle, the machine-learning competition website, where almost every recent deep-learning competition has been won using `Keras` models. While Google's `TensorFlow` is even more popular, keep in mind that `Keras` can use `Tensorflow` (and other popular DL frameworks) as backend, and allows less cumbersome and more high-level 

<img width="49%" src="https://sds-aau.github.io/SDS-master/00_media/dl_frameworks_1.jpg"/>
<img width="49%" src="https://sds-aau.github.io/SDS-master/00_media/dl_frameworks_2.png"/>

* So, after all, `Keras` represents a wonderful high-level starter, fast and easy implementable, and in most cases flexible enough to do whatever you feel like.

![](https://sds-aau.github.io/SDS-master/00_media/DL_keras_wtf.png){width=750px}

**Sidenote:** The weird name (`Keras`) means *horn* in Greek, and is a reference to ancient Greek literature. Eg., in Odyssey, supernatural *dream spirits* are divided between those who deceive men with false visions (arriving to Earth through a gate of ivory), and those who announce a future that will come to pass (arriving through a gate of horn). So, enough history lessons, let's run our first deep learning model!

```{r}
# Load our main tool
library(keras)
```

# Our first deep learning model

## Introduction

* Well, its about time to get serious. We will dive straight in, and use a simple deep learning model on the classical `Mnist` dataset. 
* This is the original data used by Jan LeCun and his team to fit an ANN that identifies handwritten digits for the US postal service. 
* It consists of quite a bunch of samples of handwritten dicites together with their correct label. The wandwritten dicits here conveniently come as a 28x28 greyscale matrix, making them a good starter to warm up. Lets do that.


## Load our data and get ready

```{r}
# Load our data
mnist <- dataset_mnist()
```

```{r}
mnist %>%
  glimpse()
```


```{r}
# sepperate in train and test
train_images <- mnist$train$x
train_labels <- mnist$train$y
test_images <- mnist$test$x
test_labels <- mnist$test$y
```

* Lets take a look at the structure.

```{r}
glimpse(train_images)
```

```{r}
glimpse(train_labels)
```

```{r}
digit <- train_images[5,,]
digit[,8:20] # I crop it a bit, otherwise the columns dont fit on one page
```

To make it more tangible, lets plot one:

```{r}
digit %>% as.raster(max = 255) %>% plot()
```

```{r}
rm(digits)
```


## Define the `Keras` model

The workflow will be as follows: 

1. First, we'll feed the neural network the training data, `train_images` and `train_labels`. 
2. The network will then learn to associate images and labels. 
3. Finally, we'll ask the network to produce predictions for `test_images`, and we'll verify whether these predictions match the labels from `test_labels`.

Let's build the network - again, remember that you aren't expected to understand everything about this example yet.

Building a model in `Keras` that can be fitted on your data involves two steps:

1. Defining the networks architecture interms of layers and their shape.
2. Compiling the model, and defining the loss function, evaluation metric, and optimizer.

```{r}
network <- keras_model_sequential() %>% 
  layer_dense(units = 512, activation = "relu", input_shape = c(28 * 28)) %>%
  layer_dense(units = 10, activation = "softmax")
```

Notice that the layer stacking in `R` is done via the well-known `%>%`, in `Pyhton` with `.`. That's about the main difference between both implementations. 

* The core building block of neural networks is the **layer**, a data-processing module that you can think of as a filter for data. Some data goes in, and it comes out in a more useful form. 
* Specifically, layers extract representations out of the data fed into them - hopefully, representations that are more meaningful for the problem at hand. 
* Most of deep learning consists of chaining together simple layers that will implement a form of progressive data distillation. 

* Here, our network consists of a sequence of two layers, which are **densely connected** (`layer_dense`) neural layers. 
* The second (and last) layer is a 10-way `softmax` layer, which means it will return an array of 10 probability scores (summing to 1). 
* Each score will be the probability that the current digit image belongs to one of our 10 digit classes. So, we defined a network with overall 634 cells, consisting of:
   1. input layer: 28x28 = 512 cells
   2. intermediate layer : 28x28  = 512 cells
   3. Output layer: 10 cells

* To make the network ready for training, we need to pick three more things, as part of the compilation step:
   1. **Loss function:** How the network will be able to measure its performance on the training data, and thus how it will be able to steer itself in the right direction.
   2. **Optimizer:** The mechanism through which the network will update itself based on the data it sees and its loss function.
   3. **Metrics** Here, we'll only care about accuracy (the fraction of the images that were correctly classified).

* While we are already familiar with defining metrics to optimize, defining an optimizer and loss function is new. We will dig into that later. 
* Notice that the `compile()` function modifies the network in place. We will talk about all of them later in a bit more detail.

```{r}
network %>% compile(
  optimizer = "rmsprop",
  loss = "categorical_crossentropy",
  metrics = c("accuracy")
)
```

Lets inspect our final setup:

```{r}
summary(network)
```

Well' we see that a network of this size has quite a large number of trainable parameters (all edge-weights, meaning 512x512 + 512x10).

## Preprocess the data

* Before training the model, preprocess the data by reshaping it into the shape the network expects and scaling it so that all values are in the `[0, 1]` interval. 
* Previously, our training images were stored in an 3d array of shape `(60000, 28, 28)` of type integer with values in the `[0, 255]` interval. 
* We transform it into a double array of shape `(60000, 28 * 28)` with values between `0` and `1`.

```{r}
train_images <- array_reshape(train_images, c(60000, 28 * 28))
train_images <- train_images / 255 # To scale between 0 and 1

test_images <- array_reshape(test_images, c(10000, 28 * 28))
test_images <- test_images / 255 # To scale between 0 and 1
```


* Note that we use the `array_reshape()` rather than the `dim()` function to reshape the array. I explain why later, when we talk about **tensor reshaping**.
* Lastly, we also need to categorically encode the labels.

```{r}
train_labels <- to_categorical(train_labels)
test_labels <- to_categorical(test_labels)
```

## Run the network

We're now ready to train the network via `Keras` `fit()` function. We save the output in an object we call `history.net`. 

```{r}
set.seed(1337)
history.net <- network %>% fit(x = train_images, 
                               y = train_labels, 
                               epochs = 10, # How often shall we re-run the model on the whole sample
                               batch_size = 128, # How many observations should be included in every batch
                               validation_split = 0.25 # If we want to do a  cross-validation in the training
                               )
```

* Two quantities are displayed in the log during training: the loss and accuracy of the network over the training data during the subsequent epochs (new training runs after re-adjusting the weights). 
* Notice that the measures improve in every epoch. We quickly reach an accuracy (98.9% on the training data. 
* Notice that `fit()` adjusts the weights of the network without explicly assigning it into a new object. `history.net` therefore only contains the history of the models prediction metrics through the different epochs, in case we would like to inspect it. Let's take a look:

```{r}
history.net
```

```{r}
history.net %>% glimpse()
```

We can also visualize these metrics through the epocs.

```{r}
history.net %>% plot(smooth = TRUE)
```

* Interestingly, we already see that our model overfits. Meaning, while accuracy in our training set tends to further increase through the epocs, it starts over time to decrease in our validation set. T
* here are different ways to fight that, such as defining a `layer_dropout`, or to tell the model to pick stop running further epocs as soon as the validation accuracy drops. However, we will for now just move on.
* For now, let's check if the model performs well out-of-sample on the test-set:

```{r}
metrics <- network %>% evaluate(test_images, test_labels)
```

```{r}
metrics
```

* Ok, so far so good. I think that's a decent accuracy for such an ad-hoc model. Whith a bit of tinkering, we surely could get it to 99%. But thats a task for another time...
* Lets go back to basic and revise a bit what we have done so far.

# Data representations for neural networks

* In the previous example, we started from data stored in multidimensional arrays, also called **tensors**. 
* In general, most current ML systems use tensors as their basic data structure. Tensors are fundamental to the field-so fundamental that Google's **TensorFlow** was named after them. So what's a tensor?
* Tensors are a generalization of vectors and matrices to an arbitrary number of dimensions (note that in the context of tensors, a dimension is often called an axis). 
* In `R`, vectors are used to create and manipulate 1D tensors, and matrices are used for 2D tensors. For higher-level dimensions, array objects (which support any number of dimensions) are used.

## Key tensor-attributes

A tensor is defined by three key attributes:

1. **Number of axes** (rank): For instance, a 3D tensor has three axes, and a matrix has two axes.
2. **Shape:** This is an integer vector that describes how many dimensions the tensor has along each axis. 
3. **Data type:** This is the type of the data contained in the tensor; for instance, a tensor's type could be integer or double. On rare occasions, you may see a character tensor. 

## Tensor reshaping

Remember that we before did not use the `dim()` but the `array_reshape()` function to manipulate our input tensors.

```{r}
train_images <- array_reshape(train_images, c(60000, 28 * 28))
```

```{r}
str(train_images)
```

```{r}
dim(train_images)
```

* This is an `R` specific thingy, so that the data is reinterpreted using row-major semantics (as opposed to `R`s default column-major semantics), which is in turn compatible with the way the numerical libraries called by `Keras` (`NumPy`, `TensorFlow`, and so on) interpret array dimensions. 
* You should always use the `array_reshape()` function when reshaping `R` arrays that will be passed to `Keras`.
* Reshaping a tensor means rearranging its rows and columns to match a target shape. 
* Naturally, the reshaped tensor has the same total number of coefficients as the initial tensor. Lets do a simple examples:

```{r}
x <- matrix(c(0:5),
            nrow = 3, ncol = 2, byrow = TRUE)
x
```

```{r}
x <- array_reshape(x, dim = c(3, 2))
x
```

```{r}
x <- array_reshape(x, dim = c(2, 3))
x
```

* A special case of reshaping that's commonly encountered is *transposition*. 
* Transposing a matrix means exchanging its rows and its columns, so that `x[i,]` becomes `x[, i]`. The `t()` function can be used to transpose a matrix:


```{r}
x <- t(x)
x
```

```{r}
rm(x)
```


## Geometric interpretation of tensor operations

```{r,eval=FALSE}
layer <- layer_dense(units = 32, input_shape = c(784))
```

* We're creating a layer that will only accept as input 2D tensors where the first dimension is 784 (the first dimension, the batch dimension, is unspecified, and thus any value would be accepted). This layer will return a tensor where the first dimension has been transformed to be 32.
* Thus this layer can only be connected to a downstream layer that expects 32-dimensional vectors as its input. When using Keras, you don't have to worry about compatibility, because the layers you add to your models are dynamically built to match the shape of the incoming layer. 
* For instance, suppose you write the following:

```{r}
model <- keras_model_sequential() %>%
  layer_dense(units = 32, input_shape = c(784)) %>%
  layer_dense(units = 32)
```

```{r}
model
```

```{r}
# devtools::install_github("andrie/deepviz")
library(deepviz)
plot_model(model)
```

* The second layer didn't receive an input shape argument-instead, it automatically inferred its input shape as being the output shape of the layer that came before.

* Picking the right network architecture is more an art than a science; and although there are some best practices and principles you can rely on, only practice can help you become a proper neural-network architect. 
* Here, we will simit ourself to a simple feed-forward network, where every layer is only connected to the following one. For now, there are two key architecture decisions to be made about such a stack of dense layers:
   1. How many layers to use?
   2. How many hidden units to choose for each layer?
   3. Which activation function to use?

```{r}
rm(layer, model)
```


## Activation functions

* As we already saw, we can define activation functiopns for the different activation functions for every layer. While we within the intermediate layers seldomely switch between different activation functions, the one we define for the output layer critically depends on the shape of our desired output data.
* A brief reminder: Activation functions transform the input weights of a cell to its output. Without them, the dense layer would consist of two linear operations-a dot product and an addition:
* In order to get access to a much richer hypothesis space that would benefit from deep representations, you need a non-linearity, or activation function. 
   * `relu` is the most popular activation function in deep learning, but there are many other candidates, which all come with similarly strange names: `prelu`, `elu`, and so on. A `relu` (rectified linear unit) is a function meant to zero out negative values, and commonly used for intermediate layers (formerly, almost all layers where modelled with `sigmoid`, but nowadays its proven that for intermediate layers `relu` mostly works better).

![](https://sds-aau.github.io/SDS-master/00_media/DL_activation_1.jpg){width=500px}

Our output layer, however, should model a binary choice (yes/no classification). For such a model, we would in a 2-class problem commonly use a a `sigmoid` function, which we already know from logistic regression models. It "squashes" arbitrary values into the `[0, 1]` interval, outputting something that can be interpreted as a probability.

![](https://sds-aau.github.io/SDS-master/00_media/DL_activation_2.jpg){width=500px}

However, since we have a multi-class prediction problem, we choose `softmax`, which squashes the outputs of each unit to be between `0` and `1`, just like a `sigmoid`, but it also divides each output such that the total sum of the outputs is equal to `1`. The output is equivalent to a categorical probability distribution, it tells you the probability that any of the classes are true.

If you are interested regarding the different types of layers in `Keras`, check [the reference site](https://keras.rstudio.com/reference/index.html) with all layers implemented. Furthermore, types of activation functions are discussed [HERE](https://medium.com/the-theory-of-everything/understanding-activation-functions-in-neural-networks-9491262884e0)


## Loss functions and optimizers: keys to configuring the learning process

Once the network architecture is defined, you still have to choose two more things:

* **Loss function** (objective function): The quantity that will be minimized during training. It represents a measure of success for the task at hand.
* **Optimizer:** Determines how the network will be updated based on the loss function. It implements a specific variant of stochastic gradient descent (SGD).

* Choosing the right objective and function for the right problem is extremely important: your network will take any shortcut it can, to minimize the loss; so if the objective doesn't fully correlate with success for the task at hand, your network will end up doing things you may not have wanted. 

* [HERE](https://towardsdatascience.com/understanding-different-loss-functions-for-neural-networks-dd1ed0274718) you find a brief overview on different loss functions. Fortunately, when it comes to common problems such as classification, regression, and sequence prediction, there are simple guidelines you can follow to choose the correct loss. 
* Take this rule-of-thumb table as a good starter:

![](https://sds-aau.github.io/SDS-master/00_media/DL_table_ann_config.png){width=750px}

* With respect to the `optimizer`: We will cover that later. There are a bunch of different around, most variants of the **Stochastic Gradient Descent (SGD)**, **Batch (vanilla) Gradient Descent**, and **Mini-Batch Gradient Descent**. 
* [HERE](https://medium.com/@sdoshi579/optimizers-for-training-neural-network-59450d71caf6) and [HERE](http://ruder.io/optimizing-gradient-descent/) you find a nice summary for the interested reader that wants to now more. Currently (that might change soon, since everything in DL movers fast), its common knowledge that if you have nmo strong reasons to believe so, `RMSprop` (an unpublished, adaptive learning rate method proposed by Geoff Hinton) with standard learning rates works just well.

# Reviewing our initial example

Let's go back to the first example and review each piece of it in the light of what we have learned up to now:  This was the input data:

```{r,eval=FALSE}
mnist <- dataset_mnist()

train_images <- mnist$train$x
train_images <- array_reshape(train_images, c(60000, 28 * 28))
train_images <- train_images / 255

test_images <- mnist$test$x
test_images <- array_reshape(test_images, c(10000, 28 * 28))
test_images <- test_images / 255
```


* Now you understand that the input images are stored in tensors of shape `(60000, 784)` (training data) and `(10000, 784)` (test data), respectively.
* This was our network:

```{r,eval=FALSE}
network <- keras_model_sequential() %>%
  layer_dense(units = 512, activation = "relu", input_shape = c(28*28)) %>%
  layer_dense(units = 10, activation = "softmax")
```

* Now you understand that this network consists of a chain of two dense layers, that each layer applies a few simple tensor operations to the input data, and that these operations involve weight tensors. 
* We know that `layer_dense()` creates fully connected layers, so there exists a weight between every element of one with every element of the following layer.
* Weight tensors, which are attributes of the layers, are where the knowledge of the network persists. We know the 2nd layer has `512` cells, the final output layer `10` (equal to the number of classes to predict). Finally, we know that every cell also contains a non-linear activation function, such as `relu`, `sigmoid`, or `softmax`.

This was the network-compilation step:

```{r,eval=FALSE}
network network %>% compile(
  optimizer = "rmsprop",
  loss = "categorical_crossentropy",
  metrics = "accuracy"
  )
```

* Now you understand that `categorical_crossentropy` (a measure how pure the predicted classes are) is a type of a `loss`` function that's used as a feedback signal for learning the weight tensors, and which the training phase will attempt to minimize. 
* You also know that this reduction of the loss happens via mini-batch stochastic gradient descent. The exact rules governing a specific use of gradient descent are defined by the `rmsprop` optimizer passed as the first argument.

Finally, this was the training loop:

```{r,eval=FALSE}
network %>% fit(x = train_images, 
                y = train_labels, 
                epochs = 10, 
                batch_size = 128)
```


* Now you understand what happens when you call `fit()`: the network will start to iterate on the training data in mini-batches of 128 samples, 10 times over (each iteration over all the training data is called an ?epoch?). 
* At each iteration, the network will compute the gradients of the weights with regard to the loss on the batch, and update the weights accordingly. 
* After these ?10? epochs, the network will have performed ?2,345? gradient updates (?469? per ?epoch?), and the loss of the network will be sufficiently low that the network will be capable of classifying handwritten digits with high accuracy.

At this point, you already know most of what there is to know about the basics of neural networks.

However, there is still some stuff to come, namely:

1. How to use architectures other that the simple **feed-forward** one.
2. How to fight overfitting
3. How to specify training routines and parameter grid-search
4. And some more...

But for that, there will be other sessions top come...

# Example on tabular data

* Just to make it a bit lss abstract and work with tabular data, lets give it a shot with classifying penguins :)

## Load data

```{r}
library(tidymodels)
```

```{r}
data <- read_csv("https://github.com/allisonhorst/palmerpenguins/raw/5b5891f01b52ae26ad8cb9755ec93672f49328a8/data/penguins_size.csv")
```

```{r}
data %>% glimpse()
```

## TRain TEst split
```{r}
data %<>%
  rename(y = species_short) %>%
  relocate(y) %>%
  drop_na()
```

```{r}
data_split <- tidymodels::initial_split(data, prop = 0.75, strata = y)

data_train <- data_split  %>%  training()
data_test <- data_split %>% testing()
```

## Preprocessing recipe

```{r}
data_recipe <- data_train %>%
  recipe(y ~.) %>%
  step_center(all_numeric(), -all_outcomes()) %>% # Centers all numeric variables to mean = 0
  step_scale(all_numeric(), -all_outcomes()) %>% # scales all numeric variables to sd = 1
  step_dummy(all_nominal(), one_hot = TRUE) %>%
  prep()
```

```{r}
x_train <- juice(data_recipe) %>% select(-starts_with('y')) %>% as.matrix()
x_test <- bake(data_recipe, new_data = data_test) %>% select(-starts_with('y')) %>% as.matrix()
```

```{r}
y_train <- juice(data_recipe)  %>% select(starts_with('y')) %>% as.matrix()
y_test <- bake(data_recipe, new_data = data_test) %>% select(starts_with('y')) %>% as.matrix()
```

# Define the network

```{r}
model_keras <- keras_model_sequential()
```

```{r}
model_keras %>% 
  # First hidden layer
  layer_dense(
    units              = 12, 
    activation         = "relu", 
    input_shape        = ncol(x_train)) %>% 
  # Dropout to prevent overfitting
  layer_dropout(rate = 0.1) %>%
  # Second hidden layer
  layer_dense(
    units              = 12, 
    activation         = "relu") %>% 
  # Dropout to prevent overfitting
  layer_dropout(rate = 0.1) %>%
  # Output layer
  layer_dense(
    units              = ncol(y_train), 
    activation         = "softmax") 
```

```{r}
model_keras %>% 
  compile(
    optimizer = "adam",
    loss = "categorical_crossentropy",
    metrics = "accuracy"
  )
```

```{r}
model_keras_hist <- model_keras  %>% fit(x = x_train, 
                                         y = y_train, 
                                         epochs = 10, # How often shall we re-run the model on the whole sample
                                         batch_size = 12, # How many observations should be included in every batch
                                         validation_split = 0.25 # If we want to do a  cross-validation in the training
                                         )
```

# Your Turn

Using the Keras API and reference ([Python](https://keras.io/) or [R](https://keras.rstudio.com/)) manual and in particular construct additional simple models (with appropriate metrics):

* Regression example
* Multi-label example

Also consider further resources mentioned below

# Endnotes

### References

### More info

You can find more info about:

* `keras` [here](https://keras.rstudio.com/): Excellent documentation, tutorials, and resources regarding `keras`, maintained by Rstudio

### Online Resources

Datacamp
   * [Introduction to TensorFlow in R](https://learn.datacamp.com/courses/introduction-to-tensorflow-in-r): A bit low-level, but a good intro for starters
   * Also follow the Python intros, they might still be helpful for you.
   
Others

* [RStudio AI blog](https://blogs.rstudio.com/ai/): Excellent source for frequent torch/keras exercises and announcements within th R ecosystem
* [R Markdown Notebooks for "Deep Learning with R"](https://github.com/skeydan/deep-learning-with-r-notebooks): A collection of exercises/tutorials/demos from the [Deep Learning with R](https://www.manning.com/books/deep-learning-with-r) book. Good illustrations of diffent types of ML problems and their solutions.

Books

* François Chollet & J. J. Allaire (2018). Deep Learning with R, Manning Publications: Good book, but not for free. Find it [here](https://www.manning.com/books/deep-learning-with-r). though, in case of interest.


### Session info

```{r}
sessionInfo()
```