Automated Feature Extraction with Machine Learning and Image Processing

PD Stefan Bosse

University of Siegen - Dept. Maschinenbau
University of Bremen - Dept. Mathematics and Computer Science

1 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models -

Training and Validation of data-driven Models

Adapting dynamic parameters of a functional network is an iterative optimization problem

2 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models -

Training and Validation of data-driven Models

Adapting dynamic parameters of a functional network is an iterative optimization problem

Commonly the solution space is infinite, i.e., there is no one valid solution of the optimization problem.

3 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models -

Training and Validation of data-driven Models

Adapting dynamic parameters of a functional network is an iterative optimization problem

Commonly the solution space is infinite, i.e., there is no one valid solution of the optimization problem.

Basic training is demonstrated for an Artificial Neural Network

4 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - A simple Artificial Neuron

A simple Artificial Neuron

A simple neuron (perecptron) is a mapping function f(a model) that maps an n-dimensional input vector v on a scalar value u:

f(x,w,b)=g(ni=1wixi+b)

Here w is weight vector and b an offset (dynamic parameters). The function g is called transfer or activation function, normally not parametrized.

5 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - A simple Artificial Neuron

A simple Artificial Neuron

A single neuron with a single input p and an output o. w is a weighting factor (a weight for incoming p) and b is a bias (offset)

6 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - A Multi-input Artificial Neuron

A Multi-input Artificial Neuron

A single neuron with an input vector p and a scalar output o. w is a weighting factor vector (a weight for incoming p) and b is a bias (offset)

7 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Artificial Neural Network

Artificial Neural Network

A ANN is a function graph consisting of interconnected neurons. It is a graph G(V,N) with a set of nodes (neurons) and vertices connecting the nodes.

Commonly neurons are arranged and grouped in layers, but this is not mandatory. There is always an input and one output layer. Hidden layers are between input and output layers.

8 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Artificial Neural Network

Artificial Neural Network

  • The input layer (commonly) consists of n neurons for n input variables (attributes).

  • The output layer (commonly) consists of m neurons for m output variables (regression) or m target classes (classification)

  • Commonly, but not mandatory, each neuron of a layer i is connected with the outputs of all neurons of the previous layer i-1

9 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Artificial Neural Network

Artificial Neural Network

Neural network with neurons arranged in one layer

10 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Loss and Error Functions

Loss and Error Functions

Assume there is a set of data samples D, each sample contains the x input feature vector and output target feature vector y.

  • The goal of the model training is to find a model function that maps x on y with minimal error for all instances (at least averaged)

  • The loss or error function defines the mismatch of a training or test sample with the output of the function f(here for one scalar output y):

y=f(x)MAE(y,y0)=|y0y|MBE(y,y0)=y0yMSE(y,y0)=(y0y)2

11 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Loss and Error Functions

Loss and Error Functions

  • For multiple outputs (y) we get:

y=f(x)MAE(y,y0)=ni=1yiy0,inMBE(y,y0)=ni=1yiy0,inMSE(y,y0)=ni=1(yiy0,i)2n

12 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Training by Error Backpropagation

Training by Error Backpropagation

Most of the CNN layers involve parameters which are required to be tuned appropriately for a given computer vision task (e.g., image classification and object detection).

  • Assume again a single perceptron neuron with only two inputs a and b.

  • Then we can change the respective weight parameter w just by computing the "forward" application error, and subtracting the error multiplied with the current input value from the weight w(Rough approximation!):

w´i=wiα(yy0)xi

13 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Training by Error Backpropagation

14 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Training by Error Backpropagation

Example

data = {
{x1=0, x2=0, y=0},
{x1=1, x2=0, y=0.3},
{x1=0, x2=1, y=0.5},
{x1=1, x2=1, y=1},
}
function sigmoid(x) {
1/(1+exp(-x))
}
function neuron(x1,x2,w,b) {
accu = x1w[1]+x2w[2]
sigmoid(accu+b)
}
Some training data and the implementation of the sigmoid (logstic regression) activation and neuron function with two inputs
15 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Training by Error Backpropagation

Example

w = [0,0] b=0 samples=1:4 rate=0.01
for (run in 1:1000) {
set=sample(samples,1)
row=data[[set]]
y=neuron(row$x1,row$x2,w,b)
err=y-row$y
w[1]=w[1]-rate*err*row$x1
w[2]=w[2]-rate*err*row$x2
b=b-rate*err
}
print(w) print(b)

Training with randomized selected sample instances

16 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Training by Error Backpropagation

Example

for (index in 1:4) {
row=data[[index]]
y=neuron(row$x1,row$x2,w,b)
print(paste('Index',index,'Predicted',y,'Error',y-row$y))
}

Test with sample instances

17 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Training by Error Backpropagation

Gradient Descent Method

  • Indeed, the gradient of the output error with respect to the weight parameter wi is computed and subtracted from the current weight parameter value:

w´i=wiα(yy0)wi

  • That means, the weight parameter is corrected by a term that corresponds to the amount of the change of the error by changing the weight by a small delta value.
18 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Training by Error Backpropagation

The learning rate α determines the steps to be taken along the slope to achieve the goal. Too large steps could result in jumping over or missing the point of global minimum(also known as overshooting) and too small steps results in a very slow process of achieving the goal. This is a hyperparameter that needs to be tuned. In practice, people often start with 0.01, and either decrease or increase accordingly. (Aminah Mardiyyah Rufai)

19 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Training by Error Backpropagation

Learning Rate

  • But: We have a lot of different training samples, and if we change the parameter only based on the error from the current sample we will not converge to an average!

  • Therefore, only a small fraction given by the learning rate parameter α is used!

20 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Training by Error Backpropagation

Error backpropagation in layered Networks

  • Up to here we considered only one functional node (one neuron).

  • If parameters of functions of previous nodes/layers must be adapted, the process is a little bit more complicated, although, the same principle is applied, i.e., in general the derivative of the error function by the respective weight/parameter to be adjusted must be computed:

Ewi

21 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Training by Error Backpropagation

Matt Mazur, https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/ Example network with teo input node, two inner nodes, and two output nodes

22 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Training by Error Backpropagation

  • Let us now consider one node with an input vector x, a product-sum result net(x,w) applied to the transfer function f, and a resulting output out, then we can write by a simple chain rule:

Ewi=Eoutioutinetinetiwi

23 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Training by Error Backpropagation

  • In the hidden inner layer, we start with the same formula, but slightly different to account for the fact that the output of each hidden layer neuron contributes to the output (and therefore error) of multiple output neurons.

  • We know that outh1 affects both outo1 and outo2 therefore the gradient needs to take into consideration its effect on the both output neurons:

Ew1=Eouth1outh1neth1neth1w1Eouth1=Eo1outh1+Eo2outh1

24 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Training by Error Backpropagation

Error backpropagation from output to inner layer nodes must consider error accumulation by multiple nodes

25 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Weight Initialization

Weight Initialization

A correct weight initialization is the key to stably train very deep networks. An ill-suited initialization can lead to the vanishing or exploding gradient problem during error back-propagation.

Gaussian Random Initialization

A common approach to weight initialization in CNNs is the Gaussian random initialization technique. This approach initializes the convolutional and the fully connected layers using ran- dom matrices whose elements are sampled from a Gaussian distribution with zero mean and a small standard deviation (e.g., 0.1 and 0.01).

26 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Weight Initialization

Uniform Random Initialization

The uniform random initialization approach initializes the convolutional and the fully connected layers using random matrices whose elements are sampled from a uniform distribution (instead of a normal distribution as in the earlier case) with a zero mean and a small standard deviation (e.g., 0.1 and 0.01).

  • The uniform and normal random initializations generally perform identically.
  • However, the training of very deep networks can become a problem with a random initializa- tion of weights from a uniform or normal distribution.
    • The reason is that the forward and backward propagated activations can either diminish or explode when the network is very deep.
27 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Weight Initialization

Xavier Initialization

A random initialization of a neuron makes the variance of its output directly proportional to the number of its incoming connections (a neuron’s fan-in measure).

  • To alleviate this problem, Glorot and Bengio [2010] proposed to randomly initialize the weights with a variance measure that is dependent on the number of incoming and outgoing connections (nfin and nfout respectively) from a neuron,

Var(w)=2nfin+nfout

where w are network weights. Note that the fan-out measure is used in the variance above to balance the back-propagated signal as well. Xavier initialization works quite well in practice and leads to better convergence rates.

28 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Weight Initialization

ReLU-scaled Initialization

Neurons (or filters with transfer functions) with a ReLU non-linearity do not follow the assumptions made for the Xavier initialization.

  • Precisely, since the ReLU activation reduces nearly half of the inputs to zero, therefore the variance of the distribution from which the initial weights are randomly sampled should be

Var(w)=2nfin

  • The ReLU aware scaled initialization works better compared to Xavier initialization for recent architectures which are based on the ReLU nonlinearity.
29 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Pre-training

Pre-training

One approach to avoid the gradient diminishing or exploding problem is to use layer-wise pre-training in an unsupervised fashion.

  • The unsupervised pre-training can be followed by a supervised fine-tuning stage to make use of any available annotations.

  • However, due to the new hyper-parameters, the considerable amount of effort involved in such an approach and the availability of better initialization techniques, layer-wise pre-training is seldomused now to enable the training of CNN-based very deep networks.

(not a good idea)

30 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Supervised Pre-Training

Supervised Pre-Training

In practical scenarios, it is desirable to train very deep networks, but we do not have a large amount of annotated data available for many problem settings.

  • A very successful practice in such cases is to first train the neural network on a related but different problem, where a large amount of training data is already available.

  • Afterward, the learned model can be “adapted” to the new task by initializing with weights pre-trained on the larger dataset.

This process is called “fine-tuning” and is a simple, yet effective, way to transfer learning from one task to another.

31 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Training and Validation (Test)

Training and Validation (Test)

The set of data samples are commonly split in two sub-sets:

  1. Training data samples only used to compute model errors for model parameter optimization;
  2. Test (or validation) data samples only used to check and assess the current model accuracy.

For gradient error back-propagation commonly linear error functions are used. For the validation, higher-order funtions (like MSE) can be used.

32 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Regularization

Regularization

Since deep conv. and neural networks have a large number of parameters, they tend to over-fit on the training data during the learning process.

  • Over-fitting meana that the model performs really well on the training data but it fails to generalize well to unseen data.
  • It, therefore, results in an inferior performance on new data (usually the test set).

Regularization approaches aim to avoid this problem using several intuitive ideas.

33 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Regularization

Regularization

We can categorize common regularization approaches into the following classes, based on their central idea:

  • approaches which regularize the network using data level techniques (e.g., data augmentation);
  • approaches which introduce stochastic behavior in the neural activations (e.g., dropout and drop connect);
  • approaches which aligns parameters of "saturated" nodes to bring the back in the non-saturation range;
  • approaches which normalize batch statistics in the feature activations (e.g., batch normalization);
  • approaches which use decision level fusion to avoid over-fitting (e.g., ensemble model averaging);
  • approaches which introduce constraints on the network weights (e.g., 1 norm,2 norm, max-norm, and elastic net constraints); and
  • approaches which use guidance from a validation set to halt the learning process (e.g., early stopping).
34 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Data Augmentation

Data Augmentation

Data augmentation is the easiest, and often a very effective way of enhancing the generalization power of CNN models. Especially for cases where the number of training examples is relatively low, data augmentation can enlarge the dataset (by factors of 16x, 32x, 64x, or even more) to allow a more robust training of large-scale models.

Data augmentation is performed by making several copies from a single image using straightforward operations such as rotations, cropping, flipping, scaling, translations, and shearing. These operations can be performed separately or combined together to form copies, which are both flipped and cropped.

35 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Data Augmentation

Khan, 2018 Examples of data augmentation using image cropping, flipping, and rotation

36 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Drop Out

Drop Out

One of the most popular approaches for neural network regularization is the dropout technique.

  • During network training, each neuron is activated with a fixed probability (usually 0.5 or set using a validation set).

  • This random sampling of a sub-network within the full-scale network introduces an ensemble effect during the testing phase, where the full network is used to perform prediction.

  • Activation dropout works really well for regularization purposes and gives a significant boost in performance on unseen data in the test phase.

37 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Drop Out

A random dropout layer generates a mask mBm, where each element mi is indepently sampled from a Bernoulli distribution a probability p being on (or 1-p being off).

  • This mask is used to modify the output activations from the previous layer, i.e.:

al=mf(^Wal1+bl)

Here, a ∈ ℝn and b ∈ ℝm denote the activations and biases respectively. W ∈ ℝm×n is the weight matrix, and f the transfer function.

38 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Ensemble Model Averaging

Ensemble Model Averaging

The ensemble averaging approach is another simple, but effective, technique where a number of models are learned instead of just a single model.

  • Each model has different parameters due to different random initializations, different hyper-parameter choices (e.g., architecture, learning rate) and/or different sets of training inputs.

  • The output from these multiple models is then combined to generate a final prediction score.

39 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Ensemble Model Averaging

Ensemble Model Averaging

  • The prediction combination approach can be a simple output averaging, a majority voting scheme or a weighted combination of all predictions.
    • The final prediction is more accurate and less prone to over-fitting compared to each individual model in the ensemble.
    • The committee of experts (ensemble) acts as an effective regularization mechanism which enhances the generalization power of the overall system.
40 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Early Stopping

Early Stopping

The overfitting problem occurs when a model performs verywell on the training set but behaves poorly on unseen data.

  • Early stopping is applied to avoid overfitting in the iterative gradient-based algorithms.

  • This is achieved by evaluating the performance on a held-out validation set at different iterations during the training process.

    • The training algorithm can continue to improve on the training set until the performance on the validation set also improves.
    • Once there is a drop in the generalization ability of the learned model, the learning process can be stopped or slowed down.
41 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Early Stopping

Khan, 2018 An illustration of the early stopping approach during network training using the validation error for decision making instead a pre-defined training error threshold.

42 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Gradient-based CNN Learning

Gradient-based CNN Learning

The CNN learning process tunes the parameters of the network such that the input space is correctly mapped to the output space.

  • At each training step, the current estimate of the output variables is matched with the desired output (often termed the “ground-truth” or the “label space”).
  • This matching function serves as an objective function during the CNN training and it is usually called the loss function or the error function.
  • The CNN training process involves the optimization of its parameters such that the loss function is minimized.
43 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Gradient-based CNN Learning

Each iteration which updates the parameters using the complete training set is called a “training epoch".

Each training iteration at time t using the following parameter update equation modifies the parameters (same for linear filter mask wieghts as well as for non-linear neuronal functions):

θt=θt1αδtδt=θF(θt)

44 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Gradient-based CNN Learning

But in contrast to neuron with fixed input data for a given data sample, the filter mask of a convolution operation moves the window over the entire input matrix!

Let's say we have 3x3 image, I, and a 2x2 filter W. Sliding this filter over the image will produce 2x2 output (no padding).

  • The for elements of this output would be:

O11=I11W11+I12W12+I21W21+I22W22O12=I12W11+I13W12+I22W21+I23W22O21=I21W11+I22W12+I31W21+I32W22O22=I22W11+I23W12+I32W21+I33W22

45 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Gradient-based CNN Learning

  • The next layer can be pooling and then this output can be fed into a dense layer, after flattening if necessary. For example, if it's average pooling with 2x2 pool size, we have a single output:

o=O11+O12+O21+O224

  • If L is the loss function, then we get:

LW=[LW11LW21LW12LW22]

The error must be computed and accumulated for all pixels of the input image!

46 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Gradient-based CNN Learning

Batch Gradient-Descent

  • Gradient descent algorithms work by computing the gradient of the objective function with respect to the network parameters, followed by a parameter update in the direction of the steepest descent.

  • The basic version of the gradient descent, termed “batch gradient descent,” computes this gradient on the entire training set.

    • It is guaranteed to converge to the global minimum for the case of convex problems.
    • For non-convex problems, it can still attain a local minimum.
  • However, the training sets can be very large in computer vision problems, and therefore learning via the batch gradient descent can be prohibitively slow because for each parameter update, it needs to compute the gradient on the complete training set.

47 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Gradient-based CNN Learning

Stochastic Gradient-Descent

Stochastic Gradient Descent (SGD) performs a parameter update for each set of input and output that are present in the training set.

  • As a result, it converges much faster compared to the batch gradient descent. Furthermore, it is able to learn in an “online manner”, where the parameters can be tuned in the presence of new training examples.
  • The only problem is that its convergence behavior is usually unstable, especially for relatively larger learning rates and when the training datasets contain diverse examples.
  • When the learning rate is appropriately set, the SGD generally achieves a similar convergence behavior, compared to the batch gradient descent, for both the convex and non-convex problems.
48 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Gradient Computation

Gradient Computation

  • A gradient ∇ can be approximated by small difference terms:

=uvΔuΔv=uiui1vivi1

  • But such a difference formula tends to be very inaccurate for large gradients (not known in advance and dynamic). So analytical differentiation (of a node function) is preferred if possible.

  • On the other hand, analytically deriving the derivatives of complex expressions is time-consuming and laborious. Furthermore, it is necessary to model the layer operation as a closed-form mathematical expression. However, it provides an accurate value for the derivative at each point.

49 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Gradient Computation

Gradients of functions f can be computed by:

  1. Numerical differentation (approximation from samples)

ΔfΔx=f(x+h)f(x)h

  1. Analytical differentiation (for simple functions)

  2. Symbolic differentation (for complex functions)

  3. Programmed differentiation

50 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Gradient Computation

Every computer program is implemented using a programming language, which only supports a set of basic functions (e.g., addition, multiplication, exponentiation, logarithm and trigonometric functions). Automatic differentiation uses this modular nature of computer pro- grams to break them into simpler elementary functions. The derivatives of these simple functions are computed symbolically and the chain rule is then applied repeatedly to compute any order of derivatives of complex programs.

51 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Gradient Computation

Khan, 2018 Relationships between different differentiation methods

52 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Summary

Summary

Error backpropagation requires a previous forward computation to get the error and to compute the errror gradients (Bazaga et al., 2019).

53 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Understanding CNN by Visualization

Understanding CNN by Visualization

  • Convolutional networks are large-scale models with a huge number of parameters that are learned in a data driven fashion.
    • Plotting an error curve and objective function on the training and validation sets against the training iterations is one way to track the overall training progress.
    • However, this approach does not give an insight into the actual parameters and activa- tions of the CNN layers.
    • It is often useful to visualize what CNNs have learned during or after the completion of the training process.

The visualization can be categorized into three types depending on the network signal that is used to obtain the visualization, i.e., weights, activations, and gradients.We summarize some of these three types of visualization methods below

54 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Understanding CNN by Visualization

Relevant Regions (ROIs)

Visualization of regions which are important for the correct prediction from a deep network.

This is an iteative method to get either an heatmap of regions to show their contribution in a classification problem or to mask out irrelevant regions.

55 / 56

PD Stefan Bosse - AFEML - Module F: Training and Validation of data-driven Models - Understanding CNN by Visualization

(a) The grey regions in input images is sequentially occluded and the output probability of correct class is plotted as a heat map (blue regions indicate high importance for correct classification). (b) Segmented regions in an image are occluded until the minimal image details that are required for correct scene class prediction are left

56 / 56