「学习笔记」Machine Learning 2021 Spring 台湾大学李宏毅

Machine Learning

≈ Looking for Function

General Guidance

Regression

The function outputs a scalar.

Classification

Given options(classes),the function outputs the correct one.

Structured Learning

creat something with structure(image, document)

How to find a function?

Function with Unknown Parameters
Define Loss from Training Data
- Loss is a function of parameters L(b,w)
- Loss: how good a set of values is.
Optimization
- Gradient Descent
  - Local minimal & global minimal

Backpropagation

Hyper parameters

learning rate
Sigmoid|ReLU 的个数
Batch Size
hidden layer 的层数

Linear models

Linear models have severe limitation. Model Bias

Activation function

Sigmoid Function
Rectified Linear Unit(ReLU)

Update & Epoch

把N笔资料分成一个一个的batch，用batch1算L1，根据L1来算Gradient，用这个Gradient来更新参数。

接下来再选下一个batch2算L2，根据L2来算Gradient，再更新参数。

再取下一个batch算出L3，根据L3来算Gradient，再更新参数。

每次更新一次参数叫做一次Update。

把所有的batch看过一遍，叫做一个epoch。

1 epoch = see all the batches once

Fully Connect Feedforward Network

Convolutional Neural Network(CNN)

When Gradient Is Small: Local Minimum and Saddle Point

Optimization Fails because ……

critical point

local minima
saddle point

which one?

Taler Series Approximation

Gradient $g$ is a vector
Hessian $H$ is a matrix

Hessian

Fall all $v$ :

$v^THv>0$ ➖ Local minima

= $H$ is positive definite = All eigen values are positive.

Fall all $v$ :

$v^THv<0$ ➖ Local maxima

= $H$ is negative definite = All eigen values are negative.

Sometimes $v^THv>0$ ,sometimes $v^THv<0$ ➖ Saddle point

= Some eigen values are positive, and some are negative.

Don’t afraid of saddle point?

$H$ may tell us parameter update direction!

You can escape the saddle point and decrease the loss.

(this method is seldom used in practice)

Saddle Point v.s. Local Minima

Saddle point in higher dimension?

When you have lots of parameters, perhaps local minima is rare ?

Tips for training:Batch and Momentum

Optimization with Batch

Shuffle

1 epoch = see all the batches once ➖> Shuffle after each epoch

第一个epoch分一次batch

第二个epoch会重新再分一次batch

所以哪些资料在同一个batch里面，每一个epoch都不一样，这件事情叫做 Shuffle。

Small Batch v.s. Large Batch

Larger batch size does not require longer time to compute gradient(unless batch size is too large)
Smaller batch size requires longer time for one epoch(longer time for seeing all data once)
Smaller batch size has better performance
“Noisy” update is better for training
Smaller batch is better on testing data?

Batch size is a hyperparameter you have to decide.

Have both fish and bear’s paws?

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes (https://arxiv.org/abs/1904.00962)
Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes (https://arxiv.org/abs/1711.04325)
Stochastic Weight Averaging in Parallel: Large- Batch Training That Generalizes Well (https://arxiv.org/abs/2001.02312)
Large Batch Training of Convolutional Networks (https://arxiv.org/abs/1708.03888)
Accurate, large minibatch sgd: Traiaing imagenet in 1 hour (https://arxiv.org/abs/1706.02677)

Momentum

(Vanilla)Gradient Descent

Gradient Descent + Momentum

Movement: movement of last step minus gradient at present

Concluding Remarks

Critical points have zero gradients.
Critical points can be either saddle points or local minima.
- Can be determined by the Hessian matrix.
- It is possible to escape saddle points along the direction of eigenvectors of the Hessian matrix.
- Local minima may be rare.
Smaller batch size and momentum help escape critical points.

Tips for training:Adaptive Learning Rate

Training can be difficult even without critical points.

Learning rate cannot be one-size-fits-all

Different parameters needs different learning rate

Root Mean Square

Used in Adagrad

Learning rate adapts dynamically

RMSProp

Adam

Adam: RMSProp + Momentum

Original paper:https://arxiv.org/pdf/1412.6980.pdf

Learning Rate Scheduling

Learning Rate Decay

As the training goes, we are closer to the destination,so we reduce the learning rate.

Warm Up

Increase and then decrease ?

Residual Network

https://arxiv.org/abs/1512.03385
Transformer

https://arxiv.org/abs/1706.03762

Please refer to RAdam https://arxiv.org/abs/1908.03265

Summary of Optimization

(Vanilla)Gradient Descent

Various Improvements

Momentum:weighted sum of the previous gradients (Consider direction)
root mean square of the gradients (only magnitude)
Learning rate scheduling

Loss Function: Classification

Classification as Regression ?

Regression
Classification as regression ?

Class as one-hot vector

Regression

Classification

Softmax

Loss of Classification

Mean Square Error(MSE)

Cross-entropy

Minimizing cross-entropy is equivalent to maximizing likelihood. WIN

Quick Introduction of Bath Normalization

Changing Landscape

Feature Normalization

Considering Deep Learning

Consider a batch

Batch Normalization

适用于bath size较大的时候

Batch normalization

Original paper:https://arxiv.org/abs/1502.03167

Batch normalization-Testing

We do not always have batch at testing stage.

Computing the moving average of the two of the batches during training.

Internal Covariate Shift ?

How Dose Batch Normalization Help Optimization ?

https://arxiv.org/abs/1805.11604

Experimental results (and theoretically analysis) support batch normalization change the landscape of error surface.

To learn more…

Batch Renormalization
- https://arxiv.org/abs/1702.03275
Layer Normalization
- https://arxiv.org/abs/1607.06450
Instance Normalization
- https://arxiv.org/abs/1607.08022
Group Normalization
- https://arxiv.org/abs/1803.08494
Weight Normalization
- https://arxiv.org/abs/1602.07868
Spectrum Normalization
- https://arxiv.org/abs/1705.10941

Convolutional Neural Network(CNN)

Network Architecture designed for Image

Image Classification

Do we really need “fully connected” in image processing ?

Observation 1

Identifying some critical patterns

Some patterns are much smaller than the whole image.

A neuron does not have to see the whole image.

Simplification 1

Receptive field Can be overlapped

Can different neurons have different sizes of receptive field ?
Cover only some channels ?
Not square receptive field ?

Simplification 1 - Typical Setting

all channels

kernel size (e.g.,3x3)

Each receptive field has a set of neurons (e.g.,64 neurons).

stride a hyper parameter

overlap padding

The receptive fields cover the whole image.

Observation 2

The same patterns appear in different regions.

Simplification 2

parameter sharing

Two neurons with the same receptive field would not share parameters.

Simplification 2 - Typical Setting

Each receptive field has a set of neurons (e.g.,64 neurons).

Each receptive field has the neurons with the same set of parameters.

Benefit of Convolutional Layer

Larger model bias (for image)

Convolutional Layer

Another story based on filter

Comparison of Two Stories

The neurons with different receptive fields share the parameters.

Each filter convolves over the input image.

Observation 3

Subsampling the pixels will not change the object

Pooling - Max Pooling

The whole CNN

Image —Convolution—Pooling—Convolution—Pooling—Flatten—Fully Connected Layers—softmax—output

Application:Playing Go

Why CNN for Go playing ?

More Applications

Speech

https://dl.acm.org/doi/10.1109/TASLP.2014.2339736

Natural Language Processing

https://www.aclweb.org/anthology/S15-2079/

To learn more…

CNN is not invariant to scaling and rotation (we need data augmentation).

Spatial Transformer Layer https://youtu.be/SoCywZ1hZak (in Mandarin)

Self-attention

Sophisticated Input

Input is a vector
Input is a set of vectors

Vector Set as Input

One-hot Encoding

Word Embedding

Graph is also a set of vectors (consider each node as a vector)

What is the output ?

Each vector has a label.(Sequence Labeling)
The whole sequence has a label.
Model decides the number of labels itself.

self-attention

Attention is all you need. https://arxiv.org/abs/1706.03762

Dot-product
Additive

query key attention score

Multi-head Self-attention

Different types of relevance

Positional Encoding

No position information in self-attention.
Each position has a unique positional vector ( $e^i$ )
hand-crafted
learned from data

https://arxiv.org/abs/2003.09229

Many applications …

Transformer https://arxiv.org/abs/1706.03762

BERT https://arxiv.org/abs/1810.04805

Widely used in Natural Langue Processing (NLP) !

Self-attention for Speech

Truncated Self-attention

Self-attention for Image

Self-Attention GAN https://arxiv.org/abs/1805.08318

Detection Transformer (DETR) https://arxiv.org/abs/2005.12872

Self-attention for Graph

Consider edge:only attention to connected nodes

This is one type of Graph Neural Network (GNN).

Self-attention v.s. CNN

CNN:self-attention that can only attends in a receptive field

CNN is simplified self-attention.

Self-attention:CNN with learnable receptive field

Self-attention is the complex version of CNN.

On the Relationship between Self-Attention and Convolutional Layers https://arxiv.org/abs/1911.03584

An Image is Worth 16x16 Words:Transformers for Image Recognition at Scale https://arxiv.org/pdf/2010.11929.pdf

Self-attention v.s. RNN

Recurrent Neural Network (RNN)

Transformers are RNNs:Fast Autoregressive Transformers with Linear Attention https://arxiv.org/abs/2006.16236

To Learn More …

Long Range Arena:A Benchmark for Efficient Transformers https://arxiv.org/abs/2011.04006

Efficient Transformers:A Survey https://arxiv.org/abs/2009.06732

Transformer

Sequence-to-sequence(Seq2seq)

Input a sequence, output a sequence

The output length is determined by model.

Speech Recognition
Machine Translation
Speech Translation

Text-to-Speech(TTS) Synthesis

Seq2seq for Chatbot

Most Natural Language Processing applications …

Question Answering(QA)

Seq2seq for Syntactic Parsing

Grammar as a Foreign Language https://arxiv.org/abs/1412.7449

Seq2seq for Multi-label Classification

c.f. Multi-class Classification

An object can belong to multiple classes.

https://arxiv.org/abs/1909.03434

https://arxiv.org/abs/1707.05495

Seq2seq for Object Detection

https://arxiv.org/abs/2005.12872

Seq2seq

Encoder Decoder

To learn more ……

On Layer Normalization in the Transformer Architecture

https://arxiv.org/abs/2002.04745
PowerNorm : Rethinking Batch Normalization in Transformers

https://arxiv.org/abs/2003.07845

Autoregressive(Speech Recognition as example)