Machine Learning
≈ Looking for Function
General Guidance
Regression
The function outputs a scalar.
Classification
Given options(classes),the function outputs the correct one.
Structured Learning
creat something with structure(image, document)
How to find a function?
- Function with Unknown Parameters
- Define Loss from Training Data
- Loss is a function of parameters L(b,w)
- Loss: how good a set of values is.
- Optimization
- Gradient Descent
- Local minimal & global minimal
- Gradient Descent
Backpropagation
Hyper parameters
- learning rate
- Sigmoid|ReLU 的个数
- Batch Size
- hidden layer 的层数
Linear models
Linear models have severe limitation. Model Bias
Activation function
Sigmoid Function
Rectified Linear Unit(ReLU)
Update & Epoch
把N笔资料分成一个一个的batch,用batch1算L1,根据L1来算Gradient,用这个Gradient来更新参数。
接下来再选下一个batch2算L2,根据L2来算Gradient,再更新参数。
再取下一个batch算出L3,根据L3来算Gradient,再更新参数。
每次更新一次参数叫做一次Update。
把所有的batch看过一遍,叫做一个epoch。
1 epoch = see all the batches once
Fully Connect Feedforward Network
Convolutional Neural Network(CNN)
When Gradient Is Small: Local Minimum and Saddle Point
Optimization Fails because ……
critical point
- local minima
- saddle point
which one?
Taler Series Approximation
- Gradient is a vector
- Hessian is a matrix
Hessian
Fall all :
➖ Local minima
= is positive definite = All eigen values are positive.
Fall all :
➖ Local maxima
= is negative definite = All eigen values are negative.
Sometimes ,sometimes ➖ Saddle point
= Some eigen values are positive, and some are negative.
Don’t afraid of saddle point?
may tell us parameter update direction!
You can escape the saddle point and decrease the loss.
(this method is seldom used in practice)
Saddle Point v.s. Local Minima
Saddle point in higher dimension?
When you have lots of parameters, perhaps local minima is rare ?
Tips for training:Batch and Momentum
Optimization with Batch
Shuffle
1 epoch = see all the batches once ➖> Shuffle after each epoch
第一个epoch分一次batch
第二个epoch会重新再分一次batch
所以哪些资料在同一个batch里面,每一个epoch都不一样,这件事情叫做 Shuffle。
Small Batch v.s. Large Batch
- Larger batch size does not require longer time to compute gradient(unless batch size is too large)
- Smaller batch size requires longer time for one epoch(longer time for seeing all data once)
- Smaller batch size has better performance
- “Noisy” update is better for training
- Smaller batch is better on testing data?
Batch size is a hyperparameter you have to decide.
Have both fish and bear’s paws?
- Large Batch Optimization for Deep Learning: Training BERT in 76 minutes (https://arxiv.org/abs/1904.00962)
- Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes (https://arxiv.org/abs/1711.04325)
- Stochastic Weight Averaging in Parallel: Large- Batch Training That Generalizes Well (https://arxiv.org/abs/2001.02312)
- Large Batch Training of Convolutional Networks (https://arxiv.org/abs/1708.03888)
- Accurate, large minibatch sgd: Traiaing imagenet in 1 hour (https://arxiv.org/abs/1706.02677)
Momentum
(Vanilla)Gradient Descent
Gradient Descent + Momentum
- Movement: movement of last step minus gradient at present
Concluding Remarks
- Critical points have zero gradients.
Critical points can be either saddle points or local minima.
- Can be determined by the Hessian matrix.
- It is possible to escape saddle points along the direction of eigenvectors of the Hessian matrix.
- Local minima may be rare.
Smaller batch size and momentum help escape critical points.
Tips for training:Adaptive Learning Rate
Training can be difficult even without critical points.
Learning rate cannot be one-size-fits-all
Different parameters needs different learning rate
Root Mean Square
Used in Adagrad
Learning rate adapts dynamically
RMSProp
Adam
Adam: RMSProp + Momentum
Original paper:https://arxiv.org/pdf/1412.6980.pdf
Learning Rate Scheduling
Learning Rate Decay
As the training goes, we are closer to the destination,so we reduce the learning rate.
Warm Up
Increase and then decrease ?
Residual Network
Transformer
Please refer to RAdam https://arxiv.org/abs/1908.03265
Summary of Optimization
(Vanilla)Gradient Descent
Various Improvements
- Momentum:weighted sum of the previous gradients (Consider direction)
- root mean square of the gradients (only magnitude)
- Learning rate scheduling
Loss Function: Classification
Classification as Regression ?
- Regression
- Classification as regression ?
Class as one-hot vector
Class as one-hot vector
Regression
Classification
Softmax
Loss of Classification
Mean Square Error(MSE)
Cross-entropy
Minimizing cross-entropy is equivalent to maximizing likelihood. WIN
Quick Introduction of Bath Normalization
Changing Landscape
Feature Normalization
Considering Deep Learning
Consider a batch
Batch Normalization
适用于bath size较大的时候
Batch normalization
Original paper:https://arxiv.org/abs/1502.03167
Batch normalization-Testing
We do not always have batch at testing stage.
Computing the moving average of the two of the batches during training.
Internal Covariate Shift ?
How Dose Batch Normalization Help Optimization ?
https://arxiv.org/abs/1805.11604
Experimental results (and theoretically analysis) support batch normalization change the landscape of error surface.
To learn more…
- Batch Renormalization
- Layer Normalization
- Instance Normalization
- Group Normalization
- Weight Normalization
- Spectrum Normalization
Convolutional Neural Network(CNN)
Network Architecture designed for Image
Image Classification
Do we really need “fully connected” in image processing ?
Observation 1
Identifying some critical patterns
Some patterns are much smaller than the whole image.
A neuron does not have to see the whole image.
Simplification 1
Receptive field Can be overlapped
- Can different neurons have different sizes of receptive field ?
- Cover only some channels ?
- Not square receptive field ?
Simplification 1 - Typical Setting
all channels
kernel size (e.g.,3x3)
Each receptive field has a set of neurons (e.g.,64 neurons).
stride a hyper parameter
overlap padding
The receptive fields cover the whole image.
Observation 2
- The same patterns appear in different regions.
Simplification 2
parameter sharing
Two neurons with the same receptive field would not share parameters.
Simplification 2 - Typical Setting
Each receptive field has a set of neurons (e.g.,64 neurons).
Each receptive field has the neurons with the same set of parameters.
Benefit of Convolutional Layer
Larger model bias (for image)
Convolutional Layer
Another story based on filter
Comparison of Two Stories
The neurons with different receptive fields share the parameters.
Each filter convolves over the input image.
Observation 3
- Subsampling the pixels will not change the object
Pooling - Max Pooling
The whole CNN
Image —Convolution—Pooling—Convolution—Pooling—Flatten—Fully Connected Layers—softmax—output
Application:Playing Go
Why CNN for Go playing ?
More Applications
Speech
https://dl.acm.org/doi/10.1109/TASLP.2014.2339736
Natural Language Processing
https://www.aclweb.org/anthology/S15-2079/
To learn more…
- CNN is not invariant to scaling and rotation (we need data augmentation).
Spatial Transformer Layer https://youtu.be/SoCywZ1hZak (in Mandarin)
Self-attention
Sophisticated Input
- Input is a vector
- Input is a set of vectors
Vector Set as Input
One-hot Encoding
Word Embedding
- Graph is also a set of vectors (consider each node as a vector)
What is the output ?
- Each vector has a label.(Sequence Labeling)
- The whole sequence has a label.
- Model decides the number of labels itself.
self-attention
Attention is all you need. https://arxiv.org/abs/1706.03762
Dot-product
Additive
query key attention score
Multi-head Self-attention
Different types of relevance
Positional Encoding
- No position information in self-attention.
- Each position has a unique positional vector ()
- hand-crafted
- learned from data
https://arxiv.org/abs/2003.09229
Many applications …
Transformer https://arxiv.org/abs/1706.03762
BERT https://arxiv.org/abs/1810.04805
Widely used in Natural Langue Processing (NLP) !
Self-attention for Speech
Truncated Self-attention
Self-attention for Image
Self-Attention GAN https://arxiv.org/abs/1805.08318
Detection Transformer (DETR) https://arxiv.org/abs/2005.12872
Self-attention for Graph
Consider edge:only attention to connected nodes
This is one type of Graph Neural Network (GNN).
Self-attention v.s. CNN
CNN:self-attention that can only attends in a receptive field
- CNN is simplified self-attention.
Self-attention:CNN with learnable receptive field
- Self-attention is the complex version of CNN.
On the Relationship between Self-Attention and Convolutional Layers https://arxiv.org/abs/1911.03584
An Image is Worth 16x16 Words:Transformers for Image Recognition at Scale https://arxiv.org/pdf/2010.11929.pdf
Self-attention v.s. RNN
Recurrent Neural Network (RNN)
Transformers are RNNs:Fast Autoregressive Transformers with Linear Attention https://arxiv.org/abs/2006.16236
To Learn More …
Long Range Arena:A Benchmark for Efficient Transformers https://arxiv.org/abs/2011.04006
Efficient Transformers:A Survey https://arxiv.org/abs/2009.06732
Transformer
Sequence-to-sequence(Seq2seq)
Input a sequence, output a sequence
The output length is determined by model.
- Speech Recognition
- Machine Translation
- Speech Translation
Text-to-Speech(TTS) Synthesis
Seq2seq for Chatbot
Most Natural Language Processing applications …
Question Answering(QA)
Seq2seq for Syntactic Parsing
Grammar as a Foreign Language https://arxiv.org/abs/1412.7449
Seq2seq for Multi-label Classification
c.f. Multi-class Classification
An object can belong to multiple classes.
https://arxiv.org/abs/1909.03434
https://arxiv.org/abs/1707.05495
Seq2seq for Object Detection
https://arxiv.org/abs/2005.12872
Seq2seq
Encoder Decoder
To learn more ……
On Layer Normalization in the Transformer Architecture
PowerNorm : Rethinking Batch Normalization in Transformers
Autoregressive(Speech Recognition as example)