0%

「学习笔记」Machine Learning 2021 Spring 台湾大学 李宏毅

Machine Learning

≈ Looking for Function

General Guidance

Regression

The function outputs a scalar.

Classification

Given options(classes),the function outputs the correct one.

Structured Learning

creat something with structure(image, document)

How to find a function?

  1. Function with Unknown Parameters
  2. Define Loss from Training Data
    • Loss is a function of parameters L(b,w)
    • Loss: how good a set of values is.
  3. Optimization
    • Gradient Descent
      • Local minimal & global minimal
Backpropagation
Hyper parameters
  • learning rate
  • Sigmoid|ReLU 的个数
  • Batch Size
  • hidden layer 的层数
Linear models

Linear models have severe limitation. Model Bias

Activation function
  • Sigmoid Function

  • Rectified Linear Unit(ReLU)

Update & Epoch

把N笔资料分成一个一个的batch,用batch1算L1,根据L1来算Gradient,用这个Gradient来更新参数。

接下来再选下一个batch2算L2,根据L2来算Gradient,再更新参数。

再取下一个batch算出L3,根据L3来算Gradient,再更新参数。

每次更新一次参数叫做一次Update。

把所有的batch看过一遍,叫做一个epoch。

1 epoch = see all the batches once

Fully Connect Feedforward Network
Convolutional Neural Network(CNN)

When Gradient Is Small: Local Minimum and Saddle Point

Optimization Fails because ……

critical point
  • local minima
  • saddle point
which one?

Taler Series Approximation

  • Gradient is a vector
  • Hessian is a matrix
Hessian

Fall all :

➖ Local minima

= is positive definite = All eigen values are positive.

Fall all :

➖ Local maxima

= is negative definite = All eigen values are negative.

Sometimes ,sometimes ➖ Saddle point

= Some eigen values are positive, and some are negative.

Don’t afraid of saddle point?

may tell us parameter update direction!

You can escape the saddle point and decrease the loss.

(this method is seldom used in practice)

Saddle Point v.s. Local Minima

Saddle point in higher dimension?

When you have lots of parameters, perhaps local minima is rare ?

Tips for training:Batch and Momentum

Optimization with Batch

Shuffle

1 epoch = see all the batches once ➖> Shuffle after each epoch

第一个epoch分一次batch

第二个epoch会重新再分一次batch

所以哪些资料在同一个batch里面,每一个epoch都不一样,这件事情叫做 Shuffle。

Small Batch v.s. Large Batch
  • Larger batch size does not require longer time to compute gradient(unless batch size is too large)
  • Smaller batch size requires longer time for one epoch(longer time for seeing all data once)
  • Smaller batch size has better performance
  • “Noisy” update is better for training
  • Smaller batch is better on testing data?

Batch size is a hyperparameter you have to decide.

Have both fish and bear’s paws?

Momentum

(Vanilla)Gradient Descent

Gradient Descent + Momentum

  • Movement: movement of last step minus gradient at present

Concluding Remarks

  • Critical points have zero gradients.
  • Critical points can be either saddle points or local minima.

    • Can be determined by the Hessian matrix.
    • It is possible to escape saddle points along the direction of eigenvectors of the Hessian matrix.
    • Local minima may be rare.
  • Smaller batch size and momentum help escape critical points.

Tips for training:Adaptive Learning Rate

Training can be difficult even without critical points.

Learning rate cannot be one-size-fits-all

Different parameters needs different learning rate

Root Mean Square

Used in Adagrad

Learning rate adapts dynamically

RMSProp
Adam

Adam: RMSProp + Momentum

Original paper:https://arxiv.org/pdf/1412.6980.pdf

Learning Rate Scheduling

Learning Rate Decay

As the training goes, we are closer to the destination,so we reduce the learning rate.

Warm Up

Increase and then decrease ?

Please refer to RAdam https://arxiv.org/abs/1908.03265

Summary of Optimization

(Vanilla)Gradient Descent
Various Improvements
  • Momentum:weighted sum of the previous gradients (Consider direction)
  • root mean square of the gradients (only magnitude)
  • Learning rate scheduling

Loss Function: Classification

Classification as Regression ?

  • Regression
  • Classification as regression ?

Class as one-hot vector

Class as one-hot vector
Regression
Classification
Softmax

Loss of Classification

Mean Square Error(MSE)
Cross-entropy

Minimizing cross-entropy is equivalent to maximizing likelihood. WIN

Quick Introduction of Bath Normalization

Changing Landscape

Feature Normalization

Considering Deep Learning

Consider a batch

Batch Normalization

适用于bath size较大的时候

Batch normalization

Original paper:https://arxiv.org/abs/1502.03167

Batch normalization-Testing

We do not always have batch at testing stage.

Computing the moving average of the two of the batches during training.

Internal Covariate Shift ?

How Dose Batch Normalization Help Optimization ?

https://arxiv.org/abs/1805.11604

Experimental results (and theoretically analysis) support batch normalization change the landscape of error surface.

To learn more…

Convolutional Neural Network(CNN)

Network Architecture designed for Image

Image Classification

Do we really need “fully connected” in image processing ?

Observation 1

Identifying some critical patterns

Some patterns are much smaller than the whole image.

A neuron does not have to see the whole image.

Simplification 1

Receptive field Can be overlapped

  • Can different neurons have different sizes of receptive field ?
  • Cover only some channels ?
  • Not square receptive field ?
Simplification 1 - Typical Setting

all channels

kernel size (e.g.,3x3)

Each receptive field has a set of neurons (e.g.,64 neurons).

stride a hyper parameter

overlap padding

The receptive fields cover the whole image.

Observation 2
  • The same patterns appear in different regions.
Simplification 2

parameter sharing

Two neurons with the same receptive field would not share parameters.

Simplification 2 - Typical Setting

Each receptive field has a set of neurons (e.g.,64 neurons).

Each receptive field has the neurons with the same set of parameters.

Benefit of Convolutional Layer

Larger model bias (for image)

Convolutional Layer

Another story based on filter

Comparison of Two Stories

The neurons with different receptive fields share the parameters.

Each filter convolves over the input image.

Observation 3
  • Subsampling the pixels will not change the object
Pooling - Max Pooling
The whole CNN

Image —Convolution—Pooling—Convolution—Pooling—Flatten—Fully Connected Layers—softmax—output

Application:Playing Go

Why CNN for Go playing ?

More Applications

Speech

https://dl.acm.org/doi/10.1109/TASLP.2014.2339736

Natural Language Processing

https://www.aclweb.org/anthology/S15-2079/

To learn more…

  • CNN is not invariant to scaling and rotation (we need data augmentation).

Spatial Transformer Layer https://youtu.be/SoCywZ1hZak (in Mandarin)

Self-attention

Sophisticated Input

  • Input is a vector
  • Input is a set of vectors

Vector Set as Input

One-hot Encoding

Word Embedding

  • Graph is also a set of vectors (consider each node as a vector)

What is the output ?

  • Each vector has a label.(Sequence Labeling)
  • The whole sequence has a label.
  • Model decides the number of labels itself.

self-attention

Attention is all you need. https://arxiv.org/abs/1706.03762

  • Dot-product

  • Additive

query key attention score

Multi-head Self-attention

Different types of relevance

Positional Encoding
  • No position information in self-attention.
  • Each position has a unique positional vector ()
  • hand-crafted
  • learned from data

https://arxiv.org/abs/2003.09229

Many applications …

Transformer https://arxiv.org/abs/1706.03762

BERT https://arxiv.org/abs/1810.04805

Widely used in Natural Langue Processing (NLP) !

Self-attention for Speech

Truncated Self-attention

Self-attention for Image

Self-Attention GAN https://arxiv.org/abs/1805.08318

Detection Transformer (DETR) https://arxiv.org/abs/2005.12872

Self-attention for Graph

Consider edge:only attention to connected nodes

This is one type of Graph Neural Network (GNN).

Self-attention v.s. CNN

CNN:self-attention that can only attends in a receptive field

  • CNN is simplified self-attention.

Self-attention:CNN with learnable receptive field

  • Self-attention is the complex version of CNN.

On the Relationship between Self-Attention and Convolutional Layers https://arxiv.org/abs/1911.03584

An Image is Worth 16x16 Words:Transformers for Image Recognition at Scale https://arxiv.org/pdf/2010.11929.pdf

Self-attention v.s. RNN

Recurrent Neural Network (RNN)

Transformers are RNNs:Fast Autoregressive Transformers with Linear Attention https://arxiv.org/abs/2006.16236

To Learn More …

Long Range Arena:A Benchmark for Efficient Transformers https://arxiv.org/abs/2011.04006

Efficient Transformers:A Survey https://arxiv.org/abs/2009.06732

Transformer

Sequence-to-sequence(Seq2seq)

Input a sequence, output a sequence

The output length is determined by model.

  • Speech Recognition
  • Machine Translation
  • Speech Translation
Text-to-Speech(TTS) Synthesis
Seq2seq for Chatbot
Most Natural Language Processing applications …

Question Answering(QA)

Seq2seq for Syntactic Parsing

Grammar as a Foreign Language https://arxiv.org/abs/1412.7449

Seq2seq for Multi-label Classification

c.f. Multi-class Classification

An object can belong to multiple classes.

https://arxiv.org/abs/1909.03434

https://arxiv.org/abs/1707.05495

Seq2seq for Object Detection

https://arxiv.org/abs/2005.12872

Seq2seq

Encoder Decoder

To learn more ……

Autoregressive(Speech Recognition as example)

欢迎关注我的其它发布渠道