Class 1

Recommendation : L02-L03,L09

Class 2

Recommendation : L01-L04

cs231

Paper 1

Deep Learning

Norm

$$
|\mathbf{w}|^{2}=\mathbf{w}^{\top} \mathbf{w}
$$

$$
|\mathbf{x}|{1}=\sum{i=1}^{n}\left|x_{i}\right|
$$

$$
|\mathbf{x}|{p}:=\left(\sum{i=1}^{n}\left|x_{i}\right|^{p}\right)^{1 / p}
$$

$$
|\mathbf{x}|_{\infty}:=\max {i}\left|x{i}\right|
$$

Weight decay

L2范数惩罚项
$$
L(\mathbf{w}, b)=\frac{1}{n} \sum_{i=1}^{n} \frac{1}{2}\left(\mathbf{w}^{\top} \mathbf{x}^{(i)}+b-y^{(i)}\right)^{2}
$$
回想一下,x(i)是样本i的特征, y(i)是样本i的标签, (w,b)是权重和偏置参数。 为了惩罚权重向量的大小, 我们必须以某种方式在损失函数中添加‖w‖2, 但是模型应该如何平衡这个新的额外惩罚的损失? 实际上,我们通过正则化常数λ来描述这种权衡, 这是一个非负超参数,我们使用验证数据拟合:
$$
L(\mathbf{w}, b)+\frac{\lambda}{2}|\mathbf{w}|^{2}
$$

Ignore Regularization

现在用lambd = 0禁用权重衰减后运行这个代码。 注意,这里训练误差有了减少,但测试误差没有减少, 这意味着出现了严重的过拟合。

![image-20220502124712314](/Users/zjuchy/Library/Application Support/typora-user-images/image-20220502124712314.png)

在下面的代码中,我们在实例化优化器时直接通过weight_decay指定weight decay超参数。 默认情况下,PyTorch同时衰减权重和偏移。 这里我们只为权重设置了weight_decay,所以偏置参数b不会衰减。

λ 太大后,train和test的loss会变得很大,太小后,train的loss会低,但是test的loss会很高

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def train_concise(wd):
net = nn.Sequential(nn.Linear(num_inputs, 1))
for param in net.parameters():
param.data.normal_()
loss = nn.MSELoss(reduction='none')
num_epochs, lr = 100, 0.003
# 偏置参数没有衰减
trainer = torch.optim.SGD([
{"params":net[0].weight,'weight_decay': wd},
{"params":net[0].bias}], lr=lr)
animator = d2l.Animator(xlabel='epochs', ylabel='loss', yscale='log',
xlim=[5, num_epochs], legend=['train', 'test'])
for epoch in range(num_epochs):
for X, y in train_iter:
trainer.zero_grad()
l = loss(net(X), y)
l.mean().backward()
trainer.step()
if (epoch + 1) % 5 == 0:
animator.add(epoch + 1,
(d2l.evaluate_loss(net, train_iter, loss),
d2l.evaluate_loss(net, test_iter, loss)))
print('w的L2范数:', net[0].weight.norm().item())

Dropout

期望值保持不变

![image-20220502141916069](/Users/zjuchy/Library/Application Support/typora-user-images/image-20220502141916069.png)

删除了h2和h5, 因此输出的计算不再依赖于h2或h5,并且它们各自的梯度在执行反向传播时也会消失。 这样,输出层的计算不能过度依赖于h1,…,h5的任何一个元素![image-20220502142041423](/Users/zjuchy/Library/Application Support/typora-user-images/image-20220502142041423.png)

训练用dropout,但是测试不用dropout

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import torch
def dropout_layer(X, dropout):
assert 0 <= dropout <= 1
# 在本情况中,所有元素都被丢弃
if dropout == 1:
return torch.zeros_like(X)
# 在本情况中,所有元素都被保留
if dropout == 0:
return X
mask = (torch.rand(X.shape) > dropout).float()
return mask * X / (1.0 - dropout)
X= torch.arange(16, dtype = torch.float32).reshape((2, 8))
print(X)
print(dropout_layer(X, 0.))
print(dropout_layer(X, 0.5))
print(dropout_layer(X, 1.))
1
2
3
4
5
6
7
8
tensor([[ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.],
[ 8., 9., 10., 11., 12., 13., 14., 15.]])
tensor([[ 0., 1., 2., 3., 4., 5., 6., 7.],
[ 8., 9., 10., 11., 12., 13., 14., 15.]])
tensor([[ 0., 2., 4., 0., 8., 10., 0., 0.],
[ 0., 18., 0., 0., 0., 0., 28., 30.]])
tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0.]])

Universal Approximation Theorem

Network Depth

Increasing the number of parameters is not as effective as increasing depth

Deeper networks generalize better

Space Folding Intuition

![image-20220502133430113](/Users/zjuchy/Library/Application Support/typora-user-images/image-20220502133430113.png)

Geometric explanation of the exponential advantage of deeper networks

Mirror axis of symmetry given by the hyperplane (defined by weights and bias)

Complex functions arise as mirrored images of simpler patterns

In this case, the points can be divided by simple curve.

KL Divergence

是一种量化两种概率分布P和Q之间差异的方式,又叫相对熵

img

img

Loss function

For 2 classes, we use the binary cross-entropy loss (BCE)

For C > 2 classes, we use the cross-entropy loss (CE)

maximum likelihood principle

minimize the squared loss (=L2 loss),

Laplace Distribution

Gaussian Distribution

Bernoulli Distribution

$$
p(y)=\mu^{y}(1-\mu)^{(1-y)}
$$

binary cross-entropy (BCE)

softmax

$$
\operatorname{softmax}(\mathbf{x})=\left(\frac{\exp \left(x_{1}\right)}{\sum_{k=1}^{C} \exp \left(x_{k}\right)}, \cdots, \frac{\exp \left(x_{C}\right)}{\sum_{k=1}^{C} \exp \left(x_{k}\right)}\right)
$$

The softmax is a multi-class generalization of the sigmoid function

For 2 classes, we can predict 1 value and use a sigmoid, or 2 values with softmax

For C > 2 classes we typically predict C scores and use a softmax non-linearity

Numerical Differentiation

$$
\frac{\partial f(x)}{\partial x}=\lim _{h \rightarrow 0} \frac{f(x+h)-f(x-h)}{2 h}
$$

Zero-center

Normalization

![image-20220502194635518](/Users/zjuchy/Library/Application Support/typora-user-images/image-20220502194635518.png)

AlexNet: Subtract mean image (mean image: W × H × 3 numbers)

VGGNet: Subtract per-channel mean (mean along each channel: 3 numbers)

ResNet: Subtract per-channel mean and divide by per-channel std. dev. (mean along each channel: 3 numbers)

Whitening is less common

k-fold cross-validation

将训练集分割成k个子样本,一个单独的子样本被保留作为验证模型的数据,其他k − 1个样本用来训练。 交叉验证重复k次,每个子样本验证一次,平均k次的结果或者使用其它结合方式,最终得到一个单一估测。

Computer Vision

Homogeneous vectors homogeneous coordinates (齐次坐标)
$$
\tilde{\mathbf{x}}=\left(\begin{array}{c}
\tilde{x} \
\tilde{y} \
\tilde{w}
\end{array}\right) \in \mathbb{P}^{2}
$$
(1,1,1) ==(2,2,2)