Computer-Vision
Recommendation : L02-L03,L09
Recommendation : L01-L04
Deep Learning
Norm
$$
|\mathbf{w}|^{2}=\mathbf{w}^{\top} \mathbf{w}
$$
$$
|\mathbf{x}|{1}=\sum{i=1}^{n}\left|x_{i}\right|
$$
$$
|\mathbf{x}|{p}:=\left(\sum{i=1}^{n}\left|x_{i}\right|^{p}\right)^{1 / p}
$$
$$
|\mathbf{x}|_{\infty}:=\max {i}\left|x{i}\right|
$$
Weight decay
L2范数惩罚项
$$
L(\mathbf{w}, b)=\frac{1}{n} \sum_{i=1}^{n} \frac{1}{2}\left(\mathbf{w}^{\top} \mathbf{x}^{(i)}+b-y^{(i)}\right)^{2}
$$
回想一下,x(i)是样本i的特征, y(i)是样本i的标签, (w,b)是权重和偏置参数。 为了惩罚权重向量的大小, 我们必须以某种方式在损失函数中添加‖w‖2, 但是模型应该如何平衡这个新的额外惩罚的损失? 实际上,我们通过正则化常数λ来描述这种权衡, 这是一个非负超参数,我们使用验证数据拟合:
$$
L(\mathbf{w}, b)+\frac{\lambda}{2}|\mathbf{w}|^{2}
$$
Ignore Regularization
现在用lambd = 0禁用权重衰减后运行这个代码。 注意,这里训练误差有了减少,但测试误差没有减少, 这意味着出现了严重的过拟合。

在下面的代码中,我们在实例化优化器时直接通过weight_decay指定weight decay超参数。 默认情况下,PyTorch同时衰减权重和偏移。 这里我们只为权重设置了weight_decay,所以偏置参数b不会衰减。
λ 太大后,train和test的loss会变得很大,太小后,train的loss会低,但是test的loss会很高
| 1 | def train_concise(wd): | 
Dropout
期望值保持不变

删除了h2和h5, 因此输出的计算不再依赖于h2或h5,并且它们各自的梯度在执行反向传播时也会消失。 这样,输出层的计算不能过度依赖于h1,…,h5的任何一个元素
训练用dropout,但是测试不用dropout
| 1 | import torch | 
| 1 | tensor([[ 0., 1., 2., 3., 4., 5., 6., 7.], | 
Universal Approximation Theorem
Network Depth
Increasing the number of parameters is not as effective as increasing depth
Deeper networks generalize better
Space Folding Intuition

Geometric explanation of the exponential advantage of deeper networks
Mirror axis of symmetry given by the hyperplane (defined by weights and bias)
Complex functions arise as mirrored images of simpler patterns
In this case, the points can be divided by simple curve.
KL Divergence
是一种量化两种概率分布P和Q之间差异的方式,又叫相对熵


Loss function
For 2 classes, we use the binary cross-entropy loss (BCE)
For C > 2 classes, we use the cross-entropy loss (CE)
maximum likelihood principle
minimize the squared loss (=L2 loss),
Laplace Distribution
Gaussian Distribution
Bernoulli Distribution
$$
p(y)=\mu^{y}(1-\mu)^{(1-y)}
$$
binary cross-entropy (BCE)
softmax
$$
\operatorname{softmax}(\mathbf{x})=\left(\frac{\exp \left(x_{1}\right)}{\sum_{k=1}^{C} \exp \left(x_{k}\right)}, \cdots, \frac{\exp \left(x_{C}\right)}{\sum_{k=1}^{C} \exp \left(x_{k}\right)}\right)
$$
The softmax is a multi-class generalization of the sigmoid function
For 2 classes, we can predict 1 value and use a sigmoid, or 2 values with softmax
For C > 2 classes we typically predict C scores and use a softmax non-linearity
Numerical Differentiation
$$
\frac{\partial f(x)}{\partial x}=\lim _{h \rightarrow 0} \frac{f(x+h)-f(x-h)}{2 h}
$$
Zero-center
Normalization

AlexNet: Subtract mean image (mean image: W × H × 3 numbers)
VGGNet: Subtract per-channel mean (mean along each channel: 3 numbers)
ResNet: Subtract per-channel mean and divide by per-channel std. dev. (mean along each channel: 3 numbers)
Whitening is less common
k-fold cross-validation
将训练集分割成k个子样本,一个单独的子样本被保留作为验证模型的数据,其他k − 1个样本用来训练。 交叉验证重复k次,每个子样本验证一次,平均k次的结果或者使用其它结合方式,最终得到一个单一估测。
Computer Vision
Homogeneous vectors homogeneous coordinates (齐次坐标)
$$
\tilde{\mathbf{x}}=\left(\begin{array}{c}
\tilde{x} \
\tilde{y} \
\tilde{w}
\end{array}\right) \in \mathbb{P}^{2}
$$
(1,1,1) ==(2,2,2)


