Object Recongnition
ImageNet
2012 CNN
Image classification
object detection
CNN
2012 AlexNet: 7-8 layers
1998 LeCun CNN born
Why?
- Computation->large network
- Better data: high quality labeled datasets
True, deep understanding of a picture, a long way to go
piazza
Image in computer: a giant grid of numbers ->the problem: Semantic Gap
challenges: Viewpoint variation/ Illumination/ Deformation/ Occlusion/ Background Clutter/ Intraclass variation
An image classifier? Unable to write an algorithm.
Maybe find edges, then find corners...
- Collect a dataset of images and labels
- Use ML to train a classifier
- Evaluate the classifier on new images
-
Two funcs:
def train(images, labels): # Machine Learning! return model
def predict(model, test_images): #use model to predict labels return test_labels
Dataset: CIFAR10
Distance Metric to compare images
-
L1 distance:
$d_1(I_1,I_2)=\sum_p\left|I_1^p-I_2^p\right|$
import numpy as np
class NearestNeighbor:
def __init__(self):
pass
def train(self, X, y):
‘’‘ X is N x D where each row is an example. Yis 1-d of size N ’‘’
# the nearest neighbor classifier simply remebers all the tarining data
self.Xtr = X
self.ytr = y
def predict(self X):
’‘’ X is N x D where each row is an example wish to predict label for ‘’‘
num_test = X.shape[0]
# lets make sure that the output type matches the input type
Ypred = np.zeros(num_test, dtype = self.ytr.dtype)
# loop over all test rows
for i in xrange(num_test):
# find the nearest training image to the i*th test image
# using the L1 distance (sum of absolute value differences)
distances = np.sum(np.abs(self.Xtr - X[i,:]), axis = 1) # using broadcast
min_index = np.argmin(distances) # get the index with smallest distance
Ypred[i] = self.ytr[min_index] # predict the label of the nearest example
return Ypred
- Tarin O(1); predict O(N)
- We want fast at prediction; slow for training is ok
-
Take majority vote from K closest points
-
Chosing the Distance Metric:
L1(Manhattan) distance
L2(Euclidean) distance :$d_2(I_1,I_2�)=\sqrt{\sum_p(I^p_1-I^p_2)^2}$
- L1 distance depends on ur choice of coordinate frame, but L2 does not
- That means if the data has some specific meaning, it is better to use L1 distance
Such K and distance are hyperparameters: cannot learn from data, depend on the problem
- It is wrong to choose hyperparameters that work best on the data/ the test data: Evaluate on the test set only a single time, at the very end
- Split data into train, validation, and test sets; choose hyperparameters on validation set and evaluate on test
- The splits people tend to use is between 50%-90% of the training data for training and rest for validation.
- Cross-Validation: computational expensive, and not usually in deep learning.
- Typical number of folds you can see in practice would be 3-fold, 5-fold or 10-fold cross-validation.
- If the number of hyperparameters is large you may prefer to use bigger validation splits. If the number of examples in the validation set is small (perhaps only a few hundred or so), it is safer to use cross-validation.
# assume we have Xtr_rows, Ytr, Xte_rows, Yte as before
# recall Xtr_rows is 50,000 x 3072 matrix
Xval_rows = Xtr_rows[:1000, :] # take first 1000 for validation
Yval = Ytr[:1000]
Xtr_rows = Xtr_rows[1000:, :] # keep last 49,000 for train
Ytr = Ytr[1000:]
# find hyperparameters that work best on the validation set
validation_accuracies = []
for k in [1, 3, 5, 10, 20, 50, 100]:
# use a particular value of k and evaluation on validation data
nn = NearestNeighbor()
nn.train(Xtr_rows, Ytr)
# here we assume a modified NearestNeighbor class that can take a k as input
Yval_predict = nn.predict(Xval_rows, k = k)
acc = np.mean(Yval_predict == Yval)
print 'accuracy: %f' % (acc,)
# keep track of what works on the validation set
validation_accuracies.append((k, acc))
Problems:
- Vary slow at test time: we often care about the test time efficiency much more than the efficiency at training time
- Distance metrics on pixels are not informative
- Curse of dimensionality
- images are high-dimensional objects (i.e. they often contain many pixels), and distances over high-dimensional spaces can be very counter-intuitive
- the pixel-wise distance does not correspond at all to perceptual or semantic similarity.
Parametric Approach :
f(x,W) = Wx + b: x means input data; W means parameters or weights; b means bias vector
-
Shape: $x = 32323 = 3072 * 1; W = 10 * 3072; b = 10 * 1 ;\ f(x,W)= 10 * 1$
-
Problem: it can only learn one catagory, and produce one template; hard cases: multimodal data; odd and even problem
-
We can interpret linear classifier as template matching: each row of W correspond to a template
-
A commonly used trick is to combine the two sets of parameters W,b into a single matrix that holds both of them by extending the vector x with one additional dimension that always holds the constant 1- a default bias dimension. With the extra dimension, the new score function will simplify to a single matrix multiply: f(x,W) = Wx
-
Image data preprocessing: change the raw pixel values from [0,255] to [-1,1]
-
Quantify how good the model is
-
Loss over the dataset:
$L=\frac{1}{N}\sum_iL_i(f(x_i,W),y_i)$ -
Multiclass SVM loss:
-
Given an example
$(x_i,y_i)$ where$x_i$ is the image and$y_i$ is the (integer) label -
using the shorthand for the scores vector:
$s_i=f(x_i,W)$ -
SVM loss: $$L_i=\sum_{j\not=y_i}\left{\begin{array}{l} 0 &if\quad s_{y_i}\ge s_j+1 \s_j-s_{y_i}+1 &otherwise\end{array}\right.\ \quad =\sum_{j\not=y_i}max(0,s_i-s_{y_i}+1)$$
-
hinge loss(->squared hinge loss)
-
The SVM loss function wants the score of the correct class
$y_i$ to be larger than the incorrect class scores by at least by Δ (delta). If this is not the case, we will accumulate loss. -
Here we set the safety margin as 1, when we start off training, the scores are approximately equal to 0, so the expected loss equals (numbers of classes - 1)
-
There is no difference between using sum or mean (scaling does not matter)
-
W is not unique->2W, 3W...
def L_i_vectorized(x, y, W): scores = W.dot(x) margins = np.maximum(0, scores - scores[y] + 1) margins[y] = 0 loss_i = np.sum(margins) return loss_i
-
-
Adding regularization loss: L=
$\frac{1}{N}\sum_iL_i(f(x_i,W),y_i)+\lambda R(W)$ - Model should be simple, no over-fitting(not just fitting the training data): the more complex the model is, the greater the penalty is
- Occam's razor
-
$\lambda$ is an important hyperparameter: regularization strength - When combining with gradient descent, it becomes Weight Decay
- Why regularize?
- Express preferences over weights
- Make the model simple so it works on test data
- Improve optimization by adding curvature
- In common use:
-
L2 regularization: $R(W) = \sum_k\sum l W^2{k,l}$
- penalizing the euclidean norm of the weight vector;
- spreading the influence across all x, not depending only on some certain elements (overfitting)
-
L1 regularization: $R(W) = \sum_k\sum l |W{k,l}|$
- encouraging sparsity in the matrix W
- Elastic net(L1 + L2): $R(W) = \sum_k\sum l \beta W{k,l}^2+ |W_{k,l}|$
-
L2 regularization: $R(W) = \sum_k\sum l W^2{k,l}$
-
Softmax Classifier (Multinomial Logistic Regression)
-
Probability distribution:
$P(Y=k|X=x_i)=\frac{e^{s_k}}{\sum_je^{s_j}}$ -
Minimizing the negtive log likelihood of the correct class -- Maximum Likelihood Estimation:
$L_i=-logP(Y=y_i|X=x_i)=-s_i+log\sum_je^s_j$ -
We can now also interpret the regularization term
$R(W)$ in the full loss function as coming from a Gaussian prior over the weight matrix$W$ , where instead of MLE we are performing the Maximum a posteriori (MAP) estimation. -
Min loss is 0; max loss is infinity (actually we will never reach them)
-
when we start off training, the scores are approximately equal to 0, so the expected loss equals
$log(C)$ -
Practical issues: Numerical stability:
-
the intermediate terms
$e^{s_i}, \sum_je^{s_i}$ can be very large, and dividing large numbers can be numerically unstable. -
So we use a normalization trick:$\frac{e^{s_k}}{\sum_je^{s_j}}=\frac{Ce^{s_k}}{C\sum_je^{s_j}}=\frac{e^{s_k+logC}}{\sum_je^{s_j+logC}}$
-
A common choice for
$C$ is to set$logC=-max_js_j$ (shifting the values inside the vector$s$ so that the highest value is zero.) -
f = np.array([123, 456, 789]) # example with 3 classes and each having large scores p = np.exp(f) / np.sum(np.exp(f)) # Bad: Numeric problem, potential blowup # instead: first shift the values of f so that the highest number is 0: f -= np.max(f) # f becomes [-666, -333, 0] p = np.exp(f) / np.sum(np.exp(f)) # safe to do, gives the correct answer
-
-
�Cross-entropy loss:
- The cross-entropy between a “true” distribution p and an estimated distribution q is defined as:
$H(p,q) = -\sum_xp(x)log[q(x)]$ - The Softmax classifier is hence minimizing the cross-entropy between the estimated class probabilities (
$q=\frac{e^{s_k}}{\sum_je^{s_j}}$ ) and the “true” distribution, which in this interpretation is the distribution where all probability mass is on the correct class (i.e.$p=[0,…1,…,0]$ contains a single 1 at the$y_i$ � position.).
- The cross-entropy between a “true” distribution p and an estimated distribution q is defined as:
-
-
��suprvised learning
-
Convex optimazation: https://stanford.edu/~boyd/cvxbook/
-
Radom search
-
Gradient: the direction of steepest descent is the negative gradient
-
Finite differences approximation (numerical gradient): super slow
def eval_numerical_gradient(f, x): """ a naive implementation of numerical gradient of f at x - f should be a function that takes a single argument - x is the point (numpy array) to evaluate the gradient at """
fx = f(x) # evaluate function value at original point grad = np.zeros(x.shape) h = 0.00001 # iterate over all indexes in x it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite']) while not it.finished: # evaluate function at x+h ix = it.multi_index old_value = x[ix] x[ix] = old_value + h # increment by h fxh = f(x) # evalute f(x + h) x[ix] = old_value # restore to previous value (very important!) # compute the partial derivative grad[ix] = (fxh - fx) / h # the slope it.iternext() # step to next dimension return grad
* Analytic gradient: exact, fast, error-prone * Numerical gradient: centered formula $\frac{f(x)}{dx}=\frac{f(x+h)-f(x-h)}{2h}$ * **Gradient check**: * Using analytic gradient, but check implementation with numerical gradient * Relative error: $\frac{\left |f'_a-f'_n \right|}{max(\left |f'_a\right|,\left|f'_n \right|)}$ (max or add) * relative error > 1e-2 usually means the gradient is probably wrong * 1e-2 > relative error > 1e-4 should make you feel uncomfortable * 1e-4 > relative error is usually okay for objectives with kinks. But if there are no kinks (e.g. use of tanh nonlinearities and softmax), then 1e-4 is too high * 1e-7 and less you should be happy * The deeper the network, the higher the relative errors will be * Using double but not float for precision, and if the gradients or step size h are too small (<1e-10) there will be some numerical issues * One source of inaccuracy to be aware of during gradient checking is the problem of *kinks*. Kinks refer to non-differentiable parts of an objective function, introduced by functions such as ReLU (max(0,x)max(0,x)), or the SVM loss, Maxout neurons, etc * Using only few datapoints to solve the above problem * It is best to use a short **burn-in** time during which the network is allowed to learn and perform the gradient check after the loss starts to go down * Regularization loss may overwhelm the data loss, in which case the gradients will be primarily coming from the regularization term. Therefore, it is recommended to turn off regularization and check the data loss alone first, and then the regularization term second and independently * Remember to turn off dropout/augmentations * Make sure to gradient check a few dimensions for every separate parameter
-
-
Gradient descent:
$L(W)=\frac{1}{N}\sum^N_{i=1}\nabla W L_i(x_i,y_i,W)+\lambda \nabla_W R(W)$
# Vanilla Gradient Descent while True: weights_grad = evaluate_gradient(loss_fun, data, weights) weights += - step_size * weights_grad # perform parameter update
- Step size or learning rate is an important hyper-parameter
- In practice, the
$N$ can be quite large, so we often use a minibatch of examples (32/64/128) - Stochastic Gradient Descent (SGD)
# Vanilla Minibatch Gradient Descent while True: data_batch = sample_training_data(data, 256) # sample 256 examples weights_grad = evaluate_gradient(loss_fun, data_batch, weights) weights += - step_size * weights_grad # perform parameter update
- Using proper feature representation is much better than putting in raw pixels in to the classifier
- Color histogram
- Histogram of Oriented Gradients
- Bag of Words
- ConvNets: learn the features from the data
-
Back Propagation: recursively use of the chain rule
-
At each node, we just compute the local gradient and multiply it with receiving numerical values of gradients coming from upstream
-
We can group any nodes to make any complex nodes
-
Gradients add up at forks. The forward expression involves the variables x,y multiple times, so when we perform backpropagation we must be careful to use
+=
instead of=
to accumulate the gradient on these variables (otherwise we would overwrite it). This follows the multivariable chain rule in Calculus, which states that if a variable branches out to different parts of the circuit, then the gradients that flow back to it will add. -
$\sigma(x)=\frac{1}{1+e^{-x}}$ , called sigmoid function, and$\frac{d\sigma(x) }{dx}=(1-\sigma(x))\sigma(x)$ -
patterns in backward flow:
- Add gate: gradient distributor
- Max gate: gradient router
- Mul gate: gradient switcher
- Copy gate: gradient adder
-
Gradients add at branches
-
Gradients for vectorized code: Jacobian matrix
-
In practice, the Jacobian matrix can be vary large; and sometimes they are diagnoal matrix (element-wise)
-
The gradient with respect to a variable should have the same shape as the variable
-
$Y=WX$ :$\frac{\partial L}{\partial x}=\frac{\partial L}{\partial Y}W^T�$ ,$\frac{\partial L}{\partial W}=X^T\frac{\partial L}{\partial Y}$
Modularized implementation: forward / backward API
class ComputationalGraph(object):
#...
def forward(inputs):
# 1. pass inputs to input gates...
# 2. forward the computational graph:
for gate in self.graph.nodes_topologically_sorted():
gate.forward()
return loss # the final gate in the graph outputs the loss
def backward():
for gate in reversed(self.graph.nodes_topologically_sorted()):
gate.backward()
return inputs_gradients
- Neural network is stacking some simple functions in a hierachical way in order to make up a more complex non-leaner fuction
$s=W_2max(0,W_1x)$ - Non-linear layers are important: ReLU....
- Each row of
$W1$ is a template, and$W2$ is the weighted sum of all these intermediate scores ($h$ which is after the non-linearity) - The abstraction of a layer has the nice property that it allows us to use efficient vectorized code (e.g. matrix multiplies)
- The forward pass of a fully-connected layer corresponds to one matrix multiplication followed by a bias offset and an activation function.
- The neural network can approximate any continuous function
-
Convolution layer: preseve spatial structure
-
Filters always extend the depth of the input volume, and each filter produce an activation map which means the depth equals the number of filters
-
Convolution: taking a dot product between the filter and a small chunk of the image
-
Actually, we stretch the filter out into a long vector to do the dot product
-
Filters at the earlier layers: low-level features like edges; Filters at the mid-level layers: more complex features like corners and blobs; high-level features....
-
Output size: (N - F) / stride + 1
-
In practice: Common to zero pad the border to maintain the same size we have before: (N - F + 2P) / stride + 1
-
Don't forget the bias term
-
number of params: F * F * D * K weights and K biases
-
5x5 filter = 5x5 receptive field
-
Pooling layer:
- Makes the representations smaller and more manageable
- Spatially downsampling (without depth)
- In practice, we often set the stride to avoid overlapping (one value to represent one reign)
- Max pooling: more intuitive
- Same equation: (N - F) / stride + 1
- People don't use padding in pooling
- The depth dimension remains unchanged
-
Fully connected layer
-
Typical architectures:
*[(CONV - RELU)*N - POOL]*M - (FC - RELU)K, SOFTMAX
- N is usually up yo 5, M is large, 0<=K<=2
-
Sigmoid:
$\sigma(x)=1/(1+e^x)$ - Squashes numbers to range [0,1]
- Can be interpreted as a saturating firing rate of a neuron
- Problems:
- "kills" the gradients (when x is away from 0);
- its outputs are not zero-centered. When the input to a neuron is always positive,the gradients on
$w$ are all positive or negative, causing inefficient gradients update - Exp() is compute expensive
-
Tanh(x)
- Squashes numbers to [-1,1]
- Tanh is a scaled sigmoid:
$tanh(x)=2\sigma(2x)-1$ - Zero centered
- kills gradients when saturated
-
ReLU (Rectified Linear Unit) BEST!
$f(x)=max(0,x)$ - Don't saturate (in + region)
- Very computationally effiecieny
- Greatly accelerate the convergence of stochastic gradient descent compared to sigmoid/tanh in practice
- More biologically plausible than sigmoid
- Problems:
- Not zero-centered output
- Kills the gradients when x<=0: dead ReLU. So people like to initialize ReLU neurons with slightly positive biases.
- Be careful with the learning-rate!
-
Leaky ReLU (Good)
-
$f(x)=max(0.1x, x)$ -
Will not die
-
-
Parametric ReLU
$f(x)=max(\alpha x,x)$ - Backprop into
$\alpha$
-
ELU (Exponential Linear Units)
- $f(x)= \left{\begin{array}{lrc}x&if&x>0\\alpha(exp(x)_1)&if&x\leq0\end{array}\right.$
-
All benefits of ReLU
- Closer to zero mean outputs
- Negative saturation regime compared with Leaky ReLU, adds some robustness to noise
-
SELU (Scaled Exponential Linear Units)
-
Maxout Neuron (Good)
$f=max(w_1^Tx+b_1,w_2^T+b_2)$ - Generalizes ReLU and Leaky ReLU
- Linear Regime: does not saturate or die
- Problem: double the number of parameters
-
Zero-centered data:
X -= np.mean(x, axis = 0)
- Subtract the mean image: AlexNet
- Subtract per-channel mean: VGGNet
-
Normalized data:
X /= np.std(X, axis = 0)
-
Any preprocessing statistics (e.g. the data mean) must only be computed on the training data, and then applied to the validation / test data
-
PCA and Whitening
-
# Assume input data matrix X of size [N x D] X -= np.mean(X, axis = 0) # zero-center the data (important) cov = np.dot(X.T, X) / X.shape[0] # get the data covariance matrix U,S,V = np.linalg.svd(cov) # SVD factorization Xrot = np.dot(X, U) # decorrelate the data """ Using PCA: Xrot_reduced = np.dot(X, U[:,:100]) # Xrot_reduced becomes [N x 100] """ # whiten the data: # divide by the eigenvalues (which are square roots of the singular values) Xwhite = Xrot / np.sqrt(S + 1e-5)
-
-
Weight Initialization
- If we initialize all weights using the same value, their gradients will be all same and act the same (no symmetry breaking)
- Initialization too small: Activation go to zero, gradients also zero, No learning
- Initialization too big: Activations saturate, gradients zero, No learning
- Initialization just right: Nice distribution of activations at all layers, Learning proceeds nicely
- Small random numbers (but does not fit for deep network)
- Xavier initialization:
W = np.random.randn(Din, Dout) / np.sqrt(Din)
- Kaiming initializaion: Due to the reason that ReLU kills half of the neurons,
W = np.random.randn(Din, Dout) / np.sqrt(Din / 2)
- It is possible and common to initialize the biases to be zero, since the asymmetry breaking is provided by the small random numbers in the weights.
-
Forcing the inputs to be unit Gaussian
-
Compute the empirical mean and variance (for each mini-batch) independently for each dimension (element-wise)
- Normalization:
$\hat{x}^{(k)}=\frac{x^{(k)}-E[x^{(k)}]}{\sqrt{Var[x^{(k)}]}}$ - Scaling and shifting:
$\hat{y}=\gamma ^{(k)}\hat{x}^{(k)}+\beta ^{(k)}$ ,the network can learn:$\gamma ^{(k)}=\sqrt{Var[x^{(k)}]}\ \ \ \ \beta^{(k)}=E[x^{(k)}]$ - Insert after FC or Conv layers and before nonlinearity
- At test time, using average of values seen during training
- Benefits:
- Improves gradient flow through the network
- Allows higher learning rates. Faster convergence
- Reduces the strong dependence on initialization
- Acts as a form of regularization
- Normalization:
-
Preprocess the data
-
Choose the architecture
-
Before optimization:
- Look for correct loss at chance performance. Make sure you’re getting the loss you expect when you initialize with small parameters. It’s best to first check the data loss alone (so set regularization strength to zero)
- As a second sanity check, increasing the regularization strength should increase the loss
- Overfit a tiny subset of data. Lastly and most importantly, before training on the full dataset try to train on a tiny portion (e.g. 20 examples) of your data and make sure you can achieve zero cost. For this experiment it’s also best to set regularization to zero, otherwise this can prevent you from getting zero cost
-
Start training
- Quantities that should be monitored: loss, train/val accuracy, ratio of the update magnitudes, the first-layer weights in ConvNets����
-
Start with small regularization and find learning rate that makes the loss go down: [1e-3, 1e-5]
- If the loss not going down, that means the learning rate is too low; if the loss is NaN, that means a high learning rate
- If there is a big gap between training accuracy and validation accuracy, that means overfitting (increase regularization or get more data); if there is no gap, that means you can increase your model capacity
- If accuracy still going up, you need to train longer
-
Early stop: when validation loss starts going up
-
Hyperparameter Optimization
-
Learning rate, regularization, learning rate decay, model size
-
Prefer one validation fold to cross-validation
-
Cross-validation in stages:
-
Coarse to fine search:
- First stage: only a few epochs to get rough idea of what params work (wide range; 1-5 epochs)
- Second stage: longer running time, fine search (repeat as necessary; 10-20 epochs)
-
It is best to optimize in log space:
learning_rate = 10 ** uniform(-6, 1)
-
But some parameters are instead searched in the original scale:
dropout = uniform(0,1)
-
Random Search is sometimes better than Grid Search
-
Make sure that the best value is not at the edge of ur searching interval
-
-
Track the ratio of weight updates / weight magnitudes: around 0.001
-
-
Problems of SGD:
- The maximum ratio between the largest and the smallest gradient can be quite large, especially in high-dimensional problems, causing zig-zag path in SGD
- Local minima / saddle point (much more common in high dimension): zero gradient, so gradient descent get stuck
- Our gradients come from mini batches so they can be noisy
-
SGD + Momentum
-
$v_{(t+1)}=\rho v_t+\nabla f(x_t)$ $x_{t+1}=x_t-\alpha v_{t+1}$ -
vx = 0 while True: dx = compute_gradient(x) vx = rho * vx + dx x -= learning_rate * vx
-
Velocity is like a running mean of gradients, rho gives " the coefficient of friction" ( [0.5, 0.9, 0.95, 0.99])
-
Velocity and gradient weighted sum (vector composition)
-
Perfect solving all those problems above
-
SGD + Nesterov Momentum:
-
$v_{t+1}=\rho v_t-\alpha \nabla f(x_t+\rho v_t)$ $x_{t+1}=x_t+v_{t+1}$ -
$v_{t+1}=\rho v_t-\alpha \nabla f(\widetilde x_t)$ $\widetilde x_{t+1}=\widetilde x_t+v_{t+1}+\rho (v_{t+1}-v_t)$ ,$\widetilde x_t=x_t+\rho v_t$ -
dx = compute_gradient(x) old_v = v v = rho * v - learning_rate * dx x += -rho * old_v + (1 + rho) * v
-
-
-
AdaGrad
-
grad_squared = 0 while True: dx = compute_gradient(x) grad_squared += dx * dx x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)
-
Adding element-wise scaling of the gradient based on the historical sum of squares in each dimension
-
Step-size will get smaller and smaller: good at convex problems
-
RMSProp
-
grad_squared = 0 while True: dx = compute_gradient(x) grad_squared = decay_rate * grad_squared + (1 - decay_rate) * dx * dx x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)
-
-
-
Adam
-
first_moment = 0 second_moment = 0 for t in range(num_iterations): dx = compute_gradient(x) first_moment = beta1 * first_moment + (1 - beta1) * dx # Momentum second_moment = beta2 * second_moment + (1 - beta2) * dx * dx # RMSProp first_unbias = first_moment / (1 - beta1 ** t) # t is your iteration counter going from 1 to infinity second_unbias = second_moment / (1 - beta2 ** t) # bias correction x -= learning_rate * first_moment / (np.sqrt(second_moment) + 1e-8) # AdaGrad/RMSProp
-
Adam with beta1 = 0.9, beta2 = 0.999, and learning_rate = 1e-3 or 5e-4 is a great starting point for many models
-
Adam uses exponential weighted average (EWA) of the squared gradients and momentum
-
-
Learning rate
- Learning rate decay over time:
- Step decay: Decay learning rate by half every few epochs. One heuristic you may see in practice is to watch the validation error while training with a fixed learning rate, and reduce the learning rate by a constant (e.g. 0.5) whenever the validation error stops improving.
- Exponential decay:
$\alpha = \alpha_0e^{-kt}$ , where$\alpha _0 , k$ are hyperparameters and$t$ is the iteration number (but you can also use units of epochs) - Cosine:
$\alpha_t=\frac{1}{2}\alpha_0(1+cos(t\pi /T))$ - Linear:
$\alpha_t=\alpha_0(1-t/T)$
- Inverse sqrt:
$\alpha_t=\alpha_0/\sqrt t$ - 1/t decay:
$\alpha = \alpha_0/(1+kt)$ - Start woth no decay, and observe the loss curve to find where need decay
- 1/t decay:
- Warm up: High initial learning rates can make loss explode; linearly increasing learning rate from 0 over the first ~5000 iterations can prevent this
- If increase the batch size by N, also scale the initial learning rate by N
- Learning rate decay over time:
-
First-Order Optimization: use gradient form linear approximation, and step to minimize the approximation
-
Second-Order Optimization: use gradient and Hessian to form quadratic approximation, and step to the minima of the approximation
$\theta^* = \theta_0-H^{-1}\nabla_{\theta}J(\theta_0)$ - No learning rate
- Computing the Hessian is very costly
- Quasi-Newton methods (BGFS most popular) that seek to approximate the inverse Hessian
- L-BFGS: dose not form/store the full inverse Hessian and works well in full batch (less stochastic data)
-
Model Ensembles
- Training multiple independent models and average their results at test time
- Same model, different initializations
- Top models discovered during cross-validation
- Using multiple snapshots of a single model during training
- Using the moving average of the parameter vector at test time
- Training multiple independent models and average their results at test time
-
L1, L2, Maxnorm
-
Dropout
-
In each forward pass, randomly set some neurons to zero
-
Probability of dropping is a hyperparameter: 0.5 is common
-
p = 0.5 def train_step(X): # forward pass for 3-layer neural network H1 = np.maximum(0, np.dot(W1, X) + b1) U1 = np.random.rand(*H1.shape) < p # first dropout mask H1 *= U1 # drop! H2 = np.maximum(0, np.dot(W2, H1) + b2) U2 = np.random.rand(*H2.shape) < p # first dropout mask H2 *= U2 # drop! out = np.dot(W3, H2) + b3 def predict(X): # ensembke forward pass H1 = np.maximum(0, no.dot(W1, X) + b1) * p H2 = np.maximum(0, no.dot(W2, X) + b2) * p # scale at test-time out = np.dot(W3, H2) + b3
-
Can be interpret as doing model ensemble within a single model: each binary mask is one model (
$2^N$ models) -
Dropout makes output random (Training):
$y=f_W(x,z)$ -
We want to average out the randomness at test-time:
$y=f(x)=E_z[f(x,z)]=\int p(z)f(x,z)dz$ - Want to approximate the integral: at test time, multiply by dropout probability
-
-
Inverted dropout
-
p = 0.5 def train_step(X): # forward pass for 3-layer neural network H1 = np.maximum(0, np.dot(W1, X) + b1) U1 = (np.random.rand(*H1.shape) < p) / p # first dropout mask H1 *= U1 # drop! H2 = np.maximum(0, np.dot(W2, H1) + b2) U2 = (np.random.rand(*H2.shape) < p) / p # first dropout mask H2 *= U2 # drop! out = np.dot(W3, H2) + b3 def predict(X): # ensembke forward pass H1 = np.maximum(0, no.dot(W1, X) + b1) * p H2 = np.maximum(0, no.dot(W2, X) + b2) * p # scale at test-time out = np.dot(W3, H2) + b3
-
to make test-time faster
-
-
Batch Normalization act the same: add noise or stochasticity at training time, and average it out at test time
-
Data Augmentation (transform the data without changing its label)
- Random crops and scales
- Training: sample random crops / scales
- Testing: average a fixed set of crops
- Color jitter (randomsize contrast and brightness)
- Random mix/combinations of: translation, rotation, streching, shearing, len distortions, ...
- Random crops and scales
-
DropConnect
-
Zero out some of the valuses of the weight matrix
-
Fractional Max Pooling
-
Stochastic Depth
-
Cutout/ Mixup for small classification datasets
- One reason of overfitting is there is not enough data
Very similar dataset | Very different dataset | |
---|---|---|
Very little data | Use Linear Classifier on top layer | Difficult... Try linear classifier from different stages |
Quite a lot of data | Finetune a few layers | Finetune a larger number of layer |
- Compared to CPU: more cores, but each core is much slower, and good at doing parallel tasks: Matrix multiplication
- CUDA
- Streaming multiprocessors: FP32 cores, Tensor Core (4x4 matrix)
-
Tensor
-
Autograd
-
Requires_grad = True
-
Loss.backward()
-
Gradients are accumulated into w1.grad and w2.grad and the graph is destroyed
-
Set gradients to zero:
w1.grad.zero_() w2.grad.zero_()
-
Tell pytorch not to build a graph for these operations:
with torch.no_grad(): # gradient descent... for param in model.parameters(): param -= learning_rate * param.grad
-
Use optimizer to update params and zero gradients:
optimizer.step() optimizer.zero_grad()
-
Can define new functions, but pytorch still creates computation graphs step by step (numerical unstable)
-
Define new autograd operators by subclassing Function, define forward and backward
class Sigmoid(torch.autograd.Function): @staticmethod def forward(): @staticmethod def backward(): def sigmoid(x): return Sigmoid.apply(x)
- Only adds one node to the graph
-
-
Module
-
Define Modules as a torch.nn.Module subclass
class TwoLayerNet(torch.nn.Module): def __init__(self, D_in, H, D_out): super(TwoLayerNet, self).__init__(): # ... def forward(self, x): # ... # no need to define backward - autograd will handle it
-
Stack multiple instances of the component in a sequential:
model = torch.nn.Sequential()
-
-
Pretrain Models: torchvision
-
tensorboard
-
Dynamic Computation Graphs:
-
Building the graph and computing the graph happen at the same time
-
let u use regular Python control flow during the forward pass
-
Applications: model structure that depends on the input (RNN)
-
-
Static Computation Graph:
- Build computational graph describing our computation
- Reuse the same graph on every iteration
@torch.jit.script # python function compiled to a graph when it is defined
Static | Dynamic | |
---|---|---|
Optimization | Framework can optimize the graph before it runs | None |
Serialization | Once the graph is built, can serialize it and run it without the code (tarin model in Python, deploy in C++) | Graph building and execution are intertwined, so always need to keep code around (RNN) |
Debugging | Lots of indirection - can be hard to debug, benchmark, etc | The code u write is the code that runs. Easy to reason about, debug |
- Data parallel
- nn.DataParallel
- nn.DistributedDataParallel
- TF 1.0 (Static Graphs)
- Define computational graph
- run rhe graph many times
- TF 2.0 (Dynamic Graphs)
- Static: @tf.function
- Keras
-
Some indexes:
-
Memory: KB = (num of output elements) * (bytes per element) / 1024
= (C * H' * W') * 4 [for 32-bit floating point] / 1024
-
Params: number of weights = (weight shape) + (bias shape)
= C' * C * H * W + C'
-
Number of floating point operations (flop: multiply + add)
= (number of output elements) * (ops per outpit element)
= (C' * H' * W') * (C * H * W)
-
- Model split over two GPUs
- Most of the memory usage is in the early convolution layers
- Nearly all of parameters are in the fc layers
- Most floating-point ops occurs in the convolution layers
- Both the architecture of AlexNet and ZFNet are set by trials and errors (hand-design)
-
VGG Design rules:
- All conv are 3x3 stride 1 pad 1
- All max pool are 2x2 stride 2
- After pool, double #channels
-
Have 5 conv stages:
- conv-conv-pool
- conv-conv-pool
- conv-conv-pool
- conv-conv-conv-[conv]-pool
- conv-conv-conv-[conv]-pool (VGG19 have 4 conv in stage 4 and 5)
-
Two 3x3 conv has same receptive field as a single 5x5 conv, but has fewer parameters and takes less computation. Plus, two 3x3 conv can have more non-linearity by inserting ReLU
-
Conv layers at each spatial resolution take the same amount of computation by doubling channels and halfing its spatial resolution
-
Innovations for efficiency
-
Stem network at the start aggressively downsamples input
-
Inception module: local unit with parallel branches which is repeated many times throughout the network
- Try all size of kernels instead of setting it as a hyperparameter
- Uses 1x1 bottleneck layers to reduce channel dim before expensive conv
-
Use Global Average Pooling to collapse spatial dimensions, and one linear layer to produce class scores
-
Auxiliary Classifiers��: no longer need to use this after BN
-
Problem: Deeper networks perform worse than the shallow one
-
Hypothesis: deeper networks are harder to optimize, and don't learn identity functions to emulate shallow models
-
Solution: just copying the learned layers from the shallower model and setting additional layers to identity mapping
-
Change the network so learning identity functions with extra layers is easy
-
Residual block:
- Basic block: two 3x3 conv
- Bottleneck block: 1x1 conv + 3x3 conv + 1x1 conv
- Pre-action block: ReLU inside residual -> can learn true identity function by setting Conv weights to zero
-
Network is divided into stages like VGG: the first block of each stage halves the resolution (with stride-2 conv) and doubles the number of channels
-
Uses the same aggressive stem as GoogLeNet to downsample the inout 4x before applying residual blocks
-
Uses Globle Average Pooling
-
Improving ResNets: ResNeXt (group comvolution)
SENet (2017)
Densely Connected Neural Networks
MoblieNets, ShuffleNet
Neural Architecture Search
- RNN on both sequntial and non-sequential data
- RNNs have an internal state that is updated as a sequence is processed
$h_t=f_W(h_{t-1},x_t)$ - Sharing weights: using the same weight matrix W at every timestamp
-
$h_t=tanh(W_{hh}h_{t-1}+W_{xh}x_t)=tanh(W(\substack{h_{t-1}\x_t}))$ -
$y_t=W_{hy}h_t$ -
$h_0$ is either set to all 0, or learn it -
Language Modeling
- Encode inputs as one-hot-vector
- Embedding layer
-
GPT-2
-
Forward the entire sequence to compute the loss, the backward through entire sequence to compute gradient
-
Truncated Backproppagation Through Time
-
Image Captioning:
$h = tanh(W_{xh}*x+W_{hh}*h+W_{ih}*v)$ -
Vanilla RNN Gradient Flow
-
Computing gradient to
$h_0$ incolces many factors of W and repeated tanh (almost always < 1): Exploding / Vanishing gradients -
Explodin Gradients: Gradient clipping -> scale gradient if its norm is too big
grad_norm = np.sum(grad * grad) if grad_norm > threshold: grad *= (threshold / grad_norm)
-
Vanishing Gradients: change RNN architecture
-
-
$\left(\substack{i\f\o\g}\right)=\left(\substack{\sigma\\sigma\\sigma\tanh}\right)(W\left(\substack{h_{t-1}\x_t}\right)+b_h)$ -
$c_t=f\odot c_{t-1}+i\odot g$ $h_t=o\odot tanh(c_t)$ -
i: input gate, whether to write to cell
-
f: forget gate, whether to erase cell
-
o: output gate, how much to reveal cell
-
g: gate gate, how much to write to cell
-
Gradient Flow:
- Backpropagation from
$c_t$ to$c_{t-1}$ only elementwise multiplication by$f$ (between [-1,1]) - Uninterrupted gradient flow (Similar to ResNet)
- Allow better control of gradients values, using suitable parameter updates of the forget gate
- Add through the f, i, g, o gates: better balancing of gradient values
- LSTM makes it easier tfor RNN to preserve information over many time steps (if f = 1, i = 0, then the information of that cell is preserved indefinitely)
- LSTM doesn't guarantee that there is no vanishing/exploding gradient, but it does provide an easier way for the model to learn long-distance dependencies
- Backpropagation from
-
Multilayer RNNs
Gated Recurent Unit (GRU)
- Many to one (Encoder) :
$h_t=f_W(x_t,h_{t-1})$ - Initial decoder state
$s_0$ - Context vector
$c$ (often$c=h_t$ )
- Initial decoder state
- One to many (Decoder):
$s_t=g_U(y_{t-1},h_{t-1},c)$ - Input sequence bottlenecked through fixed-sized vector
-
Alignment scores:
$e_{t,i}=f_{att}(s_{t-1},h_i)$ ($f_{att}$ is an MLP) -
normalize alignment scores to get attention weights (using Softmax) :
$\sum_ia_{t,i}=1, 0<a_{t,i}<1$ -
Compute context vector as linear combination of hidden states:
$c_t=\sum_ia_{t,i}h_i$ -
Use context vector on decoders:
$s_t=g_U(y_{t-1},c_t,s_{t-1})$ -
Image captioning with RNNs and Attention: Each tilmestep of decoder uses a different context vector that looks at different parts of the input image
-
The decoder doesn't use the fact that
$h_i$ form an ordered sequence - it just treats them as an unordered set${h_i}$ -
X, Attend, and Y: Show, attend, and tell/read
-
Attention Layer:
- Inputs:
- Query vector:
$Q$ (Shape:$N_Q \times D_Q$ ) (Multiple query vectors) - Input vectors:
$X$ (Shape:$N_X\times D_X$ ) - Similarity function: scaled dot product instead of
$f_{att}$ (large similarities will cause softmax to saturate and give vanishing gradients)
- Query vector:
- Computation:
- Similarities:
$E=QX^T$ (Shape:$N_Q\times N_X$ )$E_{i,j}=Q_i\cdot X_j/sqrt(D_Q)$ - Attention weights:
$A=softmax(E, dim=1)$ (Shape:$N_Q\times N_X$ ) - Output vector:
$Y=AX$ (Shape:$N_Q\times D_X$ )$Y_i=\sum_jA_{i,j}X_j$
- Similarities:
- Changes: Separate key and value
- Key matrix:
$W_K$ (Shape:$D_X\times D_Q$ ) - Value matrix:
$W_V$ (Shape:$D_X \times D_V$ ) - Key vectors:
$K=XW_K$ (Shape:$N_X\times D_Q$ ) - Value vectors:
$V=XW_V$ (Shape:$N_X\times D_V$ ) - Similarities:
$E=QK^T$ (Shape:$N_Q\times N_X$ )$E_{i,j}=Q_i\cdot K_j/sqrt(D_Q)$ - Output vector:
$Y=AV$ (Shape:$N_Q\times D_X$ )$Y_i=\sum_jA_{i,j}V_j$
- Key matrix:
- Inputs:
-
Self-Attention Layer:
- one query per input vector
- Inputs:
- Input vectors:
$X$ (Shape:$N_X\times D_X$ ) - Key matrix:
$W_K$ (Shape:$D_X\times D_Q$ ) - Value matrix:
$W_V$ (Shape:$D_X \times D_V$ ) - Query matrix:
$W_Q$ (Shape:$D_X \times D_Q$ )
- Input vectors:
- Computation:
- Query vectors:
$Q=XW_Q$ - Key vectors:
$K=XW_K$ (Shape:$N_X\times D_Q$ ) - Value vectors:
$V=XW_V$ (Shape:$N_X\times D_V$ ) - Similarities:
$E=QK^T$ (Shape:$N_Q\times N_X$ )$E_{i,j}=Q_i\cdot K_j/sqrt(D_Q)$ - Attention weights:
$A=softmax(E, dim=1)$ (Shape:$N_Q\times N_X$ ) - Output vector:
$Y=AV$ (Shape:$N_Q\times D_X$ )$Y_i=\sum_jA_{i,j}V_j$
- Query vectors:
- Self-Attention layer is Permutation Equivariant
$f(s(x))=s(f(x))$ - Self-Attention layer works on sets of vectors, compared with each others, and output sets of vectors (does not care about the order if the vector it is processing)
- In order to make processing position-aware, concatenate input with positional encoding
-
Masked Self-Attention Layer:
- Don't let vectors look ahead in the sequence
- Language modeling (predict next word)
- setting
$E_{i,j}$ to -inf, where$i>j$
-
Multihead Self-Attention Layer:
- use
$H$ independent Attention Heads in parallel - Hyperparameters: Query dimension
$D_Q$ , Number of heads$H$
- use
-
"Self-Attention Generative Adversarial Networks"
-
Three Ways of Processing Sequences
- RNN
- Works on Ordered Sequences
- Good at long sequences: afyer one RNN layer,
$h_T$ sees the whole sequence - Not parallelizable: need to compute hidden states sequntially
- 1D Convolution
- Works on Multidimensional Grids
- Highly parallel
- Bad at long sequences: need to stack many conv layers for outputs to see the whole sequnce
- Self-Attention
- Works on Sets of Vectors
- Good at long sequences: after one self-attention layer, each output sees all inputs
- Highly parallel: each output can be computed in parallel
- Very memory intensive
- RNN
-
The Transformer
- "Attention is all you need"
- Transformer Block:
- input + Self-Attetion layer + residual connection + layer normalization + MLP (FC layer) independently on each vector + residual connection + layer normalization = output
- Self-attention is the only interaction between vectors
- A Transformer is a sequence of transformer blocks
- Transfer Learning:
- Pretrain a giant Transformer model for language modeling
- Finetune the Transformer on your own task
- Scaling up Transformers
-
Machine learning models can be attacked by an adversary
-
an example attack: Input perturbation
- Adersary goal: find a small perturbation
$\delta_i$ such that$f(x_i+\delta_i)\neq f(x_i)$ (non-targeted) - The perturbation is small enough that may not be noticeable to human eyes
- Adersary goal: find a small perturbation
-
Adversarial input attacks
- White-box attacks
- Black-box attacks: Adverary can query the model, but no access its internals (weights etc)
- For targeted attack, w]we want to reduce the distance to target label and move opposite to the gradient
-
Data poisoning attacks during training
-
Evasion attacks
- First Layer: edges, colors...
- Last Layer
- Nearest neighbor
- Dimensionality Reduction: PCA, t-SNE
- Visualizing Activations: Maximally Activating Patches
- Salient Map
- Saliency map tells us the degree to which each pixel in the image affects the classification score for that image
- Saliency via Occlusion: mask part of the image before feeding to CNN, check how much predicted probabilities change
- Saliency via Backprop
- Intermediate Features via (guided) Backprop
- Class Activation Mapping
- Gradient Ascent
- Generate a synthetic image that maximally activates a neuron
$I^*=argmax_If(I)+R(I)$ - Simple regularizer:
$R(I)=\lambda\parallel I\parallel^2_2$ to make images more natural
- Feature inversion
- Given a CNN feature vector for an image, find a new image that
- Match the give feature veactor
- Look natural (image prior regularization)
- Given a CNN feature vector for an image, find a new image that
- DeepDream: Amplify Existing Features
- Choose and image and a layer in a CNN; repeat:
- Forward: compute activations at chosen layer
- Set gradient of chosen layer equal to its activation
- Backward: compute gradient on tmage
- Update image: :
$I^*=argmax_I\sum_If_i(I)^2$
- Choose and image and a layer in a CNN; repeat:
- Neural Texture Synthesis: Gram Matrix
- Neural Style Transfer: Feature (high-level) + Gram Reconstruction (low-level)
- Input: Single RGB Image
- Output: A set of detected objects; for each object predict:
- Category label (from fixed, known set of categories)
- Bouding box (four numbers: x, y, width, height)
- Multiple outputs
- Multiple types of output: "what" and "where"
- Large images: higher resolution for detection
- Correct label: Softmax Loss
- Correct box: L2 Loss
- Multitask Loss: Weighted Sum
- Object or background
- Region Proposals: find a small set of boxes that are likely to cover all objects
- Reigions of interest from a proposal method
- Categorize each region proposal as positive, negative, or neutral based on overlap with ground-truth boxes
- Using IoU
- Warped image reigions (224*224)
- Forward each reigion through ConvNet:
- Classify each reigion:
- Positive boxes predict class and box offset
- Negative boxes just predict background class
- Bounding box regression: predict "transform" to correct the Reigions of interest
- Input: (
$p_x, p_y, p_h, p_w$ ) - Transform: (
$t_x, t_y, t_h, t_w$ ) - Output: (
$b_x, b_y, b_h, p_w$ ) - Translate relative to box size:
$b_x=p_x+p_wt_x$ $b_y=p_y+p_ht_y$ - Log-space scale transform:
$b_w=p_wexp(t_w)$ $b_h=p_hexp(t_h)$
- Input: (
- Classify each reigion:
- Use scores to select a subset of region proposals to output (threshold on background / per-category / take top K peoposals per image)
- Compare with ground-truth boxes
- Intersection over Union (IoU)
$\frac{Area\ of\ Intersection}{Area\ of\ Union}$ - IoU > 0.5 is decent; IoU > 0.7 is pretty good; IoU > 0.9 is almost perfect
- Overlapping boxes: Non-Max Suppression (NMS)
- Problem: Object detectors often output many overlapping detections
- Solution: post-process raw detections using NMS
- Select next highest-scoring box
- Eliminate lower-scoring boxes with IoU > threshold (e.g. 0.7)
- If any boxes remain, GOTO 1
- Evaluating Object Detectors: Mean Average Precision (mAP)
- Run object detector on all test images (with NMS)
- For each category, compute Average Precision
- For each detection (highest score to lowest score)
- If it matches some GT box with IoU > 0.5, mark it as positive and eliminate the GT
- Otherwise mark it as negtive
- Plot a point on PR Curve
- Average Precision = area under Precision vs Recall Curve
- For each detection (highest score to lowest score)
- mAP = average of AP for each category
- For "COCO mAP": Compute mAP@thresh for each IoU threshold (0.5, 0.55, 0.6, ..., 0.95) and take average
- Add backbone network before region proposals
- Cropping and resizing features: RoI Pool / RoI Align (bilinear interpolation)
- Must be differentiable for backprop
- Per-Reign Network is relatively lightweight
- Problem: Runtime dominated by region proposals
-
Insert Region Proposal Network to predict proposals from features
- Imagine an anchor box of fixed size at each point in the feature map
- At each point, predict whether the corresponding anchor contains an object (per-cell logistic regression, predict scires with conv layer)
- For positive boxes, also predict a box transform to regress from anchor box to object box
- Problem: anchor box may have the wrong size / shape
- Solution: use K different anchor boxes at each point
-
Jointly train with 4 losses:
- RPN classification
- RPN regression
- Object classification
- Object regression
-
Ignore overlapping proposals with non-max suppression
-
Two-stage object detector
- First stage: Run once per image
- Backbone network
- Region proposal network
- Second stage: Run once per region
- Crop features: RoI pool / align
- Predict object class
- Predict Bbox offset
- First stage: Run once per image
-
Single-stage object dection
-
Classify each object as one of C categories or background
-
YOLO / SSD / RetinaNet
-
-
Feature Pyramid Networks
Fully Convolutional Network
- Problems
- Effective receptive field size is linear in number of conv layers: with 3x3 conv layers, receptive field is 1 + 2L
- Convolution on high resolution images is expensive
- Downsampling and upsampling
- In-Network Upsampling: "Unpooling"
- Bed od Nails: padding with 0, copy value to up-left corner
- Nearest Neighbor: padding with input value
- Bilinear Interpolation
- Bicubic Interpolation
- Max Unpooling: place max value into remembered positions, padding with 0
- Learnable Upsampling: Transposed Convolution
- Weight filter by input value and copy to output
- Stride gives ratio between movement in output and input
- Sum overlaps
- In-Network Upsampling: "Unpooling"
Mask R-CNN
-
For each pixel, depth map gives distance from the carema to the objec in the world at that pixel
-
RGB image + Depth image = RGB-D image (2.5D)
-
Predicting Depth Maps using FCN
- Fatal problem in 3D vision: scale / depth ambiguity
- scale invariant loss
- For each pixel, surface normals give a vector giving the normal vector to the object in the world for that pixel
- Predicting Normals
- Per-pixel Loss:
$(x\cdot y)/(|x|\ |y|)$
- Need high spatial resolution to capture fine structures
- Processing Voxels input: 3D Convolution
- Generating Voxel Shapes
- 3D Convolution
- Voxel Tubes
- Voxel problems: computational expensive and large memory usage
- Oct-Trees, Nested Shape Layers
- Learn a function to classify arbitrary 3D points as inside / outside the shape:
$R^3\rightarrow{0,1}$ - The surface of the 3D object is the level set
${x:o(x)=0.5}$ - signed distance function
- Processing Pointcloud Inputs: PointNet
- Order of points doesn't matter (set)
- Generatong Pointcloud Outputs
- Predicting Point Clouds: Loss Function: Chamfer distance
- Predicting Meshes: Pixel2Mesh
- Iterative mesh refinement
- Graph Convolution: $f'i=W_0f_i+\sum{j\in N(i)}W_1f_j$
- Vertex-Aligned Features
- Loss Function: Chamfer distance between predicted samples and ground-truth samples
- Voxel IoU
- Chamfer Distance
- F1 Score:
- Precision@t = fraction of predicted points within t of some ground-truth point
- Recall@t = fraction of ground-truth points with t of some predicted point
- F1@t =
$2*\frac{precision@t*Recall@t}{Precision@t+Recall@t}$ - Robust to outliers
- Canonical Coordinates vs View Coordinates
- ShapeNet: 3D CAD models
- Pix3D: 3D models of IKEA furniture
Mesh R-CNN
- Input image -> 2D object recognition -> 3D object voxels -> 3D object meshes
- Chamber Loss + L2 regularization
- Video Classification: UCF101, Sports-1M, YouTubw 8M
- Atomic Actions: Charades, Atomic Visual Actions (AVA), Moments in Time (MIT)
- Video Retrival (Movie Querying): M-VAD, MPII-MD, LSMDC
- Computaionally expensive (size of video >> image datasets, redundancy in adjacent frames)
- Lower quality (Resolution, motion blur, occlusion)
- Requires a lot of training data
- Sequence modeling
- Temporal reasoning (receptive filed)
- Focus on action recognition (representative task for video understanding)
Features:
- Local features (hand-crafted): HOF + HOF
- Trajectory-based:
- Motion Boundary Histogram (MBH)
- (impoved) dense trajectories: good performance, but computationally intensive
Ways to aggregate features:
- Bag of Visual Words
- Fisher vectors
Measuring Motion: Optical Flow
-
Highlights local motion
-
give a displacement field F between images
$I_t$ and$I_{t+1}$ :$F(x,y)=(dx,dy)$ -
Trajectory stacking
- CNN + RNN: video understanding as sequence modeling
- 3D Convolution: embed temporal dimension to CNN
- Two-stream networks: explicit model of motion
Problem: raw videos are big
Solution:
- Train on short clips: low fps and low spatial resolution
- Run model on different clips, average predictions
- Train normal 2D CNN to classify video frames independently
- Average prediction at test-time
- Strong baseline for video classification
- Intuition: Get high-level appearance of each frame, and combine them
- Run 2D CNN on each frame, concatenate features and feed to MLP
- Intuition: Get high-level appearance of each frame, and combine them
- Run 2D CNN on each frame, pool features and feed to Linear
- Problem: hard to compare low-level motion between frames
- Intuition: Compare frames with vary first Conv Layer, after that normal 2D CNN
- Input: T x 3 x H x W
- Reshape: 3T x H x W
- First Conv Layer: D x H x W (collapse all temporal information)
- Problem:
- One layer of temporal processing may not be enough
- No temporal shift-invariance -> needs to learn separate filters for the same motion at different times in the clip
- Intuition: Use 3D convolution and pooling to slowly fuse temporal information over the course of the network
- Build slowly in space and time
- Temporal shift-invariant since each filter slides over time
- C3D: The VGG of 3D CNNs
- I3D: Inflating 2D networks to 3D
- Input: 2D conv / pool kernel
$K_h\times K_w$ - Output: 3D conv / pool kernel
$K_t\times K_h\times K_w$ by copying$K_t$ times in space and divide by$K_t$
- Input: 2D conv / pool kernel
- Video = Appearance + Motion (Complementary)
- Separate motion (multi-frame -> optical flow) from static appearance (single frame)
- Motion: external + camera -> mean subtraction to compensate camera motion
- Problem:
- The appearance and motion are not aligned -> spatial fusion
- Lacking modeling of temporal evolution -> temporal fusion
Modeling long-term temporal structure
-
Extract feature with CNN (2D or 3D) :
- Each value a function of a fixed temporal window (local temporal strcuture)
-
process local features using RNN (LSTM)
- Each vector is a function of all previous vectors (global temporal structure)
-
Sometimes don't backdrop to CNN to save memory, pretrain and use CNN as a feature extractor
-
Multi-layer RNN
- Features from layer
$L$ , timestep$t-1$ - Features from layer
$L-1$ , timestep$t$ - Use different weights at each layer, share weights across time
- Features from layer
-
Recurrent CNN: replace dot product in RNN with convolution
- Interpret as a set of THW vectors of dim C
- Trick: initialize last conv to 0, then entire block computes identity, so the nonlocal block can be inserted into existing 3D CNNs
- Treating time and space differently
-
Supervised learning: learning a function to map x -> y (classification, regression, object detection, semantic segmentation, image captioning)
-
Unsupervised learning: learn some underlying hidden structure of the data (clustering, dimensionality reduction, feature learning, density estimation)
- Discriminative Model:
- Learn a probability distribution
$p(y|x)$ - The possible labels for each input compete for probability mass
- Can not handle unreasonable inputs, it must give label distributions for all images
- Applications:
- Assign labels to data
- Feature learning (with labels)
- Learn a probability distribution
- Generative Model:
- Learn a probability distribution
$p(x)$ - All possible images compete with each other for probability mass
- Model can reject unreasonable inputs by assigning them small values
- Applications:
- Detect outliers
- Feature learning (without labels)
- Sample to generate new data
- Learn a probability distribution
- Conditional Generative Model: learn
$p(x|y)$ - Each possible label induces a competition among all images
$P(x|y)=\frac{P(y|x)}{P(y)}P(x)$ - Applications:
- Assign labels, while rejecting outliers
- Generate new data conditioned on input labels
- Given training data, generate new sample from same distribution
- Formulate as density estimation problems:
- Explicit density estimation: explicitly define and solve for
$p_{model}(x)$ - Implict density estimation: learn model that can sample from
$p_{model}(x)$ without explicitly defining it
- Explicit density estimation: explicitly define and solve for
- https://deepgenerativemodels.github.io/notes/
-
Goal: write down an explict function for
$p(x)=f(x,W)$ -
Fully visible belief network
-
Tarin by maximize probability of training data:
-
$W^*=argmax_W\prod_ip(x^{(i)})$ (Maximum likelihood estimation) -
$W^*=argmax_W\sum_ilog\ p(x^{(i)})$ (Log trick to exchange product for sum) -
$W^*=argmax_W\sum_ilog\ f(x^{(i)},W)$ (Train with gradient descent)
-
-
$p(x)=\prod^T_{t=1}p(x_t|x_i, ...,x_{t-1})$ -> RNN -
PixelRNN
-
Generate image pixels one at a time, strating at the upper left corner
-
$h_{x,y}=f(h_{x-1,y},h_{x,y-1},W)$ -
Drawback: sequential generation is slow in both training and inference
-
-
PixelCNN
- Intractable density: optimize lower bound
- Autoencoders (regular)
- Use the features to reconstruct the input data with a decoder
- Loss function:
$||\hat x-x||^2_2$ - Features need to be lower dimensional than the input data: to catch meaningful factors of variation in data
- Encoder can be used to initialize a supervised model
- No way to sample new data from learned ones
- Probabilistic spin
- Learn latent features z from raw data
- Sample from the model to generate new data
- Assume simple prior p(z), e.g. Gaussian
- Represent p(x|z) with a neural network (similar to decoder)
- Independet assumption: pixels are conditional independent given the latent features
- Train
- Maximize likelihood of data
$p_\theta(x)=\frac{p_\theta(x|z)p_\theta(z)}{p_\theta(z|x)}$ - Train decoder that inputs latent code z, gives distribution over data x
$p_\theta(x|z)=N(\mu_{x|z},\sum_{x|z})$ - Train another network (encoder) that inputs data x, gives distribution over latent codes z
$q_\phi(z|x)=N(\mu_{z|x},\sum_{z|x})$ - If we can ensure that
$q_\phi(z|x)\approx p_\theta(z|x)$ , the we can approximate$p_\theta(x)\approx\frac{p_\theta(x|z)p_\theta(z)}{q_\phi(z|x)}$ - Jointly train both encoder and decoder to maximize the variational lower bound on data likelihood
$log\ p_\theta(x)\geq E_{Z\sim q_\phi(z|x)}[log\ p_\theta(x|z)]-D_{KL}(q_\phi(z|x),p(z))$ - Run input data through encoder to get a distribution over latent codes
- Encoder output should match the prior p(z)
- Sample code z from encoder output
- run sampled code through decoder to get a distribution over data samples
- Original input data should be likely under the distribution output from (4)
- Maximize likelihood of data
- Generating data
- Sample z from prior p(z)
- run sampled z through decoder to get distribution over data x
- sample from distribution in (2) to generate data
- Edit images: modify some dimensions of sampled z
- Combining VAE + Autoregressive: VQ-VAE2
-
Idea: introduce a latent variable
$z$ with simple prior$p(z)$ , sample$z \sim p(z)$ and pass to a Generator Network$x = G(z)$ , then$x$ is a sample from the Generator distribution$p_G$ , we want$p_G=p_{data}$ - Train Generator Network G to convert z into fake data x sampled from
$p_G$ by fooling the discriminator D - Train Discriminator Network D to classify data as real or fake (0/1)
- Train Generator Network G to convert z into fake data x sampled from
-
jointly train generator G and discriminator D with a minimax game:
$\ \ \ \ min_Gmax_D(E_{x\sim p_{data}}[log\ D(x)]+E_{z\sim p(z)}[log\ (1-D(G(z)))])\=min_Gmax_DV(G,D)$ -
Using alternative gradient updates
For t in 1, ..., T:
- Update D:
$D=D+\alpha_D\frac{\partial V}{\partial D}$ (Gradient ascent) - Update G:
$G=G-\alpha_G��������\frac{\partial V}{\partial G}$ (Gradient descent)
- Update D:
-
Problem: Vanishing gradients for G at the beginning
-
Solution: train G to maximize
$-log\ (D(G(z)))$ (maximize likelihood of discriminator to being wrong) -
The minimax game achieves its global minimum when
$p_G=p_{data}$
-
- The agent sees a state; may be noisy or incomplete
- The agent makes an action based on what it sees
- Reward tells the agent hwo well it is doing
- Action causes change to environment
- Agent updates
- Stochasticity: Rewards and state transitions maybe random
- Credit assignment: Reward
$r_t$ may not directly depend on action$a_t$ - Nondifferentiable: Can't backprop through world; can't compute
$dr_t/da_t$ - Nonstationary: What the agent experiences depends on how it acts
-
A tuple
$(S,A,R,P,\gamma)$ -
$S$ : Set of possible states -
$A$ : Set of possible actions -
$R$ : Distribution of reward given (state, action) pair -
$P$ : Transition probability: distribution over next state given (state, action) -
$\gamma$ : Discount factor (tradeoff between future and present rewards)
-
-
Agent executes a policy
$\pi$ giving distribution of actions conditioned on states (a function from S to A that specifies what action to take in each state) -
Objective: Find policy
$\pi^*$ that maximize cumulative discounted rewards:$\sum_t\gamma ^tr_t$ -
Training process
- At time step t = 0, environment samples initial state
$s_0\sim p(s_0)$ - Then, for f = 0 until done:
- Agent selects action
$a_t\sim \pi(a|s_t)$ - Environment samples reward
$r_t\sim R(r|s_t, a_t)$ - Environment samples next state
$s_{t+1}\sim P(s|s_t, a_t)$ - Agent receives reward
$r_t$ and next state$s_{t+1}$
- Agent selects action
- At time step t = 0, environment samples initial state
-
GridWorld
-
Problem: Lots of randomness
-
Solution: Maximize the expected sum of rewards
$\pi^*=arg\ max_\pi E[\sum_{t\geq0}\gamma^tr_t|\pi]$ -
Value Function and Q Function
- Value Function:
$V^\pi(s)=E[\sum_{t\geq0}\gamma^tr_t|s_0=s,\pi]$ - Q Function:
$Q^\pi(s,a)=E[\sum_{t\geq0}\gamma^tr_t|s_0=s,a_0=a,\pi]$
- Value Function:
- Optimal Q-function:
$Q^*(s,a)=max_\pi E[\sum_{t\geq0}\gamma^tr_t|s_0=s,a_0=a,\pi]$ - $Q^$ Encodes the optimal policy: $\pi^(s)=arg \ max_{a'}Q(s,a')$
- Bellman Equation: $Q^(s,a)=E_{r,s'}[r+\gamma max_{a'}Q^(s',a')]$ , where
$r\sim R(s,a), s'\sim P(s,a)$
- If we find a function
$Q(s,a)$ that satisfies the Bellman Equation, then it must be$Q^*$ - Start with a random Q, and use the Bellman Equation as an update rule:
-
$Q_{i+1}(s,a)=E_{r,s'}[r+\gamma max_{a'}Q^*(s',a')]$ , where$r\sim R(s,a), s'\sim P(s,a)$
-
-
$Q_i$ converges to$Q^*$ as$i\rightarrow\infty$ - Problem: Not scalable. Must compute
$Q(s,a)$ for every state-action pair - Solution: Use a function approximator to estimate
$Q(s,a)$ -> neural network
- train a neural network (with weights
$\theta$ ) to approximate $Q^$ : $Q^(s,a)\approx Q(s,a;\theta)$ -
$y_{s,a,\theta}=E_{r,s'}[r+\gamma max_{a'}Q^*(s',a';\theta)]$ , where$r\sim R(s,a), s'\sim P(s,a)$ - Loss for training
$Q$ :$L(s,a)=(Q(s,a;\theta)-y_{s,a;\theta})^2$
- Train a network
$\pi_\theta(a|s)$ that takes state as input, gives distribution over which action to take in that state - objective function:
$J(\theta)=E_{r\sim p_\theta}[\sum_{t\geq0}\gamma^tr_t]$ - Find the optimal policy by maximizing
$\theta^*=arg\ max_\theta J(\theta)$ (using gradient ascent) - Problem: Nondifferentiability and suffer from high variance so requires a lot of samples
$\frac{\partial J}{\partial \theta}=E_{x\sim p_\theta}[\sum_{t\geq 0}\frac{\partial}{\partial \theta}log\pi_\theta(a_t|s_t)]$ - SOTA: Proximal Policy Optimization
(Soft) Actor-Critic, Model-Based, Imitation Learning, Inverse Reinforcement Learning, Adveisarial Learning, Stochastic Computation Graphs
- We will discover interesting new types of deep models
- Neural ODE:
$\frac{dh}{dt}=f(h(t),t,\theta)$
- Neural ODE:
- Deep Learning will find new applications
- Deep Learning for Computer Science / Mathematics
- Deep Learning will use more data and compute
- New hardware for Deep Learning
- Models are biased
- Need new theory
- Deep Learning needs a lot of labeled training data (self-supervised learning)
- Deep Learning doesn't "Understand" the world