Split dataset in 3 parts. Train on train set, choose model based on dev set and final check on train set.
dev and test set just need do be big enough to give a high confidence on model performance
Trainining on different distributions: For example, train with images downloaded from web (which may be professionaly taken) vs evaluating on images submitted by users (amateur photographers). In this case, make sure dev and test set are from amateur taken pictures, so at least our metric is trust worthy.
- High bias (training data performance)?
-
Bigger Network
-
Train longer
-
NN architecture
- High variance (dev set performance)
-
More data
-
Regularization
-
NN architecture
L2_regularization_cost = lambd * (np.sum(np.square(W1)) + np.sum(np.square(W2)) + np.sum(np.square(W3))) / (2 * m)
lambda is chosen based on dev set performance (hyperparameter)
keep_prob = 0.8
d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep_prob
a3 = a3 * d3
a3 = a3 / keep_prob # scale to expected value
- Random flip (horizontal / vertical)
- Random crop
- Random zoom
- Random shift
- Random distortions
-
Purple = no momentum
-
Blue = with momentum
-
Red = with momentum with higher beta
Beta is usually 0.9
-
Blue = no decay
-
Green = with decay
- Old intuition
- Actual
-
Unlikely to get stuck in a bad local optima
-
Plateaus can make learning slow
- Use random search
- Do not use grid search
For instance:
Suppose hyperparameter 2 is not as relevant as hyperparameter 1. In the case of grid search, we are only trying 5 values of hyperparameter 1, whereas with random search we are trying several more.
- Not all hyperparameters should be sampled on uniform distribution
For instance:
In the case of learning rate, where a reasonable value could be in the range of [0.001, 1). If we were to sample uniformly, 90% of our values would be in the range [0.1, 1). Intead, we sample in log scale.
r = -4 * np.random.rand() # r is [-4, 0]
lr = 10 ^ r
- Speeds up learning
- Like feature mean and std normalization, but for intermediate linearities or non linearities
- Gamma and Beta are learnable parameters. They allow the mean to be non zero. They allow Z tilde to be the indentity function if need be.
Orthogonalization or orthogonality is a system design property that assures that modifying an instruction or a component of an algorithm will not create or propagate side effects to other components of the system. It becomes easier to verify the algorithms independently from one another, it reduces testing and development time.
When a supervised learning system is designed, these are the 4 assumptions that needs to be true and orthogonal.
- Fit training set well in cost function (~ Human level performance)
- If it doesn’t fit well, the use of a bigger neural network or switching to a better optimization algorithm might help.
- Fit dev set well on cost function
- If it doesn’t fit well, regularization or using bigger training set might help.
- Fit test set well on cost function
- If it doesn’t fit well, the use of a bigger development set might help
- Performs well in real world
-
If it doesn’t perform well, the development test set is not set correctly or the cost function is not evaluating the right thing.
-
Early Stopping is not orthogonal, it interferes with all steps above.
To choose a classifier, a well-defined development set and an evaluation metric speed up the iteration process.
- Precision Of all the images we predicted y=1, what fraction of it have cats?
Algorithm | US | China | India | Other | Average |
---|---|---|---|---|---|
A | 3% | 7% | 5% | 9% | 6% |
B | 5% | 6% | 5% | 10% | 6.5% |
C | 2% | 3% | 4% | 5% | 3.5% |
D | 5% | 8% | 7% | 2% | 5.25% |
E | 4% | 5% | 2% | 4% | 3.75% |
F | 7% | 11% | 8% | 12% | 9.5% |
There are different metrics to evaluate the performance of a classifier, they are called evaluation matrices. They can be categorized as satisficing and optimizing matrices. It is important to note that the seevaluation matrices must be evaluated on a training set, a development set or on the test set.
Classifier | Accuracy | Running Time |
---|---|---|
A | 90% | 80ms |
B | 92% | 95ms |
C | 95% | 1500ms |
In this case, accuracy and running time are the evaluation matrices. Accuracy is the optimizing metric, because you want the classifier to correctly detect a cat image as accuratelyas possible. The running timewhich is set to be under 100 ms in this example, is the satisficing metric which mean that the metrichas to meet expectation set.
Setting up the training, development and test sets have a huge impact on productivity. It is important to choose the development and test sets from the same distribution and it must be taken randomly from all the data. Choose a development set and test set to reflect data you expect to get in the future and consider important to do well.
- Old way of splitting data
We had smaller data set therefore we had to use a greater percentage of data to develop and test ideas and models.
- Modern era – Big data
Now, because a large amount of data is available, we don’t have to compromised as much and can use a greater portion to train the model.
- Set up the size of the test set to give a high confidence in the overall performance of the system.
- Test set helps evaluate the performance of the final classifier which could be less 30% of the whole data set.
- The development set has to be big enough to evaluate different ideas.
A cat classifier tries to find a great amount of cat images to show to cat loving users. The evaluation metric used is a classification error.
Algorithm | Classification Error (%) |
---|---|
A | 3% |
B | 5% |
It seems that Algorithm A is better than Algorithm B since there is only a 3% error, however for some reason, Algorithm A is letting through a lot of the pornographic images. Algorithm B has 5% error thus it classifies fewer images but it doesn't have pornographic images. From a company's point of view, as well as from a user acceptance point of view, Algorithm B is actually a better algorithm. The evaluation metric fails to correctly rank order preferences between algorithms. The evaluation metric or the development set or test set should be changed.
The misclassification error metric can be written as a function as follow:
This function counts up the number of misclassified examples.
The problem with this evaluation metric is that it treats pornographic vs non-pornographic images equally. On way to change this evaluation metric is to add the weight term 𝑤(𝑖) .
The function becomes:
- Define correctly an evaluation metric that helps better rank order classifiers
- Bayes optimal error is defined as the best possible error. In other words, it means that any functions mapping from x to y can’t surpass a certain level of accuracy.
Also, when the performance of machine learning is worse than the performance of humans, you can improveitwith different tools. They are harder to use once its surpasses human-level performance.
These tools are:
- Get labeled data from humans
- Gain insight from manual error analysis: Why did the person get this right?
- Better analysis of bias/variance
By knowing what the human-level performance is, it is possible to tell when a training set is performing well or not.
Scenario A | Scenario B | |
---|---|---|
Humans error | 1 | 7.5 |
Training error | 8 | 8 |
Development error | 10 | 10 |
- Scenario A - Use bias reduction
- Scenario B - Use variance reduction
Evaluate multiple ideas in parallel
Ideas for cat detection:
- Fix picture of dogs being classified as cats
- Fix great cats (lions, panthers...) being classified as non cat
- Improve performance on blurry images
Get some of the classification errors and manually check them, so as to have an idea on might we do next. Get an insight on why the errors might have happened.
Images | Dog | Great cats | Blurry | Comments |
---|---|---|---|---|
1 | X | Pitbull | ||
2 | X | X | ||
3 | ||||
n | ||||
% | 8% | 40% | 50% |
- Deep Learning is robust to random data mislabeling, but not systematic mislabeling (e.g. all white dogs annotated as cats)
- Use manual error analysis with a table as described above where one column is for mislabeled samples.
- Check % of errors due to mislabeled samples.
In a case on which you have two distributions, one big and one small, but you actually care about the performance of the small (e.g. it could be the one provided by your users) the best thing to do is build the dev/test set with the small dataset and train with the big one. So you can at least make sure you are not having "biased" performance torwards the big dataset.
You could also add some of the small dataset into the training set. But dev/test sets should still consist of the small set and have a reasonable size.
Let's say you are taking the approach above and get
- Training set error = 1%
- Dev set error = 10%
This could be because of two reasons:
- High variance (overfitting)
- The data in the dev test is harder, the images, for example may not be as clear as the ones on the training set.
One solution to identify which of these two possible causes are the cause, could be to create a training-dev set, which is not used during training, but comes from the same distribution. This can be used for "pre-validation".
Now you could have:
- Training set error = 1%
- Training-dev set error = 9%
- Dev set error = 10%
The issue is high variance (overfitting)
- Training set error = 1%
- Training-dev set error = 1%
- Dev set error = 10%
The dev set is "harder"
Human error <-Avoidable Bias-> Training set error <-Variance-> Training-dev set error <-Data mismatch-> Dev error <-Overfitting to dev set-> Test error
- If have data mismatch, carry out manual error analysis.
- Make training set more similar to dev/test
- Artificial data synthesis (caution to not simulate data for only subset of all possible examples)
You want to train on Task B, using a pre-trained model trained with Task A
Transfer Learning makes sense when
- Task A and B have similar input (images, audio...)
- You have more data for Task A than B (train on A, fine tune on B)
Merge several tasks into one, for instance, in the context of autonomous vehicles, we might have several tasks based on the same data (image). Detect people, cars, signs and traffic lights. We can merge this problem by having our labels be quadruples, one value for each "class".
The loss function could be the average of losses for each output.
With the loss defined this way, you can also train a model on which for a specific example the quadruple is as such (1,0,?,1)
You just make so the loss takes the non ? average (sum all non ?)
Multi-task learning makes sense when
- The tasks can benefit from same low level features
- Amout of data for each task is similar
Convolutional output dimension is given by: n - f + 1
Solves two issues:
-
Shrinking output
-
Throw away information from edge, because the edge pixels are used less than the middle ones.
- n - f + 1
- n + 2p - f + 1
- (n + 2p - f) / s + 1
-
Each of the 27 elements of the 3x3x3 filter are multiplied with their relative element in the current convolutional and summed over, resulting in the final pixel of the resulting 2D image.
-
The number of channels/depth of our image and filter volume has to always be the same.
The above image contains 2 filters, which applied to a single RGB image result in a 4x4x2 volume. The subsequent convolutions applied to this output volume should use 2 channel filters.
-
Each filter W also has a bias term, which is added channel wise to the output of the convolution operation
-
An activation function is applied to each of these elements.
-
The intuition behind Max Pooling is that if some feature was detected in some quadrant, that quadrant will have a high value, therefore this information should be persistent in the output volume.
-
Max Pooling over a volume has an output with same number of channels as input (max pooling is taken channel wise)
- Sometimes is used deep in the net to make a dimensions such as 7x7x1000 to be like 1x1x1000
- Dimension goes down
- Filters go up
- Conv -> Pool -> Conv -> Pool
- Bigger
- ReLU
- Uniform
- Dimension goes down
- Filters go up
"Plain Network":
- ResNets are robust against Vanishing/Exploding Gradients, allowing us to create deeper networks.
If the paremeters of layer l + 2 are small (e.g. because of regularization), becase of the skip connection of the residual block, it's easy for block's "function" to be an identity function, meaning a[l+2] == a[l]. Therefore, the additional residual block doesn't hurt performance in a worst case scenario.
- To make so that z[l+2] and a[l] can be added, "same" convolutional is used throughout.
When these two tensors shape are different:
- Ws could be a matrix that ends up creating padding
- Ws could be some learned parameters
- An alternative approach is discussed in 2.2 of the assignment for the 2nd week
- The extra 1x1 conv are used to reduce the volume depth do reduce computation cost.
- Object "Localization" - at most, a single object
- Object "Detection" - multiple objects
- Target label sample:
Pc indicates if there is a class present.
If there isn't a class in the evaluation we "don't care" about the rest of the output values, meaning their loss is not computed.
You can use different loss functions for each output value, maybe log likelyhood for class and booleans and MSE for bounding box
- Sliding Windows
- Train a convet to be able to find the object
- Use the convnet in sliding windows over the image with reasonable stride
- Do it again, only now with a bigger window
- Same as above with even bigger window
- Hopefully these windows would have detected the objects if any
Sliding windows has high compute cost
- Sliding windows normal convnet topology
- Sliding windows fully convolutional topology (mathematically same as above)
Each one of the 4 output rows corresponds to the respective input corner crop as illustrated.
Max Pool stride is same as sliding window stride.
Combines the idea of 4.11.3 with the target labels encoding of 4.11.2 to output a single volume where each row outputs the result for each image's region.
- bx, by, bh, bw encoding
- bx - [0, 1] relative to the size of the subimage (horizontal center)
- by - [0, 1] relative to the size of the subimage (verical center)
- bh - [0, >1] relative to the size of the subimage
- bw - [0, >1] relative to the size of the subimage
Yolo paper: https://arxiv.org/pdf/1506.02640v5.pdf
- Sometimes the same object may be detected multiple times
- We can erase the closing bouding boxes with least IoU
- How to make a grid cell detect multiple objects
- You pre-define 2 different shapes (or more)
- Anchor box 1
- Anchor box 2
- Your target y now turns out to be
The first 8 are associated with the Anchor box 1 and the later 8 with Anchor box 2.
- Each object in training image is assigned to grid cell that contains that object's point
- Learn from one image or a small number.
- Run the same DNN on different inputs and comparing the result.
- The output is a n-dim vector which is an embedding of the input images
- 3 images (anchor, positive, negative)
-
Alpha is used to make sure negative images have a "different score" by at least some margin. Otherwise the DNN could have f(img) = 0 vector and the condition would be satistied.
-
Relevant paper: https://arxiv.org/abs/1503.03832
-
Choose triplets that are "hard" (not random), otherwise gradient descent won't do anything, because the DNN will be getting everything right, it would be trivial to satisfy the condition <= 0.
- Use Logistic Regression to output if the images are the same. The input features would be the absolute difference between each element of the network embedding.
- G - Generated
- S - Style
- C - Content
- Is based on a hidden layer L
- Use pre-trained ConvNet
- Let a[l][C] and a[l][G] be the activation of layer l on the images
- If a[l][C] and a[l][G] are similar, both images have similar content
- Is based on a hidden layer L
- Measures how correlated are the filters of a given layer for two different images.
- a<0> is usually initialized with a zero vector.
- To compute the hidden state we usually use tanh
- To compute y_hat we use sigmoid for binary classifcation or softmax for multi class classification. Or any other activation that makes sense with the kind of output we are working with.
Simplifying the computations
For One2Seq type architectures, the output of the last time step is given as input for the next time step.
Seq2Seq with different input/output lenghts
- There is 2 parts, encoder and decoder.
- The final hidden state of the encoder is able to make a vector representation of the input sequence
- Which is then decoded by the decoder.
- This is useful to generate embeddings for sequences.