Have we fully understood deep learning?
Have we fully understood deep learning? Something to ponder. Ok so, how do we distinguish a neural network that generalizes well from one that doesn't?
Paper authors: Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals
created a premise for rethinking generalization in Neural networks by which we could build robust models from scratch by choosing the right model complexity according to the data available. It helps us ponder the question "What is the fact that distinguishes a neural network that generalizes well from one that doesn't"?
This question will pave the way to understanding the interpretability of the neural networks and help us give mathematical explanations behind the variation in test and train error. In any typical neural network, the number of parameters is far greater than the number of samples available for the training. This gives them enough capacity for brute force memorization which is large enough to shatter the training data. But the surprising fact is, despite this huge capacity for memorization deep neural networks are able to generalize pretty well. The traditional bias-variance convention failed to explain the fact that, how come neural networks which fall under over parameterized regime are able to generalize well on the test data. Another surprising fact is that, as the amount of parametrization increases (complexity of the model increases) the divergence between the test and training error reduces. So, the authors concluded that the number of parameters is not the right way to measure the complexity of the model. Then what is the right way to measure the complexity of the model?
To understand this further, the authors used a variant of the well-known randomization test from non-parametric statistics as a core idea of their methodology. Here they wanted to study what kind of a network is required to memorize a finite sample. They performed their experiments on ImageNet and CIFAR-10 datasets, where they randomized the sequence by adding noise at the input level (by corrupting the input images), algorithmic level (changing different architectures) and output level (changing the labels) . The results from these experiments revealed that, even if we randomize the labels and the input image, the network is able to fit the training set perfectly and converge without changing the learning rate schedule which was surprising. A convergence slowdown was expected after randomizing the labels but was not observed. The outputs of these experiments pointed out the “deep learning can fit the random labels”. This makes us think that the capacity of the neural network is sufficient for memorizing the entire dataset and regardless of whether the data has any structure or not, we can fit 100% training error. The authors are not claiming this as a universal fact, which opens up a new space of research for us. They derived an implication that explicit regularization may improve generalization, but it is neither necessary nor by itself sufficient.
We believe that early stopping might help in generalization which the authors didn't investigate in-depth. Based on the article's conclusion, generalization is more dependent on the model architecture itself (Ex: it fails in Alexnet) and the regularizers may marginally help in generalization but are the not fundamental reason for achieving it. It is observed that Learning and optimization need not necessarily correlate with each other. The authors also seem to observe that SGD by itself performs some implicit self-regularization. The authors also explored the finite sample expressivity of neural networks. These results show that a precise formal way of understanding generalization is yet to be discovered.