Seminar in probability theory: Levent Sagun (EPFL)
An excursion around the ideas for why the stochastic gradient descent algorithm works well on training deep neural networks leads to considerations about the underlying geometry of the related loss function. Recently, we gained a lot of insight into how tuning SGD leads to better or worse generalization properties on a given model and task. Furthermore, we have a reasonably large set of observations that lead to the conclusion that more parameters typically lead to better accuracies as long as the training process is not hampered. In this talk, I will speculatively argue that as long as the model is over-parameterized (OP), all solutions are equivalent up to finite size fluctuations.
We will start by reviewing some of the recent literature on the geometry of the loss function, and how SGD navigates the landscape in the OP regime. Then we will see how to define OP by finding a sharp transition described by the models fitting abilities to its training set. Finally, we will discuss how this critical threshold is connected to the generalization properties of the model, and argue that life beyond this threshold is (more or less) as good as it gets.
Veranstaltung übernehmen als iCal