1 Introduction
Deep Neural Networks (DNNs) are often massively overparameterized (Zhang et al., 2016), yet show stateoftheart generalization performance on a wide variety of tasks when trained with Stochastic Gradient Descent (SGD). While understanding the generalization capability of DNNs remains an open challenge, it has been hypothesized that SGD acts as an implicit regularizer, limiting the complexity of the found solution (Poggio et al., 2017; Advani and Saxe, 2017; Wilson et al., 2017; Jastrzębski et al., 2017).
Various links between the curvature of the final minima reached by SGD and generalization have been studied (Murata et al., 1994; Neyshabur et al., 2017). In particular, it is a popular view that models corresponding to wide minima of the loss in the parameter space generalize better than those corresponding to sharp minima (Hochreiter and Schmidhuber, 1997; Keskar et al., 2016; Jastrzębski et al., 2017; Wang et al., 2018). The existence of this empirical correlation between the curvature of the final minima and generalization motivates our study.
Our work aims at understanding the interaction between SGD and the sharpest directions of the loss surface, i.e. those corresponding to the largest eigenvalues of the Hessian. In contrast to studies such as those by Keskar et al. (2016) and Jastrzębski et al. (2017) our analysis focuses on the whole training trajectory of SGD rather than just on the endpoint. We will show in Sec. 3.1 that the evolution of the largest eigenvalues of the Hessian follows a consistent pattern for the different networks and datasets that we explore. Initially, SGD is in a region of broad curvature, and as the loss decreases, SGD visits regions in which the top eigenvalues of the Hessian are increasingly large, reaching a peak value with a magnitude influenced by both learning rate and batch size. After that point in training, we typically observe a decrease or stabilization of the largest eigenvalues.
To further understand this phenomenon, we study the dynamics of SGD in relation to the sharpest directions in Sec. 3.2 and Sec. 3.3. Projecting to the sharpest directions^{1}^{1}1That is considering for different , where is the gradient and is the normalized eigenvector corresponding to the largest eigenvalue of the Hessian., we see that the regions visited in the beginning resemble bowls with curvatures such that an SGD step is typically too large, in the sense that an SGD step cannot get near the minimum of this bowllike subspace; rather it steps from one side of the bowl to the other, see Fig. 1 for an illustration.
Finally in Sec. 4 we study further practical consequences of our observations and investigate an SGD variant which uses a reduced and fixed learning rate along the sharpest directions. In most cases we find this variant optimizes faster and leads to a sharper region, which generalizes the same or better compared to vanilla SGD with the same (small) learning rate. While we are not proposing a practical optimizer, these results may open a new avenue for constructing effective optimizers tailored to the DNNs’ loss surface in the future.
On the whole this paper exposes and analyses SGD dynamics in the subspace of the sharpest directions. In particular, we argue that the SGD dynamics along the sharpest directions influence the regions that SGD steers to (where larger learning rate or smaller batch size result in wider regions visited), the training speed, and the final generalization capability.
2 Experimental setup and notation
We perform experiments mainly on Resnet32^{2}^{2}2
In Resnet32 we omit BatchNormalization layers due to their interaction with the loss surface curvature
(Bjorck et al., 2018) and use initialization scaled by the depth of the network (Taki, 2017). Additional results on BatchNormalization are presented in the Appendixand a simple convolutional neural network, which we refer to as SimpleCNN (details in the Appendix
D), and the CIFAR10 dataset (Krizhevsky et al., ). SimpleCNN is a 4 layer CNN, achieving roughly test accuracy on the CIFAR10 dataset. For training both of the models we use standard data augmentation on CIFAR10 and for Resnet32 L2 regularization with coefficient . We additionally investigate the training of VGG11 (Simonyan and Zisserman, 2014) on the CIFAR10 dataset (we adapted the final classification layers forclasses) and of a bidirectional Long Short Term Memory (LSTM) model (following the “small” architecture employed by
Zaremba et al. (2014), with added dropout regularization of ) on the Penn Tree Bank (PTB) dataset. All models are trained using SGD, without using momentum, if not stated otherwise.The notation and terminology we use in this paper are now described. We will use
(time) to refer to epoch or iteration, depending on the context. By
and we denote the SGD learning rate and batch size, respectively. is the Hessian of the empirical loss at the current dimensional parameter value evaluated on the training set, and its eigenvalues are denoted as , (ordered by decreasing absolute magnitudes). is the maximum eigenvalue, which is equivalent to the spectral norm of . The top eigenvectors of are denoted by , for , and referred to in short as the sharpest directions. We will refer to the minibatch gradient calculated based on a batch of size as and to as the SGD step. We will often consider the projection of this gradient onto one of the top eigenvectors, given by , where . Computing the full spectrum of the Hessian for reasonably large models is computationally infeasible. Therefore, we approximate the top (up to ) eigenvalues using the Lanczos algorithm (Lanczos, 1950; Dauphin et al., 2014), an extension of the power method, on approximatelyof the training data (using more data was not beneficial). When regularization was applied during training (such as dropout, L2 or data augmentation), we apply the same regularization when estimating the Hessian. This ensures that the Hessian reflects the loss surface accessible to SGD. The code for the project is made available at
https://github.com/kudkudak/dnn_sharpest_directions.3 A study of the Hessian along the SGD path
In this section, we study the eigenvalues of the Hessian of the training loss along the SGD optimization trajectory, and the SGD dynamics in the subspace corresponding to the largest eigenvalues. We highlight that SGD steers from the beginning towards increasingly sharp regions until some maximum is reached; at this peak the SGD step length is large compared to the curvature along the sharpest directions (see Fig. 1 for an illustration). Moreover, SGD visits flatter regions for a larger learning rate or a smaller batchsize.
3.1 Largest eigenvalues of the Hessian along the SGD path




We first investigate the training loss curvature in the sharpest directions, along the training trajectory for both the SimpleCNN and Resnet32 models.
Largest eigenvalues of the Hessian grow initially.
In the first experiment we train SimpleCNN and Resnet32 using SGD with and and estimate the largest eigenvalues of the Hessian, throughout training. As shown in Fig. 2 (top) the spectral norm (which corresponds to the largest eigenvalue), as well as the other tracked eigenvalues, grows in the first epochs up to a maximum value. After reaching this maximum value, we observe a relatively steady decrease of the largest eigenvalues in the following epochs.
To investigate the evolution of the curvature in the first epochs more closely, we track the eigenvalues at each iteration during the beginning of training, see Fig. 2 (bottom). We observe that initially the magnitudes of the largest eigenvalues grow rapidly. After this initial growth, the eigenvalues alternate between decreasing and increasing; this behaviour is also reflected in the evolution of the accuracy. This suggests SGD is initially driven to regions that are difficult to optimize due to large curvature.
To study this further we look at a fullbatch gradient descent training of Resnet32^{3}^{3}3To avoid memory limitations we preselected the first examples of CIFAR10 to simulate fullbatch gradient training for this experiment.. This experiment is also motivated by the instability of largebatch size training reported in the literature, as for example by Goyal et al. (2017). In the case of Resnet32 (without BatchNormalization) we can clearly see that the magnitude of the largest eigenvalues of the Hessian grows initially, which is followed by a sharp drop in accuracy suggesting instability of the optimization, see Fig. 3. We also observed that the instability is partially solved through use of BatchNormalization layers, consistent with the findings of Bjorck et al. (2018), see Fig. 9 in Appendix. Finally, we report some additional results on the late phase of training, e.g. the impact of learning rate schedules, in Fig. 12 in the Appendix.




Learning rate and batchsize limit the maximum spectral norm.
Next, we investigate how the choice of learning rate and batch size impacts the SGD path in terms of its curvatures. Fig. 4 shows the evolution of the two largest eigenvalues of the Hessian during training of the SimpleCNN and Resnet32 on CIFAR10, and an LSTM on PTB, for different values of and . We observe in this figure that a larger learning rate or a smaller batchsize correlates with a smaller and earlier peak of the spectral norm and the subsequent largest eigenvalue. Note that the phase in which curvature grows for low learning rates or large batch sizes can take many epochs. Additionally, momentum has an analogous effect – using a larger momentum leads to a smaller peak of spectral norm, see Fig. 13 in Appendix. Similar observations hold for VGG11 and Resnet32 using BatchNormalization, see Appendix A.1.
Summary.
These results show that the learning rate and batch size not only influence the SGD endpoint maximum curvature, but also impact the whole SGD trajectory. A high learning rate or a small batch size limits the maximum spectral norm along the path found by SGD from the beginning of training. While this behavior was observed in all settings examined (see also the Appendix), future work could focus on a theoretical analysis, helping to establish the generality of these results.
3.2 Sharpest direction and SGD step
The training dynamics (which later we will see affect the speed and generalization capability of learning) are significantly affected by the evolution of the largest eigenvalues discussed in Section 3.1. To demonstrate this we study the relationship between the SGD step and the loss surface shape in the sharpest directions. As we will show, SGD dynamics are largely coupled with the shape of the loss surface in the sharpest direction, in the sense that when projected onto the sharpest direction, the typical step taken by SGD is too large compared to curvature to enable it to reduce loss. We study the same SimpleCNN and Resnet32 models as in the previous experiment in the first epochs of training with SGD with =0.01 and .
The sharpest direction and the SGD step.
First, we investigate the relation between the SGD step and the sharpest direction by looking at how the loss value changes on average when moving from the current parameters taking a step only along the sharpest direction  see Fig. 6 left. For all training iterations, we compute , for ; the expectation is approximated by an average over different minibatch gradients. We find that increases relative to for , and decreases for . More specifically, for SimpleCNN we find that and lead to a and increase in loss, while and both lead to a decrease of approximately . For Resnet32 we observe a and increase for and , and approximately a decrease for and , respectively. The observation that SGD step does not minimize the loss along the sharpest direction suggests that optimization is ineffective along this direction. This is also consistent with the observation that learning rate and batchsize limit the maximum spectral norm of the Hessian (as both impact the SGD step length).
These dynamics are important for the overall training due to a high alignment of SGD step with the sharpest directions. We compute the average cosine between the minibatch gradient and the top sharpest directions . We find the gradient to be highly aligned with the sharpest direction, that is, depending on and model the maximum average cosine is roughly between and . Full results are presented in Fig. 5.
(color) evaluated at different level of accuracies, during training (x axis). For comparison, the horizontal purple line is alignment with a random vector in the parameter space. Curves were smoothed for clarity.
Qualitatively, SGD step crosses the minimum along the sharpest direction.
Next, we qualitatively visualize the loss surface along the sharpest direction in the first few epochs of training, see Fig 6 (right). To better reflect the relation between the sharpest direction and the SGD step we scaled the visualization using the expected norm of the SGD step where the expectation is over minibatch gradients. Specifically, we evaluate , where is the current parameter vector, and
is an interpolation parameter (we use
). For both SimpleCNN and Resnet32 models, we observe that the loss on the scale of starts to show a bowllike structure in the largest eigenvalue direction after six epochs. This further corroborates the previous result that SGD step length is large compared to curvature in the sharpest direction.Summary.
We infer that SGD steers toward a region in which the SGD step is highly aligned with the sharpest directions and would on average increase the loss along the sharpest directions, if restricted to them. This in particular suggests that optimization is ineffective along the sharpest direction, which we will further study in Sec. 4.
3.3 How SGD steers to sharp regions in the beginning
Here we discuss the dynamics around the initial growth of the spectral norm of the Hessian. We will look at some variants of SGD which change how the sharpest directions get optimized.
Experiment.
We used the same SimpleCNN model initialized with the parameters reached by SGD in the previous experiment at the end of epoch . The parameter updates used by the three SGD variants, which are compared to vanilla SGD (blue), are based on the minibatch gradient as follows: variant 1 (SGD top, orange) only updates the parameters based on the projection of the gradient on the top eigenvector direction, i.e. ; variant 2 (SGD constant top, green) performs updates along the constant direction of the top eigenvector of the Hessian in the first iteration, i.e. ; variant 3 (SGD no top, red) removes the gradient information in the direction of the top eigenvector, i.e. . We show results in the left two plots of Fig. 7. We observe that if we only follow the top eigenvector, we get to wider regions but don’t reach lower loss values, and conversely, if we ignore the top eigenvector we reach lower loss values but sharper regions.
Summary.
The takehome message is that SGD updates in the top eigenvector direction strongly influence the overall path taken in the early phase of training. Based on these results, we will study a related variant of SGD throughout the training in the next section.
4 Optimizing faster while finding a good sharp region
In this final section we study how the convergence speed of SGD and the generalization of the resulting model depend on the dynamics along the sharpest directions.
Our starting point are the results presented in Sec. 3 which show that while the SGD step can be highly aligned with the sharpest directions, on average it fails to minimize the loss if restricted to these directions. This suggests that reducing the alignment of the SGD update direction with the sharpest directions might be beneficial, which we investigate here via a variant of SGD, which we call NudgedSGD (NSGD). Our aim here is not to build a practical optimizer, but instead to see if our insights from the previous section can be utilized in an optimizer. NSGD is implemented as follows: instead of using the standard SGD update, , NSGD uses a different learning rate, , along just the top eigenvectors, while following the normal SGD gradient along all the others directions^{4}^{4}4While NSGD can be seen as a second order method, NSGD in contrast to typical second order methods does not adapt the learning rate to be in some sense optimal given the curvature; to make it more precise we included in Appendix E a discussion on differences between NSGD and the Newton method.. In particular we will study NSGD with a low base learning rate , which will allow us to capture any implicit regularization effects NSGD might have. We ran experiments with Resnet32 and SimpleCNN on CIFAR10. Note, that these are not stateoftheart models, which we leave for future investigation.
We investigated NSGD with a different number of sharpest eigenvectors , in the range between and ; and with the rescaling factor . The top eigenvectors are recomputed at the beginning of each epoch^{5}^{5}5In these experiments each epoch of NSGD takes approximately  longer compared to vanilla SGD. This overhead depends on the number of iterations needed to reach convergence in Lanczos algorithm used for computing the top eigenvectors.. We compare the sharpness of the reached endpoint by both computing the Frobenius norm (approximated by the top eigenvectors), and the spectral norm of the Hessian. The learning rate is decayed by a factor of when validation loss has not improved for epochs. Experiments are averaged over two random seeds. When talking about the generalization we will refer to the test accuracy at the best validation accuracy epoch. Results for Resnet32 are summarized in Fig. 8 and Tab. 1; for full results on SimpleCNN we relegate the reader to Appendix, Tab. 2. In the following we will highlight the two main conclusions we can draw from the experimental data.
Test acc.  Val. acc. (50)  Loss  Dist.  
/  
/  
/  
/  
SGD(0.005)  /  
SGD(0.1)  / 
NSGD optimizes faster, whilst traversing a sharper region.
First, we observe that in the early phase of training NSGD optimizes significantly faster than the baseline, whilst traversing a region which is an order of magnitude sharper. We start by looking at the impact of which controls the amount of eigenvectors with adapted learning rate; we test in with a fixed . On the whole, increasing correlates with a significantly improved training speed and visiting much sharper regions (see Fig. 8). We highlight that NSGD with reaches a maximum of approximately compared to baseline SGD reaching approximately . Further, NSGD retains an advantage of over ( for SimpleCNN) validation accuracy, even after epochs of training (see Tab. 1).
NSGD can improve the final generalization performance, while finding a sharper final region.
Next, we turn our attention to the results on the final generalization and sharpness. We observe from Tab. 1 that using can result in finding a significantly sharper endpoint exhibiting a slightly improved generalization performance compared to baseline SGD using the same . On the contrary, using led to a wider endpoint and a worse generalization, perhaps due to the added instability. Finally, using a larger generally correlates with an improved generalization performance (see Fig. 8, right).
More specifically, baseline SGD using the same learning rate reached test accuracy with the Frobenius norm of the Hessian ( with on SimpleCNN). In comparison, NSGD using found endpoint corresponding to test accuracy and ( and on SimpleCNN). Finally, note that in the case of Resnet32 leads to test accuracy and which closes the generalization gap to SGD using . We note that runs corresponding to generally converge at final crossentropy loss around and over training accuracy.
As discussed in Sagun et al. (2017) the structure of the Hessian can be highly dataset dependent, thus the demonstrated behavior of NSGD could be dataset dependent as well. In particular NSGD impact on the final generalization can be dataset dependent. In the Appendix C and Appendix F we include results on the CIFAR100, Fashion MNIST (Xiao et al., 2017) and IMDB (Maas et al., 2011) datasets, but studies on more diverse datasets are left for future work. In these cases we observed a similar behavior regarding faster optimization and steering towards sharper region, while generalization of the final region was not always improved. Finally, we relegate to the Appendix C additional studies using a high base learning and momentum.
Summary.
We have investigated what happens if SGD uses a reduced learning rate along the sharpest directions. We show that this variant of SGD, i.e. NSGD, steers towards sharper regions in the beginning. Furthermore, NSGD is capable of optimizing faster and finding good generalizing sharp minima, i.e. regions of the loss surface at the convergence which are sharper compared to those found by vanilla SGD using the same low learning rate, while exhibiting better generalization performance. Note that in contrast to Dinh et al. (2017) the sharp regions that we investigate here are the endpoints of an optimization procedure, rather than a result of a mathematical reparametrization.
5 Related work
Tracking the Hessian: The largest eigenvalues of the Hessian of the loss of DNNs were investigated previously but mostly in the late phase of training. Some notable exceptions are: LeCun et al. (1998) who first track the Hessian spectral norm, and the initial growth is reported (though not commented on). Sagun et al. (2016) report that the spectral norm of the Hessian reduces towards the end of training. Keskar et al. (2016) observe that a sharpness metric grows initially for large batchsize, but only decays for small batchsize. Our observations concern the eigenvalues and eigenvectors of the Hessian, which follow the consistent pattern, as discussed in Sec. 3.1. Finally, Yao et al. (2018) study the relation between the Hessian and adversarial robustness at the endpoint of training.
Wider minima generalize better: Hochreiter and Schmidhuber (1997) argued that wide minima should generalize well. Keskar et al. (2016) provided empirical evidence that the width of the endpoint minima found by SGD relates to generalization and the used batchsize. Jastrzębski et al. (2017) extended this by finding a correlation of the width and the learning rate to batchsize ratio. Dinh et al. (2017) demonstrated the existence of reparametrizations of networks which keep the loss value and generalization performance constant while increasing sharpness of the associated minimum, implying it is not just the width of a minimum which determines the generalization. Recent work further explored importance of curvature for generalization (Wen et al., 2018; Wang et al., 2018).
Stochastic gradient descent dynamics. Our work is related to studies on SGD dynamics such as Goodfellow et al. (2014); Chaudhari and Soatto (2017); Xing et al. (2018); Zhu et al. (2018). In particular, recently Zhu et al. (2018) investigated the importance of noise along the top eigenvector for escaping sharp minima by comparing at the final minima SGD with other optimizer variants. In contrast we show that from the beginning of training SGD visits regions in which SGD step is too large compared to curvature. Concurrent with this work Xing et al. (2018) by interpolating the loss between parameter values at consecutive iterations show it is roughlyconvex, whereas we show a related phenomena by investigating the loss in the subspace of the sharpest directions of the Hessian.
6 Conclusions
The somewhat puzzling empirical correlation between the endpoint curvature and its generalization properties reached in the training of DNNs motivated our study. Our main contribution is exposing the relation between SGD dynamics and the sharpest directions, and investigating its importance for training. SGD steers from the beginning towards increasingly sharp regions of the loss surface, up to a level dependent on the learning rate and the batchsize. Furthermore, the SGD step is large compared to the curvature along the sharpest directions, and highly aligned with them.
Our experiments suggest that understanding the behavior of optimization along the sharpest directions is a promising avenue for studying generalization properties of neural networks. Additionally, results such as those showing the impact of the SGD step length on the regions visited (as characterized by their curvature) may help design novel optimizers tailorfit to neural networks.
Acknowledgements
SJ was supported by Grant No. DI 2014/016644 from Ministry of Science and Higher Education, Poland and No. 2017/25/B/ST6/01271 from National Science Center, Poland. Work at Mila was funded by NSERC, CIFAR, and Canada Research Chairs. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 732204 (Bonseyes). This work is supported by the Swiss State Secretariat for Education‚ Research and Innovation (SERI) under contract number 16.0159.
References
 Advani and Saxe (2017) Madhu S Advani and Andrew M Saxe. Highdimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667, 2017.
 Bjorck et al. (2018) J. Bjorck, C. Gomes, and B. Selman. Understanding Batch Normalization. ArXiv eprints, May 2018.
 Chaudhari and Soatto (2017) P. Chaudhari and S. Soatto. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. ArXiv eprints, October 2017.
 Chollet et al. (2015) François Chollet et al. Keras. https://github.com/kerasteam/keras, 2015.
 Dauphin et al. (2014) Yann Dauphin, Razvan Pascanu, Çaglar Gülçehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in highdimensional nonconvex optimization. CoRR, abs/1406.2572, 2014. URL http://arxiv.org/abs/1406.2572.
 Dinh et al. (2017) Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. arXiv preprint arXiv:1703.04933, 2017.
 Goodfellow et al. (2014) I. J. Goodfellow, O. Vinyals, and A. M. Saxe. Qualitatively characterizing neural network optimization problems. ArXiv eprints, December 2014.
 Goyal et al. (2017) Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017.
 Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.
 Jastrzębski et al. (2017) Stanisław Jastrzębski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in sgd. arXiv preprint arXiv:1711.04623, 2017.

Keskar et al. (2016)
N.S.= Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P.
Tang.
On LargeBatch Training for Deep Learning: Generalization Gap and Sharp Minima.
ArXiv eprints, September 2016.  (12) Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar10 (canadian institute for advanced research). URL http://www.cs.toronto.edu/~kriz/cifar.html.
 Lanczos (1950) Cornelius Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. J. Res. Natl. Bur. Stand. B, 45:255–282, 1950. doi: 10.6028/jres.045.026.
 LeCun et al. (1998) Yann LeCun, Léon Bottou, Genevieve B. Orr, and KlausRobert Müller. Efficient backprop. In Neural Networks: Tricks of the Trade, This Book is an Outgrowth of a 1996 NIPS Workshop, pages 9–50, London, UK, UK, 1998. SpringerVerlag. ISBN 3540653112. URL http://dl.acm.org/citation.cfm?id=645754.668382.

Maas et al. (2011)
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and
Christopher Potts.
Learning word vectors for sentiment analysis.
In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies  Volume 1, HLT ’11, pages 142–150, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics. ISBN 9781932432879. URL http://dl.acm.org/citation.cfm?id=2002472.2002491.  Murata et al. (1994) Noboru Murata, Shuji Yoshizawa, and Shun ichi Amari. Network information criteriondetermining the number of hidden units for an artificial neural network model. IEEE transactions on neural networks, 5 6:865–72, 1994.
 Neyshabur et al. (2017) Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. Exploring generalization in deep learning. CoRR, abs/1706.08947, 2017. URL http://arxiv.org/abs/1706.08947.
 Poggio et al. (2017) Tomaso Poggio, Kenji Kawaguchi, Qianli Liao, Brando Miranda, Lorenzo Rosasco, Xavier Boix, Jack Hidary, and Hrushikesh Mhaskar. Theory of deep learning iii: explaining the nonoverfitting puzzle. arXiv preprint arXiv:1801.00173, 2017.
 Sagun et al. (2016) Levent Sagun, Léon Bottou, and Yann LeCun. Singularity of the hessian in deep learning. arXiv preprint arXiv:1611.07476, 2016.
 Sagun et al. (2017) Levent Sagun, Utku Evci, V Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical analysis of the hessian of overparametrized neural networks. arXiv preprint arXiv:1706.04454, 2017.
 Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556, 2014. URL http://arxiv.org/abs/1409.1556.
 Taki (2017) Masato Taki. Deep residual networks and weight initialization. CoRR, abs/1709.02956, 2017. URL http://arxiv.org/abs/1709.02956.
 Wang et al. (2018) H. Wang, N. Keskar, C. Xiong, and R. Socher. Identifying Generalization Properties in Neural Networks. ArXiv eprints, September 2018.
 Wen et al. (2018) W. Wen, Y. Wang, F. Yan, C. Xu, C. Wu, Y. Chen, and H. Li. SmoothOut: Smoothing Out Sharp Minima to Improve Generalization in Deep Learning. ArXiv eprints, May 2018.

Wilson et al. (2017)
Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin
Recht.
The marginal value of adaptive gradient methods in machine learning.
2017.  Xiao et al. (2017) H. Xiao, K. Rasul, and R. Vollgraf. FashionMNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. ArXiv eprints, August 2017.
 Xing et al. (2018) C. Xing, D. Arpit, C. Tsirigotis, and Y. Bengio. A Walk with SGD. ArXiv eprints, February 2018.
 Yao et al. (2018) Zhewei Yao, Amir Gholami, Qi Lei, Kurt Keutzer, and Michael W Mahoney. Hessianbased analysis of large batch training and robustness to adversaries. arXiv preprint arXiv:1802.08241, 2018.
 Zaremba et al. (2014) Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization, 2014. URL https://arxiv.org/abs/1409.2329.
 Zhang et al. (2016) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
 Zhu et al. (2018) Z. Zhu, J. Wu, B. Yu, L. Wu, and J. Ma. The Regularization Effects of Anisotropic Noise in Stochastic Gradient Descent. ArXiv eprints, February 2018.
Appendix A Additional results for Sec. 3.1
a.1 Additional results on the evolution of the largest eigenvalues of the Hessian
First, we show that the instability in the early phase of fullbatch training is partially solved through use of BatchNormalization layers, consistent with the results reported by Bjorck et al. [2018]; results are shown in Fig. 9.


Next, we extend results of Sec. 3.1 to VGG11 and BatchNormalized Resnet32 models, see Fig. 11 and Fig. 10. Importantly, we evaluated the Hessian in the inference mode, which resulted in large absolute magnitudes of the eigenvalues on Resnet32.
a.2 Impact of learning rate schedule
In the paper we focused mostly on SGD using a constant learning rate and batchsize. We report here the evolution of the spectral and Frobenius norm of the Hessian when using a simple learning rate schedule in which we vary the length of the first stage ; we use for epochs and drop it afterwards to . We test in . Results are reported in Fig. 12. The main conclusion is that depending on the learning rate schedule in the next stages of training curvature along the sharpest directions (measured either by the spectral norm, or by the Frobenius norm) can either decay or grow. Training for a shorter time (a lower ) led to a growth of curvature (in term of Frobenius norm and spectral norm) after the learning drop, and a lower final validation accuracy.
a.3 Impact of using momentum
In the paper we focused mostly on experiments using plain SGD, without momentum. In this section we report that large momentum similarly to large leads to a reduction of spectral norm of the Hessian, see Fig. 13 for results on the VGG11 network on CIFAR10 and CIFAR100.
Appendix B Additional results for Sec. 3.2
In Fig. 14 we report the corresponding training and validation curves for the experiments depicted in Fig. 6. Next, we plot an analogous visualization as in Fig. 6, but for the 3rd and 5th eigenvector, see Fig. 15 and Fig. 16, respectively. To ensure that the results do not depend on the learning rate, we rerun the Resnet32 and SimpleCNN experiments with , see Fig. 17.
Finally, we run the same experiment as in Sec. 3.2, but instead of focusing on the early phase we replot Fig. 6 for the first epochs, see Fig. 18
Appendix C Additional results for Sec. 4
Here we report additional results for Sec. 4. Most importantly, in Tab. 2 we report full results for SimpleCNN model. Next, we rerun the same experiment, using the Resnet32 model, on the CIFAR100 dataset, see Tab. 5, and on the FashionMNIST dataset, see Tab. 6. In the case of CIFAR100 we observe that conclusion carryover fully. In the case of FashionMNIST we observe that the final generalization for the case of and is similar. Therefore, as discussed in the main text, the behavior seems to be indeed dataset dependent.
In Sec. 4 we purposedly explored NSGD in the context of suboptimally picked learning rate, which allowed us to test if NSGD has any implicit regularization effect. In the next two experiments we explored how NSGD performs when using either a large base learning , or when using momentum. Results are reported in Tab. 3 (learning rate ) and Tab. 4 (momentum ). In both cases we observe that NSGD improves training speed and reaches a significantly sharper region initially, see Fig. 22. However, the final region curvature in both cases, and the final generalization when using momentum, is not significantly affected. Further study of NSGD using a high learning rate or momentum is left for future work.
Test acc.  Val. acc. (50)  Loss  Dist.  
/  
/  
/  
/  
SGD(0.005)  /  
SGD(0.05)  / 
Test acc.  Val. acc. (50)  Loss  Dist.  

/  
/  
/ 
Test acc.  Val. acc. (50)  Loss  Dist.  

/  
/  
/  
/ 
name  Test acc.  Val. acc. (50)  Loss  Dist.  

/  
/  
/  
/ 
name  Test acc.  Val. acc. (50)  Loss  Dist.  

/  
/  
/  
/ 
Appendix D SimpleCNN Model
The SimpleCNN used in this paper has four convolutional layers. The first two have filters, while the third and fourth have filters. In all convolutional layers, the convolutional kernel window size used is (,
) and ‘same’ padding is used. Each convolutional layer is followed by a ReLU activation function. Maxpooling is used after the second and fourth convolutional layer, with a poolsize of (
,). After the convolutional layers there are two linear layers with output size and respectively. After the first linear layer ReLU activation is used. After the final linear layer a softmax is applied. Please also see the provided code.Appendix E Comparing NSGD to Newton Method
NudgedSGD is a second order method, in the sense that it leverages the curvature of the loss surface. In this section we argue that it is significantly different from the Newton method, a representative second order method.
The key reason for that is that, similarly to SGD, NSGD is driven to a region in which curvature is too large compared to its typical step. In other words NSGD does not use an optimal learning rate for the curvature, which is the key design principle for second order methods. This is visible in Fig. 20, where we report results of a similar study as in Sec. 3.2, but for NSGD (, ). The loss surface appears sharper in this plot, because reducing gradients along the top sharpest directions allows optimizing over significantly sharper regions.
As discussed, the key difference stems for the early bias of SGD to reach maximally sharp regions. It is therefore expected that in case of a quadratic loss surface Newton and NSGD optimizers are very similar. In the following we construct such an example. First, recall that update in Newton method is typically computed as:
(1) 
where is a scalar. Now, if we assume that is diagonal and put , and finally let , it can be seen that the update of NSGD with and is equivalent to that of Newton method.
Appendix F Experiments on sentiment analysis dataset
Most of the experiments in the paper focused on image classification datasets (except for language modeling experiments in Sec. 3.1). The goal of this section is to extend some of the experiments to the text domain. We experiment with the IMDB [Maas et al., 2011] binary sentiment classification dataset and use the simple CNN model from the Keras [Chollet et al., 2015] example repository^{6}^{6}6Accessible at https://github.com/kerasteam/keras/blob/master/examples/imdb_cnn.py.
First, we examine the impact of learning rate and batchsize on the Hessian along the training trajectory as in Sec. 3.1. We test and . As in Sec. 3.1 we observe that the learning rate and the batchsize limit the maximum curvature along the training trajectory. In this experiment the phase in which the curvature grows took many epochs, in contrast to the CIFAR10 experiments. The results are summarized in Fig. 21.
Next, we tested Nudged SGD with , and . We test . We increased the number of parameters of the base model by increasing by a factor of
number of filters in the first convolutional layer and the number of neurons in the dense layer to encourage overfitting. Experiments were repeated over
seeds.We observe that in this setting NSGD for optimizes significantly faster and finds a sharper region initially. At the same time using does not result in finding a better generalizing region. The results are summarized in Tab. 7 and Fig. 22.
Test acc.  Val. acc. (10)  Loss  Dist.  

/  86.79  
/  84.52  
/  78.32 
Comments
There are no comments yet.