UDC 517.2+519.977.58+519.8
1 V.M. Glushkov Institute of Cybernetics, National Academy of Sciences of Ukraine, Kyiv, Ukraine, and National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Kyiv, Ukraine
vladimir.norkin@gmail.com
|
|
STOCHASTIC GENERALIZED GRADIENT METHODS FOR TRAINING
NONCONVEX NONSMOOTH NEURAL NETWORKS
Abstract. The paper observes a similarity between the stochastic optimal control of discrete dynamical systems
and the learning multilayer neural networks. It focuses on contemporary deep networks with nonconvex nonsmooth loss
and activation functions. The machine learning problems are treated as nonconvex nonsmooth stochastic optimization problems.
As a model of nonsmooth nonconvex dependences, the so-called generalized-differentiable functions are used.
The backpropagation method for calculating stochastic generalized gradients of the learning quality functional
for such systems is substantiated basing on Hamilton–Pontryagin formalism. Stochastic generalized gradient learning algorithms
are extended for training nonconvex nonsmooth neural networks. The performance of a stochastic generalized gradient algorithm
is illustrated by the linear multiclass classification problem.
Keywords: machine learning, deep learning, multilayer neural networks, nonsmooth nonconvex optimization, stochastic optimization, stochastic generalized gradient.
FULL TEXT
REFERENCES
- Rumelhart D.E., Hinton G.E., Williams R.J. Learning representations by back-propagating errors. Nature. 1986. Vol. 323, P. 533–536. https://doi.org/10.1038/323533a0.
- LeCun Y.A, Bottou L., Orr G.B., Muller K.-R. Efficient BackProp. In: NN: Tricks of the trade, 2nd ed.. LNCS, Vol. 7700. Montavon G., Orr G.B., Muller K.-R. (Eds.). Berlin; Heidelberg: Springer-Verlag, 2012. P. 9–48.
- LeCun Y., Bengio Y., Hinton G. Deep learning. 436. Nature. 2015. Vol. 521. P. 436–444. https://doi.org/10.1038/nature14539.
- Flach P. Machine learning. The art and science of algorithms that make sense of data. Cambridge University Press, 2012. 409 p.
- Goodfellow I., Bengio Y., Courville A. Deep learning. MIT Press, 2016. 800 p.
- Nikolenko S., Kadurin A., Arhangelskaia E. Deep learning. St. Petersburg: Piter, 2018. 480 p. (In Russian).
- Vorontsov K.V. Machine learning (a year course). URL: http://www.machinelearning.ru/wiki/ (Last accessed: 11.07.2019).
- Schmidhuber J. Deep learning in neural networks: An overview. Neural Networks. 2015. Vol. 61. P. 85–117. http://dx.doi.org/10.1016/j.neunet.2014.09.003.
- Bottou L., Curtisy F.E., Nocedalz J. Optimization methods for large-scale machine learning. SIAM Rev. 2018. Vol. 60, N 2. P. 223–311. https://doi.org/10.1137/16M1080173.
- Zhang C., Liao Q., Rakhlin A., Miranda B., Golowich N., Poggio T. Theory of deep learning IIb: Optimization properties of SGD. CBMM Memo. N 072. Cambridge, MA: Center for Brains, Minds, and Machines, McGovern Institute for Brain Research, Massachusetts Institute of Technology, 2018. 9 p. arXiv:1801.02254v1 [cs.LG] 7 Jan 2018.
- Newton D., Yousefian F., Pasupathy R. (2018). Stochastic gradient descent: Recent trends. INFORMS TutORials in operations research. 2018. P. 193–220. https://doi.org/10.1287/ educ.2018.0191.
- Robbins H. Monro S. A stochastic approximation method. The Annals of Mathematical Statistics. 1951. Vol. 22, N 3. P. 400–407.
- Ermoliev Y.M. Methods of stochastic programming. Moscow: Nauka, 1976. 240 p. (In Russian).
- Nemirovski A., Juditsky A., Lan G., Shapiro A. Robust stochastic approximation approach to stochastic programming. SIAM J. on Optimization. 2009. Vol. 19, N 4. P. 1574-1609.
- Shapiro A., Dentcheva D., A. Lectures on stochastic programming: Modeling and theory. Philadelphia: SIAM, 2009. 546 p.
- Davis D., Drusvyatskiy D., Kakade S, Lee J.D. Stochastic subgradient method converges on tame functions. Found. Comput. Math. 2020. Vol. 20, Iss. 1. P. 119–154. https://doi.org/ 10.1007/s10208-018-09409-5.
- Zhu R., Niu D., Li Z. Asynchronous stochastic proximal methods for nonconvex nonsmooth optimization. arXiv:1802.08880v3 [cs.LG] 15 Sep 2018.
- Li Z., Li J. A simple proximal stochastic gradient method for nonsmooth nonconvex optimization. In: Advances in Neural Information Processing Systems. Bengio S., Wallach H., Larochelle H., Grauman K., Cesa-Bianchi N., Garnett R. (Eds.). 2018. P. 5564–5574.
- Majewski S., Miasojedow B., Moulines E. Analysis of nonsmooth stochastic approximation: the diferential inclusion approach. arXiv:1805.01916v1 [math.OC] 4 May 2018.
- Kungurtsev V., Egan M., Chatterjee B., Alistarh D. Asynchronous stochastic subgradient methods for general nonsmooth nonconvex optimization. arXiv:1905.11845 May 28, 2019.
- Davis D., Drusvyatskiy D. Stochastic model-based minimization of weakly convex functions. SIAM J. Optim. 2019. Vol. 29, Iss.1. P. 207–239. https://doi.org/10.1137/18M1178244.
- Clarke F.H. Optimization and nonsmooth analysis. Ser. Classics in Applied Mathematics. Vol. 5. Philadelphia: Society for Industrial and Applied Mathematics (SIAM), 1990. 320 p.
- Norkin V.I. Generalized-differentiable functions. Cybernetics. 1980. Vol. 16, N 1. P. 10–12. https://doi.org/10.1007/BF01099354.
- Mikhalevich V.S., Gupal A.M., Norkin V.I. Methods of nonconvex optimization. Moscow: Nauka, 1987. 280 p. (In Russian).
- Nurminskii E.A. Minimization of nondifferentiable functions in the presence of noise. Cybernetics. 1974. Vol. 10, N 4. P. 619–621. https://doi.org/10.1007/BF01071541.
- Nurminski E.A. Numerical methods for solving stochastic minimax problems. Kyiv: Naukova dumka, 1979. 176 p. (In Russian).
- Ermoliev Y.M., Norkin V.I. Stochastic generalized gradient method for solving nonconvex nonsmooth stochastic optimization problems. Cybernetics and Systems Analysis. 1998. Vol. 34, N 2. P. 196–215. https://doi.org/10.1007/BF02742069.
- Norkin V.I. Stochastic methods for solving nonconvex stochastic optimization problems and their applications: Extended abstract ... Doctor thesis. Kyiv, 1998. 27 p. (In Ukrainian). URL: http://library.nuft.edu.ua/ebook/file/01.05.01%20Norkin%20VI.pdf.
- Ermoliev Y.M., Norkin V.I. Solution of nonconvex nonsmooth stochastic optimization problems. Cybernetics and Systems Analysis. 2003. Vol. 39, N 5. P. 701–715. https://doi.org/ 10.1023/B:CASA.0000012091.84864.65.
- Burke J., Lewis A., Overton M. A robust gradient sampling algorithm for nonsmooth nonconvex optimization. SIAM J. Opt. 2005. Vol. 15, Iss. 3. P. 751–779.
- Haykin S. Neural networks. A comprehensive foundation. 2nd ed. Prentice Hall, 2006. 842 p.
- Jarrett K., Kavukcuoglu K., Ranzato M., LeCun Y. What is the best multi-stage architecture for object recognition? Proc. 2009 IEEE 12th International Conference on Computer Vision (ICCV). (29 Sept.-2 Oct. 2009, Kyoto, Japan). Kyoto, 2009. P. 2146–2153. https://doi.org/ 10.1109/iccv.2009.5459469.
- Norkin V.I. Nonlocal minimization algorithms of nondifferentiable functions. Cybernetics. 1978. Vol. 14, N 5. P. 704–707. https://doi.org/10.1007/BF01069307.
- Mifflin R. An algorithm for constrained optimization with semi-smooth functions. Math. Oper. Res. 1977. Vol. 2, N 2. P. 191–207.
- Bolte J., Daniilidis A., Lewis A. Tame functions are semismooth. Math. Program. 2009. Vol. 117, Iss. 1–2. P. 5–19. https://doi.org/10.1007/s10107-007-0166-9.
- Norkin V.I. Stochastic generalized-differentiable functions in the problem of nonconvex nonsmooth stochastic optimization. Cybernetics. 1986. Vol. 22, N 6. P. 804–809. https://doi.org/ 10.1007/BF01068698.
- Norkin V.I. Generalized gradients in problems of dynamic optimization, optimal control, and machine learning. Cybernetics and Systems Analysis. 2020. Vol. 56, N 2. P. 243–258. https://doi.org/10.1007/s10559-020-00240-x.
- Pontryagin L.S., Boltianski V.G., Gamkelidze R.V., Mischenko E.F. The mathematical theory of optimal processes. New York; London: Interscience Publishers, 1962. 360 p.
- Boltianski V.G. Optimal control of discrete systems. Moscow: Nauka, 1973. 446 p. (In Russian).
- Ermoliev Yu.M., Norkin V.I. Stochastic generalized gradient method with application to insurance risk management. Interim Report IR-97-021. Laxenburg (Austria): Int. Inst. for Appl. Syst. Anal., 1997. 19 p. URL: http://pure.iiasa.ac.at/id/eprint/5270/1/IR-97-021.pdf.
- Gel’fand I.M., Tzeitlin M.L. The principle of the nonlocal search in automatic optimization systems. Dokl. Akad. Nauk SSSR. 1961. Vol. 137(2). P. 295–298.
- Nesterov Y. A method of solving a convex programming problem with convergence rate . Soviet Math. Dokl. 1983. Vol. 27(2). P. 372–376.
- Urjas’ev S.P. Step control for direct stochastic-programming methods. Cybernetics. 1980. Vol. 16, N 6. P. 886-890. https://doi.org/10.1007/BF01069063.
- Urjas’ev S.P. Stochastic quasi-gradient algorithms with adaptively controlled parameters. Working paper WP-86-32. Laxenburg (Austria): Int. Inst. for Appl. Syst. Anal., 1986. 27 p. URL: http://pure.iiasa.ac.at/id/eprint/2827/1/WP-86-032.pdf.
- Urjas’ev S.P. Adaptive algorithms of stochastic optimization and game theory. Moscow: Nauka, 1990. 184 p. (In Russian).
- Gupal A.M. Stochastic methods for solution of nonsmooth extremal problems. Kyiv: Naukova dumka, 1979. 150 p. (In Russian).
- Granichin O.N., Polyak B.T. Randomized algorithms of optimization and estimation under almost arbitrary noise. Moscow: Nauka, 2003. 293 p. (In Russian).
- Vapnik V.N. Statistical learning theory. New York: Wiley & Sons, 1998. 768 p.
- Shlezinger M., Glavach V. Ten lectures on the statistical and structural recognition. Kyiv: Nakova dumka, 2004. 545 p. (In Russian).
- Yuan G.-X., Ho C.-H., Lin C.-J. Recent advances of large-scale linear classification. Proc. of the IEEE. 2012. Vol. 100, Iss. 9. P. 2584–2603. https://doi.org/10.1109/JPROC.2012.2188013.
- Laptin Yu.P., Zhuravlev Yu.I., Vinogradov A.P. Empirical risk minimization and problems of constructing linear classifiers. Cybernetics and Systems Analysis. 2011. Vol. 47, N 4. P. 640–648. https://doi.org/10.1007/s10559-011-9344-0.
- Zhuravlev Yu.I., Laptin Yu.P., Vinogradov A.P., Zhurbenko N.G., Lykhovyd O.P., Berezovskyi O.A. Linear classifiers and selection of informative features. Pattern Recognition and Image Analysis. 2017. Vol. 27, N 3. P. 426–432. https://doi.org/10.1134/S1054661817030336.
- Zhuravlev Yu.I., Laptin Yu.P., Vinogradov A.P. A comparison of some approaches to classification problems, and possibilities to construct optimal solutions efficiently. Pattern Recognition and Image Analysis. 2014. Vol. 24, N 2. P. 189–195. https://doi.org/10.1134/ S1054661814020175.