Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions

作者: Mengdi Wang , Ethan X. Fang , Han Liu

DOI: 10.1007/S10107-016-1017-3

关键词:

摘要: Classical stochastic gradient methods are well suited for minimizing expected-value objective functions. However, they do not apply to the minimization of a nonlinear function involving expected values or composition two functions, i.e., problem $$\min _x \mathbf{E}_v\left[ f_v\big (\mathbf{E}_w [g_w(x)]\big ) \right] .$$minxEvfv(Ew[gw(x)]). In order solve this problem, we propose class compositional descent (SCGD) algorithms that can be viewed as versions quasi-gradient method. SCGD update solutions based on noisy sample gradients $$f_v,g_{w}$$fv,gw and use an auxiliary variable track unknown quantity $$\mathbf{E}_w\left[ g_w(x)\right] $$Ewgw(x). We prove converge almost surely optimal solution convex optimization problems, long such exists. The convergence involves interplay iterations with different time scales. For nonsmooth achieves rate $$\mathcal {O}(k^{-1/4})$$O(k-1/4) in general case {O}(k^{-2/3})$$O(k-2/3) strongly case, after taking k samples. smooth accelerated at {O}(k^{-2/7})$$O(k-2/7) {O}(k^{-4/5})$$O(k-4/5) case. nonconvex any limit point generated by is stationary point, which also provide analysis. Indeed, setting where one wants optimize compositions functions very common practice. proposed find wide applications learning, estimation, dynamic programming, etc.

参考文章(35)
Arkadi Nemirovski, Reuven Y. Rubinstein, An Efficient Stochastic Approximation Algorithm for Stochastic Saddle Point Problems Springer, New York, NY. pp. 156- 184 ,(2005) , 10.1007/0-306-48102-2_8
A. Nedić, D.P. Bertsekas, V.S. Borkar, Distributed Asynchronous Incremental Subgradient Methods Studies in Computational Mathematics. ,vol. 8, pp. 381- 407 ,(2001) , 10.1016/S1570-579X(01)80023-9
Ohad Shamir, Tong Zhang, Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes international conference on machine learning. pp. 71- 79 ,(2013)
Darinka Dentcheva, Alexander Shapiro, Andrzej P. Ruszczyński, Lectures on Stochastic Programming: Modeling and Theory ,(2009)
Pierre Priouret, Michel Métivier, Albert Benveniste, Adaptive Algorithms and Stochastic Approximations ,(1990)
H. Robbins, D. Siegmund, A Convergence Theorem for Non Negative Almost Supermartingales and Some Applications Optimizing Methods in Statistics#R##N#Proceedings of a Symposium Held at the Center for Tomorrow, the Ohio State University, June 14–16, 1971. pp. 111- 135 ,(1985) , 10.1007/978-1-4612-5110-1_10
John N. Tsitsiklis, Dimitri P. Bertsekas, Parallel and Distributed Computation: Numerical Methods ,(1989)