作者: Christopher De Sa , Yuyang Wang , Dean Foster , Youngsuk Park , Yucheng Lu
DOI:
关键词:
摘要: In real-world applications of large-scale time series, one often encounters the situation where temporal patterns while drifting over time, differ from another in same dataset. this paper, we provably show under such heterogeneity, training a forecasting model with commonly used stochastic optimizers (e.g. SGD) potentially suffers large gradient variance, and thus requires long training. To alleviate issue, propose sampling strategy named Subgroup Sampling, which mitigates variance via pre-grouped series. We further introduce SCott, reduced SGD-style optimizer that co-designs subgroup control variate method. theory, provide convergence guarantee SCott on smooth non-convex objectives. Empirically, evaluate other baseline both synthetic series problems, converges faster respect to iterations wall clock time. Additionally, two variants can speed up Adam Adagrad without compromising generalization models.