Publications
Research papers and preprints.
2026
- Atlas: Adaptive Stepsizes from First PrinciplesKonstantin Mishchenko, Yuan Gao, and Sebastian U. Stich2026Manuscript
- The Dual Averaging Power-Prox Method with Application to Heavy-Tail Incremental GradientYuan Gao, Jeremy Rack, and Sebastian U. Stich2026Submitted to NeurIPS 2026
- Composite Optimization with Error Feedback: the Dual Averaging ApproachIn International Conference on Learning Representations, poster , 2026
Communication efficiency is a central challenge in distributed machine learning training, and message compression is a widely used solution. However, standard Error Feedback (EF) methods (Seide et al., 2014), though effective for smooth unconstrained optimization with compression (Karimireddy et al., 2019), fail in the broader and practically important setting of composite optimization, which captures, e.g., objectives consisting of a smooth loss combined with a non-smooth regularizer or constraints. The theoretical foundation and behavior of EF in the context of the general composite setting remain largely unexplored. In this work, we consider composite optimization with EF. We point out that the basic EF mechanism and its analysis no longer stand when a composite part is involved. We argue that this is because of a fundamental limitation in the method and its analysis technique. We propose a novel method that combines Dual Averaging with EControl (Gao et al., 2024), a state-of-the-art variant of the EF mechanism, and achieves for the first time a strong convergence analysis for composite optimization with error feedback. Along with our new algorithm, we also provide a new and novel analysis template for inexact dual averaging method, which might be of independent interest. We also provide experimental results to complement our theoretical findings.
@inproceedings{gao2026composite, title = {Composite Optimization with Error Feedback: the Dual Averaging Approach}, author = {Gao, Yuan and Rodomanov, Anton and Rack, Jeremy and Stich, Sebastian U.}, booktitle = {International Conference on Learning Representations}, year = {2026}, url = {https://openreview.net/forum?id=PSmakC4sw5}, } - Accelerated Distributed Optimization with Compression and Error FeedbackIn International Conference on Artificial Intelligence and Statistics, poster , 2026
Modern machine learning tasks often involve massive datasets and models, necessitating distributed optimization algorithms with reduced communication overhead. Communication compression, where clients transmit compressed updates to a central server, has emerged as a key technique to mitigate communication bottlenecks. However, the theoretical understanding of stochastic distributed optimization with contractive compression remains limited, particularly in conjunction with Nesterov acceleration – a cornerstone for achieving faster convergence in optimization. In this paper, we propose a novel algorithm, ADEF (Accelerated Distributed Error Feedback), which integrates Nesterov acceleration, contractive compression, error feedback, and gradient difference compression. We prove that ADEF achieves the first accelerated convergence rate for stochastic distributed optimization with contractive compression in the general convex regime. Numerical experiments validate our theoretical findings and demonstrate the practical efficacy of ADEF in reducing communication costs while maintaining fast convergence.
@inproceedings{gao2026accelerated, title = {Accelerated Distributed Optimization with Compression and Error Feedback}, author = {Gao, Yuan and Rodomanov, Anton and Rack, Jeremy and Stich, Sebastian U.}, booktitle = {International Conference on Artificial Intelligence and Statistics}, year = {2026}, url = {https://openreview.net/forum?id=JdX5bJOJLC}, }
2025
- A Bias Correction Mechanism for Distributed Asynchronous OptimizationYuan Gao, Yuki Takezawa, and Sebastian U. StichTransactions on Machine Learning Research, Oct 2025
We develop an asynchronous gradient method for training Machine Learning models with asynchronous distributed workers, each with its own communication and computation pace, and its own local data distribution. In the modern distributed machine learning training process, local data distribution across workers is often heterogeneous (a.k.a. client bias), which is a significant limiting factor in the analysis of most existing distributed asynchronous optimization methods. In this work, we propose AsyncBC, a distributed asynchronous variant of the SARAH method, and show that this is an effective Bias Correction mechanism for distributed asynchronous optimization. We show that AsyncBC can effectively manage arbitrary data heterogeneity, as well as handle gradient updates that arrive in an uncoordinated manner and with delays. As a byproduct of our analysis, we also provide a deeper understanding of the impacts of different stochasticity models on the convergence of the SARAH method.
@article{gao2025bias, title = {A Bias Correction Mechanism for Distributed Asynchronous Optimization}, author = {Gao, Yuan and Takezawa, Yuki and Stich, Sebastian U.}, journal = {Transactions on Machine Learning Research}, year = {2025}, month = oct, url = {https://openreview.net/forum?id=8doMbaah0s} } - Towards Faster Decentralized Stochastic Optimization with Communication CompressionRustem Islamov, Yuan Gao, and Sebastian U. StichIn International Conference on Learning Representations, poster , 2025
Communication efficiency has garnered significant attention as it is considered the main bottleneck for large-scale decentralized Machine Learning applications in distributed and federated settings. In this regime, clients are restricted to transmitting small amounts of quantized information to their neighbors over a communication graph. Numerous endeavors have been made to address this challenging problem by developing algorithms with compressed communication for decentralized non-convex optimization problems. Despite considerable efforts, the current results suffer from various issues such as non-scalability with the number of clients, requirements for large batches, or bounded gradient assumption. In this paper, we introduce MoTEF, a novel approach that integrates communication compression with Momentum Tracking and Error Feedback. Our analysis demonstrates that MoTEF achieves most of the desired properties, and significantly outperforms existing methods under arbitrary data heterogeneity. We provide numerical experiments to validate our theoretical findings and confirm the practical superiority of MoTEF.
@inproceedings{islamov2025faster, title = {Towards Faster Decentralized Stochastic Optimization with Communication Compression}, author = {Islamov, Rustem and Gao, Yuan and Stich, Sebastian U.}, booktitle = {International Conference on Learning Representations}, year = {2025}, url = {https://openreview.net/forum?id=CMMpcs9prj}, } - AC(k): Robust Solution of Laplacian Equations by Randomized Approximate Cholesky FactorizationYuan Gao, Rasmus Kyng, and Daniel A. SpielmanSIAM Journal on Scientific Computing, 2025To appear
We introduce a new algorithm and software for solving linear equations in symmetric diagonally dominant matrices with non-positive off-diagonal entries (SDDM matrices), including Laplacian matrices. We use pre-conditioned conjugate gradient (PCG) to solve the system of linear equations. Our preconditioner is a variant of the Approximate Cholesky factorization of Kyng and Sachdeva (FOCS 2016). Our factorization approach is simple: we eliminate matrix rows/columns one at a time and update the remaining matrix using sampling to approximate the outcome of complete Cholesky factorization. Unlike earlier approaches, our sampling always maintains a connectivity in the remaining non-zero structure. Our algorithm comes with a tuning parameter that upper bounds the number of samples made per original entry. We implement our algorithm in Julia, providing two versions, AC and AC2, that respectively use 1 and 2 samples per original entry. We compare their single-threaded performance to that of current state-of-the-art solvers Combinatorial Multigrid (CMG), BoomerAMG-preconditioned Krylov solvers from HyPre and PETSc, Lean Algebraic Multigrid (LAMG), and MATLAB with Incomplete Cholesky Factorization (ICC). Our evaluation uses a broad class of problems, including all large SDDM matrices from the SuiteSparse collection and diverse programmatically generated instances. Our experiments suggest that our algorithm attains a level of robustness and reliability not seen before in SDDM solvers, while retaining good performance across all instances. Our code and data are public, and we provide a tutorial on how to replicate our tests. We hope that others will adopt this suite of tests as a benchmark, which we refer to as SDDM2023.
@article{gao2025ack, title = {{AC(k)}: Robust Solution of Laplacian Equations by Randomized Approximate Cholesky Factorization}, author = {Gao, Yuan and Kyng, Rasmus and Spielman, Daniel A.}, journal = {SIAM Journal on Scientific Computing}, year = {2025}, note = {To appear}, url = {https://rasmuskyng.com/papers/GKS25.pdf}, }
2024
- Non-convex Stochastic Composite Optimization with Polyak MomentumYuan Gao, Anton Rodomanov, and Sebastian U. StichIn Proceedings of the 41st International Conference on Machine Learning, Jul 2024
The stochastic proximal gradient method is a powerful generalization of the widely used stochastic gradient descent (SGD) method and has found numerous applications in Machine Learning. However, it is notoriously known that this method fails to converge in non-convex settings where the stochastic noise is significant (i.e. when only small or bounded batch sizes are used). In this paper, we focus on the stochastic proximal gradient method with Polyak momentum. We prove this method attains an optimal convergence rate for non-convex composite optimization problems, regardless of batch size. Additionally, we rigorously analyze the variance reduction effect of the Polyak momentum in the composite optimization setting and we show the method also converges when the proximal step can only be solved inexactly. Finally, we provide numerical experiments to validate our theoretical results.
@inproceedings{gao2024polyak, title = {Non-convex Stochastic Composite Optimization with Polyak Momentum}, author = {Gao, Yuan and Rodomanov, Anton and Stich, Sebastian U.}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {14826--14843}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = jul, publisher = {PMLR}, url = {https://proceedings.mlr.press/v235/gao24l.html}, } - EControl: Fast Distributed Optimization with Compression and Error ControlYuan Gao, Rustem Islamov, and Sebastian U. StichIn International Conference on Learning Representations, poster , 2024
Modern distributed training relies heavily on communication compression to reduce the communication overhead. In this work, we study algorithms employing a popular class of contractive compressors in order to reduce communication overhead. However, the naive implementation often leads to unstable convergence or even exponential divergence due to the compression bias. Error Compensation (EC) is an extremely popular mechanism to mitigate the aforementioned issues during the training of models enhanced by contractive compression operators. Compared to the effectiveness of EC in the data homogeneous regime, the understanding of the practicality and theoretical foundations of EC in the data heterogeneous regime is limited. Existing convergence analyses typically rely on strong assumptions such as bounded gradients, bounded data heterogeneity, or large batch accesses, which are often infeasible in modern machine learning applications. We resolve the majority of current issues by proposing EControl, a novel mechanism that can regulate error compensation by controlling the strength of the feedback signal. We prove fast convergence for EControl in standard strongly convex, general convex, and nonconvex settings without any additional assumptions on the problem or data heterogeneity. We conduct extensive numerical evaluations to illustrate the efficacy of our method and support our theoretical findings.
@inproceedings{gao2024econtrol, title = {EControl: Fast Distributed Optimization with Compression and Error Control}, author = {Gao, Yuan and Islamov, Rustem and Stich, Sebastian U.}, booktitle = {International Conference on Learning Representations}, year = {2024}, url = {https://openreview.net/forum?id=lsvlvWB9vz}, }
2022
- Physarum Inspired Dynamics to Solve Semi-Definite ProgramsYuan Gao, Hamidreza Kamkari, Andreas Karrenbauer, and 2 more authors2022arXiv preprint
Physarum Polycephalum is a slime mold that can solve shortest path problems. A mathematical model based on Physarum’s behavior, known as the Physarum Directed Dynamics, can solve positive linear programs. In this paper, we present a family of Physarum-based dynamics extending the previous work and introduce a new algorithm to solve positive Semi-Definite Programs (SDP). The Physarum dynamics are governed by orthogonal projections (w.r.t. time-dependent scalar products) on the affine subspace defined by the linear constraints. We present a natural generalization of the scalar products used in the LP case to the matrix space for SDPs, which boils down to the linear case when all matrices in the SDP are diagonal, thus, representing an LP. We investigate the behavior of the induced dynamics theoretically and experimentally, highlight challenges arising from the non-commutative nature of matrix products, and prove soundness and convergence under mild conditions. Moreover, we consider a more abstract view on the dynamics that suggests a slight variation to guarantee unconditional soundness and convergence-to-optimality. By simulating these dynamics using suitable discretizations, one obtains numerical algorithms for solving positive SDPs, which have applications in discrete optimization, e.g., for computing the Goemans-Williamson approximation for MaxCut or the Lovasz theta number for determining the clique/chromatic number in perfect graphs.
@misc{gao2022physarum, title = {Physarum Inspired Dynamics to Solve Semi-Definite Programs}, author = {Gao, Yuan and Kamkari, Hamidreza and Karrenbauer, Andreas and Mehlhorn, Kurt and Sharifi, Mohammadamin}, year = {2022}, url = {https://arxiv.org/abs/2111.02291}, note = {arXiv preprint} }
2020
- A New Combinatorial Property of Geometric Unique Sink OrientationsYuan Gao, Bernd Gärtner, and Jourdain Lamperski2020arXiv preprint
A unique sink orientation (USO) is an orientation of the hypercube graph with the property that every face has a unique sink. A number of well-studied problems reduce in strongly polynomial time to finding the global sink of a USO; most notably, linear programming (LP) and the P-matrix linear complementarity problem (P-LCP). The former is not known to have a strongly polynomial-time algorithm, while the latter is not known to even have a polynomial-time algorithm, motivating the problem to find the global sink of a USO. Although, every known class of geometric USOs, arising from a concrete problem such as LP, is exponentially small, relative to the class of all USOs. Accordingly, geometric USOs exhibit additional properties that set them apart from general USOs, and it may be advantageous, if not necessary, to leverage these properties to find the global sink of a USO faster. Only a few such properties are known. In this paper, we establish a new combinatorial property of the USOs that arise from symmetric P-LCP, which includes the USOs that arise from linear and simple convex quadratic programming.
@misc{gao2020uso, title = {A New Combinatorial Property of Geometric Unique Sink Orientations}, author = {Gao, Yuan and G{\"a}rtner, Bernd and Lamperski, Jourdain}, year = {2020}, url = {https://arxiv.org/abs/2008.08992}, note = {arXiv preprint} }