Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> This post covers one appealing way to constrain the weight matrices of a neural network—by keeping the tensors constrained to submanifolds at each layer. This opens the door to re-thinking optimization, as we can co-design optimization algorithms with these manifold constraints. As an example, we propose a manifold version of the Muon optimizer whose weights are constrained to the Stiefel manifold: the manifold of matrices with unit condition number. We conclude the post by defining the idea of a modular manifold, which is a composable manifold that attempts to make it easier to scale up and train large networks.

Very good presentation. Projected gradient methods were popular during the convex optimization craze two decades ago. The ideas advanced here have precedent and seem sensible to me. My concern is whether it helps much. The test accuracy in figure 6b shows a marginal increase, and a gentler transition to the overfitting regime, suggesting the regularization is working. The higher LR did not translate to a speed up: "Manifold Muon increased the wall clock time per step compared to AdamW..."

More fundamentally, I am a bit skeptical that low test accuracy is the right goal in LLMs because statistical learning theory does not adequately model the macro-behavior of very large models.



\> statistical learning theory does not adequately model the macro-behavior of very large models.

Might you please elaborate on this? I recognize that "artificial neural networks are lossy de/compression algorithms" does not enumerate the nuances of these structures, but am curious whether anything in particular is both interesting and missing from SLT.


SLT typically uses empirical risk minimization, leading to the bias-variance decomposition and a unimodal extremum as the monotonically decreasing bias supposedly balances against the monotonically increasing variance. We now know this does not accurately model overparameterized models, which exhibit double descent, and other phenomena like grokking. To explain them you have to look past classical statistics to statistical mechanics.


> The test accuracy in figure 6b shows a marginal increase, and a gentler transition to the overfitting regime, suggesting the regularization is working.

Sounds like it might help for online RL training regimes as those are naturally quite vulnerable to overfitting .


The test accuracy in figure 6b shows a marginal increase, and a gentler transition to the overfitting regime, suggesting the regularization is working.

Higher LR does not mean there’s overfitting.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: