Mathematician uncovers way to shrink sampling errors in large-dimensional data

godelski · on April 24, 2023

paper: James–Stein for the leading eigenvector

https://www.pnas.org/doi/10.1073/pnas.2207046120

YossarianFrPrez · on April 24, 2023

So basically, if you want to estimate the average of more than three averages, there is a better way to go than just averaging them together. Sort of like how when computing the Standard Deviation for a sample population you divide by n-1 instead of n. Anyways, that's the James-Stein estimator, and the authors apply it to PCA dimensionality reduction; they de-bias and improve the leading eigenvector with their technique. Ctrl-f for "fig. 3" to see their results in action.

This looks like it might affect a wide number of fields. I appreciate the concrete examples (batting averages, finance / Markowitz) as well.

geokon · on April 24, 2023

PCA is an interative algo. Once you build the leading vector, you subtract it from your data and then build the next.

Do you know by any chance why you can't use this new method recursively for building the whole SVD basis ? (I haven't read the paper carefully yet.. It's a bit at the limit of my understanding )

_eojb · on April 24, 2023

Suggest changing the link to link the actual paper which suggests a method to reduce estimation error of the leading eigenvector

mjhay · on April 24, 2023

Yeah, there's literally no actual information in that article pertinent to the method. It's not even like this is a particularly arcane thing.

asplake · on April 24, 2023

> a method to reduce estimation error of the leading eigenvector

Might this have application to principal component analysis or other ML techniques?

hackandthink · on April 24, 2023

The Stein Paradox is quite interesting. A slogan could be: Estimators do not compose.

https://en.wikipedia.org/wiki/Stein%27s_example

https://cs.nyu.edu/~roweis/csc2515-2006/readings/stein_parad...

AnnaPali · on April 24, 2023

So many things like this come out but I've rarely found a way to actually implement any of these things, so they don't become actual learnings or takeaways. Skimming over the paper it just doesn't seem applicable.

Perhaps someone has a list of interesting innovations from recent research we can actually apply at work with data, programming or such?

justeleblanc · on April 24, 2023

It's called research. Someone has to do the R&D afterward to implement it concretely.

RayVR · on April 28, 2023

Very cool to see researchers moving the needle on something as fundamental as PCA.

A few caveats for the "impact" in finance: PCA is an oft-used tool during the initial research phase, however, once a model, whether on the alpha or risk side, reaches production, marginal improvements to the leading eigenvector will likely be a rounding error compared to other confounding issues, that's assuming PCA survives past the basic first-pass approximation phase of model building at all.

There are a number of better and more analytically useful models which I would expect to be used instead.

I can already hear some people coming from less quantitative firms or groups arguing PCA is used extensively. However, those same firms and groups rarely adhere strictly to model outputs - Layering on discretionary traders views, which are not often not quantified.

I was never on the sell-side responsible for derivatives pricing, but they would almost certainly never include PCA in anything they do, unless perhaps it was a client request?

derbOac · on April 24, 2023

This is interesting and important to know about but increasingly I feel like stochastic uncertainty isn't the main source of "random/error variance", it's more like real but uninteresting heterogeneity. I can see using this in practice but it's probably a minor part of the whole picture. I don't mean to sound dismissive; I think it's more that I wish there was more focus on other types of uncertainty as well.