This looked suspect to me at first as well, but on reflection it doesn't seem so...

chroem- · on March 21, 2021

I've had the opportunity to train on an 88 billion example dataset, and while you do see diminishing returns with each additional sample, it is still important to try to cover as much of the dataset as possible. The limiting factor on training on datasets that large seems to be some sort of hysteresis overfit on the earlier samples that prevents the model from generalizing to the entire dataset. Normally it goes unnoticed, but there is still a small amount of overfit that accumulates the longer you train the model, even if every sample is new.

ShamelessC · on March 21, 2021

Thanks this insight helped me understand an issues I'm currently having with training actually!

gwern · on March 21, 2021

n = 88 billion is larger than I can think of for any DL research I've read about. Can you (ahem) enlarge on that?

chroem- · on March 21, 2021

It's an ultra high resolution sales forecast model for a national retailer. It's made somewhat more manageable by the fact that the data is 1D instead of 2D though.

gwern · on March 22, 2021

I wonder if that's an important difference? Sales data seems far more noisy and less informative than all of the scaling research datasets I can think of like images or text. It could also just be that you had some minor flaw causing underfitting in your approach like using too-small models which aren't compute-optimal or too much regularization (a critical part of scaling papers is removing standard regularizers like dropout to avoid hobbling the model).

chroem- · on March 23, 2021

I've found that it's surprisingly insensitive to the amount of regularization applied, however it definitely does warrant further experimentation. Also, I expect that one other distinction vs. image data is much higher correlation between samples in the dataset.

gradys · on March 21, 2021

But maybe with 5 million real examples sampled from the distribution CIFAR-10 was sampled from they would in fact see a difference. Maybe the generative model is capturing only a limited slice of the diversity that the ideal model would really see.

It seems like they should have downsampled from an actually large dataset rather than generatively upsampled from a small dataset. Unless I'm missing something?

habitue · on March 21, 2021

I think what you're saying is plausible, but I would expect the different models to diverge at different rates in that case. So, for example, if resnet-real and resnet-ideal stayed within 1% and the other models showed a bigger range, I would be more suspicious that the generative model was simply creating a dataset that was easy for some architectures to learn.

That being said, I think it would have been much better if they compared some non-convolutional architectures just as a sanity check

Edit: after I wrote this, I checked, and ViT-b/4 is actually a transformer architecture, not CNN. So they did this! And it stayed very close to the same error range from ideal as the CNNs. I am much more confident now that what they did is fine