Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This looked suspect to me at first as well, but on reflection it doesn't seem so bad. The important aspect here is that they have some large dataset that a model will converge on in less than one epoch. The benefit of generating it from cifar-10 is just that they already have multiple reasonable models to compare with, and they already have the hyperparameters for them.

I haven't read the paper, but my guess is that the 50K images in the real-world epoch are not just real images from the cifar-10 dataset, they're 50K random images from cifar-5m. I'm also guessing they don't ever compare performance between a model trained on cifar-10 vs. cifar-5m, they only compare performance of real vs. ideal. So in effect, you can ignore the cifar-10 dataset.



I've had the opportunity to train on an 88 billion example dataset, and while you do see diminishing returns with each additional sample, it is still important to try to cover as much of the dataset as possible. The limiting factor on training on datasets that large seems to be some sort of hysteresis overfit on the earlier samples that prevents the model from generalizing to the entire dataset. Normally it goes unnoticed, but there is still a small amount of overfit that accumulates the longer you train the model, even if every sample is new.


Thanks this insight helped me understand an issues I'm currently having with training actually!


n = 88 billion is larger than I can think of for any DL research I've read about. Can you (ahem) enlarge on that?


It's an ultra high resolution sales forecast model for a national retailer. It's made somewhat more manageable by the fact that the data is 1D instead of 2D though.


I wonder if that's an important difference? Sales data seems far more noisy and less informative than all of the scaling research datasets I can think of like images or text. It could also just be that you had some minor flaw causing underfitting in your approach like using too-small models which aren't compute-optimal or too much regularization (a critical part of scaling papers is removing standard regularizers like dropout to avoid hobbling the model).


I've found that it's surprisingly insensitive to the amount of regularization applied, however it definitely does warrant further experimentation. Also, I expect that one other distinction vs. image data is much higher correlation between samples in the dataset.


But maybe with 5 million real examples sampled from the distribution CIFAR-10 was sampled from they would in fact see a difference. Maybe the generative model is capturing only a limited slice of the diversity that the ideal model would really see.

It seems like they should have downsampled from an actually large dataset rather than generatively upsampled from a small dataset. Unless I'm missing something?


I think what you're saying is plausible, but I would expect the different models to diverge at different rates in that case. So, for example, if resnet-real and resnet-ideal stayed within 1% and the other models showed a bigger range, I would be more suspicious that the generative model was simply creating a dataset that was easy for some architectures to learn.

That being said, I think it would have been much better if they compared some non-convolutional architectures just as a sanity check

Edit: after I wrote this, I checked, and ViT-b/4 is actually a transformer architecture, not CNN. So they did this! And it stayed very close to the same error range from ideal as the CNNs. I am much more confident now that what they did is fine




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: