Great to see this here. We used this dataset from Tiny Stories to train small models (as small as 20M params) and test out knowledge addition. Published a paper based on this dataset. We could get coherent outputs at sizes as low as 20M-25M. (though not as great as LLMs, but still decent enough).
[1]: Blog + Paper: https://medium.com/@ankit_94177/expanding-knowledge-in-large... (Paper is titled: Cross-Domain Content Generation with Domain-Specific Small Language Models)