Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hello! Totally agree that tokens will be model dependent. We chose to calculate tokens using the GPT-2 tokenizer as that is a common metric used by other datasets like fineweb. So this should roughly give you a sense of how large the data is in comparison to others. We report other metrics too like number of documents and number of images.


How does the GPT-2 tokenizer deal with non-text input? This dataset is multimodal but I thought GPT-2 was text only.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: