Hello! Totally agree that tokens will be model dependent. We chose to calculate ... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		anas-awadalla on July 25, 2024 \| parent \| context \| favorite \| on: A multimodal dataset with one trillion tokens Hello! Totally agree that tokens will be model dependent. We chose to calculate tokens using the GPT-2 tokenizer as that is a common metric used by other datasets like fineweb. So this should roughly give you a sense of how large the data is in comparison to others. We report other metrics too like number of documents and number of images.

reverius42 on July 26, 2024 [–]

How does the GPT-2 tokenizer deal with non-text input? This dataset is multimodal but I thought GPT-2 was text only.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact