Tokens, n-grams, and bag-of-words models (2023)

politelemon · on April 6, 2024

I found this useful, so thanks for sharing it; ngrams and bag-of-words are terms I've encountered in the past but skipped without thinking about htem.

It's making me wonder, why are models usually in Python? Could these models be implemented in say, Scala, Kotlin, or NodeJS, and have there been attempts to do so?

marginalia_nu · on April 6, 2024

N-grams are unfortunately an overloaded term.

It's used in the context of multiple "words" as in this article, but also tuples of characters. The confusion stems from the fact that "token" may refer to words as is used in this article (this is common in modern NLP) but older sources often use the word token to refer to graphemes ("letters") and various other breakdowns. This wikipedia article[1] is a good example of such usage.

[1] https://en.wikipedia.org/wiki/Bigram

nostrebored · on April 6, 2024

There’s no strict definition of a token, it’s just a unit of breaking down text / arbitrary data. You could create a sentence tokenizer, word tokenizer, character tokenizer, etc.

Afaik this is pretty canonical.

3abiton · on April 6, 2024

Does it relate to tokenization?

m3at · on April 7, 2024

Two of the most popular libraries for token creation (tokenization) are in fact in rust, with an interface in python:

https://github.com/huggingface/tokenizers

https://github.com/openai/tiktoken

bobbylarrybobby · on April 7, 2024

The models themselves are usually in C (for performance and portability), and Python is currently the best interface to C code.

Going forward, I assume we'll see more Python interfacing with more Rust as well. (Polars is one prominent example.) Python is just a fantastic UI over “close to the metal” languages.

gentleman11 · on April 6, 2024

There’s some libraries, numpy, pandas, and sklearn, that are written in python and highly optimized. People use python because those libraries, plus a few other tools, are python. Plus python is nice and easy to use on any platform

fbdab103 · on April 6, 2024

Don't forget the Jupyter notebook ecosystem. Having a nice REPL environment where you can quickly iterate on code is a huge boon compared to a language which might have a slower feedback cycle.

This is incredibly important in an environment where you are exploring datasets and generating new hypotheses.

jjtheblunt · on April 6, 2024

I think you're mixing distinct things, because numpy, pandas, sklearn are NOT written primarily in Python, but generally in C, which is why they are highly optimized, relying on this underlying C code, for example, but with Python bindings for convenient ergonomic use from Python.

PaulHoule · on April 6, 2024

People aren't going to do this in Java which has a xenophobic attitude about libraries. I mean you could, but people wouldn't.

jjtheblunt · on April 7, 2024

I did quite a lot of scientific computation using Apache Spark (thus, jvm) for a while, and the lack of more specific numeric types, like machine hardware supported types, in the JVM was a veritable pain in the ass to work around.

I don't know if that is a handicap to the idea you mention, but it may be...perhaps dropping into C for hardware performance and then back up to the JVM is just too annoying since things representable in hardware had no non-super-tedious representation back in the JVM (Scala or Java in my case).

politelemon · on April 6, 2024

Oh that bit does make sense, there's a natural gravitation towards it and the more people use it the better it gets. Have there been attempts to recreate a similar ecosystem in other languages though.

rabbits77 · on April 6, 2024

Yes, of course. Things like numpy are far from new and in many cases are easy to use wrappers around the real computational workhorses written in, say, Fortran or C. For example, check out lapack https://hpc.llnl.gov/software/mathematical-software/lapack which is still, as far as I know, the gold standard.

Python is widely adapted not really because the language itself is any good or particularly performant (it's not at all), but because it presents easy to use wrapper APIs to developers who may have a poor background in Computer Science, but are rather stronger in statistics, general data analysis, or applied fields like economics.

pedrosorio · on April 6, 2024

Before Python, I believe Fortran (one of the first programming languages) was for many years a key language in scientific computing.

MATLAB is a proprietary computing platform with its own language that was very widely used (and probably still the standard in some fields of engineering). The fact that it is proprietary and the language is not great as a general programming language were significant drawbacks to wide adoption.

As far as ML is concerned, the deep learning revolution happened when Python was the dominant language for scientific computing (mostly due to NumPy and SciPy), so naturally a lot of the ecosystem was built to have Python as the main (scripting) language. The rest is history.

As far as "attempts to recreate similar ecosystem":

PyTorch (currently the most popular deep learning framework), was originally Torch (initial release: 2002 - long before "Deep Learning" was a thing) with Lua as the scripting language. Python's momentum in 2010's meant that eventually it was rewritten in Python thus becoming PyTorch.

Julia language is a famous somewhat recent example (first release 2012, stable 1.0 in 2018) of a language that was built partially to address some of Python's shortcomings and to "replace" it as the default for scientific computing. It didn't succeed - it's hard to move people away from an ecosystem with so much head start and momentum as Python had in the 2010s.

atoav · on April 7, 2024

Python had one of the first popular natural language processing libraries. Additionally my personal opinion is that Python is a good choice when it comes to string manipulation (and has been a decade ago).

Buttons840 · on April 6, 2024

> Could these models be implemented in say, Scala, Kotlin, or NodeJS, and have there been attempts to do so?

Of course. Python itself could be reimplemented in Scala or Kotlin.

I'm not trying to be snarky, but my teenage self didn't realize all languages could do all things when I was young, so I'm speaking to anyone else out there who might not have learned this.

It's just a matter of ease. Python is easy for short programs (although maintainability can suffer on large programs), and, for better or worse, Python has become the most popular language in the world, especially for numerical computing.