Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A first-order Markov-chain text generator is not complex at all; it's a one-line shell script, reformatted here onto three lines for clarity:

    perl -ane 'push@{$m{$a}},$a=$_ for@F;END{
      print$a=$m{$a}[rand@{$m{$a}}]," "for 1..100_000
    }' <<<'fish fish for fish.'
Given the King James Bible instead of "fish fish for fish." as input, this program produces output like this:

25:13 As the hands upon his disciple; but make great matter wisely in their calamity; 1:14 Sanctify unto the arches thereof unto them: 28:14 Happy is our judges ruled, that believe. 19:36 Thus saith the first being one another's feet.

This leans pretty hard on Perl DWIM; a Python version is many times longer:

    import random, sys

    model, last, word = {None: []}, None, None
    for word in (word for line in sys.stdin for word in line.split()):
        model.setdefault(last, []).append(last := word)

    sys.stdout.writelines(str(word := random.choice(model.get(word) or words)) + " "
                          for words in [list(model.keys())] for i in range(100_000))
(If you write code like that at a job, you might get fired. I would reject your pull request.)

It's probably worth mentioning:

- Markov-chain states don't have to be words. For text modeling, a common thing to do is to use N-grams (of either words or characters) instead of single words for your states.

- It's good to have some nonzero probability to take a transition that hasn't occurred in the training set; this is especially important if you use a larger universe of states, because each state will occur fewer times in the training set. "Laplacian smoothing" is one way to do this.

- Usually (and in my code examples above) the probability distribution of transitions we take in our text generator is the probability distribution we inferred from the input. Potter's approach of multiplying his next-state-occupancy vector Ms⃗ by a random diagonal matrix R and then taking the index of the highest value in the resulting vector produces somewhat similar results, but they are not the same; for example

    sum(random.random() * 0.75 > random.random() * 0.25 for i in range(1_000_000))
is roughly 833000, not roughly 750000. It's 5× as common instead of 3×. But I imagine that it still produces perfectly cromulent text.

Interestingly, modern compilers derive from Chomsky's theory of linguistics, which he formulated while working on an AI project in the 01950s in order to demolish the dominant theory of psychology at the time, behaviorism. Behaviorism essentially held that human minds were Markov chains, and Chomsky showed that Markov chains couldn't produce context-free languages.



This is actually sick! I have to admit, my implementation is far more (likely unnecessarily) complex.


Thanks! I think! Your implementation might be easier to understand and modify, too, although perhaps not because of being unnecessarily complex.


I brushed with MCMC decades ago, and your code is truely beautiful


MCMC is considerably more complicated, though!


What????????? Given that, how was it not obvious for decades, that this devolves into something like the current LLMs? I give it only the first few paragraphs of a random man page and it already passes for a human, that tries to form nonsense sentences, while having mostly correct grammar, but not always. LLMs now feel not very technically advanced. The code isolation in their interface seams to be more impressive than this.


If there was an array programming language with automatic differentiation and a focus on writing neural networks, I'm sure you could write a transformer based LLM including how to train it on a napkin.


I think reverse-mode automatic differentiation is significantly harder to implement than a string-indexed hash table and delimiter splitting, but maybe that's not important when those are such a small part of an interpreter for Perl or even Awk?

How big is a Transformer in TensorFlow in Python?


(I guess I should have pointed out that forward-mode automatic differentiation was only about 100–150 lines of code when I implemented it in http://canonical.org/~kragen/sw/81hacks/autodiffgraph/, depending on where you draw the lines. But gradient descent isn't practical with forward-mode autodiff unless it's with a very small number of independent variables. Also even 100–150 lines of code is significantly bigger than a hash table and delimiter splitting.)


If you only give it a few paragraphs, most words only occur once, so it ends up copying whole phrases from the input until it hits a common word. Computers are very good at sounding like human writing when they're just reproducing things that humans in fact did write. LLMs mostly aren't doing that, though.


You're right, I was a bit too enthusiastic. I've feed it random C code (both normal and already preprocessed) and it produces random garbage, not remotely passable for syntax. Also it seems to really like emitting parts of copyright messages, since they occur verbatim everywhere.

Happy part of today's 10'000.


Welcome!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: