More

arugulum · 2025-11-05T01:13:50 1762305230

>that they need to rig their elections against themselves to get dissenting voices

I don't believe this is true. If you're talking about Non-Constituency Members of Parliament, they are consolation prizes given to best losers, and there are many things they cannot vote on. Moreover, the ruling party almost never lifts the party whip, i.e. members of the party CANNOT vote against the party line (without being kicked out of the party, which results in them being kicked out of parliament). In other words, since the ruling party already has a majority, any opposing votes literally do not matter.

If you aren't talking about the NCMP scheme, then I do not know what you're talking about, as the ruling party does institute policies that are beneficial for the incumbent party.

arugulum · 2025-08-17T05:28:35 1755408515

GPT-1 wasn't used as a zero-shot text generator; that wasn't why it was impressive. The way GPT-1 was used was as a base model to be fine-tuned on downstream tasks. It was the first case of a (fine-tuned) base Transformer model just trivially blowing everything else out of the water. Before this, people were coming up with bespoke systems for different tasks (a simple example is that for SQuAD a passage-question-answering task, people would have an LSTM to read the passage and another LSTM to read the question, because of course those are different sub-tasks with different requirements and should have different sub-models). One GPT-1 came out, you just dumped all the text into the context, YOLO fine-tuned it, and trivially got state on the art on the task. On EVERY NLP task.

Overnight, GPT-1 single-handedly upset the whole field. It was somewhat overshadowed by BERT and T5 models that came out very shortly after, which tended to perform even better on the pretrain-and-finetune format. Nevertheless, the success of GPT-1 definitely already warrants scaling up the approach.

A better question is how OpenAI decided to scale GPT-2 to GPT-3. It was an awkward in-between model. It generated better text for sure, but the zero-shot performance reported in the paper, while neat, was not great at all. On the flip side, its fine-tuned task performance paled compared to much smaller encoder-only Transformers. (The answer is: scaling laws allowed for predictable increases in performance.)

gnerd00 · 2025-08-17T14:31:37 1755441097

> Transformer model just trivially blowing everything else out of the water

no, this is the winners rewriting history. Transformer style encoders are now applied to lots and lots of disciplines but they do not "trivially" do anything. The hype re-telling is obscuring the facts of history. Specifically in human language text translation, "Attention is All You Need" Transformers did "blow others out of the water" yes, for that application.

arugulum · 2025-08-17T20:16:04 1755461764

My statement was

>a (fine-tuned) base Transformer model just trivially blowing everything else out of the water

"Attention is All You Need" was a Transformer model trained specifically for translation, blowing all other translation models out of the water. It was not fine-tuned for tasks other than what the model was trained from scratch for.

GPT-1/BERT were significant because they showed that you can pretrain one base model and use it for "everything".

arugulum · 2025-07-01T05:09:45 1751346585

Because the author is artifically shrinking the scope of one thing (prompt engineering) to make its replacement look better (context engineering).

Never mind that prompt engineering goes back to pure LLMs before ChatGPT was released (i.e. before the conversation paradigm was even the dominant one for LLMs), and includes anything from few-shot prompting (including question-answer pairs), providing tool definitions and examples, retrieval augmented generation, and conversation history manipulation. In academic writing, LLMs are often defined as a distribution P(y|x) where X is not infrequently referred to as the prompt. In other words, anything that comes before the output is considered the prompt.

But if you narrow the definition of "prompt" down to "user instruction", then you get to ignore all the work that's come before and talk up the new thing.

arugulum · on Dec 24, 2024

I believe the above post was highlighting that as a misconception young people may have, not saying it is the case.

arugulum · on June 23, 2024

Two points to consider, one against and one for.

1) It's a small island, but it's also a major trading port. Which means its whole economy is already geared towards importing food from neighboring countries.

2) On the other hand: no domestic industry to disrupt! No domestic farming groups lobbying against meat substitutes, which may push research/distribution furhter along.

rjh29 · on June 24, 2024

Singapore has the most expensive meat in Asia ( https://www.picodi.com/sg/bargain-hunting/meat-prices-2023 ) and I guess depends mostly on Malaysia for fresh meat.

They're also big on future-proofing and environmental awareness in general as they have a very long term stable government that looks 10-100 years ahead.

aziaziazi · on June 24, 2024

May you share some more knowledge or source regarding their environmental awareness ? Thanks

arugulum · on June 16, 2024

The long story short is you are technically correct but in practice things are a little different. There are 2 factors to consider here:

1. Model Capability

You are right that mechanically, input and output tokens in a standard decoder Transformer are "the same". A 32K context should mean you can have 1 input tokens and 32K output tokens (you actually get 1 bonus token), or 32K input tokens and 1 output token,

However, if you feed an LM "too much" of its own input (read: have too long an output length), it starts to go off the rails, empirically. The word "too much" is doing some work here: it's a balance of both (1) LLM labs having data that covers that many output tokens in an example and (2) LLMs labs having empirical tests to have confidence that the model won't reasonably go off the rails within some output limit. (Note, this isn't pretraining but the instruction tuning/RLHF after, so you don't just get examples for free)

In short, labs will often train a model targeting an output context length, and put out an offering based on that.

2. Infrastructure

While mathematically having the model read external input and its own output are the same, the infrastructure is wildly different. This is one of the first things you learn when deploying these models: you basically have a different stack for "encoding" and "decoding" (using those terms loosely. This is after all still a decoder only model). This means you need to set max lengths for both encoding and decoding separately.

So, after a long time of optimizing both the implementation and length hyperparameters (or just winging it), the lab will decide "we have a good implementation for up to 31K input and 1k output" and then go from there. If they wanted to change that, there's a bunch of infrastructure work involved. And because of the economies of batching, you want many inputs to have as close to the same lengths as possible, so you want to offer fewer configurations (some of this bucketing may be performed hidden from the user). Anyway, this is why it may become uneconomical to offer a model at a given length configuration (input or output) after some time.

arugulum · on May 23, 2024

You could easily make the other argument: As a professor of ethics she studies many different ethical systems, including ones that are not mainstream. This means that she can more easily find some ethical system under which a given action is considered ethical.

The "ethics expert = more ethical" connection has never held up and mainly serves as a gotcha.

wpietri · on May 24, 2024

It's a good thing I never claimed "ethics expert = more ethical", then. What I'm saying is that I agree there's an irony here.

It's true that, as you say, she could use her knowledge of ethics to be less ethical. But that would just be a different kind of irony for somebody who teaches on law and ethics.

jessriedel · on May 24, 2024

It’s not ironic if a mechanic is a bad race-car driver

arugulum · on Feb 14, 2024

Is it stated somewhere that Radford was inspired by that blog post?

magoghm · on Feb 14, 2024

I tried to find the where I heard that Radford was inspired by that blog post, but the closest thing I found is that in the "Sentiment Neuron" paper (Learning to Generate Reviews and Discovering Sentiment: https://arxiv.org/pdf/1704.01444.pdf), in the "Discussion and Future Work" section they mention this Karpathy paper from 2015: Visualizing and Understanding Recurrent Networks https://arxiv.org/abs/1506.02078

arugulum · on Feb 3, 2024

It is no coincidence that EleutherAI named their pretraining dataset "the Pile"

arugulum · on Feb 2, 2024

The Pythia models have all the training data, code, and configurations available.