The Bitter Lesson (2019)

civilized · on April 2, 2022

> Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better.

This assessment is a bit off.

First, convolution and invariance are definitely not the only things you need. Modern DL architectures use lots of very clever gadgets inspired by decades of interdisciplinary research.

Second, architecture still matters a lot in neural networks, and domain experts still make architectural decisions heavily informed by domain insights into what their goals are and what tools might make progress towards these goals. For example, convolution + max-pooling makes sense as a combination because of historically successful techniques in computer vision. It wasn't something randomly tried or brute forced.

The role of domain expertise has not gone away. You just have to leverage it in ways that are lower-level, less obvious, less explicitly connected to the goal in a way that a human would expect based on high-level conceptual reasoning.

From what I've heard, the author's thesis is most true for chess. The game tree for chess isn't so huge as Go, so it's more amenable to brute forcing. The breakthrough in Go was not from Moore's Law, it was from innovative DL/RL techniques.

Computation may enable more compute-heavy techniques, but it doesn't mean it's obvious what these techniques are or that they are well-characterized as simpler or more "brute force" than past approaches.

a-dub · on April 2, 2022

> First, convolution and invariance are definitely not the only things you need. Modern DL architectures use lots of very clever gadgets inspired by decades of interdisciplinary research.

i have noticed this. rather than replacing feature engineering, it seems that you find some of those ideas from psychophysics just manually built into the networks.

civilized · on April 2, 2022

Curious what you're referring to? My knowledge of this area is not that broad or deep at all.

MauranKilom · on April 3, 2022

The weight patterns that convolutional neural networks develop are pretty familiar in many ways. For example, the first layer will generally end up with small-scale feature detectors, such as borders, gradients/color pairs and certain textures, at various scales and angles.

Try an image search for "imagenet first layer" to see examples.

I took the comment to mean "we have ourselves discovered certain filters being useful (e.g. https://en.wikipedia.org/wiki/Gabor_filter), and the networks now also discover this same information".

a-dub · on April 3, 2022

it is true that the promise of dl succeeds at finding some handcrafted features. it is also true that (at least the last time i checked), the state of the art is still making use of handcoded transforms that are derived from results in psychophysics.

a-dub · on April 2, 2022

mel warped cepstra still see use and improve performance for nns, is one example.

fxtentacle · on April 2, 2022

I used to agree, but now I disagree. You don't need to look any further than Google's ubiquitous mobilenet v3 architecture. It needs a lot less compute but outperforms v1 and v2 in almost every way. It also outperforms most other image recognition encoders at 1% the FLOPS.

And if you read the paper, there's experienced professionals explaining why they made which change. It's a deliberate handcrafted design. Sure, they used parameter sweeps, too, but that's more the AI equivalent of using Excel over paper tables.

vegesm · on April 2, 2022

Actually, MobileNetV3 is a supporting example of the bitter lesson and not the other way round. The point of Sutton's essay is that it isn't worth adding inductive biases (specific loss functions, handcrafted features, special architectures) to our algorithm. Having lots of data, just put that into a generic architecture and it eventually outperforms manually tuned ones.

MobileNetV3 uses architecture search, which is a prime example of the above: even the architecture hyperparameters are derived from data. The handcrafted optimizations just concern speed and do not include any inductive biases.

fxtentacle · on April 2, 2022

"The handcrafted optimizations just concern speed"

That is the goal here. Efficient execution on mobile hardware. Mobilenet v1 and v2 did similar parameter sweeps, but perform much worse. The main novel thing about v3 is precisely the handcrafted changes. I'd treat that as an indication that those handcrafted changes in v3 far exceed what could be achieved with lots of compute in v1 and v2.

Also, I don't think any amount of compute can come up with new efficient non-linearity formulas like hswish in v3.

fnbr · on April 2, 2022

Right, but that’s not a counterexample. The bitter lesson suggests that, eventually, it’ll be difficult to outperform a learning system manually. It doesn’t say that this is always true. DeepBlue _was_ better than all other chess players at the time. But now, AlphaZero is better.

I believe the same is true for neural network architecture search: at some point, learning systems will be better than all humans. Maybe that’s not true today, but I wouldn’t bet on that _always_ being false.

fxtentacle · on April 2, 2022

The article says:

"We have to learn the bitter lesson that building in how we think we think does not work in the long run."

And I would argue: It saves at least 100x in compute time. So by hand-designing relevant areas, I can build an AI today which otherwise would become possible due to Moore's law in about 7 years. Those 7 years are the reason to do it. That's plenty of time to create a startup and cash out.

bee_rider · on April 2, 2022

I think the "we" in this case is researchers and scientists trying to advance human knowledge, not startup folks. Startups of course expend lots of effort on doing things that don't end up helping humanity in the long run.

sitkack · on April 2, 2022

Sutton is talking about a long term trend. Would Google have been able to achieve this w/o a lot of computation? I don't think it refutes the essay in any way. If anything, model compression takes even more computation. We can't scale heuristics, we can scale computation.

koeng · on April 2, 2022

Link to the paper?

fxtentacle · on April 2, 2022

https://arxiv.org/abs/1905.02244v5

tejohnso · on April 2, 2022

This reminds me of George Hotz's Comma.ai end to end reinforcement learning approach vs Tesla's feature-engineering based approach described in an article I read.

Hotz feels that "not only will comma outpace Tesla, but that Tesla will eventually adopt comma’s method."[1]

[1]: https://return.life/2022/03/07/george-hotz-comma-ride-or-die

Previous discussion on the article: https://news.ycombinator.com/item?id=30738763

fnbr · on April 2, 2022

I think an end-to-end RL approach will eventually work- but _eventually_ could be in a really long time. It’s also a question of scale: even if Comma’s approach is fundamentally better, how much better is it? If Tesla has 1000x more cars, and their approach is 10x worse, they’ll still improve 100x faster than Comma.

rahimiali · on April 3, 2022

Many of the comments seem to interpret this to mean "don't try to craft insights into your models. instead use dumb generic models and a lot of data." That's not how I read the article.

Not all convolutional neural nets do a good job at image classification, regardless of how much data you feed them. Dense neural nets are plain just plain bad at it. Not all brute force searches over chess plays win. Not all recurrent models translate text well.

The art of incorporating insights into model is still alive and well. Without introducing "structural biases" into models, none of our current methods would work. The problem is that we've managed to introduce a lot of structural biases without understanding them well. And we should do a better job of understanding these bits of side knowledge we introduce into models.

I also agree with the main point of the article that if you're at a crossroads between introducing a very blunt structural bias (like invariance under lighting for image classifiers), you're probably better off instead throwing compute at the problem (by, say, synthetically augmenting your dataset by 10x).

rzk · on April 2, 2022

Previous discussions:

2019: https://news.ycombinator.com/item?id=19393432

2020: https://news.ycombinator.com/item?id=23781400

dang · on April 2, 2022

Thanks! Macroexpanded:

The Bitter Lesson (2019) - https://news.ycombinator.com/item?id=23781400 - July 2020 (85 comments)

The Bitter Lesson - https://news.ycombinator.com/item?id=19393432 - March 2019 (53 comments)

GlenTheMachine · on April 3, 2022

I’m currently attempting to NOT do reinforcement learning research, and the points made in this article are why, in the negative.

In naive RL, we model every process as having unknown dynamics and an unknown optimal policy, and furthermore the dynamics are assumed to be Markov decision processes. An MDP is, as far as I know, the least restrictive (hence most general) setting in which a dynamic process can be considered a “process”; any more general and the only “dynamics” we would be left with would be pure random noise.

But for nearly all physical systems of interest, the dynamics can be described perfectly well as differential equations. We might not know the structure of the equations, but simply stating that the dynamics are in fact differential equations, and building a learning algorithm around that assumption, actually provides a significant amount of structure vice MDPs. So I am researching differential equation-based learning techniques, on the assumption that they should be much more data efficient.

The arguments put forth here seem to indicate that I should stop, and just wait for computing to catch up to extremely high dimensional MDPs.

Harrumph.

kjksf · on April 2, 2022

And this is why I'm much less pessimistic than most about robotaxis.

Waymo has a working robotaxi in a limited area and they got there with a fleet of 600 cars and mere millions of driving data.

Now imagine they trained on 100x cars i.e. 60k cars and billions of driving data.

Guess what, Tesla already has FSD running, under human supervision, in 60k cars and that fleet is driving billions of miles.

They are colleting 100x data sets as I write this.

We also continue to significantly improve hardware for both NN inference (Nvidia Drive, Tesla FSD chip) and training (Nvidia GPUs, Tesla Dojo, Google TPU and 26 other startups working on AI Hardware https://www.ai-startups.org/top/hardware/)

If the bitter lesson extends to problem of self-driving, we're doing everything right to solve it.

It's just a matter of time to collect enough training data, have enough compute to train the neural network and enough compute to run the network in the car.

fxtentacle · on April 2, 2022

You're not wrong, but I believe you're so far off on the necessary scale that it'll never solve the problem.

For an AI to learn to play Bomberman at an acceptable level, you need to run 2-3 billion training steps with RL learning where the AI is free to explore new actions to collect data about how well they work. I'm part of team CloudGamepad and we'll compete in the Bomberland AI challenge finals tomorrow, so I do have some practical experience there. Before I looked at things in detail, I also vastly overestimated reinforcement learning's capabilities.

For an AI to learn useful policy without the ability to confirm what an action does, you need exponentially more data. There's great papers by DeepMind and OpenAI that try to ease the pain a bit, but as-is, I don't think even a trillion miles driven would be enough data. Letting the AI try out things, of course, is dangerous, as we have seen in the past.

But the truly nasty part about AI and RL in particular is that the AI will act as if anything that it didn't see often enough during training simply doesn't exist. If it never sees a pink truck from the side, no "virtual neurons" will grow to detect this. AIs in general don't generalize. So if your driving dataset lacks enough examples of 0.1% black swan events, you can be sure that your AI is going to go totally haywire when they happen. Like "I've never seen a truck sideways before => it doesn't exist => boom."

gwern · on April 2, 2022

> But the truly nasty part about AI and RL in particular is that the AI will act as if anything that it didn't see often enough during training simply doesn't exist. If it never sees a pink truck from the side, no "virtual neurons" will grow to detect this. AIs in general don't generalize. So if your driving dataset lacks enough examples of 0.1% black swan events, you can be sure that your AI is going to go totally haywire when they happen. Like "I've never seen a truck sideways before => it doesn't exist => boom."

Let's not overstate the problem here. There are plenty of AI things which would work well to recognize a sideways truck. Look at CLIP, which can also be plugged into DRL agents (per the cake); find an image of your pink truck and text prompt CLIP with "a photograph of a pink truck" and a bunch of random prompts, and I bet you it'll pick the correct one. Small-scale DRL trained solely on a single task is extremely brittle, yes, but trained over a diversity of tasks and you start seeing transfer to new tasks and composition of behaviors and flexibility (look at, say, Hide-and-Seek or XLAND).

These are all in line with the bitter hypothesis that much of what is wrong with them is not some fundamental problem that will require special hand-designed "generalization modules" bolted onto them by generations of grad students laboring in the math mines, but simply that they are still trained on too undiverse problems for too short a time with too little data using too little models, and that just as we already see strikingly better results in terms of generalization & composition & rare datapoints from past scaling, we'll see more in the future.

What goes wrong with Tesla cars specifically, I don't know, but I will point out that Waymo manages to kill many fewer people and so we shouldn't consider Tesla performance to even be SOTA on the self-driving task, much less tell us anything about fundamental limits to self-driving cars and/or NNs.

mattnewton · on April 2, 2022

> What goes wrong with Tesla cars specifically, I don't know, but I will point out that Waymo manages to kill many fewer people and so we shouldn't consider Tesla performance to even be SOTA on the self-driving task, much less tell us anything about fundamental limits to self-driving cars and/or NNs.

Side note, but I think Waymo is treating this more like a JPL "moon landing" style problem and Tesla is trying to sell cars today. It's very different to start with making it possible and then scaling it down vs trying to build something working backwards from the sensors and compute economical to ship today.

fxtentacle · on April 2, 2022

My theory is that gradient descent won't work for AGI and we'll need a completely new approach before we get safe general driving. But that's just my theory.

"Let's not overstate the problem here. There are plenty of AI things which would work well to recognize a sideways truck"

Yes, but nothing reliable enough to trust your life with. Clip also has some pretty weird bugs.

https://aiweirdness.tumblr.com/post/660687015733559296/galle...

gwern · on April 3, 2022

Gradient descent has ushered out many challengers who were sure they could do things gradient descent could never do even on bigger compute. I'm not worried. (My own belief is that gradient descent is so useful that any better optimization approach will simply evolve gradient descent as an intermediate phase for bootstrapping problem-specific learning. It's a tower of optimizers all the way down/up.)

You can't call that a 'CLIP bug' because using CLIP for gradient ascent on a diffusion model is not remotely what it was trained or intended to do, and is not much like your use-case of detecting real world objects. It's basically adversarial pixel-wise hacking, which is not what real world pink trucks are like. Also, that post was 7 months ago, and the AI art community advances really fast. (However bad you think those samples are, I assure you the first BigSleep samples, back in February 2021 when CLIP had just come out, would have been far worse.) 'Unicorn cake' may not have worked 7 months ago, but maybe it does now... Check out the Midjourney samples all over AI art Twitter the past month.

shadowgovt · on April 2, 2022

The sensors self-driving cars use are far less sensitive to color than human eyes.

You can generalize your concept to the other sensors, but sensor fusion compensates somewhat... The odds of an input being something never seen across all sensor modalities become pretty low.

(And when it did see something weird, it can generally handle it the way humans do... Drive defensively).

mr_toad · on April 3, 2022

> with RL learning where the AI is free to explore new actions to collect data about how well they work

Self driving cars aren’t free to explore new actions. That would be frightening. Self driving cars use a limited form of AI to recognise the world around them, but the rules that decide what they do with that information are simple algorithms.

naveen99 · on April 2, 2022

What were the new data augmentation methods for optical flow you referred to on a previous comment on this topic ?

fxtentacle · on April 3, 2022

I'm not quite sure what you mean, but I do remember writing this last time there were AI approach discussions: https://news.ycombinator.com/item?id=29898425

And https://news.ycombinator.com/item?id=29911293 was about my Sintel approach.

But that was a while ago, so nowadays new submissions have pushed me further down, but archive.org confirms that I was leading in "d0-10" and "EPE matched" on 25 Sep 2020:

https://web.archive.org/web/20200925202839/http://sintel.is....

On a completely unrelated note: I'm so looking forward to the Bomberland finals in 15 minutes :)

Animats · on April 2, 2022

Waymo is not a raw neural network. Waymo has an explicit geometric world model, and you can look at it.

VHRanger · on April 2, 2022

More data doesn't help if the additional data points don't add information to the dataset.

At some point it's better to add features than simply more rows of observations.

Arguably text and images are special cases here because we do self supervised learning (which you cant do for self driving for obvious reasons).

What TSLA should have done a long time ago is keep investing in additional sensors to enrich data points, rather than blindly collecting more of the same.

YeGoblynQueenne · on April 2, 2022

And a Better Lesson:

https://rodneybrooks.com/a-better-lesson/

jasfi · on April 2, 2022

So don't build your AI/AGI approach too high-level. But you still need to represent common sense somehow.

douglee650 · on April 3, 2022

Never underestimate the power of brute force

antiquark · on April 2, 2022

The author is applying the "past performance guarantees future results" fallacy.