Hacker Newsnew | past | comments | ask | show | jobs | submit | ikhatri's commentslogin

When I was in college my friends and I did something similar with all of Donald Trump’s tweets as a funny hackathon project for PennApps. The site isn’t up anymore (RIP free heroku hosting) but the code is still up on GitHub: https://github.com/ikhatri/trumpitter


This is a super common occurrence in training loops for ML models because PyTorch uses multiprocessing for its dataloader workers. If you want to read more, see the discussion in this issue: https://github.com/pytorch/pytorch/issues/13246#issuecomment...

As you’ve pointed out fork() isn’t ideal for a number of reasons and in general it’s preferred to use torch tensors directly instead of numpy arrays so that you are not forced into using fork()

There’s also this write up which I found to be quite useful for details: https://ppwwyyxx.com/blog/2022/Demystify-RAM-Usage-in-Multip...


Egocentric means that the sensors frame of reference is the same as the ego (self). The person walking around is wearing the sensors/glasses themselves & their pose is given as “ego pose”.

This dataset is clearly targeted for research on AR/XR/VR applications.


+1

I’ll “yes and” here…beyond AR/VR a more powerful use case is multi-modal learning (with RL) which is what Meta is probably the leader in IMO.

Example paper here: “ Towards Continual Egocentric Activity Recognition: A Multi-modal Egocentric Activity Dataset for Continual Learning”

https://arxiv.org/abs/2301.10931

This IMO is the pathway to AGI, as it combines all sense-plan-do data into a time coordinated stream and mimics how humans transfer learning to children via demonstration recording and behavior authoring.

If we can create robotics with locomotion and dexterous manipulation, egocentric exploration, and a behavior authoring loop that uses human behavior demonstration and trajectory reinforcement - well, we’ll have the AI we’ve been all talking about.

Probably the most exciting area of research that most people don’t know or care about.

That’s why head mounted all day ego centric AR is so important - it gives eyes ears and sense perception to our learning systems with human directed egocentric behaviors, guiding the whole thing. Just like pushing your kid down the street in the stroller.


Just to make sure I understand your excitement: we need guinea-pigs ahem people to wear 'head mounted all day ego centric AR' with who knows how many integrated sensors for long stretches on end, so we can finally get to our fabled A.G.i?

That is some B.F. Skinner level future we're aiming for--only this time around, humans become the fully surveilled 'teaching machine'.


Well no...not guinea pigs. But correct conceptually - if it's opt-in only and perfectly transparent to everyone what is happening, which in this specific case of Aria it absolutely is.

If we want to make machines with equivalent or better capacity as humans we have to transfer the process for scientific discovery, including the sum of our cognitive capacity and knowledge to them.

If you quantize human adult-infant interactions, then it boils down to Human adults introducing learning trajectories, labeling input data and biasing weights with reinforcing behaviors for new reinforcement agents. If we can re-build the infrastructure to do precisely that, where the agent is in the place of the infant and society is in the place of the "Human Adult" then we will have re-built at scale the process for human development.

The best way we know how to do this today is implementing transfer learning approaches from the basic human developmental research. I started down this road back in 2010 trying to follow the work of Frank Guerin out of the University of Aberdeen [1] [2].

[1]https://www.surrey.ac.uk/people/frank-guerin

[2] https://scholar.google.co.uk/citations?view_op=view_citation...


But what about observer effects? People act differently when recorded, and rarely do we catch humans acting natural when knowingly observed (some of the early 24h/day Twitch streamers come to mind). And what happens once trials are done? How would people feel about their actions becoming part of a technology potentially able to replace them?

Even when this barrier can be overcome (i.e. people become accustomed to wearing these devices), I worry about the opt-in nature of it. We've yet to see a disruptive technology adhering to this principle through-and-through, and if current learning efforts are anything to go by, training data is not something companies want to willingly let go or lose out on.

Taken both, this path has the potential to be quite coercive if no strong guarantees or safeties can be upheld, especially if early exciting trials generate an interest-boom similar to the one we're seeing right now in the LM-space.


This is a great point and why, I advocate so vociferously thatall of these systems and future organizations that are going this direction should be cooperatively owned, based on mutual voluntary Democratic principles, rather than owned by a small subset of wealthy individuals in your standard business construct.


That would be a welcome future, indeed. And hopefully, not just upheld in some regions of the world, but everywhere where AR-backed AGI gets off the ground. And this governing structure would need to work for some decades at least. Which would be quite a feat.

That still leaves my first question regarding observer effects and how people would respond to such a technology on an individual level. It would have the capacity to reshape behaviour towards preferential and/or optimal interactions, would it not? Seeing how we do not want reinforce models with 'erroneous' interactions?


TBH I don't know, and I think there's a real chance that there's going to be actual changes in how people behave as a result - which, if it's integrated like many other social changes will become another layer in the fabric of society, displacing another layer. For better or worse I think it's just an exposure thing.

You are persistently surveilled in London and Shanghai and New York City - yet people act just as unhinged in ways they did before cameras were installed.

I'm not sure what other data acquisition/technology arc is possible though, and open to ideas.


> You are persistently surveilled in London and Shanghai and New York City - yet people act just as unhinged in ways they did before cameras were installed.

Unhinged people do, but ordinary people? I'd be willing to bet that normal people who are in areas where they are aware they're on camera don't behave as their normal selves. It's hard to see how it could be otherwise.


Is that model (parents giving labeled input and affecting some weights in the child’s head with reinforcement) really a good fit for the reality of how people learn to do things?

It’s my understanding (though I haven’t looked at the primary sources myself) that one of the facts that inspired Chomsky’s language theories and work for instance, was that when you quantify the information communicated by parents to language learning children, there’s actually not very much of it. Not nearly enough to support that what’s going on is anything like the kind of learning embodied by machine learning models.

If that’s true, and there is something of how to act intelligently / humanly already encoded in children (maybe genetically?) and not communicated by this sort of training, wouldn’t ignoring that and trying to get to it purely in this machine learning way be.. at least not at all informed by evidence / examples of it working in nature?


So this is extremely complicated and nuanced with respect to intelligence acquisition, and I don’t think there’s a definitive right or wrong answer.

I certainly acknowledge my own bias with this however, with respect to what Chomsky discusses, I make the distinction that most of the “code/data/information” that you need in order for the language capacity to develop is actually embedded in our biological mechanical systems. That is to say, if you were to take a human infant and never expose it to another human with respect to generating sounds for language, the infant would still develop some sort of sound based communication system. We see this with feral children, mute children, deaf children. They still have a verbal function, even if it’s not connected to any semblance of coherency.

So in that sense it’s like you’re given all of the building blocks for language out of the gate biologically and then the people who are around you tell you how to assemble them into some thing that is functional. This is why different languages have different rules yet language acquisition is consistent across cultures.

This is why I am insistent on holistically understanding the computing infrastructure and systems because the sensors processors, etc. are the equivalent to our cells, genes muscles, bones, etc. Most people don’t think about computing systems and generally intelligent systems this way.

If you go back and look at the work of wiener and early Cybernetics it does discuss a lot of this, however, after Cybernetics was absorbed into artificial intelligence, which was an absorbed into computer science, it doesn’t really look holistically at systems of systems, unfortunately, in the general case.

And I would argue that all of machine learning currently is very much moving in to the direction that I am describing where is exposure to frequency of correlated data that gives you your effective understanding of the world, and being able to predict the future state. That’s what I mean when I say multi-modal is “sequential and consistent in time” with respect to causal action.


As with most technology, there are plusses and minuses.

If used correctly (if is doing lots of heavy lifting here) this type of system, eye gaze, imu & microphones would provide much much better hearing aids than the current state of the art, at a much cheaper price (go look up the price of hearing aids, its _extortion_ )

Using gate analysis, it would be possible predict when someone is prone to falls, allowing much longer independence for older people.

Assuming that its possible to understand who you are talking to and what they said, you could mitigate and support dementia much more than we can now.

However.

You also have a vast network of headsets with highly accurate always on location, able to see what you are looking at, who you talk to, what you say, and in somecases what you feel about things.

Add in some basic object/facial recognition and you have an authoritarian's wet dream.

now is the time to regulate, but alas, that wont happen.


+1

Applications of embodied AI very interesting. Additionally a lot of hard problems are increasingly being solved in simulation like this. See Wayve's GAIA world model


> a behavior authoring loop

A behaviour that authors loops?


I would read it as “a loop that authors behaviors”.


I was going to post basic pitch from Spotify but it looks like billconan beat me to it. That said I can give you a bit more advice. The Spotify basic pitch model isn't too good at multi-track input. It's capable of it, but you may actually get better results if you separate out the tracks first and then run them individually through the basic pitch model.

In order to do this you can use a source/stem separation model like spleeter (https://github.com/deezer/spleeter) and then run the basic pitch model (or any other midi transcription model). There's other you can try which may yield better results, for example: (https://github.com/Music-and-Culture-Technology-Lab/omnizart)

Either way the key words you want to be looking for are "midi transcription" and "stem separation", should help you find more models to try for both steps. Good luck! :)

EDIT: Oh it looks like there's even a stem separation leaderboard on papers with code, neat: https://paperswithcode.com/task/music-source-separation


That is insane(ly cool). Do you know how the separation models work, in principle?


Yup! It's a U-Net style convolutional neural network that runs on the spectrograms. There's a short 2 page abstract here: https://archives.ismir.net/ismir2019/latebreaking/000036.pdf and they cite this prior work as where they got the architecture from: https://arxiv.org/abs/1906.02618


As someone who works in the industry (disclaimer these are my own views and don't reflect those of my employer), something about the framing of this article rubs me the wrong way despite the fact that it's mostly on point. Yes it is true that different companies are choosing different sensing solutions based on cost and the ODD in which they must operate. But I think this last sentence just left a sour taste in my mouth "But the verdict is still out as to which is safer".

It is not an open question and I hate it when writers frame it this way. Camera only (specifically, monocular camera only) systems literally cannot be safer than ones with sensor fusion right now. This may change in the future at some point, but it's not a question right now it is a fact.

Setting aside comparisons to humans for a second (will get back to this), monocular cameras can only provide relative depth. You can guess the absolute depth with your neural net but the estimates are pretty garbage. Unfortunately, robots can't work/plan with this input. The way any typical robotics stack works is that it relies on an absolute/measured understanding of the world in order to make its plans.

That isn't to say that one day with sufficiently powerful ML and better representations we would be totally unable to use mono (relative) depth. People argue that humans don't really use our stereoscopic depth past ~10m or so and that's a fair point. But we also don't plan the way robots do. We don't require accurate measurements of distance and size. When you're squeezing your car into a parking spot you don't measure your car and then measure the spot to know if it'll fit. You just know. You just do it. And it's a guesstimate (so sometimes humans make mistakes and we hit stuff). Robots don't work this way (for now), so their sensors cannot work this way either (for now).


Self driving isn't a sensor problem, its a software problem.

From how humans drive, its pretty clear that there exists some latent space representation of immediate surroundings inside our brains that doesn't require a lot of data. If you had a driving sim wheel and 4 monitors for each direction + 3 smaller ones for rear view mirror, connected to a real world car with sufficiently high definition cameras, you could probably drive the car remotely as well as you could in real life, all because the images would map to the same latent space.

But the advantage that humans have is that we have an innate understanding of basic physics from experience in interacting with the world, which we can deduce from something simple as a 2d representation, and that is very much a big part of that latent space. You wouldn't be able to drive a car if you didn't have some "understanding" of things like velocity, acceleration, object collision, e.t.c

So my bet is that just like with LLMs, there will be research published at some point that given certain frames in a video, it will be able to extrapolate the physical interactions that will occur, including things like collision, relative distances, and so on. Once that is in place, self driving systems will get MASSIVELY better.


It's both. Your eyes have much better dynamic range and FPS than modern self driving systems & cameras. If you can reduce the amount of guessing your robot does (e.g. laser says _with certainty_ that you'll collide with an object ahead), you should do it.

Self-driving is still a robotics problem, and robots are probablistic operators with many component dependencies. If you have 3 99% reliable systems strung together running 24 hours a day, that's 43 minutes a day that it will be unreliable ((1 - .99^3)*1440). Multi-modality allows your systems to provide redundancy for one another and reduce the accumulating correlated errors.


> Your eyes have much better dynamic range and FPS than modern self driving systems & cameras.

Eh, kind of...

https://youtu.be/HU6LfXNeQM4?t=1987

Check out this NOVA video on how limited your acute vision actually is. It is only by rapidly moving our eyes around that we have high quality vision. In the places you are not looking your brain is computing what it thinks is happening, not actually watching it.


I should have said eyes+brain in combination have much better dynamic range and FPS perception than self driving systems. Point remains unchanged -- what sensor you use is tied to the computation you need to do. What you see is the sum of computation+sensor so it's impossible for sensor not to matter.

Tangential: event cameras work more like our eyes but aren't ready for AVs yet.


It's only "kind of" if they compensate for the reduced specs. As the root commenter said, they don't compensate yet. It's just less safe in those situations.

Whether it's fine to be less safe in certain situations because it's safer overall is a different question.


> In the places you are not looking your brain is computing what it thinks is happening, not actually watching it.

The existence of peripheral vision disputes that pretty definitively, though.


I do recommend that you stop and watch the video first to understand better what's going on there....


I tried but it's not available in my country, sadly.


> Your eyes have much better dynamic range and FPS than modern self driving systems & cameras. If you can reduce the amount of guessing your robot does (e.g. laser says _with certainty_ that you'll collide with an object ahead), you should do it.

You could drive fine at 30fps on a regular monitor (SDR). More fps would help with aggressive/sporty driving of course.


> You could drive fine at 30fps on a regular monitor (SDR). More fps would help with aggressive/sporty driving of course.

What? This is preposterous.

Have you tried playing a shooter video game at 30 FPS? It's atrocious, you get rekt. There is a reason all gamers are getting 120 FPS and up.

30 FPS means 33 ms of latency. Driving on a highway, car moves over a meter before the camera even detects an obstacle. The display has it's own input lag, so does the operating system. Your total latency is going to be over 100ms, so the car will have travelled several meters. If a motorcyclist in front of you falls, you will feel the car crashing into his body before the image even appears on the screen.


There's plenty of FPS racing games that you can play just fine at 30FPS. Obviously more FPS is a better experience, but it's not like it becomes impossible to drive.

Also, if you truly are only a few meters behind a motorcyclist when driving at highway speeds, by definition you are being unsafe. The rule I learned in driving school was roughly 1 car length per 10mph of space, so you should be ~90 feet (~30 meters) away.

Finally, the average reaction time for people driving in real life is something like 3/4 of a second. 750ms to transition from accelerating to braking. A self-driving car being able to make decisions in the 100ms time frame is FAR superior.


I agree this is preposterous but one nit to pick: event loops on self driving cars are really that slow, and they must use very good behavior prediction + speculative reasoning to deal with scenarios like the one you described.


oh dear


Have you tried doing this in the dark? Have you tried spotting the little arrow in the green traffic light that says you can turn left, consistently, in your video feed even facing a low sun?


Only if that monitor was hooked up to a camera that could dynamically adjust its gain to achieve best possible image contrast in everything from bright sunlight to moonlit night.

You’d also lose depth perception entirely, which can’t be good for your driving.


You can test this pretty easily, it's not like that model doesn't exist. Play your average driving videogame at 30fps in first-person mode. Crank up the brightness until you can barely see if you like. We do it just fine because the model exists in our head, not because there's some inherent perfection in our immediate sensing abilities.


Yeah. I mean you're right and wrong at the same time imo. I won't hypothesize about how humans drive. I think for the most part it's a futile exercise and I'll leave that to the people who have better understanding of neuroscience. (I hate when ML/CS people pretend to be experts at everything).

That being said, this idea of a latent space representation of the world is the right tree to be barking up (imo). The problem with "scale it like an LLM" right now is that 3D scene understanding (currently) requires labels. And LLMs scale the way they do because they don't require labels. They structure the problem as next token prediction and can scale up unsupervised (their state space/vocabulary is also much smaller). And without going into too much detail, myself (and others I know in this field) are actively doing research to resolve these issues so perhaps we really will get there someday.

Until then however. Sensors are king, and anyone selling you "self-driving" without them is lying to you :)


Correction: anyone selling you "self driving" is lying to you.

We're at least a decade away from it... (and yes, I've seen the current batch of FSD videos).


The only one literally selling it, is Mercedes. What is wrong with it? Don’t you consider it „self driving“?

https://media.mbusa.com/releases/mercedes-benz-worlds-first-...


I think you may be over-indexing on the word "selling". I didn't mean it literally as in for sale to you (the customer) directly. That is what Tesla FSD is claiming and I agree with you that we're some indeterminate amount of time away from it.

However Waymo, Cruise and others do exist. If you haven't already, check out JJRicks videos on YouTube. I think you might be changing the number of years in your estimation ;)


Each time I see functional FSD it is in a very specific and limited scope. Simple thing that ultra precise maps, low speed, good roads, suitable climate, and a system that can just bail and stop the car are common themes. I would also be interested to hear if places with waymo have traffic rules where pedestrians/cyclists have priority without relying on traffic signs.


> if you had a driving sim wheel and 4 monitors for each direction + 3 smaller ones for rear view mirror, connected to a real world car with sufficiently high definition cameras, you could probably drive the car remotely as well as you could in real life, all because the images would map to the same latent space.

I disagree. When in a car, we are using more than our eyes. We have sound as well, of course, something that we provide feedback even in the quietest cars. We also have the ability to feel vibration, gravity and acceleration. Sitting in a sim without at least some of these additional forms of feedback would be a different skill.


There was an even where they took the top iRacing sim driver and put him in a real F1 car and he was able to do VERY well in terms of lap times.

There was another even where they took another sim driver and put him in a real drift car, and he was able to drift very well.

Both vids are on youtube. Yes, real world driving has more variables, and yes, the racing drivers had force feedback wheels, but in general, if a person is able to control a car so well as to put the virtual wheel in the right square foot of the virtual track to take a corner optimally, its probably likely that most people could drive very well solely from visual feedback. Sound and IMUs can provide additional correctional information, but the key point still remains, is that whatever software runs has to deduce physics from visual images.


Would you say your examples are moving the sim driver from fewer sensors (an abstraction of driving) to more sensors (the real world)?


Driving sims obviously have sound, and also feedback through the steering wheel (sometimes also the seat).

Self driving cars obviously have microphones and accelerometers too.


https://youtu.be/HU6LfXNeQM4

I recommend watching this NOVA video on human perception. When doing any number of task, especially ones we do commonly we're using a ton of unconscious perception and prediction based upon our internal representation of physics and human modeling.

For example when I was younger I was noticing that I was commonly aware that a car was going to get over before it did so. I kept an eye out trying to determine why this was the case and I noticed two things. One is people commonly turn their head and check the mirrors before they even signal to get over. The other is they'll make a slight jerk of the wheel in the direction before making the lane change.


This assertion: Self driving isn't a sensor problem, its a software problem. is hard to support today. Your human vision analogy leaves out a lot of both sensor and processing differences between what we call machine vision and human vision.

Even if parity with human vision can be attained, humans kill 42,000 other American humans each year on the roads. If human driven cars were invented today, and pitched as killing only 42,000 people per year, the inventor would get thrown into a special prison for supervillains.


> Self driving isn't a sensor problem, its a software problem.

Taking things to the extreme, perhaps it’s actually a networking problem.

Cars should have the ability to send signals and hear signals from the cars around it.

Imagine if Car A could improve its own understanding of the environment using inputs/sensor data from nearby Car B.


Not much would change. The idiotic idea of removing traffic lights in favor of self driving cars zipping past each other forgets about those pesky pedestrians we should be designing cities for.


When I wrote the comment, I was envisioning the current world, but with some bluetooth type protocol that cars could use to send beacons to help other cars near it.

The most basic example of how this could be helpful is if the car ahead of you turns a sharp corner and crashes into a truck stopped in the road. Without car-to-car networking, you won't brake until the crash is in your line of sight.

Have you ever seen those youtube videos of massive car pile ups on highways caused by a crash, and then a cascade of additional crashes afterwards? E.g. icy conditions or dense fog. What if the original crash could communicate to cars behind it, wouldn't that be helpful if the crash isn't yet in the driver's (or car's) line of sight?

I agree "not much would change" overnight. It's just another input for the car's software to have at its disposal.

With the current hardware on the roads, I don't think it's technically possible for autos to achieve legitimate self-driving (if that's even the goal anymore?) - there are way too many edge cases that are way too difficult to solve for with just software.


pedestrians, cyclists, skateboarders, and all the other road users that the US car-centric society has determined are "hazards" to driving.


And what happens if there is a child on the road? Or are we going to need implanted transmitter chips in the future, so we can safely go outside and are not overrun by „smart“ cars?

Even if every car is required to be part of the network, there may be badly maintained cars that don’t work properly, or even malicious cars, that send wrong data on purpose.


It would create a better model, but this is not necessary. Cars are already "networked" through things like turn signals and brake lights.


Something more is necessary if "self-driving" is going to actually live up to its name at some point in the future, and I don't think the answer is 100% software.

At this point it's all about edge cases. Certain edge cases are impossible to overcome with just software + cameras alone.


Most humans can drive fairly well in heavy downpour, solely from the brake lights of the car and occasional glimpses of road markings. Thats almost equivalent to a very poor sensor suite.


For this to work, either (1) the network has to be reliable, and all cars have to be trustworthy (both from a security and fault tolerance perspective), or (2) the cars have to be safe even when disconnected from the network, such as during an evacuation.

We already know for sure that we can’t solve (1), which means we have to solve (2). Therefore, car-to-car communication is, at best, a value add, not the enabling technology.


> Imagine if Car A could improve its own understanding of the environment using inputs/sensor data from nearby Car B.

You can't rely on this in real time because urban canyons make it hard to get consistent cell signal (for one thing), but you can definitely improve your models on this data once the data's been uploaded to your offline systems, and some SDC companies do this.


A system of this sort could use some local area networking (think infrared, RF, or even lasers) to create an adhoc mesh network. It's how I imagine cars in the future to be networked at least.


I'd suggest giving Car Wars by Cory Doctorow a read. https://doctorow.medium.com/car-wars-a01718a27e9e

It involves a situation with networked self driving cars.


A total security nightmare.


Monocular cameras are a strange strawman. Is anyone seriously considering them?

Binocular cameras provide absolute depth information, and are an order of magnitude cheaper sensors than the other options.

Since this technology is clearly computationally limited, you should subtract the budget for the sensors from the budget for the computation.

According to the article, the non-camera sensors are in the $1000’s per car range, so the question becomes whether a camera system with an extra $2000 of custom asic / gpu / tpu compute is safer than a computationally-lighter system with a higher bandwidth sensor feed.

I’m guessing camera systems will be the safest economically-viable option, at least until the compute price drops to under a few hundred dollars.

So, assuming multi-camera setups really are first to market, the question then is whether the exotic sensors will ever be able to justify their cost (vs the safety win from adding more cameras and making the computer smarter).


It is not a strawman. Tesla FSD, in all forms, exclusively uses monocular cameras.

As seen on their website [1], and confirmed numerous times, they have monocular vision around the car and, though having three front-facing cameras, they each have different focal lengths and are located next to each other and thus can not operate as binocular vision.

[1] https://www.tesla.com/en_AE/autopilot/


Wow. I had no idea they were so far behind. I guess it should have been obvious from their miles between disengagement stats.


Tesla is so expensive and they can't install another camera. why...


At the risk of stating the obvious, stereovision in practice has a few interesting challenges. Yes, the main formula is deceptively simple: d = b*f / D (d - depth, D - disparity, b - baseline, f - focal length), but in practice, all 3 terms on the right require some thinking. The most difficult is D - disparity, it usually comes from some sort of feature matching algorithm, whether traditional or ML-based. Such algorithms usually require some texture surfaces to work properly, so if the surface does not have "enough" texture (example would be a gray truck in front of the cameras), then the feature matching will work poorly. In CV research there are other simplifying assumptions being made so that epipolar constraints make the task simpler. Examples of these assumptions are coplanar image planes, epipolar lines being parallel to a line connecting focal points and so on. In practice, these assumptions are usually wrong, so you need, for example, to rectify the images which is an interesting task by itself. Additionally, baseline b can drift due to changes in temperature and mechanical vibrations. So is the focal length f, so automatic camera calibration is required (not trivial).

Don't forget some interesting scenarios like dust particles or mud on one of the cameras (or windshield if cameras are located behind the windshield) or rain beading and distorting the image thus breaking the feature matcher and resulting disparity estimates.

Next, to "see" further, a stereo rig needs to have a decent baseline. For example, in a classic KITTI dataset, the baseline is approximately 0.54m which is much larger than, for example, human eyes (0.065m). Such baseline, 54cm, together with focal length, which, if I remember correctly, is about 720px in case of KITTI vehicle cameras, would give about 388m in the ideal case of being able to detect 1 pixel disparity. But detecting 1px of D is very difficult in practice - don't forget you will be running your algo on a car with limited compute resources. Say, you can have around 5px of D, that means max depth of around 77m - comparable to older Velodyne LiDARs.

Some of the issues I mentioned are not specific to stereovision (e.g. you need to calibrate monocular cameras as well and so on), just wanted to point out that stereovision does not magically enable depth perception. The solution would likely be a combination of monocular and stereo cameras, combined with SfM (Structure from Motion) and depth-from-stereo algorithms.


Isn't binocular information only useful for objects 10m ahead or closer? At least according to Hacker News, the most reliable source of information on the internet: https://news.ycombinator.com/item?id=36182151


This paper suggests that human vision maintains stereopsis much further out than many researchers have thought: “Binocular depth discrimination and estimation beyond interaction space” https://jov.arvojournals.org/article.aspx?articleid=2122030

They measured out to 18m & point out that the typical measured limits of angular resolution of the human eye mean that we could extract stereo image information out to 200m or more.

This paper claims to demonstrate stereopsis out to 250m, which is roughly the limit you’d expect from typical human visual acuity: “Stereoscopic perception of real depths at large distances” https://jov.arvojournals.org/article.aspx?articleid=2191614

This paper suggests that steropsis occurs out to somewhere between 20m & 65m before other cues dominate 3D depth perception: “The Role of Binocular Cues in Human Pilot Landing Control” https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30...

It seems that the claim that stereo vision only occurs in the near field case is probably wrong? Human stereo vision is much more capable than that & if it reaches out to significantly > 20m is surely being used when driving?


I think binocular depth resolution is roughly proportional to the space between the cameras. A car hood is much wider than a human head. I’m not sure how far you can push that without hitting issues with close up stuff.


Depends on the angular resolution of the sensor as well.


Yeah; proportional, assuming the sensor stays constant and hand waving about the FOV of the lens.


> According to the article, the non-camera sensors are in the $1000’s per car range.... I’m guessing camera systems will be the safest economically-viable option, at least until the compute price drops to under a few hundred dollars.

Human life is worth about $10 million (in US), that's a bit more than the sensor does. If one in 10,000 of camera-only car causes deaths, then it's not economically viable.

A London bus costs about $300,000, it is economically viable. Why is $1,000 sensor a problem. It is definitely viable to be installed on busses and trucks. Maybe you need to get out of the mindset of personal cars. It is not a viable business model, and it is not a viable model of dealing with congestion either.


If the car fleet kills half as many people as a human with a $1000/car system, and zero people with a $100,000 system, then we should immediately put the $1000 system on 100 times as many cars as we could put the $100,000 system on. (I picked those numbers because that is the order of magnitude range I have heard self driving car companies quote.)

The $100,000 system would only ever make sense if the fleet was already entirely self driving, and money for other life saving stuff like the environment and health care also hit suitable diminishing returns. Of course, by then, the cheap systems will have improved.

This argument holds for any non-negative dollar value you place on human life.

It is also independent of who owns the vehicles. Money the bus fleet spends in expensive self driving pulls money away from bus stop upgrades, pollution controls, etc, etc.


As far as I know humans can also safely drive only with one eye. It’s perfectly legal in most countries.

But I agree that current software (Tesla?) is not able to do that in the same way. So it may need more sensors until the software gets better.

In theory cameras should also be able to see more than humans. They can have a wider angle, higher contrast, higher resolution and better low-light vision than the human eye.


A human with one eye can use slight head movements and eye to gain a sense of depth. Perhaps the mono cameras need some kind of mount that allows them to not only look around but also move in 3 dimensions. That seems more complex than just having binocular cameras, though.


Yup! This kind of reconstruction is known as multi-view reconstruction. Though the cameras don't need to have a movable mount, they're already on a car which moves! The car moves and gives them a new "perspective" at every frame. That's how some monocular systems already work. Here's an example of one such system: https://github.com/nianticlabs/manydepth

That said, I think what you're referring to is more extreme perspectives that shift in ways the car cannot drive and you are correct that this would aid in reconstruction. This is how NERF models do their 3D reconstruction (https://nerfies.github.io/).


like pigeons.

Can't cameras do this by just comparing frame 1 to frame 2?


Compare how?


Minor nitpick

> monocular cameras can only provide relative depth

While the environment awareness is nowhere near as good as two or more cameras would be, if you consider the output over time, you get valuable information about the change rate of the environment, i.e. how fast that big thing is getting bigger, which may indicate one should actuate the brakes.

Of course, I'm with the crowd that answers the question with a "how many can we have?" question. The more, the merrier. And the more types, the better - give me polarized light and lidar, sonar, radar, thermal, and whatever else that can be plugged in the car's brain to make it better aware of what happens (and correctly guess what's going to happen) outside it.


Can you elaborate on your reasoning? I’m shaky on some of logic here.

“Monocular” cameras > no “absolute” depth > less safe

The last leap is not well justified.

Also, cars with vision based driving have multiple cameras. Whats the difference between a “binocular” camera and two “monocular” cameras?

How does a “binocular” camera get better depth information?

Is using multiple cameras to drive sensor fusion?

Why is absolute depth a strict safety win? How do you know how the sensor details translate to the final safety of the full system?

If this is just a handwavey upper bound on safety, how do you know that such a system can’t be safe enough for its design goals?

If humans with only one eye are able to drive, why wouldn’t mono surround vision be at least as good as that?


I am just a hobbyist but I can answer some of these.

> Whats the difference between a “binocular” camera and two “monocular” cameras?

For the camera itself, nothing. They are probably referring to the implementation. You can have two cameras side by side but unless you are using homography to estimate depth from the two images, then your setup is monocular.

> How does a “binocular” camera get better depth information?

A pixel in two images (with known separation) will have a geometric relationship that can be used to extract depth information. This is a lot faster than alternative methods with a single camera and multiple images.

> Is using multiple cameras to drive sensor fusion?

This is really just a question of semantics.

> Why is absolute depth a strict safety win?

Why is it better to have two eyes than one? You can be more certain about what you are seeing.

> If this is just a handwavey upper bound on safety, how do you know that such a system can’t be safe enough for its design goals?

If you had a system with infinite compute you could probably do enough math to calculate absolute depth with 100% certainty. I believe you can already extract absolute depth with something called bundle adjustment-- but it requires multiple images since you are relying on parallax effects. It is also computationally expensive.

> If humans with only one eye are able to drive, why wouldn’t mono surround vision be at least as good as that?

Computers are not humans.


> Why is absolute depth a strict safety win? How do you know how the sensor details translate to the final safety of the full system?

If you can get reliable depth information, the algorithm needed to avoid hitting stationary and slow-moving objects is extremely simple.

Is the stationary object in our path, of nontrivial size, and about to enter our minimum stopping distance? If yes, do we have a swerve planned that will let us safely avoid it? If no, emergency stop.

Because this logic is simple and well defined you can audit the implementation to the high standards applied to things like aircraft autopilot systems.

And it'll work even if the stationary object is something that didn't appear in your training data - you know the algorithm will work the same even if that concrete barrier is painted with some cheery flowers, or if that fire truck is airport yellow instead of the normal red.

Of course, this relies on the assumption you can get reliable depth information. If your depth sensor gets confused by a cloud of dust while driving in the desert, or gets blinded by the light of the setting sun, or is unable to detect a barbed wire fence, things are no longer quite so simple....

> If this is just a handwavey upper bound on safety, how do you know that such a system can’t be safe enough for its design goals?

Personally I would say that in freeway driving, a self-driving car should be able to avoid 100% of collisions with clearly visible stationary objects in dry, well lit conditions when all system components are in normal working condition.


"But humans can do it with one eye closed and"

... and I want to grab the guy who says that by the collar and scream in their face "The whole point is to build something that can do better than a human."


> People argue that humans don't really use our stereoscopic depth past ~10m or so and that's a fair point.

The second paper I reference in this comment https://news.ycombinator.com/edit?id=36232198 claims that humans can maintain stereopsis out to 250m.

That’s a huge difference from 10m & if true suggests that human drivers might well use 3D vision when driving.


I think I said this already in one of my comments but I'm not a neuroscientist and I don't claim to be. That's why I think it's kind of pointless and silly for me (or any other engineer) to sit here and make arguments about what humans do and don't do in their brains.

IMO it's better for us to focus on what the robots can and cannot do right now, and focus on solving those problems :)

Thanks for the sources though, those papers are definitely neat and I'll be taking a look when I get a chance.


I am implementing Monocular vSLAM as a side project right now. I am working with some optimization libraries like GTSAM but having some issues. Do you know any good resources for troubleshooting this kind of stuff?

It's pretty easy to see, even as someone with very little experience, the benefits of stereo vision over monocular. In addition to the depth stuff it's a lot easier/faster to create your point clouds from disparity maps.


Adding in the car speed and direction information to the monocular camera images gets you an absolute/measured understanding of the world.


> You can guess the absolute depth with your neural net but the estimates are pretty garbage.

I'm not sure what kind of systems you're referring to with "monocular cameras", but if you look at the visualization in a Tesla with FSD Beta, it's actually really good at detecting the position of everything. And that's with pretty bad cameras and not a lot of compute.

Only rarely you'll see Tesla's FSD mess up because of perception, the vast majority of times they mess up is just the software being dumb with planning.


Let’s say you are driving down the street in a suburban neighborhood. You see a kid throw a ball into the street. You see from how his body moved that it is a lightweight ball and that it doesn’t require drastic (or any) measures to avoid. Or you see that it is a very heavy object and requires evasive maneuvers.

How exactly does a certain type of sensor help with this? Isn’t the problem entirely based on a software model of the world?


> Setting aside comparisons to humans for a second (will get back to this), monocular cameras can only provide relative depth. You can guess the absolute depth with your neural net but the estimates are pretty garbage.

Stereoscopic vision in humans only works for nearby objects. The divergence for far away objects is not sufficient for this. You may think you can tell something is 50 or 55 meters away through stereoscopic vision, but you can't. That's your brain estimating based on effectively a single image.

That said reality is not a single image, it's a moving image, a video. Monocular video can still be used to estimate object distance in motion.

Eventually AI will be good enough to work better than humans with just a camera. The problem is we're not there yet, and what Tesla is doing is irresponsible. They should've added LIDAR and used that to train their camera-only models, until they're ready to take over.


You can use the onnx cpu runtime in python or c++ too. It doesn’t have to be rust. And if you want GPU support you can even run models saved in the onnx format on Nvidia GPUs with the TensorRT runtime.

Honestly while ggml is super cool. It started as a hobby project and you probably shouldn’t use it in production. ONNX has been the defacto standard for ML inference for years. What it is missing (compared to ggml) is 2-6bit inference which is helpful for large scale transformers on edge devices (and is what helped ggml gain adoption so fast).


Intel OpenVINO is also quite punchy for CPU inference.


Yeah I've heard of it but never used it. Looks like they have a backend/runtime for ONNX models as well (https://pypi.org/project/onnxruntime-openvino/) neat!

ONNX really is the universal format. If you can get your model exported to ONNX, running it on various platforms becomes much easier.*

*as long as every hardware platform supports the ops you use in your network and you're not doing anything too fancy/custom :P


Yeah I've only used it with networks in ONNX format (converted from tensorflow or torch). I was looking for high perf low latency / real-time, the C or C++ APIs for OpenVINO are quite OK if you spend some time playing with it. I hope Intel keeps investing on it...

Edit: often if you go through the ONNX intermediate format, be prepared to perform some 'network surgery' to clean up some conversion cruft, but also to remove training-only stuff left in the network...


This is a compelling argument at the surface level (that roads are designed for humans with vision) that quickly breaks down when you examine how Tesla constructs their self-driving system.

Quick disclaimer that this doesn't reflect the views of my employer, nor does any of what I'm saying about self-driving software apply specifically to our system. Rather I am making broad generalizations about robotics systems in general, and about Tesla's system in particular based on their own Autonomy Day presentations.

When you drive on the road as a human, you rely a lot more on intuition and feel than exact measurements. This is exactly the opposite of how a self-driving car works. Modern robotics systems work by detecting every relevant actor in the scene (vehicles, cyclists, pedestrians etc.), measuring their exact size and velocity, predicting their future trajectories, and then making a centimeter level plan of where to move. And they do all of this 10s of times per second. It's this precision that we rely on when we make claims about how AVs are safer drivers than humans. To improve performance in a system like this, you need better more accurate measurements, better predictions and better plans. Every centimeter of accuracy is important.

By contrast, when you drive as a human it really is as simple as "images in, steering angle out". You just eyeball (pun intended) the rest. At no point in time can you look at the car in the lane next to you and tell its exact dimensions or velocity.

Now perhaps with millions of Nvidia A100s we could try to get to a system that's just "images in, steering angle out" but so far that has proven to be a pipe dream. The best research in the area doesn't even begin to approach the performance that we're able to get with our more classical robotics stack described above, and even Tesla isn't trying to end-to-end learn it all.

That isn't to say it's impossible (obviously, humans do it) but I think one could make a strong argument that "images in, steering angle out" is like epsilon close to just solving the problem of AGI, and perhaps even a million A100s wouldn't cut it ;)


That's not really true. Humans, at critical moments, do make implicit and even explicit plans of movement and follow them. We don't use literal velocity measurements for other objects, true, but in making those plans we do sometimes anticipate their locations at various points in the future, which is really what matters.

The best human drivers do this not at centimeter, but at the millimeter level. Look as downhill (motor)bike racing, Formula 1, WRC, etc..., These drivers can execute millimeter level accuracy maneuveurs that are planned well in advance at over 100km/h.


Yeah that's kind of what I was trying to say. You're right in that we predict the actions of others, but we don't do it in the same way. Even when we execute millimeter level maneuvers, we aren't explicitly measuring anything... Like if you were to ask a driver for instructions on how to repeat that maneuver they wouldn't be able to tell you, they just have a "feel" for it.

Basically humans are really really good at guesstimating with great accuracy (but poor reproducibility) and since we don't use basic measurements in the first place, having better measurement accuracy wouldn't really help us be better drivers on average (it does help for certain scenarios like parking though, where knowing the # of inches remaining to an obstacle can be very useful).

But for everyday driving at speed, we wouldn't even be able to process measurements in real time even if someone was providing them to us. AVs are different and that's basically the gist of what I was trying to say. Because they actually do use, rely on, and process measurements in real time, improving their measurement accuracy (ie. switching from camera based approximate depth, to cm level accurate depth from a LiDAR) can have a meaningful impact on the final performance of the system.


Yeah it's shocking to me how many people overlook this. Even if we pretended that the Tesla sensor suite was capable of FSD, it's not FSD if you have to disengage when the lens gets mud on it. Sensor cleaning is an integral part of actually being able to have driverless operation. When I worked at Argo we spent a lot of time making sure that we were designing failsafe methods for detecting and dealing with obstructions (https://www.axios.com/2021/12/15/self-driving-cars-clean-dir...).


So with the disclaimer that these views don't reflect those of my employer, as someone who works in the industry I think this article is basically spot on. The only point I would add is that the top line cost for all these vehicles is quite high right now, so scaling up the service alone isn't really a solution to the profitability problem. I won't get into specifics, but I think this blog post from Cruise summarizes the point pretty well (https://getcruise.com/news/blog/2023/av-compute-deploying-to...). The term "edge supercomputer" really is the best way to describe AV hardware deployment. And that doesn't even cover the sensor suite which is quite costly as well.

So if I was a betting man, I'd say that you can expect Cruse, Waymo & others to scale a little bit now, just to show investors that they can but for them to really save the bulk of the scaling (to hit that targeted figure of 1B/yr of revenue) until after they've found a way to get the costs down. That's going to come in the form of more bespoke vehicles that are better vertically integrated with custom hardware and sensing solutions (like the Cruise Origin).


To some extent Tesla is already going to need to resolve this hardware discrepancy issue. The new HW4 revision vehicles will have 4D imaging Radar and forward mounted side facing cameras (https://twitter.com/greentheonly/status/1625905220432671015?...) to address the blind spot that they've had since the original hardware (https://youtu.be/DlC2tpRocK8). HW3 vehicles cannot be retrofitted with HW4 (https://twitter.com/greentheonly/status/1625905186387505155?...) so this will likely be a large sore spot in the coming months as HW4 vehicles start hitting the roads and people start noticing the discrepancies.

Tesla of course claims that the HW3 cars will still get FSD at some point, but unless they somehow figure out how to bend light, that blind spot will continue to be an issue on older cars.


Tesla really needs to be made to pay billions for their FSD lies.


Turns out that the HW4 model Y doesn't have radar at all: https://www.notateslaapp.com/news/1456/tesla-s-model-y-with-...


That's not what the article you linked says at all. Some HW4 vehicles that are shipping now are shipping without Radar (likely due to supply chain or cost issues). But HW4 was designed with a Radar in mind, it's present in the code, registered with the FCC and there are physical connectors for it on the HW4 compute module. You can see photos of the interior of the Radar in the thread from Green that I linked earlier or in this article (https://www.teslarati.com/tesla-hardware-4-hd-radar-first-lo...)

The fact that some early HW4 units will ship with the updated camera suite but not the Radar only further adds to my point that some users are going to be left with inferior sensing systems despite having paid the same $10k for FSD as everyone else. The whole "radar will be used to train and improve the vision" argument is just nonsense made up by Elon & Tesla fans. A properly functioning radar camera sensor fusion system will be superior in every way to a camera only solution. And there will be 0 Tesla's that actually achieve "full self driving" (ie. you going to sleep in the back-seat and waking up at your destination) until Tesla adds things like a cleaning system to their existing camera solution for example. The hardware is simply inadequate.


I'm not putting into question the fact that having a radar is significantly superior to vision only, that's obvious.

I'm saying that, as we see, HW4 and radar are two distinct hardware configurations: saying that there will be a large sore spot because HW4 cars have radar is objectively false as not all of them do.


I’m super confused… you’re agreeing that having a radar is superior but you’re disagreeing that it’s going to be a sore spot? How is that possible. Regardless of how you name the configurations (HW4, HW4.5 whatever, the naming is irrelevant) the point is that there will be multiple configurations of sensing suites, some of which will be objectively better than others. That’s the sore spot.

The fact that HW4 and radar are separate configurations by name is not important. HW3 is also included in the mix (you can buy FSD on it) and it has totally different camera placement. So radar aside there’s still a difference in the sensor suite.

People who paid $10k and were promised “FSD” and future hardware upgrades have every right to be pissed off about this.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: