I did work on this as part of my thesis quite a few years back at the university.
One other optimization would be to process the points in parallel.
Regarding the coloring of each 3d point, it might be feasible to not use one camera image, but a weighted sum of all camera images that can see the same point in the scene. Each pixel color is then weighted with the scalar product of the points normal and the viewing direction of the camera. This would also regard for noise and specular reflections (which can mess up the original color).
Yes, I am working on using numpy to do the projection using matrices so we dont have to loop over each point and project it individually. That should be a big boost.
The way I handle the different camera images is to simply see which one provides a lower depth and use - with the idea that if the camera is closer, it would provide better information. But what you are suggesting is pretty interestint. I'm going to try that as well.
Lidars are pretty powerful, but one big disadvantage of using point clouds for perception is that they are not colored. This makes identifying objects more difficult compared to camera images. However, by combining camera images with lidar data, we can enhance the point cloud by assigning colors to the points based on the corresponding camera image pixels. This makes visualizing and processing the point cloud much easier.
Thanks, been trying to look into AI tools to generate point clouds from photos for a hobby robot. Crazy that a mediocre LIDAR costs more than every other part of the robot combined, maybe times 10.
Its not AI, but it is simple and you can re-use a point cloud to re-localise against (ie once the map has been generated you can just localise rather than have to map the same time.)
Some places use ML to make a more robust descriptor (ie the thing that identifies the point in a point cloud) which mostly practical. I've not yet seen a practical "deep" SLAM pipeline. (but I'd not looked recently. )
Have been doing something similar to this using image to image translation (XYZ rendered images to RGB space domain). Most of the information is contained in the Z-axis which gives you the height information, e.g. helps to distinguish the grass and buildings color. However I am skeptical if the X and Y is noise and how much spatial information it provides during Conv blocks. Anyone who had previous experience on this?
As pointed out in my other comment, using a single image for point coloring is prone to errors due to noise, specular reflection and occlusion. I'd consider using a (normalized) cross-correlation approach with several images.
Isn't there some math which crosses over between what Lidar is showing vs what photogrammetry provides from overlapping photograph images -> providing depth corrected/adjusted/ground-truthing of images?
Mostly true. But the last part is incorrect. There are far more pixels in an image compared to the number of points in the point cloud area covered by that image. So you get 1 pixel per point. In addition there can be multiple points that map to the same pixel.
Regarding the coloring of each 3d point, it might be feasible to not use one camera image, but a weighted sum of all camera images that can see the same point in the scene. Each pixel color is then weighted with the scalar product of the points normal and the viewing direction of the camera. This would also regard for noise and specular reflections (which can mess up the original color).