Both VisualSfM and OpenSfM use similar methodology (incremental structure from motion) to reconstruct 3D scene points and camera positions from photos.
And yes, you can build your own tools (even commercially) with OpenSfM.
Exactly, the convnet is applied on 2d pixels instead of on 3D point clouds. If one have labelled 3D points (e.g. if a point is part of a car or not), one can train a network for recognition directly on 3D points.
Yes. We can apply different heuristics to remove outlier points. One interesting way to do it will be to train a neural network to refine the dense point cloud.
And yes, you can build your own tools (even commercially) with OpenSfM.