I wrote a USB host driver for the STM32F4 a couple months ago, and the most difficult part was the lack of good documentation - the peripheral had r/w registers mapped as read-only in the SVD, there was a bit you had to set that wasn't mentioned in the documentation, and then the whole USB bulk-only-transport pointing to a non-existent SCSI spec. I ended up finding a blog post stating that you have to basically copy what Windows does in order for USB drives to work.
Luckily I had an oscilloscope that could decode USB frames to save me a whole bunch of time to understand why things weren't working.
USB seems to depend on a whole lot of tribal knowledge, which makes it impressive that it is so ubiquitous and works out of the box for the most part.
Not OP, but for USB "full speed" devices (12 Mbps), you can use a saleae logic analyzer clone ($12) with the opensource pulseview which can decode USB frames and streams rather well.
Even with faster devices, you can usually force them down to 12 Mbps with a USB 1.1 hub for analysis and bugfixing of the driver/firmware, and then have the exact same code work fast without the hub.
On desktop, wireshark also has the ability to monitor USB data transfers for a software-only approach.
In a way that makes it worse because while you might get it "working" sooner, you'll never be sure if it's because you got it right, or if you're skating across the ice of a "quirks mode" that you hope will work in the next silicon rev.
Then again, if you don't need to follow the spec for your to work, maybe the spec is overly restrictive and a Vernacular Protocol is more effective.
Luckily I had an oscilloscope that could decode USB frames to save me a whole bunch of time to understand why things weren't working.
USB seems to depend on a whole lot of tribal knowledge, which makes it impressive that it is so ubiquitous and works out of the box for the most part.