12B is pretty small, so I’m doubting it’ll be anywhere close to internvl2 however mistral does great work and likely this model is still useful for on device tasks
>Qwen2-VL is the latest addition to the vision-language models in the Qwen series, building upon the capabilities of Qwen-VL. Compared to its predecessor, Qwen2-VL offers:
>State-of-the-Art Image Understanding
>Extended Video Comprehension
Besides, it'd have been pretty silly for them to mention it on their slides if it wasn't.