Hi!
MSCOCO and Flickr datasets are the main datasets for Image retrieval. The results published in most papers (including CLIP) are based on them. So we used exactly these datasets for evaluation.
The datasets we used are pretty clean themselves if we compare them with LAION.
But we also filtered out images with captions on them and by CLIP's scores.
Btw, huge thanks for Laion and Open_clip projects! It inspires us a lot.
Are the pretraining and training pipelines available anywhere under a FOSS license? I'd love to take a swing at training a mid-fusion model on data other than text and images (e.g., sound, neuron spike trains, etc.)
There is not only a difference in the data source but pre-trained tasks as well.
But you are right, a fine-tuned models on human-annotated data are way better than zero-shot (just pre-trained) on Image retrieval.
And it is correct for CLIP, ALBEF, VICHA, and UFORM.
Yes, we are training text embedding models right now. And also have plans to open-source some of them!
In addition, we train encoders for different modalities with retrieval purposes. For example, video data.
Hi!
I am one of the contributors!
We were focused on image retrieval only. Almost all semantic search engines for images are based on CLIP today. We are also building a semantic multimodal search engine as a DBMS component. That is why Image retrieval is so crucial for us as well as inference perf.
Also, for semantic segmentation and detection, you probably use only the image encoder part of the CLIP.
I think it’s fine you’re focused on retrieval but you should add that as a caveat to your results, 100 times better at retrieval.
As an ML researcher in grad school here’s what >80% use case of clip I’ve seen:
1. Take a random image, take a random set of text (can just be categories separated by commas). CLIP will find the text that’s the best match to your image. CLIP is also incredibly robust at this, you can literally take an image with your phone and it will give you reasonable results. If you speed such a model up by 100X in inference or training, that would be a huge deal to the entire ML research community and you can expect some best paper awards (maybe even VC capital looking at stable diffusion) to come your way
Hi!
You are right that we had to clarify that "100 times better at retrieval".
Btw, we have plans to tune models, evaluate, and publish results in different tasks (zero-shot ImageNet classification, etc)
Paper link: https://arxiv.org/abs/2305.04790