More

vov_or · on May 10, 2023

Guys trained a multi-modal chatbot with visual and language instructions based on the open-source multi-modal model OpenFlamingo!

Paper link: https://arxiv.org/abs/2305.04790

vov_or · on March 1, 2023

Hi! MSCOCO and Flickr datasets are the main datasets for Image retrieval. The results published in most papers (including CLIP) are based on them. So we used exactly these datasets for evaluation.

vov_or · on Feb 28, 2023

More efficient - for sure!

vov_or · on Feb 28, 2023

The datasets we used are pretty clean themselves if we compare them with LAION. But we also filtered out images with captions on them and by CLIP's scores. Btw, huge thanks for Laion and Open_clip projects! It inspires us a lot.

vov_or · on Feb 28, 2023

Hi! Just added Apache2.0 to HF models card. Thanks!

cosmojg · on Feb 28, 2023

Are the pretraining and training pipelines available anywhere under a FOSS license? I'd love to take a swing at training a mid-fusion model on data other than text and images (e.g., sound, neuron spike trains, etc.)

vov_or · on Feb 28, 2023

Thanks Seems like a typo. It will be fixed soon

vov_or · on Feb 28, 2023

143M - English 206M - Multilingual

vov_or · on Feb 28, 2023

There is not only a difference in the data source but pre-trained tasks as well. But you are right, a fine-tuned models on human-annotated data are way better than zero-shot (just pre-trained) on Image retrieval. And it is correct for CLIP, ALBEF, VICHA, and UFORM.

ttt3ts · on Feb 28, 2023

Any plans to document how to fine tune your models then?

vov_or · on Feb 28, 2023

It will take some time, but yes, we have this in our plans.

vov_or · on Feb 28, 2023

Yes, it is possible. Approaches, on which our model is based, are capable to solve VQA and other similar tasks showing SOTA results.

ilaksh · on Feb 28, 2023

Do you know anyone working on a large text completion model based on it?

freediver · on Feb 28, 2023

Do you have/plan to have a text embeddings model?

vov_or · on Feb 28, 2023

Yes, we are training text embedding models right now. And also have plans to open-source some of them! In addition, we train encoders for different modalities with retrieval purposes. For example, video data.

vov_or · on Feb 28, 2023

Hi! I am one of the contributors! We were focused on image retrieval only. Almost all semantic search engines for images are based on CLIP today. We are also building a semantic multimodal search engine as a DBMS component. That is why Image retrieval is so crucial for us as well as inference perf. Also, for semantic segmentation and detection, you probably use only the image encoder part of the CLIP.

sashank_1509 · on March 1, 2023

I think it’s fine you’re focused on retrieval but you should add that as a caveat to your results, 100 times better at retrieval. As an ML researcher in grad school here’s what >80% use case of clip I’ve seen: 1. Take a random image, take a random set of text (can just be categories separated by commas). CLIP will find the text that’s the best match to your image. CLIP is also incredibly robust at this, you can literally take an image with your phone and it will give you reasonable results. If you speed such a model up by 100X in inference or training, that would be a huge deal to the entire ML research community and you can expect some best paper awards (maybe even VC capital looking at stable diffusion) to come your way

vov_or · on March 1, 2023

Hi! You are right that we had to clarify that "100 times better at retrieval". Btw, we have plans to tune models, evaluate, and publish results in different tasks (zero-shot ImageNet classification, etc)