Hacker Newsnew | past | comments | ask | show | jobs | submit | vov_or's commentslogin

Guys trained a multi-modal chatbot with visual and language instructions based on the open-source multi-modal model OpenFlamingo!

Paper link: https://arxiv.org/abs/2305.04790


Hi! MSCOCO and Flickr datasets are the main datasets for Image retrieval. The results published in most papers (including CLIP) are based on them. So we used exactly these datasets for evaluation.


More efficient - for sure!


The datasets we used are pretty clean themselves if we compare them with LAION. But we also filtered out images with captions on them and by CLIP's scores. Btw, huge thanks for Laion and Open_clip projects! It inspires us a lot.


Hi! Just added Apache2.0 to HF models card. Thanks!


Are the pretraining and training pipelines available anywhere under a FOSS license? I'd love to take a swing at training a mid-fusion model on data other than text and images (e.g., sound, neuron spike trains, etc.)


Thanks Seems like a typo. It will be fixed soon


143M - English 206M - Multilingual


There is not only a difference in the data source but pre-trained tasks as well. But you are right, a fine-tuned models on human-annotated data are way better than zero-shot (just pre-trained) on Image retrieval. And it is correct for CLIP, ALBEF, VICHA, and UFORM.


Any plans to document how to fine tune your models then?


It will take some time, but yes, we have this in our plans.


Yes, it is possible. Approaches, on which our model is based, are capable to solve VQA and other similar tasks showing SOTA results.


Do you know anyone working on a large text completion model based on it?


Do you have/plan to have a text embeddings model?


Yes, we are training text embedding models right now. And also have plans to open-source some of them! In addition, we train encoders for different modalities with retrieval purposes. For example, video data.


Hi! I am one of the contributors! We were focused on image retrieval only. Almost all semantic search engines for images are based on CLIP today. We are also building a semantic multimodal search engine as a DBMS component. That is why Image retrieval is so crucial for us as well as inference perf. Also, for semantic segmentation and detection, you probably use only the image encoder part of the CLIP.


I think it’s fine you’re focused on retrieval but you should add that as a caveat to your results, 100 times better at retrieval. As an ML researcher in grad school here’s what >80% use case of clip I’ve seen: 1. Take a random image, take a random set of text (can just be categories separated by commas). CLIP will find the text that’s the best match to your image. CLIP is also incredibly robust at this, you can literally take an image with your phone and it will give you reasonable results. If you speed such a model up by 100X in inference or training, that would be a huge deal to the entire ML research community and you can expect some best paper awards (maybe even VC capital looking at stable diffusion) to come your way


Hi! You are right that we had to clarify that "100 times better at retrieval". Btw, we have plans to tune models, evaluate, and publish results in different tasks (zero-shot ImageNet classification, etc)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: