The sheer amount of data being generated seems overwhelming. For example these r...

Sabinus · on Aug 3, 2024

>sorting through these huge volumes of data to ascertain which gene variants are the most important is challenging at best

Seems like a promising task for LLMs.

Myrmornis · on Aug 3, 2024

Why is that? One of their key constraints is input size isn't it?

pests · on Aug 3, 2024

That's assuming a pretained model where you are just stuffing the genome data into the prompt.

You could fine-tune or train your own model on the data and then design the prompt / query interface to give you interesting results.

Myrmornis · on Aug 4, 2024

Yes, true. It's not clear that next word prediction is the best model for such an exercise, but yes, they do seem rather successful.

pests · on Aug 4, 2024

Everyone always calls it "next word prediction" but that's also a simplification.

If you go back to the original Transformer paper, the goal was to translate a document from one language to another. In all prompt systems the model is only given "past" tokens (as it's generating new tokens in real time) but in that original paper the LLM can use backwards and forward context to determine the translation.

Just saying the architecture of how a model is trained and how it outputs tokens is less limited than you think.