The sheer amount of data being generated seems overwhelming. For example these researchers created a new family tree of just the grass species (very import agriculturally and industrially of course):
> "The research team generated transcriptomes — DNA sequences of all of the genes expressed by an organism — for 342 grass species and whole-genome sequences for seven additional species."
This does allow analysis of very complex but desirable traits like drought tolerance, which involve a great many genes, but sorting through these huge volumes of data to ascertain which gene variants are the most important is challenging at best.
Everyone always calls it "next word prediction" but that's also a simplification.
If you go back to the original Transformer paper, the goal was to translate a document from one language to another. In all prompt systems the model is only given "past" tokens (as it's generating new tokens in real time) but in that original paper the LLM can use backwards and forward context to determine the translation.
Just saying the architecture of how a model is trained and how it outputs tokens is less limited than you think.
> "The research team generated transcriptomes — DNA sequences of all of the genes expressed by an organism — for 342 grass species and whole-genome sequences for seven additional species."
https://www.psu.edu/news/eberly-college-science/story/new-mo...
This does allow analysis of very complex but desirable traits like drought tolerance, which involve a great many genes, but sorting through these huge volumes of data to ascertain which gene variants are the most important is challenging at best.