In my app the best lexical search approaches completely broke my agent. For my rag system the llm would on average take 2.1 lexical searches to get the results it needed. Which wasn’t terrible but it meant sometimes it needed up to 5 searches to find it which blew up user latency. Now that I have a hybrid semantic search + lexical search it only requires 1.1 searches per result.
The problem is not using parallel tool calling or not returning a search array. We do this across large data sets and don’t see much of a problem. It also means you can swap algorithms on the fly. Building a BM25 index over a few thousand documents is not very expensive locally. Rg and grep are freeish. If you have information on folder contents you can let your agent decide at execution time based on information need.
Embeddings just aren’t the most interesting thing here if you’re running a frontier fm.
Search arrays help, but parallel tool calling assumes you’ve solved two hard problems: generating diverse query variations, and verifying which result is correct. Most retrieval doesn’t have clean verification. The better approach is making search good enough that you sidestep verification as much as possible (hopefully you are only requiring the model to make a judgment call within its search array). In my case (OpenStreetMap data), lexical recall is unstable, but embeddings usually get it right if you narrow the search space enough—and a missed query is a stronger signal to the model that it’s done something wrong.
Besides, if you could reliably verify results, you’ve essentially built an RL harness—which is a lot harder to do than building an effective search system and probably worth more.