I'm not sure people in these comments are reading this paper correctly.
This seems to essentially disprove the whole idea of multi-agent setups like Chain-of-thought and LLM-Debate.
Because this paper introduces their alternative method which simply runs the same query multiple times on the same LLM, without any context shared across queries. And then they run a similarity algorithm on the answers and pick the most common answer. (Which makes sense to me. If an LLM is giving you a mixture of "hallucinations" and correct answers, the correct answers will similar and the hallucinations will hopefully be chaotic)
And this simple algorithm preform just as well (and sometimes better) than all the other multi-agent algorithms.
This suggests that the other multi-agent schemes with their clever prompts aren't really doing anything special; Their improve results are coming mostly from the fact that the LLM is run multiple times, that the prompt asks the LLM to pick the best answer.
I looked into this project when it was first announced. The “in Rust” part seems more aspirational than reality. For those who may not know, Knuth originally wrote TeX in a language called WEB, which is basically Pascal with some preprocessors for making it usable and documentable. Later extensions to TeX, including eTeX, pdfTeX and XeTeX have also been written in WEB. The existing TeX distributions (TeX Live, MikTeX, etc), at their core first translate this WEB/Pascal into (autogenerated and basically unreadable) C, then run it through a C compiler, etc.
What this project has done is take the auto-generated translation into C of xetex.web, and wrapped this (unreadable) C code in Rust — which is an odd choice to say the least. It seems (from Reddit comments by the author) that the reason is that author of this project at the time was unaware of LuaTeX, which (starting with a manual and readable translation into C years ago) is actually written in C.
All these odd choices aside, and barring the somewhat misleading “in Rust” description (misleading for another reason: the TeX/LaTeX ecosystem is mostly TeX macros rather than the core “engine” anyway), there are some good user-experience decisions made by this project. With a regular TeX distribution these would be achieved with something like latexmk/rubber/arara, which too are wrappers around TeX much like this project.
There is still room for someone to do a “real” rewrite of TeX (in Rust or whatever), but as someone on TeX.SE said, it is very easy to start a rewrite of TeX; the challenge is to finish it.
Disclaimer: I work on ANITA. Also I need to go to bed, so I'm writing this really fast so it probably doesn't make sense.
ANITA is a radio telescope attached to a balloon looking for broaband impulsive radio emission in Antarctica.
The main purpose is to look for the Askaryan emission from neutrinos interacting in the ice. The Askaryan emission is just the coherent version of the same process (Cerenkov radiation) that produces the flashes of light in IceCube (basically at long wavelengths you can't resolve the charges in a cascade and see a fast moving current density-- there's a negative charge excess because positrons can annihilate with atomic electrons). To detect this Askaryan emission, you need a dense dielectric material (if not dense, no target mass, if not dielectric, then RF won't propagate). Antarctica happens to be both the place you do long duration ballooning (due to all-day sunlight and favorable wind patterns that keep you over land) and the place with the most ice.
However, the events discussed here were produced by another channel. ANITA can also see RF emission from cosmic-ray extensive air showers (EAS). The RF emission here mostly comes from the splitting of charges in the showers by the Earth's magnetic field. Because in Antarctica, the magnetic field is approximately vertical, this produces horizontally polarized emission. Because ANITA is so high up (~40 km), EAS development from cosmic rays occurs below the payload, so the most common way for us to observe EAS's from cosmic rays is for the emission to bounce off the ice (because it's very forward-beamed). We can also see atmosphere-skimming showers that miss the ice entirely. As expected, the events that bounce off the ice have a polarity flip compared to the events that miss the ice.
The strange events discussed here look like EAS's from air showers, but the RF emission clearly points at the ice and there is no polarity flip from reflection, so the events look like very-energetic upward going air showers. There's no good way to explain upward going air showers in the Standard Model at these energies and observed angle (at lower energies or more grazing angles, tau neutrinos make it through the earth, which can decay to make upward-going air showers). So either there is something wrong with the measurement (we can't think of anything, but we're trying!), we got really unlucky with anthropogenic backgrounds (we think this is very unlikely), or there might be some new physics.
For this detection channel, there isn't too much special about Antarctica, just that we're on a balloon looking down so we can see stuff coming from below. The ice could potentially offer a slight enhancement compared to rock, but that's probably not so important. Other observatories looking for upward going showers from tau neutrinos (Pierre Auger) only look at very grazing incidence. There are proposals using fluorescence instead of radio emission (e.g. JEM-EUSO, and the SPB-EUSO balloon mission) that could do more or less the same thing.
This seems to essentially disprove the whole idea of multi-agent setups like Chain-of-thought and LLM-Debate.
Because this paper introduces their alternative method which simply runs the same query multiple times on the same LLM, without any context shared across queries. And then they run a similarity algorithm on the answers and pick the most common answer. (Which makes sense to me. If an LLM is giving you a mixture of "hallucinations" and correct answers, the correct answers will similar and the hallucinations will hopefully be chaotic)
And this simple algorithm preform just as well (and sometimes better) than all the other multi-agent algorithms.
This suggests that the other multi-agent schemes with their clever prompts aren't really doing anything special; Their improve results are coming mostly from the fact that the LLM is run multiple times, that the prompt asks the LLM to pick the best answer.