Hacker Newsnew | past | comments | ask | show | jobs | submit | cheeseblubber's commentslogin

Yes definitely we were using our own budget and out of our own pocket and these model runs were getting expensive. Claude costed us around 200-300 dollars a 8 month run for example. We want to scale it and get more statistically significant results but wanted to share something in the interim.

Got it. It is an interesting thing to explore.

Via api you can turn off websearch internally. We provided all the models with their own custom tools that only provided data up to the date of the backtest.

But Grok is internally training on Tweets etc. continuously.

We used the LLMs API and provided custom tools like a stock ticker tool that only gave stock price information for that date of backtest for the model. We did this for news apis, technical indicator apis etc. It took quite a long time to make sure that there weren't any data leakage. The whole process took us about a month or two to build out.

I have a hunch Grok model cutoff is not accurate and somehow it has updated weights though they still call it the same Grok model as the params and size are unchanged but they are incrementally training it in the background. Of course I don’t know this but it’s what I would do in their situation since ongoing incremental training could he a neat trick to improve their ongoing results against competitors, even if marginal. I also wouldn’t trust the models to honestly disclose their decision process either.

That said. This is a fascinating area of research and I do think LLM driven fundamental investing and trading has a future.


OP here. We realized there are a ton of limitations with backtest and paper money but still wanted to do this experiment and share the results. By no means is this statistically significant on whether or not these models can beat the market in the long term. But wanted to give everyone a way to see how these models think about and interact with the financial markets.

You should redo this with human controls. By a weird coincidence, I have sufficient free time.

> Grok ended up performing the best while DeepSeek came close to second.

I think you mean "DeepSeek came in a close second".


OK, now it says:

> Grok ended up performing the best while DeepSeek came close second.

"came in a close second" is an idiom that only makes sense word-for-word.


You're not really giving them any money and it's not actually trading.

There's no market impact to any trading decision they make.


Cool experiment.

I have a PhD in capital markets research. It would be even more informative to report abnormal returns (market/factor-adjusted) so we can tell whether the LLMs generated true alpha rather than just loading on tech during a strong market.


What were the risk adjusted returns? Without knowing that, this is all kind of meaningless. Being high beta in a rising market doesn't equate to anything brilliant.

I can almost guarantee you that these models will underperform the market in the long run, because they are simply not designed for this purpose. LLMs are designed to simulate a conversation, not predict forward returns of a time series. What's more, most of the widely disseminated knowledge out there on the topic is effectively worthless, because there is an entire cottage industry of fake trading gurus and grifters, and the LLMs have no ability to separate actual information from the BS.

If you really wanted to do this, you would have to train specialist models - not LLMs - for trading, which is what firms are doing, but those are strictly proprietary.

The only other option would be to train an LLM on actually correct information and then see if it can design the specialist model itself, but most of the information you would need for that purpose is effectively hidden and not found in public sources. It is also entirely possible that these trading firms have already been trying this: using their proprietary knowledge and data to attempt to train a model that can act as a quant researcher.


I think it would be interesting to see how it goes in a scenario where the market declines or where tech companies underperform the rest of the market. In recent history they've outperformed the market and that might bias the choices that the LLMs make - would they continue with these positive biases if they were performing badly?

These are LLMs - next token guessers. They don't think at all and I suggest that you don't try to get rich quick with one!

LLMs are handy tools but no more. Even Qwen3-30B heavily quantised will do a passable effort of translating some Latin to English. It can whip up small games in a single prompt and much more and with care can deliver seriously decent results but so can my drill driver! That model only needs a £500 second hand GPU - that's impressive for me. Also GPT-OSS etc.

Yes, you can dive in with the bigger models that need serious hardware and they seem miraculous. A colleague had to recently "force" Claude to read some manuals until it realised it had made a mistake about something and frankly I think "it" was only saying it had made a mistake. I must ask said colleague to grab the reasoning and analyse it.


> But wanted to give everyone a way to see how these models think…

Think? What exactly did “it” think about?


You can click in to the chart and see the conversation as well as for each trade what was the reasoning it gave for it

A model can't tell you why it made the decision.

What it can do is inspect the decision it made and make up a reason a human might have said when making the decision.


"Pass the salt? You mean pass the sodium chloride?"

There is a transcript of the entire conversation if you scroll down a little

Made a site that let's LLMs trade stocks and you can watch their performance https://www.aitradearena.com/


You should allow sorting by most recent in the conversations tab


Would be cool to see the Quetzal website internationalized itself and see what it looks like in different languages.


We do have it translated into 13 languages right now, which language is your browser set to so we can add it?


I was also looking for that


! basically functions like @ in Slack and chat applications. The only difference in Linen is that we give you an @mention so that it doesn't necessarily need to send a push.


We actually tried shipping a desktop client with Tauri and Rust. It helped with initial bundle size but performance and resource consumption was kind of similar. We also ran in to issues with the api’s being bit limiting. Most likely we’ll revisit the desktop client with Tauri but depending on how mature the api is we may have to resort to electron.

Our bundle size has actually only gotten smaller since we’ve shipped our first version. We know it will gradually grow but we try to be disciplined in not using too many external dependencies


> we try to be disciplined in not using too many external dependencies

Cheers to that. It’s unfortunately an uncommon belief.


Thanks! Linen is secretly email in disguise. We have played around with syncing with email clients which would let you manage all your messages and notifications from one client.


As someone who has officially given up on having any semblance of keeping up with the never ending fire house of totally unmanageable slack threads, channels, DMs etc, especially since Slacks update last year, I would sell my left Kidney to just have one big list of things I need to look at at reply to/dismiss.


That sounds similar to Basecamp tbh.


I've actually REALLY wanted to find a project that would be suitable to give base camp a try with.

It DOES feel like it "could" be that holy-grail between IM, project management, email, etc.

But I feel like you need a whole team all onboard for it to make sense.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: