I am writing some python code to do Order Flow Imbalance analysis from L2 orderbook updates. The language is unimportant: the logic is pretty subtle, so that the main difficulties are not in the language details, but in the logic and handling edge cases.
Initially I was using Claude 3.5 sonnet, then writing unit tests and manually correcting sonnet's code. Sonnet's code mostly worked, except for failing certain complicated combined book updates.
Then I fed the code and the tests into DeepSeek. It turned out pretty bad.
At first it tried to make the results of the tests conform to the erroneous results of the code. When I pointed that out, it fixed the immediate logical problem in the code, introducing two more nested problems that we're not there before by corrupting the existing code. After prompted that, it fixed the first error it introduced but left the second one. Then I fixed it myself, uploaded the fix and asked it to summarize what it has done. It started basically gaslighting me, saying that the initial code had the problem that it introduced.
In summary, I lost two days, reverted everything and went back to Sonnet.
Using a local 7B for chatting, I saw it tries very hard to check for inconsistencies of itself, and that may spill to also checking for the user's "inconsistencies".
Maybe it's better to carefully control and explain the talk progression. Selectively removing old prompts (adapting where necessary) - which also reduces the context - results in it not having to "bother" to check for inconsistencies internal to irrelevant parts of the conversation.
Eg. asking it to extract Q&A from a line of text and format it to json, which could be straightforward, sometimes it would wonder about the contents from within the Q&A itself, checking for inconsistencies eg:
- I need to be careful to not output content that's factually incorrect. Wait but I'm not sure about this answer I'm dealing with here..
- Before the questions were about mountains and now it's about rivers, what's up with that?
- etc..
I had to strongly demand it to treat it all as jumbled text/verbatim, and never think about their meaning. So it should be more effective if I always branched from the starting prompt when entering a new Q&A for it to work on. So this is what I meant by "selectively remove old prompts".
what work flow where you using to feed it code? was it cline? Cline has major prompting issues with DeepSeek, Deepseek really doesn't like you changing out its prompt with what normal LLMs are using.
Initially I was using Claude 3.5 sonnet, then writing unit tests and manually correcting sonnet's code. Sonnet's code mostly worked, except for failing certain complicated combined book updates.
Then I fed the code and the tests into DeepSeek. It turned out pretty bad. At first it tried to make the results of the tests conform to the erroneous results of the code. When I pointed that out, it fixed the immediate logical problem in the code, introducing two more nested problems that we're not there before by corrupting the existing code. After prompted that, it fixed the first error it introduced but left the second one. Then I fixed it myself, uploaded the fix and asked it to summarize what it has done. It started basically gaslighting me, saying that the initial code had the problem that it introduced.
In summary, I lost two days, reverted everything and went back to Sonnet.