Guiding LLMs in Competitive Programming

When does it go wrong?

Introduction

Hi there! My name is Ben, and I’m an author on the paper “Can Language Models Solve Olympiad Programming,” where we explored language model performance on competitive programming questions from the USA Computing Olympiad (USACO).

One interesting result was that for some unsolved bronze problems, certain models could solve the problem given pointed advice on where they went wrong during their initial generation. We coined this setup “human-in-the-loop,” demonstrating that despite similar solve rates, different models exhibited different plasticity to feedback that directly influenced their ability to solve problems beyond zero-shot capability with assistance.

We got a lot of interest in these results post paper release, and the paper doesn’t dive too deeply into those results. So in this article, I’ll aim to provide additional context to those results, releasing the set of original trajectories, new trajectories on more difficult problems, as well as analysis for each problem/trajectory pairing. Hopefully by sharing the failure modes at the individual problem level, we can gain a deeper understanding on the improvements needed for stronger code reasoning in multi-turn interactions: something that is more and more relevant with the increase in agentic systems released for code.

Viewing of trajectories occur through the following website. Although the trajectories come preloaded in the examples in the blog, feel free to follow instructions on the website to converse with the model, play around with the tool to generate some of your own trajectories.

The trajectories are organized by difficulty, as well as by whether the problem was released before or after the cutoff date.

Hoping this is a helpful read!

Bronze (Precutoff)

This is the set of problems we explored in the original paper release. Due to the sheer volume of trajectories, I’ll just be releasing a small representative subset here. See trajectories below to get a taste:

Problem Link 1: Bronze back_and_forth
Trajectory Link: Bronze back_and_forth

Problem Link 2: Bronze watching_mooloo
Trajectory Link: Bronze watching_mooloo

Problem Link 3: Bronze lemonade_line
Trajectory Link: Bronze lemonade_line

Problem Link 4: Bronze photoshoot_988
Trajectory Link: Bronze photoshoot_988

Problem Link 5: Bronze air_cownditioning
Trajectory Link: Bronze air_cownditioning

Problem Link 6: Bronze livestock_lineup
Trajectory Link: Bronze livestock_lineup

Analysis

Because most questions in this category contained single/few points of errors where the ground truth solution was very simple and linear, most of the problems that we tested were able to be solved with some amount of human intervention, where the human helped the models by pointing out the general direction of where they could have went wrong. In the paper we tested 13 bronze problems in this category: 11 out of the 13 problems were solved with human intervention in a very short conversation loop.

Bronze (Postcutoff)

In our original paper release, we found that none of the 9 bronze problems released after the training cutoff date were able to be solved zero-shot. We believe that this is due to a combination of small sample size, adversarially constructed problems, and potential reasoning contamination in pretraining data.

This leaves an interesting question: what are the failure modes for these problems, and how do they differ from problems before the training cutoff? Can we guide models to overcome these shortcomings through verbal feedback, just like the previous bronze problems? We analyze 2 of these problems here.

Problem 1: Palindrome Game (2024)

Nature of Problem: Standard inductive game theory problem. A game is outlined, and we want to figure out winning/losing positions given an input. Normally you would solve this with some sort of DP, however, in this case, running several examples leads you to figure out that you just need to check whether the number is divisible by 10, leading to a very straightforward implementation.

Initial Error: Model attempts dynamic programming and exceeds time limit without realizing the heuristic.

Trajectory 1: Attempt 1 Fail (GPT-4o)
Trajectory 2: Attempt 2 Fail (GPT-4-turbo)
Trajectory 3: Attempt 3 Fail (GPT-4o)

Analysis:

Models fail equivalently at this problem, and try to attempt non-functional DP solutions despite it not being necessary. Even when prompted to follow instructions to find a heuristic, the model keeps on trying to find a dynamic programming solution to the problem, or tries to make overt generalizations on false patterns.

Main error modes:

  1. Small problem-specific reasoning errors, such as not realizing single digit numbers are palindromes. Curiously enough, even when corrected verbally, it will fix its logic for the next generation, but fall back to original misconceptions later on in the conversation loop.
  2. Poor pattern generalization: Models continually claim the winning + losing positions alternate even/odd when the model just earlier demonstrated via example that it did not.
  3. Generations tend to invoke complex game theoretic concepts completely unnecessary to solve the problem. It seems that GPT-4o is worse at following instructions and conversing with the user than GPT-4-turbo.
  4. GPT-4o spouts a bunch of nonsensical things sometimes given long contexts (see trajectory 3).

Problem 2: Milk Exchange (2024)

Nature of problem: This problem is a simulation problem, where the model needs to write code that emulates the algorithm described in the problem. However, simulating the algorithm exactly will only pass 8/12 test cases. To pass the last 4, we need to realize that you don’t have to simulate minute by minute, all you have to do is find cows that are overflowing, and calculate all the milk that is going into the overflowing cow.

Initial Error: Model interprets problem incorrectly and assumes that the net transfer per cow is constant, without accounting for overflow.

Trajectory 1: Attempt 1 Fail (GPT-4o)
Trajectory 2: Attempt 2 Fail (GPT-4-turbo)

Analysis:

Both models are actually capable of being tutored up to partial credit, meaning that it can overcome its initial problem misunderstanding of not accounting for milk overflow. However, getting full credit is where the issue inlies. When prompted to extend to the full solution, the model always reverts to the idea that it can just multiply the net transfer by the number of minutes, no matter how many times I tell it that its wrong: this is also the model’s original solution. It seems that model priors are hard to override.

Overall

When I see a new coding problem, usually before even finishing reading a question, I already have some idea of what the problem is asking, and the canonical solution that is paired with the problem. Some sort of reasoning-based retrieval in my head I suppose. However, what these two problems have in common is that they have some element that subverts expectations: for example, a problem that seems like it should be solved with dynamic programming when in actuality, it just requires some careful reasoning. These types of problems thus require strong extrapolation over interpolation, a critical ability that is especially tested with postcutoff USACO problems.

Silver (Precutoff)

We now tutor over some silver problems, which represent a step up in difficulty from bronze problems. They generally require some sort of algorithms knowledge, and solution programs are slightly more complex.

Problem 1: Rest Stops

Nature of problem: When I was first doing this it sort of sounded like a dynamic programming question, but once you realize that a greedy approach works (It is stricly inferior to stop at a rest stop that occurs before the global max of tastiness rate), then the problem is quite simple.

Initial Error: Sometimes models get the greedy approach correctly, but fails to understand that it should be multiplying the gain by the previous max, instead of the current max. Other times, the model completely gets wrong the approach and tries to stop at every single rest stop to eat grass.

Trajectory 1: Attempt 1 Fail (GPT-4o)
Trajectory 2: Attempt 2 Success (GPT-4-turbo)
Trajectory 3: Attempt 3 Fail (GPT-4o)

Analysis

  1. GPT-4o has a tough time responding to simple feedback. Even when its error is within one line, and I tell it exactly what is wrong with it, it generates a super long paragraph about going through the entire solution again: this is probably a reflection of its very, very strong instruction tuning. This is reminiscent of my human tutoring experineces with GPT-3.5-turbo.
  2. On the other hand, GPT-4-turbo can fix its solution, given careful dissection of its reasoning and back and forth dialogue about it. It seems that GPT-4-turbo has more plasticity in responding to prompts, while GPT-4o is much more rigid, which makes it very hard to dissect its reasoning and converse with it in a human-oriented/effective format.
  3. Attempt 3 is very interesting: GPT-4o is capable of determining where it went wrong, understanding that it cannot multiply by the current tastiness, it needs to use the previous one. However, despite this, when prompted to generate the code containing fixes utilizing above intuition, the code that it generates still contains the previous mistake.

Side note: This problem is unequivocably much easier than the postcutoff bronze problems tested, suggesting perhaps in the next iteration of a USACO benchmark we need to look past their difficulty labels in terms of pure ranks.

Problem 2: Loan Repayment

Nature of problem: Binary search + Simulation problem. Direct simulation only passes the first few tests: need to optimize solution by not simulating every single day to get a sqrt(N) runtime for validation of K values.

Mode of error: Sometimes models have a small bug in the verifiction function, but they generally fix this pretty easily. However, the main issue is that the model does not generate code efficient enough to pass the time requirements as they do not implement the “batching” optimization to simulate multiple days at once.

Trajectory 1: Attempt 1 Success (GPT-4o)
Trajectory 2: Attempt 2 Fail (GPT-4-turbo)

Analysis:

Attempt 1 is a great example of the power of feedback + occasions where the model only requires minimal steering to get at the first answer. At first, it only passes 1 test case due to a small bug in the “can_repay” function, which it fixes when told there is a bug in that function. Then, when prompted that its solution is not efficient enough, after several rounds of iteration, it arrives at a correct fix with not too much oracle intervention (this is quite subjective, of course).

Notably I try, on my second attempt, to just directly prompt GPT to generate fixes, without the intermediate step of verifying it’s reasoning and it does not work nearly as well! It seems, at least with the examples I’m trying, that making sure that the model has a prior context of a correct plan is pretty important!

Gold? Platinum? And Beyond?

USACO generally presents an extremely challenging task for models: the best model, GPT-4o, only achieves a 18.8% solve rate. The majority of problems solved reamins the “Bronze” suite of questions, with gold and platinum problems remaining a new frontier for future models to overcome.

More interestingly, there seems to be a qualitative difference between tutoring on problems past the cutoff date, and before the cutoff date, where models have a harder time incorporating feedback for problems after the cutoff date. This leads me to hypothesize that feedback incorporation is a both function of a model’s inherent reasoning capabilities and the ability to map feedback to seen tasks. Reshaping models to allow overriding of model priors by feedback will be a key necessity for models down the line.

I would love to have done lots and lots more of these trajectories on harder difficulty levels such as Gold and Platinum, as they contain lots of interesting insights into potential improvement areas. Nothing beats actually going through problems to learn about the capabilities of LLMs in code generation. However, these are truly very time consuming to do for one person: I’ll add more trajectories onto this article as I get the time to do more. If you are interested in helping me catalog error modes please reach out!