Advanced Prompting for reasoning & coding
I recently took the ‘Advanced LLM Agents’ class from UC Berkely. The slides and recordings are shared with the public if you’re interested.
It was quite incredible to learn what the latest research is now able to accomplish, even getting a Bronze Medal at the 2024 International Math Olympiads (which require real skills, beyond ‘just applying formulas’). To give you an idea, the geometry problem was solved in a few seconds…
The techniques presented in this class allow to build smart systems that actually work in the real world.
The first lesson ‘Inference-Time Techniques for LLM Reasoning’ introduces quite efficient prompting techniques to optimize the results. Research has shown that with more inference time compute, the accuracy can jump from <25% to 85% (O3), but it implies making the model go through a Chain-of-Thought (CoT) process, ie going through several hidden thoughts, each consuming inference time, to mimic human’s thinking process.
There are 4 techniques to trigger CoT:
few-shot prompting (which the lecture focuses on)
zero-shot prompting: elicit CoT with specific instructions (“let’s think step by step”)
instruction fine-tuning
Reinforcement Learning (RL)
Research has shown that CoT scales with the model size, and that zero-shot Cot outperforms zero-shot, but still lacks compared to few-shot learning. How can we get the best of both worlds?
Analogical prompting
This technique seems counter-intuitive because it actually requires the LLM to come up with the examples on its own. But, in reality, it is similar to how humans use past experiences to solve new problems. And LLMs have a lot of knowledge, to the point that, to me, few-shot seems actually restricting the model to fully think. And this is especially true with maths…
You will find examples in the slides. The figure below shows that analogical prompting beats few-shot learning by a significant margin. More performance, less work!
Another way to explain the phenomenon is that self-generated examples are more tailored to the LLM. I would add that, for a company application, it could require fine-tuning the model with the domain knowledge for the examples to be also relevant.
But why stop there? In “Large Language Models as Optimizers”[1], researchers showed that LLMs can work as optimizers, generating intermediate (‘meta’) prompts and evaluating them at every step, to improve the prompt’s final outcome. I leave you to the paper for more details, but, surprisingly, the best prompt was
“Take a deep breath and work on this problem step-by-step”
It beats the traditional “let’s think step by step” by 8%.
Dynamic Least-to-Most prompting
Furthermore, it is possible to explicitly instruct the desired reasoning strategy. Dynamic Least-to-Most prompting [2] involves several steps:
dynamic problem reduction: to decompose more complex problems in more sub-steps
dynamic example selection: to select the more relevant exmaples from a large pool
adaptive prompt adjustment: the system modifies the prompt in real-time to optimize each sub-task.
Essentially, it pushes the idea of structured, step-by-step reasoning further by making the process itself intelligent and adaptable, allowing the LLM to handle a broader range of complex tasks with higher accuracy and efficiency. It achieves 99.7% accuracy with examples being only 0.1% of the training set.
This strategy is well-suited for text-to-code applications [3].
Searching the solution space
So far so good, the previous chapter basically shows how to use more token budget to generate a single solution, here it’s about increasing the width to explore the solution space, and in the next chapter, it will be about increasing the depth to improve the final solution.
We should not limit the LLM to generate only one solution per problem. Instead, we should explore multiple branches to allow the LLM to recover from mistakes (in a single generation) by
• Generating multiple candidate solutions per problem
• Generating multiple potential next reasoning steps at every intermediate thought
DeepMind proposes a solution called ‘Self-consistency’, [4] which selects the response with the most consistent final answer.
Sampling diverse responses is crucial to self-consistency. There are various techniques such as beam search, ensemble methods or asking the LLM to select the final response (method called Universal Self-Consistency).
The next improvement comes from intervening before the final response is generated. Tree-of-Thoughts (ToT) is based on tree-search, where the LLM can explore the more promising partial solutions (without completing them all).
At each step, the LLM is prompted to (1) generate possible next thinking steps, and (2) to evaluate the current state and select the best one. This allows to explore more possible solutions, which is why ToT with breadth-first search (BFS) scales better than standard prompting and CoT with a fixed token budget (less useless attempts). More advanced methods use Monte-Carlo Tree Search which basically introduces random exploration of the tree.
Iterative self-improvement
The previous methods tend to produce the ‘best’ answer, but what if the answer is incorrect? After all, even humans answer correctly at first, but then they refine their thinking.
One technique, ‘Reflexion and Self-Refine’ [5], makes the LLM self-refine its output after evaluating its own output. It works even better using external evaluation when available as shown below.
“Self-debugging” [6] is a natural and more complex workflow for code generation (the ultimate challenge).
As you can see, it uses various types of feedback: unit tests, execution trace, line-by-line explanation of the code, execution results and a general feedback. The perfect debugger!
Research has shown that self-debug significantly improves the performance of a coding agent.
Finally, with a fixed budget, it should be noted that a smaller model would allow to sample more possible solutions (exactly like in my previous article about fine-tuning).
References
[1] Large Language Models as Optimizers
[2] Zhou et al., Least-to-Most Prompting Enables Complex Reasoning in Large Language Models, ICLR 2023
[4] Wang et al., Self-Consistency Improves Chain of Thought Reasoning in Language Models, ICLR 2023
[5] Shinn et al., Reflexion: Language Agents with Verbal Reinforcement Learning, NeurIPS 2023
[6] Chen, Lin, Schärli, Zhou, Teaching Large Language Models to Self-Debug, ICLR 2024