New UC Berkeley research maps how AI reasoning strategies affect accuracy, efficiency

As large language models (LLMs) move from research labs into classrooms, offices and engineering workflows, “prompt engineering” has become a central practice for shaping how LLMs reason and respond. A new study led by UC Berkeley researchers goes beyond the general understanding that prompts matter, showing that the way models are instructed to reason can influence not only accuracy but also efficiency, reliability and cost, sometimes as much as the underlying AI model itself.

The research finds that no single reasoning strategy consistently produces the best results across tasks, models or computational budgets. Instead, the effectiveness of widely used prompting approaches depends on the type of problem being solved and the scale of the AI system. The findings offer practical guidance for developers and policymakers seeking to deploy AI tools that are both accurate and efficient.

The study was led by Junyu Guo, a doctoral student in UC Berkeley’s Department of Industrial Engineering and Operations Research, and conducted in collaboration with researchers from UC Berkeley and Virginia Tech. Javad Lavaei, a UC Berkeley professor of industrial engineering and operations research and Guo’s faculty advisor, is a senior author on the paper. The team developed a new benchmark, called StyleBench, to systematically evaluate how different prompting strategies shape a model’s problem-solving behavior.

“People often assume that more elaborate reasoning prompts automatically lead to better answers,” Lavaei said. “Our results show that this is not always true. In many cases, simpler and more concise strategies can be both faster and just as accurate, depending on the task and the model.”

Using StyleBench, the researchers evaluated five common reasoning styles, ranging from step-by-step explanations to approaches that encourage models to explore multiple solution paths or generate compact, symbolic responses. The styles were tested across five categories of reasoning tasks, including math problem solving, logical deduction and commonsense question answering, using 15 open-source language models spanning a wide range of sizes.

The analysis revealed clear trade-offs. Search-based strategies that explore many possible solutions performed well on open-ended problems, such as puzzle solving, but only when paired with very large models and at a significant computational cost. In contrast, more concise reasoning styles often delivered comparable accuracy on well-defined tasks while using far fewer computational resources.

The researchers also observed sharp differences in how models of different sizes behaved. Smaller models frequently produced confident but incorrect answers on difficult problems, even when given detailed prompting instructions. Larger models were more likely to follow instructions and generate coherent reasoning processes across a range of prompting styles.

“These differences matter in practice,” Guo said. “If you are deploying AI systems in resource-constrained environments, such as on edge devices or in classrooms, choosing the wrong reasoning strategy can waste computation without improving results.”

The study also challenges the idea that AI systems can easily be trained to automatically select the best reasoning strategy for a given problem. Experiments designed to teach models how to choose among prompting styles showed that current approaches tend to rely on shallow pattern matching rather than true strategic reasoning.

Taken together, the findings suggest that developers should move away from one-size-fits-all prompting approaches. Instead, the authors argue, reasoning strategies should be selected based on the task, the available computational budget and the capabilities of the model being used.

As large language models are increasingly integrated into decision-making and analytical workflows, the researchers say benchmarks like StyleBench can help ensure that these systems are deployed responsibly.

“Efficiency and reliability are becoming just as important as raw performance,” Lavaei said. “Understanding when and how AI systems should reason is a critical step toward making them trustworthy in real-world applications.”