Large language models lack true reasoning capabilities, researchers argue

Large language models function through sophisticated retrieval rather than genuine reasoning, according to research published across multiple studies in 2025.

Luis Rijo

Jul 19, 2025 • 10 min read

AI head with neural networks showing difference between approximate retrieval and true reasoning capabilities

The debate surrounding artificial intelligence capabilities has intensified following key research findings published in March and April 2025. Subbarao Kambhampati, a professor at Arizona State University and former president of the Association for the Advancement of Artificial Intelligence, has challenged prevailing assumptions about large language models through extensive technical documentation.

According to Kambhampati's research, published as Can Large Language Models Reason and Plan? in March 2025, these systems excel at what he terms "universal approximate retrieval" rather than principled reasoning. "LLMs are trained to predict the distribution of the n-th token given n-1 previous tokens," the research states, explaining that current models function as sophisticated n-gram systems trained on web-scale language corpora.

The study examined GPT-3, GPT-3.5, and GPT-4 performance across planning instances derived from International Planning Competition domains, including the well-known Blocks World environment. While GPT-4 achieved 30% empirical accuracy in Blocks World tasks—an improvement over earlier versions—this performance collapsed when researchers obfuscated action and object names. Standard artificial intelligence planners experienced no difficulty with such modifications.

Subscribe the PPC Land newsletter ✉️ for similar stories like this one. Receive the news every day in your inbox. Free of ads. 10 USD per year.

Technical challenges expose fundamental limitations

Testing methodology revealed critical distinctions between pattern recognition and genuine reasoning capabilities. When researchers reduced the effectiveness of approximate retrieval through name obfuscation, model performance "plummeted precipitously," according to the findings. These results suggest that improved performance stems from enhanced retrieval over larger training corpora rather than actual planning abilities.

Yann LeCun, VP and Chief AI Scientist at Meta, supported this perspective through social media commentary on April 16, 2025. "To invent new knowledge and new artifacts, or simply to deal with new situations for which they have not been explicitly trained, AI systems need to learn mental models of the world," LeCun stated. "Manipulating our mental model of the world is what we call thinking."

The technical analysis differentiates between knowledge acquisition and reasoning application. Many research papers claiming planning abilities actually confuse general planning knowledge extraction for executable plans, according to Kambhampati's analysis. When evaluated on abstract plans such as "wedding plans" without execution requirements, these distinctions become less apparent to casual observers.

This confusion stems from the fundamental difference between declarative knowledge about planning processes and procedural capability to execute those plans. Large language models excel at retrieving and synthesizing information about planning methodologies, step sequences, and best practices from their training data. However, this knowledge extraction differs significantly from the computational reasoning required to generate executable plans that account for resource constraints, temporal dependencies, and goal interactions. The research demonstrates that when models produce planning outputs, they often rely on pattern matching from similar scenarios in their training corpus rather than systematic reasoning through problem constraints.

The distinction becomes particularly evident in domains requiring precise execution sequences where subgoal interactions create complex dependencies. Abstract planning scenarios like event organization or project management may appear successful when models generate reasonable-sounding step lists that human evaluators can easily correct or adapt. However, when these same models face formal planning problems with explicit preconditions, effects, and goal states, their performance deteriorates substantially. According to the research findings, this degradation occurs because executable planning requires verification of logical consistency across action sequences, a capability that extends beyond the approximate retrieval mechanisms underlying current language model architectures.

Marketing professionals encounter similar challenges when evaluating AI tools for campaign planning and optimization tasks. Systems may generate comprehensive marketing strategies that include audience segmentation, channel selection, and budget allocation recommendations based on extracted knowledge from successful campaigns in their training data. While these outputs may contain valuable insights and appear strategically sound, they often lack the precise logical reasoning necessary to ensure budget constraints are respected, timing dependencies are maintained, and conflicting objectives are properly balanced across campaign elements.

LLM-Modulo framework provides alternative approach

Research teams propose the LLM-Modulo framework as a constructive application of current capabilities. This approach leverages language models' idea generation abilities while maintaining external verification systems. "The cleanest approach—one we advocate—is to let an external model-based plan verifier do the back prompting and to certify the correctness of the final solution," the research documentation explains.

This framework acknowledges models' value as knowledge sources while avoiding attribution of autonomous reasoning capabilities. Similar to historical knowledge-based AI systems, LLMs effectively replace human knowledge engineers by providing problem-specific information, albeit with relaxed correctness requirements compared to traditional approaches.

Fine-tuning experiments showed limited improvement in planning performance. Such modifications essentially convert planning tasks into memory-based approximate retrieval, functioning more like compilation from System 2 to System 1 processing rather than demonstrating actual planning capability.

Subscribe the PPC Land newsletter ✉️ for similar stories like this one. Receive the news every day in your inbox. Free of ads. 10 USD per year.

Self-verification claims lack empirical support

Recent studies from Kambhampati's laboratory challenge claims about self-improvement capabilities in large language models. Research on plan verification and constraint verification indicates that self-verification performance actually worsens due to false positives and false negatives during solution evaluation.

The assumption that models excel at verification compared to generation lacks justification for LLM systems, unlike established computational complexity principles. "While for many computational tasks (e.g. those in class NP), the verification is often of lower complexity than generation, that fact doesn't seem particularly relevant for LLMs which are generating (approximately retrieving) guesses," the research explains.

Human-in-the-loop prompting presents additional complications through potential Clever Hans effects. In such scenarios, humans with knowledge of correct solutions unconsciously guide models toward accurate responses, making it difficult to attribute success to model capabilities versus human steering.

Knowledge closure limitations

The discussion extends beyond individual model limitations to broader epistemological constraints. According to Kambhampati's analysis shared on social media in April 2025, "Neither LLMs nor LRMs have the ability to go beyond the humanity's knowledge closure—which is needed for true discoveries."

Large Reasoning Models (LRMs) such as OpenAI's o1 and o3 series represent attempts to address reasoning limitations through verifier signal compilation. However, this approach essentially creates synthetic "reasoning webs" to supplement training data rather than enabling genuine knowledge discovery beyond human capabilities.

The research emphasizes that verifiers themselves remain products of human knowledge, creating inherent bounds on system capabilities. "LLMs/LRMs are great as force multipliers," Kambhampati noted. "But if you really want your AI agent to learn and go beyond the humanity's knowledge closure, you also need your agent to act in the world (not simulator), and learn from that."

Marketing implications for AI adoption

These findings carry significant implications for marketing professionals evaluating AI-powered solutions. Understanding limitations becomes crucial when assessing AI tools that claim advanced reasoning capabilities for campaign optimization and customer analysis.

Recent PPC Land research indicates that marketing teams increasingly rely on AI for complex decision-making processes, from campaign optimization to customer segmentation strategies. However, the fragility revealed in reasoning studies presents concerns for applications requiring logical consistency.

Marketing analytics applications face particular challenges, as minor changes in data presentation or problem formulation could yield dramatically different insights. This undermines confidence in AI-driven marketing intelligence platforms that depend on consistent logical processing.

The research suggests marketing professionals should distinguish between scenarios requiring genuine reasoning versus those suitable for approximate retrieval and pattern recognition. Many marketing applications may benefit from standard language models rather than more expensive reasoning-enhanced alternatives.

Current trends show 80% of companies have chosen to block LLM access to their websites, reflecting growing concerns about AI capabilities and limitations across industries. This statistic underscores the importance of understanding actual versus claimed AI capabilities when developing marketing technology strategies.

Technical architecture insights

Language model architecture provides additional context for understanding these limitations. Modern systems use transformer models processing text through attention mechanisms, learning patterns from vast training datasets to understand context and generate responses. The development process includes training phases, continuous improvement through techniques like Supervised Fine-Tuning and Reinforcement Learning with Human Feedback, and inference phases generating outputs based on user inputs.

These systems function as "giant non-veridical memories akin to an external System 1 for us all," according to the research characterization. This framing helps explain both capabilities and limitations observed across different evaluation scenarios.

Subscribe the PPC Land newsletter ✉️ for similar stories like this one. Receive the news every day in your inbox. Free of ads. 10 USD per year.

Future development considerations

The research concludes that while LLMs demonstrate remarkable approximate retrieval abilities suitable for various applications, attributing reasoning or planning capabilities creates false expectations. Effective deployment requires understanding these systems as powerful idea generation tools requiring external verification rather than autonomous reasoning agents.

This perspective maintains the potential value of current technologies while establishing realistic boundaries for application development. As marketing professionals increasingly integrate AI tools into their workflows, these distinctions become essential for successful implementation and appropriate expectation management.

Follow PPC Land on LinkedIn

Timeline

March 8, 2024: Kambhampati's "Can Large Language Models Reason and Plan?" research published on arXiv
February 9, 2022: Early social media discussion between LeCun and Kambhampati about reinforcement learning applications
April 16, 2025: Kambhampati presents research findings at Microsoft Research, later referenced by LeCun
June 7, 2025: Apple study reveals similar limitations in reasoning models through puzzle tests
June 10, 2025: Salesforce study shows enterprise AI agents achieve only 35% success in multi-turn business scenarios
June 26, 2025: MIT research reveals "potemkin understanding" phenomenon in language models
July 8, 2025: Berkeley researcher proposes collectivist approach to AI development

Follow PPC Land on Google News

Key Terms Explained

Universal Approximate Retrieval: This technical concept describes how large language models function by probabilistically reconstructing text completions based on training patterns rather than executing logical reasoning processes. Unlike traditional databases that retrieve exact matches, LLMs generate responses through statistical approximations of likely word sequences. For marketing professionals, this distinction becomes crucial when evaluating AI tools for tasks requiring precise logical consistency, such as budget optimization algorithms or compliance-sensitive campaign management systems.

LLM-Modulo Framework: A hybrid approach that combines language model idea generation with external verification systems to ensure accuracy and reliability. This framework acknowledges that while LLMs excel at producing creative solutions and identifying patterns, they require human oversight or automated checking mechanisms to validate outputs. Marketing teams can apply this concept by using AI for initial content generation or audience insights while implementing review processes before campaign deployment, ensuring both innovation and accuracy in their strategies.

Knowledge Closure Limitations: The theoretical boundary representing the collective knowledge available to humanity, which constrains what AI systems trained on human-generated data can discover independently. Current language models cannot exceed this boundary because they learn exclusively from existing human knowledge rather than conducting novel research or experiments. This limitation affects marketing applications by defining realistic expectations for AI capabilities in market research, competitive analysis, and strategic planning where genuine innovation beyond existing knowledge would be required.

System 1 versus System 2 Processing: A cognitive psychology framework distinguishing between fast, automatic thinking (System 1) and slower, deliberate reasoning (System 2). Large language models primarily operate as external System 1 processors, providing rapid pattern-based responses without the careful deliberation characteristic of System 2 reasoning. Marketing professionals should recognize this distinction when deploying AI tools, understanding that current systems excel at quick content generation and trend identification but may struggle with complex strategic reasoning requiring careful analysis of multiple variables.

Transformer Architecture: The underlying neural network design powering modern language models, utilizing attention mechanisms to process and understand relationships between words in text sequences. This architecture enables models to maintain context across long passages and generate coherent responses by weighing the importance of different input elements. For marketing applications, understanding transformer limitations helps explain why AI tools may excel at content creation and sentiment analysis while struggling with tasks requiring logical reasoning or mathematical precision in campaign optimization scenarios.

N-gram Model Dynamics: Statistical language modeling technique that predicts subsequent words based on sequences of preceding words, forming the foundation of how large language models generate text. Current systems like GPT-4 function as sophisticated n-gram models trained on web-scale data, enabling them to produce contextually appropriate responses through pattern recognition rather than genuine understanding. Marketing teams using AI for content creation should understand that outputs reflect statistical patterns from training data rather than original thinking, influencing how they approach content review and brand alignment processes.

Reinforcement Learning with Human Feedback: A training methodology that improves AI model performance by incorporating human evaluator preferences into the learning process, helping align model outputs with desired behaviors and quality standards. This technique addresses some limitations of pure statistical training by introducing human judgment about response quality, safety, and appropriateness. Marketing professionals should recognize that RLHF-trained models may better understand brand guidelines and audience preferences, but still require oversight for strategic decision-making and complex reasoning tasks.

False Positive and False Negative Verification: Error types occurring when AI systems incorrectly validate or invalidate their own outputs, undermining the reliability of self-checking mechanisms. False positives involve accepting incorrect solutions as valid, while false negatives reject correct answers as flawed. These verification failures significantly impact marketing applications where accuracy matters, such as budget allocation algorithms, audience targeting parameters, or compliance checking systems, necessitating external validation processes rather than relying solely on AI self-assessment capabilities.

Attention Mechanisms: Neural network components that enable models to focus selectively on relevant parts of input data when generating responses, similar to how humans concentrate on important information while filtering out distractions. These mechanisms allow language models to maintain coherence across long texts and understand contextual relationships between distant words or concepts. Marketing professionals benefit from understanding attention limitations when deploying AI for complex analysis tasks, as these systems may struggle to maintain focus across multiple competing priorities or conflicting data sources in comprehensive campaign strategies.

Synthetic Data Generation: The process by which AI systems create artificial training examples to supplement or replace real-world data, often used to improve model performance in specific domains or tasks. While synthetic data can address privacy concerns and data scarcity issues, it may also introduce biases or artifacts that affect model reliability. Marketing teams should understand synthetic data implications when evaluating AI tools for customer analysis, market research, or predictive modeling, as the artificial nature of training data may limit the applicability of insights to real-world consumer behavior and market dynamics.

Summary

Who: Subbarao Kambhampati, professor at Arizona State University and former AAAI president, alongside Yann LeCun, VP and Chief AI Scientist at Meta, leading research into large language model capabilities.

What: Technical research demonstrating that large language models perform sophisticated pattern matching and retrieval rather than genuine reasoning or planning, challenging industry claims about AI capabilities.

When: Key findings published March 2024 through ongoing research presentations in 2025, with supporting studies from Apple, MIT, and Salesforce throughout 2025.

Where: Research conducted at Arizona State University with broader implications for AI development across technology companies and marketing organizations worldwide.

Why: Growing concern over misattributed AI capabilities driving unrealistic expectations in business applications, particularly affecting marketing technology adoption and enterprise AI deployment strategies.