Introduction: Apple’s Bold Claim on AI Reasoning
In a move that’s turning heads across the AI industry, Apple has released a set of groundbreaking research papers challenging the core assumption about artificial intelligence reasoning capabilities. Despite the growing adoption of large language models (LLMs) such as GPT-4o, Claude, Gemini, and others, Apple asserts that these systems do not truly “reason.” Instead, they function more like sophisticated pattern matchers — creating what Apple calls an illusion of thinking.
Why This Matters
From search engines to smart assistants, code generation to autonomous vehicles, the perception that LLMs can “think” is foundational to their increasing role in society. If Apple is right, this could fundamentally shift our expectations — and strategies — for AI development.
Inside Apple’s Landmark Research

1. “The Illusion of Thinking” – June 2025
Apple’s primary research paper, “The Illusion of Thinking,” explores how LLMs and large reasoning models (LRMs) perform on a range of logical puzzles. The results? Shocking.
Experiment Highlights
-
Controlled puzzle tasks: Tower of Hanoi, Checker Jumping, River Crossing, Blocks World.
-
Transparent reasoning: Step-by-step tracing of each model’s thought process.
-
No data contamination: Custom puzzles ensured zero overlap with training data.
Key Findings
-
Easy Tasks: LLMs often outperform LRMs due to less overthinking.
-
Medium Tasks: LRMs provide slight improvements through added reasoning.
-
Hard Tasks: Both models experience catastrophic accuracy collapse.
-
Thinking Paradox: As task difficulty increases, models output less reasoning, not more.
-
Token Budget Limits: Failures often stem from reaching token caps—not flawed logic.
“We observed zero evidence of genuine logical reasoning in any model tested.” — Apple Research
2. “GSM-Symbolic” – October 2024
Apple extended its critique to LLMs’ performance on the widely used GSM8K math benchmark, exposing vulnerabilities in what many considered one of their strengths: numerical reasoning.
GSM-Symbolic Key Insights
-
Modified Problem Sets: Changing names and values in math questions (e.g., Jimmy → John) led to 10–40% drops in accuracy.
-
Irrelevant Info Confusion: Adding distractors (e.g., “five were smaller than average”) confused even top models like GPT-4o and Claude 3.
-
Performance Drop Examples:
-
Claude 3.5: ~60% accuracy drop.
-
GPT-4o: Down by ~32%.
-
o1 Preview: ~17.5% accuracy after small prompt edits.
-
Takeaway
Even state-of-the-art models falter when irrelevant but semantically plausible information is introduced — a red flag for real-world applications.
Strategic Implications for Apple
Apple’s research timing — right before WWDC — suggests a calculated move to temper AI expectations before launching new products. It aligns with Apple’s history of prioritizing usability and safety over AI hype. Unlike OpenAI or Google, Apple may aim for “reliable AI” rather than “reasoning AI.”
Broader Impact on the AI Industry
Scaling Isn’t the Solution
-
Simply adding more parameters, data, or compute won’t overcome the reasoning barrier.
-
Models fail at algorithmic generalization — the ability to apply learned rules to new tasks.
Time for New Architectures?
-
Apple’s findings support critics like Garry Marcus, who argue for neurosymbolic AI—a hybrid model combining neural networks with symbolic reasoning.
Redefining Evaluation
-
Traditional benchmarks like GSM8K may misrepresent true model capabilities.
-
Future evaluations must analyze reasoning chains, not just final answers.
Expert Commentary: What AI Leaders Are Saying
Garry Marcus: “LLMs Are Flawed by Design”
-
Believes AI has overpromised and underdelivered.
-
Recommends moving toward neurosymbolic systems.
-
Warns against trusting LLMs in high-stakes use cases like healthcare or law.
Yann LeCun: “LLMs Will Become Obsolete”
-
Meta’s Chief AI Scientist sees LLMs as a temporary bridge.
-
Predicts the next paradigm will involve:
-
Persistent memory
-
World models
-
Planning + reasoning
-
-
Believes LLMs lack essential abilities for human-level intelligence.
Ilya Sutskever: “The Brain is a Biological Computer”
-
Still believes in LLMs’ potential to reach AGI with sufficient scale.
-
Acknowledges their current unpredictability and reasoning gaps.
-
Supports adding self-correction and goal-driven architectures to next-gen models.
Human Fallibility: Are LLMs Just Like Us?
Apple’s findings also mirror well-known human cognitive failures:
Fallibility | LLM Equivalent |
---|---|
Anchoring Bias | Overreliance on first prompt token |
Heuristic Shortcuts | Pattern completion from similar training examples |
Irrelevant Info Influence | Performance degradation from non-mathematical clauses |
This raises philosophical questions: If LLMs make human-like mistakes, are they not reasoning — or are they reasoning too much like us?
Reframing the Debate: What Is Reasoning?
One surprising discovery: while LLMs failed logical puzzles in text, some succeeded via code generation.
Example: Claude 3.7 generated HTML+JS to simulate the Tower of Hanoi with 20 discs — something few humans could do manually.
This suggests an alternative form of reasoning: tool-use via language. Is reasoning strictly symbolic, or can executing functional outputs (like code) qualify?
What’s Next for AI?
If Apple Is Right:
-
LLMs will remain powerful assistants — not autonomous reasoners.
-
Reliable AI may require symbolic logic + neural networks.
-
Expectations around AGI may need a reset.
If Critics Are Right:
-
Reasoning may emerge as LLMs grow in size and capability.
-
Token-based models may evolve into multi-modal, goal-driven AI.
Final Thoughts: Don’t Mistake Fluency for Intelligence
Apple’s bold research reaffirms a critical principle: Just because an AI can say something smart, doesn’t mean it understands it. For now, LLMs remain incredibly useful — but not fully reliable.
TL;DR Summary
-
Apple says LLMs don’t truly reason — they pattern match.
-
Controlled experiments show LLMs collapse on hard problems.
-
Math tests reveal fragility to minor changes.
-
Experts agree AI needs new architecture for real thinking.
-
LLMs ≠ AGI, but they’re great at structured tasks (like coding).
-
The future? Hybrid systems, self-correction, and tool-using AIs.
Further Reading:
Apple’s Full Research on “The Illusion of Thinking”
What Do You Think?
Is Apple being too skeptical—or are we finally facing the limits of LLMs? Drop your thoughts in the comments 👇