The hardest critique of disembodied AI is older than modern deep learning: a system that only manipulates symbols never actually understands them. Whether that critique still bites in the era of multimodal models and robot policies is a live question.
The symbol grounding problem
Stevan Harnad's 1990 formulation: how do the symbols a system manipulates ever come to mean what they purport to mean, rather than being formal squiggles? A dictionary that defines words in terms of other words is circular; somewhere the chain has to bottom out in non-symbolic experience.
For classical symbolic AI this was a deep problem. For deep learning systems trained on raw pixels and audio, the answer is less clear — embeddings learned from sensory data are at least non-symbolic.
Embodied cognition
The embodied cognition tradition (Varela, Thompson, Rosch; Andy Clark) argues that thinking is fundamentally shaped by having a body that acts in the world. Concepts like 'up,' 'heavy,' 'grasp,' or 'forward' are not abstract: they are scaffolded on bodily experience.
If embodied cognition is right in its strong form, no purely text-trained system can fully understand the words it manipulates. If it's right in a weaker form, multimodal training plus some action data is enough. The empirical jury is still out.
Vision-language-action models
Since 2023, vision-language-action (VLA) models — RT-2, Open X-Embodiment, Octo, Gemini Robotics, GR00T-style foundation policies — have brought transformer-scale learning to robotics. A single model handles perception, language understanding, and motor control across many robot embodiments.
These systems are still far from human dexterity, but they have changed what 'embodied AI' means in practice. Robotics is no longer a separate field; it is becoming the action interface for general models.
What grounding does and doesn't fix
Adding cameras, microphones, and robot bodies clearly helps with the kinds of common-sense reasoning that pure text fails at: object permanence, intuitive physics, tool use. It does not obviously fix abstract reasoning, mathematics, or social inference.
The best current bet is that grounding is necessary but not sufficient: AGI will need it, but it will also need everything language-only models have learned, plus better reasoning and planning machinery.
Key terms
- Symbol grounding
- The problem of how formal symbols come to mean anything.
- Embodied cognition
- View that thinking is shaped by having a body that acts in the world.
- VLA model
- Vision-language-action model — a single network handling perception, language, and motor control.
- Sim-to-real
- Training a policy in simulation and transferring it to physical hardware.
- Foundation policy
- A robot control policy trained on broad data, reusable across tasks and embodiments.
Connects to AGI
If AGI must be embodied, the bottleneck is robotics, data, and dexterity, not language models. If embodiment is helpful but optional, scale and reasoning are the bottleneck. Most 2026 labs are hedging by pursuing both.