Language, Intelligence, and the Multimodal Convergence

Natural language understanding was long classified as “AI-complete” — as hard as general intelligence itself. The assumption was that you’d need to solve reasoning first, and language would follow. Instead, the field discovered that training on language prediction alone produced systems with broad reasoning, coding, math, and planning abilities. Language wasn’t the destination. It was the vehicle.
The biological evidence tells a similar story.
Language as a Catalyst for Intelligence
Across species, more complex communication systems correlate with more flexible cognition. Cetaceans, great apes, corvids, and elephants all pass the mirror self-recognition test, use tools, and show social reasoning—and all have relatively sophisticated signaling systems. Bottlenose dolphins pass the mirror test — the first nonprimates shown to do so — while most primate species never do.[1]
But correlation isn’t causation. The stronger evidence comes from intervention studies—experiments where researchers actively train subjects in new skills and measure the resulting cognitive changes. Kanzi the bonobo acquired over 348 lexigrams (symbols representing words) and comprehended novel English sentences, demonstrating planning, categorical reasoning, and concept combination that untrained bonobos don’t display.[2] Language doesn’t just reflect intelligence. It scaffolds it.[3]
Human development data reinforces this. Children’s cognitive growth tracks closely with language acquisition. Inner speech (self-directed language) is critical for executive function, working memory, and self-regulation. Deaf children without early language exposure show delays of roughly three years on false-belief tasks, even when general intelligence is unaffected.[4]
The evolutionary picture suggests a feedback loop. Dunbar’s social brain hypothesis[5] showed that primate neocortex size correlates with social group size across 38 primate groups — bigger groups require bigger brains. Tomasello’s shared intentionality framework[6] argues that cooperative communication is the foundation of uniquely human cognition. Both point to co-evolution. Larger social groups demanded better communication. Better communication enabled cooperation and cultural transmission, selecting for still larger groups. A ratchet effect — and one that AI followed independently.
The AI Mirror
Large language models learn to predict the next token in a sequence of text. That’s it. But language is a compressed representation of human knowledge, reasoning patterns, and world models. The text is a shadow of the world, and modeling the shadow requires modeling much of the world.
Language-trained AI systems exhibit emergent abilities that weren’t directly optimized for: chain-of-thought reasoning, analogical thinking, in-context learning. These appear abruptly around the 100-billion-parameter scale.[7] The symbolic structure of language scaffolds abstract reasoning in silicon just as it does in brains.
Having models “think out loud” step by step before answering improved PaLM 540B’s accuracy on the GSM8K math benchmark from 18% to 57%.[8] This parallels Vygotsky’s theory of inner speech—the idea that externalized language gets internalized as a thinking tool.[9] AI models reason better when they use language to structure their thinking, just as children do.
The limits mirror biological ones. Current language models struggle with tasks that don’t map well to linguistic representation: fine motor planning, continuous spatial reasoning, real-time sensory processing. These are areas where embodied experience matters more than language, just as octopus intelligence suggests non-linguistic routes to cognition exist.[10]
The Multimodal Convergence
The frontier of AI research is increasingly multimodal—integrating language, vision, audio, and action into unified systems. The key finding isn’t just that these systems can handle multiple formats. It’s that each modality improves performance in the others. A model that can see images reasons better about spatial language. A model that processes code writes better natural language explanations. The modalities are synergistic, not merely additive.
This addresses a central critique of pure language models: the symbol grounding problem.[11] Words in a language-only system are patterns of tokens without real-world reference. Multimodal training partially solves this. A model that has seen millions of images of dogs alongside the word “dog” has something closer to a grounded concept than one that only knows “dog” from its textual relationships to other words.
The biological parallel is close. Human cognition is fundamentally multimodal. Our concepts aren’t stored as text—they’re distributed across sensory, motor, and linguistic representations.[12] The brain’s convergence zones integrate information across modalities into unified representations. Multimodal AI architectures are converging on the same design principle.
The next frontier is robotics. Teams at Google DeepMind, Figure, and others are connecting language models to robotic bodies.[13] Language provides a powerful planning and abstraction layer; embodied experience provides the grounding and physics understanding that language alone struggles with. A robot that can be told “pick up the fragile thing carefully” needs language comprehension, visual recognition, and motor control integrated seamlessly.
Capabilities emerge only when modalities combine—spatial reasoning that pure language models fail at, generalization of instructions to novel visual scenarios never described in text.
The trajectory points toward unified world models — systems that maintain a shared internal representation updated by whatever modality is available. Your brain already works this way, integrating what you see, hear, feel, and know into a single coherent experience.[14] The pattern holds across biological and artificial intelligence: language is the most powerful single modality for abstract reasoning, but it reaches its full potential when grounded in other forms of experience.
Language isn’t intelligence. But it might be the closest thing to a universal catalyst for it.
References
1. Reiss, D., & Marino, L. (2001). “Mirror self-recognition in the bottlenose dolphin: A case of cognitive convergence.” Proceedings of the National Academy of Sciences, 98(10), 5937–5942. — First demonstration of mirror self-recognition in a nonprimate species.
2. Savage-Rumbaugh, S., & Lewin, R. (1994). Kanzi: The Ape at the Brink of the Human Mind. Wiley. — Documents Kanzi’s language acquisition and the emergent cognitive abilities observed in language-trained bonobos.
3. Lupyan, G., & Bergen, B. (2016). “How Language Programs the Mind.” Topics in Cognitive Science, 8(2), 408–424. — Reviews evidence for how language shapes perception, categorization, and memory across domains.
4. Peterson, C. C., & Siegal, M. (2000). “Insights into Theory of Mind from Deafness and Autism.” Mind & Language, 15(1), 123–145. doi:10.1111/1468-0017.00126 — Demonstrates that deaf children without early language access show delays in theory of mind development.
5. Dunbar, R. I. M. (1998). “The Social Brain Hypothesis.” Evolutionary Anthropology, 6(5), 178–190. — Proposes that primate brain size evolved primarily to manage complex social relationships, with language as a key enabler.
6. Tomasello, M. (2008). Origins of Human Communication. MIT Press. — Argues that shared intentionality and cooperative communication are the foundations of uniquely human cognition.
7. Wei, J., et al. (2022). “Emergent Abilities of Large Language Models.” Transactions on Machine Learning Research. — Documents cognitive capabilities that appear suddenly at scale in language models without being explicitly trained.
8. Wei, J., et al. (2022). “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” Advances in Neural Information Processing Systems, 35, 24824–24837. — Demonstrates that step-by-step verbal reasoning dramatically improves LLM performance on complex tasks.
9. Vygotsky, L. S. (1934/1986). Thought and Language. MIT Press. — The foundational work on inner speech as a cognitive tool, arguing that language transforms thinking rather than merely expressing it.
10. Mather, J. A., & Dickel, L. (2017). “Cephalopod Complex Cognition.” Current Opinion in Behavioral Sciences, 16, 131–137. — Reviews evidence for sophisticated intelligence in octopuses and cuttlefish despite minimal social communication.
11. Harnad, S. (1990). “The Symbol Grounding Problem.” Physica D, 42, 335–346. — Defines the problem of how symbols acquire meaning, a central challenge for both AI and cognitive science.
12. Barsalou, L. W. (1999). “Perceptual Symbol Systems.” Behavioral and Brain Sciences, 22(4), 577–660. — Proposes that cognition is grounded in simulated sensory-motor experience rather than amodal symbols.
13. Brohan, A., et al. (2023). “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.” arXiv preprint arXiv:2307.15818. — Shows how multimodal language-vision models can be connected to robotic action for grounded intelligence.
14. Goyal, A., & Bengio, Y. (2022). “Inductive Biases for Deep Learning of Higher-Level Cognition.” Proceedings of the Royal Society A, 478(2266). — Discusses architectural principles for building AI systems that develop abstract reasoning, including the role of multimodal integration.