AI Doesn’t Dream in Tupi – But It Might Learn Nheengatu
Brazil's forgotten Indigenous language is finding new life through artificial intelligence. Can algorithms help revive what colonisation tried to erase?

When we talk about the future, we often talk about technology. Rarely do we talk about the past.
Even more rarely do we talk about the languages left behind—those that were never digitised, never archived, never taught to the machines now shaping our world.
One of them is Tupi, the once-dominant language of Brazil's Atlantic coast. Before Portuguese became official, Tupi was spoken across thousands of kilometres of forest, rivers, and villages. It was the first language taught to Jesuit missionaries, the foundation of Brazil’s earliest colonial communications—and one of the first to be legislated out of existence.
Today, Old Tupi is considered extinct. But a descendant lives on: Nheengatu, a creole born from the collision of Tupi, Portuguese, and regional Indigenous speech.
Now, in an unlikely twist, Nheengatu is being taught to artificial intelligence.
The Language the State Forgot
Nheengatu—literally, “the good language”—was developed in the 16th century as a tool for communication in Brazil’s vast and linguistically diverse interior. For a time, it was even promoted by the colonial authorities, who found it useful for unifying diverse Indigenous groups under missionary control. But by the 18th century, that policy reversed.
The Marquis of Pombal, in a move designed to consolidate Portuguese colonial identity, banned the use of Tupi and its offshoots. The suppression of Indigenous languages continued well into the republican era. For generations, children were punished for speaking anything other than Portuguese at school.
Yet in the Amazon, especially around São Gabriel da Cachoeira, Nheengatu endured. It adapted, evolved, and survived orally, passed from grandparent to child in river villages and forest clearings.
In 2002, it was officially recognised as a co-official language in São Gabriel—a landmark for linguistic rights in Brazil. But that recognition didn’t guarantee digital survival.
The Algorithmic Divide
Artificial intelligence systems—the kinds used in Google Translate, Siri, or ChatGPT—are trained on massive volumes of data: books, articles, Wikipedia pages, social media content. These training sets are overwhelmingly in English, Mandarin, Spanish, Portuguese. Indigenous languages? Barely present.
And when a language isn’t in the dataset, it isn’t in the algorithm’s world. It doesn’t get translated. It doesn’t complete your sentences. It’s rendered invisible—a phenomenon some scholars are now calling “algorithmic extinction.”
But a group of Brazilian researchers is trying to change that.
At the University of São Paulo’s Center for Artificial Intelligence, in partnership with IBM Research, a team has been building AI tools in Nheengatu—not just translation engines, but educational and writing support systems. These tools allow users to translate, spellcheck, and even auto-complete in a language that, until recently, had never been coded.
Crucially, the project has involved Nheengatu speakers and Indigenous educators in the development process. It’s not just about technical innovation—it’s about linguistic justice.
Small Data, Big Potential
Unlike English or Portuguese, there isn’t a vast corpus of digitised Nheengatu material. The team worked with a dataset of just 7,000 sentence pairs, drawing from oral recordings, religious texts, educational materials, and Indigenous stories.
In AI terms, that’s an extremely low-resource environment. But the researchers adapted their methods, using machine learning techniques that can make the most of small, carefully curated datasets. The result is a suite of tools that not only support the language, but also open up new possibilities for its preservation and revitalisation.
For young Nheengatu speakers, this is a game-changer. They can now type messages, write stories, or learn grammar in their ancestral language—on the same platforms where they interact with the rest of the world.
Beyond Words: Preserving Worldviews
To treat Nheengatu as just a “language” is to miss the point. Indigenous languages encode ways of seeing, being, and relating to the world that are often absent from Western frameworks.
In Tupi-derived languages, identity is not fixed but relational. Time is cyclical. Rivers and spirits, humans and animals, are part of the same conversation. Much of this nuance gets lost when translated into global tech standards—especially when those standards were never designed to accommodate Indigenous epistemologies.
So the question isn’t just whether AI can “speak” Nheengatu. It’s whether it can learn to listen differently.
A Blueprint for Linguistic Inclusion
Brazil is home to over 180 Indigenous languages, many of which are endangered. Nheengatu is relatively lucky: it has political recognition, growing community use, and now, its first set of digital tools.
But the methods developed for Nheengatu could be adapted elsewhere. Already, similar initiatives are underway for African languages like Yoruba and Igbo, and Indigenous North American tongues like Mohawk and Dakota.
What they all share is a belief that AI doesn’t have to reproduce the biases of the past. With the right design, it can help repair historical silence.
The Future Talks Back
AI doesn’t dream in Tupi. Not yet. But in a country where Indigenous languages were once silenced by decree, machines are beginning to speak words that were nearly lost.
It’s not just a technical achievement. It’s a symbolic one.
Because when a language like Nheengatu appears in a digital interface—corrected, translated, suggested—it tells its speakers: you belong here.
And that might be the most revolutionary sentence of all.