Where does a text come from? In the age of Fake News, that’s a not insignificant question.
In mid-February, the nonprofit organization OpenAI, which is largely funded by Silicon Valley mogul Elon Musk, released a remarkable article. In it, the scientists described a new language model. The adaptive software GPT-2 was trained to complete sentences and to complement texts meaningfully. In addition, she can answer simple questions about a given text.
If you feed the software with two fictitious sentences, it even spins a complete story.
A default was:
“Scientists have discovered a herd of unicorns living in a remote, unexplored valley in the Andes, but even more surprising to the researchers was the fact that the unicorns spoke perfect English.”
“The four-horned, silver-white unicorns were previously unknown to science. … Dr. Jorge Pérez, an evolutionary biologist from La Paz University, and several companions explored the Andes when they found a small valley where no other animals or humans lived,” Pérez noted that the valley had a seemingly natural well and was framed by two peaks of rocks covered in silvery silver snow, and Pérez and the others ventured further towards the valley., When we reached one of the peaks, the water looked blue covered with crystals, ‘said Pérez, Pérez and his friends were amazed to see the unicorn herd.’
While not a high literature, it is remarkable for a number of reasons: evolutionary biologist Pérez has freely invented the model, though it was not explicitly programmed for it. Content inconsistencies such as the “four-horned unicorns” and the painstaking repetition of words, the researchers still count on the weaknesses of the model.
Their progress has not been achieved by the Open AI researchers by a new process, but by more resources. They gave the algorithm ten times more training data than before – which also required ten times more computing time for the training.
Technically, behind the software lies the principle of distribution semantics. It assumes that words that occur in a similar context also have a similar meaning. For example, the words “cat” and “dog” are related in meaning because they are used more or less the same. You can feed a cat or a dog – but not an orange.
Algorithms based on distribution semantics use machine learning, essentially counting how often words appear in a text and what other words are clustered in their environment. The resulting models can use the patterns thus learned to complete words, but also to construct complete sentences or even paragraphs. In recent years, some researchers have also investigated the distribution of random strings instead of words, so models can more flexibly deal with acronyms, punctuation, slang, and other deviations that do not appear in the dictionary.
Politically explosive, however, the story is that OpenAI not, as usual, published the training data, the source code and the parameters of their model. Because this, the organization wrote, could also be used for the automated mass production of misinformation. A decision sharply criticized by some AI researchers: “Anyone can be misused to seduce other people and spread lies and conspiracy theories,” mocked deep learning pioneer Yann LeCun on Twitter. “Should we stop producing babies?”
At least part of the research community is more interested in the technical possibilities of the software than in its diffuse dangers. Hendrik Strobelt and Sebastian Gehrmann from the IBM Watson Lab of the Massachusetts Institute of Technology and the University of Harvard, for example, suggested using the technology to expose computer-generated language. Because language models generate sentences by predicting the next word in a sequence. So if most words in a text are easily predictable, the text is probably written by a machine, they argue.
Strobelt and colleagues programmed a test software called the “Giant Language Model Test Room” (GLTR) with the slimmed-down model of OpenAI. It colors words that are easily predictable by an AI green. Less likely words are colored yellow or red and violet. Not surprisingly, she marked the unicorn text predominantly green.
For example, Janelle Shane, who runs the Letting Neural Networks Be Weird blog, trains neural networks to invent names for beers. She subjected the tool to a stricter test. Instead of feeding it only with GPT-2-generated text, she also checked passages from other language models.
The result: The software was unable to predict much of the words in these texts. At least for now, machines can not protect us from fake news written by machines.