Joscha Bach: https://twitter.com/Plinz/status/1529013919682994176
Bragging rights are in constant flux, it would seem. As to whether those multimodal AI models do anything to address the criticism on resource utilization and bias, while there is not much known at this point, based on what is known the answers seem to be “probably not” and “sort of”, respectively. And what about the actual intelligence part? Let’s look under the hood for a moment.
OpenAI notes that “DALL·E 2 has learned the relationship between images and the text used to describe them. It uses a process called “diffusion,” which starts with a pattern of random dots and gradually alters that pattern towards an image when it recognizes specific aspects of that image”.
Google notes that their “key discovery is that generic LLMs (e.g. T5), pre-trained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model”.
While Imagen seems to rely heavily on LLMs, the process is different for DALL-E 2. However, both OpenAI’s and Google’s people, as well as independent experts, claim that those models show a form of “understanding” that overlaps with human understanding. The MIT Technology review went as far as to call the horse-riding astronaut, the image which has become iconic for DALL-E 2, a milestone in AI’s journey to make sense of the world.
Gary Marcus, however, remains unconvinced. Marcus, a scientist, best-selling author, and entrepreneur, is well known in AI circles for his critique on a number of topics, including the nature of intelligence and what’s wrong with deep learning. He was quick to point out deficiencies in both DALL-E 2 and Imagen, and to engage in public dialogue, including with people from Google.
Marcus shares his insights in an aptly titled “Horse rides astronaut” essay. His conclusion is that expecting those models to be fully sensitive to semantics as it relates to the syntactic structure is wishful thinking and that the inability to reason is a general failure point of modern machine learning methods and a key place to look for new ideas.
Last but not least, in May 2022, DeepMind announced Gato, a generalist AI model. As ZDNet’s own Tiernan Ray notes, Gato is a different kind of multimodal AI model. Gato can work with multiple kinds of data to perform multiple kinds of tasks, such as playing video games, chatting, writing compositions, captioning pictures, and controlling robotic arm stacking blocks.
As Ray also notes, Gato does a so-so job at a lot of things. However, that did not stop people from the DeepMind team that built Gato from exclaiming that “The Game is Over! It’s about making these models bigger, safer, compute efficient, faster at sampling, smarter memory, more modalities”.
Language, goals, and the market power of the few
So where does all of that leave us? Hype, metaphysical beliefs and enthusiastic outbursts aside, the current state of AI should be examined with sobriety. While the models that have been released in the last few months are really impressive feats of engineering and are sometimes able of producing amazing results, the intelligence they point to is not really artificial.
Human intelligence is behind the impressive engineering that generates those models. It is human intelligence that has built models that are getting better and better at what Alan Turing’s foundational paper, Computing Machinery and Intelligence called “the imitation game,” which has come to be known popularly as “the Turing test”.
As the Executive Director of the Center on Privacy & Technology (CPT) at Georgetown Law Emily Tucker writes, Turing replaced the question “can machines think?” with the question of whether a human can mistake a computer for another human.
Turing does not offer the latter question in the spirit of a helpful heuristic for the former question; he does not say that he thinks these two questions are versions of one another. Rather, he expresses the belief that the question “can machines think?” has no value, and appears to hope affirmatively for a near future in which it is in fact very difficult if not impossible for human beings to ask themselves the question at all.
In some ways, that future may be fast approaching. Models like Imagen and DALL-E break when presented with prompts that require intelligence of the kind humans possess in order to process. However, for most intents and purposes, those may be considered edge cases. What the DALL-Es of the world are able to generate is on par with the most skilled artists.
The question then is, what is the purpose of it all. As a goal in itself, spending the time and resources that something like Imagen requires to be able to generate cool images at will seems rather misplaced.
Seeing this as an intermediate goal towards the creation of “real” AI may be more justified, but only if we are willing to subscribe to the notion that doing the same thing at an increasingly bigger scale will somehow lead to different outcomes.