GPT-4 Is Here. Just How Powerful Is It?

Updated at 2:15 p.m. ET on March 14, 2023

Less than four months after releasing ChatGPT, the text-generating AI that seems to have pushed us into a science-fictional age of technology, OpenAI has unveiled a new product called GPT-4. Rumors and hype about this program have circulated for more than a year: Pundits have said that it would be unfathomably powerful, writing 60,000-word books from single prompts and producing videos out of whole cloth. Today’s announcement suggests that GPT-4’s abilities, while impressive, are more modest: It performs better than the previous model on standardized tests and other benchmarks, works across dozens of languages, and can take images as input—meaning that it’s able, for instance, to describe the contents of a photo or a chart.

Unlike ChatGPT, this new model is not currently available for public testing (although you can apply or pay for access), so the obtainable information comes from OpenAI’s blog post, and from a New York Times story based on a demonstration. From what we know, relative to other programs, GPT-4 appears to have added 150 points to its SAT score, now a 1410 out of 1600, and jumped from the bottom to the top 10 percent of performers on a simulated bar exam. Despite pronounced fears of AI’s writing, the program’s AP English scores remain in the bottom quintile. And while ChatGPT can handle only text, in one example, GPT-4 accurately answered questions about photographs of computer cables. Image inputs are not publicly available yet, even to those eventually granted access off the waitlist, so it’s not possible to verify OpenAI’s claims.

The new GPT-4 model is the latest in a long genealogy—GPT-1, GPT-2, GPT-3, GPT-3.5, InstructGPT, ChatGPT—of what are now known as “large language models,” or LLMs, which are AI programs that learn to predict what words are most likely to follow each other. These models work under a premise that traces its origins to some of the earliest AI research in the 1950s: that a computer that understands and produces language will necessarily be intelligent. That belief underpinned Alan Turing’s famous imitation game, now known as the Turing Test, which judged computer intelligence by how “human” its textual output read.

Those early language AI programs involved computer scientists deriving complex, hand-written rules, rather than the deep statistical inferences used today. Precursors to contemporary LLMs date to the early 2000s, when computer scientists began using a type of program inspired by the human brain called a “neural network,” which consists of many interconnected layers of artificial nodes that process huge amounts of training data, to analyze and generate text. The technology has advanced rapidly in recent years thanks to some key breakthroughs, notably programs’ increased attention spans—GPT-4 can make predictions based on not just the previous phrase but many words prior, and weigh the importance of each word differently. Today’s LLMs read books, Wikipedia entries, social-media posts, and countless other sources to find these deep statistical patterns; OpenAI has also started using human researchers to fine-tune its models’ outputs. As a result, GPT-4 and similar programs have a remarkable facility with language, writing short stories and essays and advertising copy and more. Some linguists and cognitive scientists believe that these AI models show a decent grasp of syntax and, at least according to OpenAI, perhaps even a glimmer of understanding or reasoning—although the latter point is very controversial, and formal grammatical fluency remains far off from being able to think.

GPT-4 is both the latest milestone in this research on language and also part of a broader explosion of “generative AI,” or programs that are capable of producing images, text, code, music, and videos in response to prompts. If such software lives up to its grand promises, it could redefine human cognition and creativity, much as the internet, writing, or even fire did before. OpenAI frames each new iteration of its LLMs as a step toward the company’s stated mission to create “artificial general intelligence,” or computers that can learn and excel at everything, in a way that “benefits all of humanity.” OpenAI’s CEO, Sam Altman, told the The New York Times that while GPT-4 has not “solved reasoning or intelligence… this is a big step forward from what is already out there.”

With the goal of AGI in mind, the organization began as a nonprofit that provided public documentation for much of its code. But it quickly adopted a “capped profit” structure, allowing investors to earn back up to 100 times the money they put in, with all profits exceeding that returning to the nonprofit—ostensibly allowing OpenAI to raise the capital needed to support its research. (Analysts estimate that training a high-end language model costs in “the high-single-digit millions.”) Along with the financial shift, OpenAI also made its code more secret—an approach that critics say makes it difficult to hold the technology unaccountable for incorrect and harmful output, though the company has said that the opacity guards against “malicious” uses.

The company frames any shifts away from its founding values as, at least in theory, compromises that will accelerate arrival at an AI-saturated future that Altman describes as almost Edenic: Robots providing crucial medical advice and assisting underresourced teachers, leaps in drug discovery and basic science, the end of menial labor. But more advanced AI, whether generally intelligent or not, might also leave huge portions of the population jobless, or replace rote work with new, AI-related bureaucratic tasks and higher productivity demands. Email didn’t speed up communication so much as turn each day into an email-answering slog; electronic health records should save doctors time but in fact force them to spend many extra, uncompensated hours updating and conferring with these databases.

Regardless of whether this technology is a blessing or a burden for everyday people, those who control it will no doubt reap immense profits. Just as OpenAI has lurched toward commercialization and opacity, already everybody wants in on the AI gold rush. Companies like Snap and Instacart are using OpenAI’s technology to incorporate AI assistants into their services. Earlier this year, Microsoft invested $10 billion in OpenAI and is now incorporating chatbot technology into its Bing search engine. Google followed up by investing a more modest sum in the rival AI start-up Anthropic (recently valued at $4.1 billion) and announcing various AI capacities in Google search, Maps, and other apps. Amazon is incorporating Hugging Face—a popular website that gives easy access to AI tools—into AWS, to compete with Microsoft’s cloud service, Azure. Meta has long had an AI division, and now Mark Zuckerberg is trying to build a specific, generative-AI team from the Metaverse’s pixelated ashes. Start-ups are awash in billions in venture-capital investments. GPT-4 is already powering the new Bing, and could conceivably be integrated into Microsoft Office.

In an event announcing the new Bing last month, Microsoft’s CEO said, “The race starts today, and we’re going to move and move fast.” Indeed, GPT-4 is already upon us. Yet as any good text predictor would tell you, that quote should end with “move fast and break things.” Silicon Valley’s rush, whether toward gold or AGI, shouldn’t distract from all the ways these technologies fail, often spectacularly.

Even as LLMs are great at producing boilerplate copy, many critics say they fundamentally don’t and perhaps cannot understand the world. They are something like autocomplete on PCP, a drug that gives users a false sense of invincibility and heightened capacities for delusion. These models generate answers with the illusion of omniscience, which means they can easily spread convincing lies and reprehensible hate. While GPT-4 seems to wrinkle that critique with its apparent ability to describe images, its basic function remains really good pattern matching, and it can only output text.

Those patterns are sometimes harmful. Language models tend to replicate much of the vile text on the internet, a concern that the lack of transparency in their design and training only heightens. As the University of Washington linguist and prominent AI critic Emily Bender told me via email: “We generally don’t eat food whose ingredients we don’t know or can’t find out.”

Precedent would indicate that there’s a lot of junk baked in. Microsoft’s original chatbot, named Tay and released in 2016, became misogynistic and racist, and was quickly discontinued. Last year, Meta’s BlenderBot AI rehashed anti-Semitic conspiracies, and soon after that, the company’s Galactica—a model intended to assist in writing scientific papers—was found to be prejudiced and prone to inventing information (Meta took it down within three days). GPT-2 displayed bias against women, queer people, and other demographic groups; GPT-3 said racist and sexist things; and ChatGPT was accused of making similarly toxic comments. OpenAI tried and failed to fix the problem each time. New Bing, which runs a version of GPT-4, has written its own share of disturbing and offensive text—teaching children ethnic slurs, promoting Nazi slogans, inventing scientific theories.

It’s tempting to write the next sentence in this cycle automatically, like a language model—“GPT-4 showed [insert bias here].” Indeed, in its blog post, OpenAI admits that GPT-4 “‘hallucinates’ facts and makes reasoning errors,” hasn’t gotten much better at fact-checking itself, and “can have various biases in its outputs.” Still, as any user of ChatGPT can attest, even the most convincing patterns don’t have perfectly predictable outcomes.

A Meta spokesperson wrote over email that more work is needed to address bias and hallucinations—what researchers call the information that AIs invent—in large language models, and that “public research demos like BlenderBot and Galactica are important for building better chatbots; a Microsoft spokesperson pointed me to a post in which the company described improving Bing through a “virtuous cycle of [user] feedback.” An OpenAI spokesperson pointed me to a blog post on safety, in which the company outlines its approach to preventing misuse. It notes, for example, that testing products “in the wild” and receiving feedback can improve future iterations. In other words, Big AI’s party line is the utilitarian calculus that, even if programs might be dangerous, the only way to find out and improve them is to release them and risk exposing the public to hazard.

With researchers paying more and more attention to bias, a future iteration of a language model, GPT-4 or otherwise, could someday break this well-established pattern. But no matter what the new model proves itself capable of, there are still much larger questions to contend with: Whom is the technology for? Whose lives will be disrupted? And if we don’t like the answers, can we do anything to contest them?