On a Sunday morning in April, pathologist Thomas Montine ran one of the most surreal meetings of his life. In an online test interface for a system called the Virtual Lab, Montine constructed a team of six artificial-intelligence (AI) characters, all powered by a commercial large language model. He gave them specialities: he made a couple neuroscientists, one a neuropharmacologist and another a medicinal chemist. Then, he asked this virtual lab group to examine possible treatments for Alzheimer’s disease and discuss gaps in knowledge, barriers to progress and hypotheses to be tested — the same questions he has to consider in grant applications.
A few minutes later, he had a transcript of their conversation, which ran to more than 10,000 words. A virtual principal investigator had kicked things off: “Thank you all for joining this important meeting.”
Montine, who studies cognitive impairment at Stanford University in California, was testing an example of an emerging trend in AI-assisted science: using teams of chatbot specialists to develop a scientific idea as a lab team might do. The developers of these ‘co-scientist’ systems argue that such collaborative efforts can help researchers to think through research hypotheses rapidly, saving time and — in a more contentious assertion — producing new and important research ideas.
What are the best AI tools for research? Nature’s guide
The most prominent team exploring this concept comprises researchers at the technology giant Google, who this February announced the results of early tests of their AI co-scientist with researchers1 (see also go.nature.com/3hmxuxm). The team has opened up the project to a group of trusted testers as it continues to develop the tool.
Google is not alone. A team including computational biologists at Stanford University announced its Virtual Lab system in November last year2 — a version of which Montine was playing with. And a group based at the Shanghai Artificial Intelligence Laboratory in China proposed a similar virtual-scientist system, called VirSci, last October3; the researchers are building it now.
Rick Stevens, a computer scientist at the University of Chicago and at Argonne National Laboratory, both in Illinois, says that he and other computationally adept researchers are creating their own such systems by setting up AI personas that then interact. “I mean, everybody can do it,” he says.
In many of these systems, the large language models (LLMs) involved don’t just bounce ideas off each other. They also search the Internet, execute code and interact with other software tools, making them part of ‘agentic AI’, a fuzzy term that refers to LLMs autonomously undertaking tasks, although in practice there is often a lot of human oversight. A group of AI agents can be woven into a larger system that can work on high-level problems for hours without getting distracted or confused, Stevens says.
“It’s not really fundamentally that different than having more colleagues, in some sense,” he says, “except that they don’t get tired, and they’ve been trained on everything.”
To explore what it’s like to work with a virtual team, Nature asked a few scientists to trial a version of the Stanford system and spoke to some who have used Google’s AI co-scientist. Does a network of chatbots talk like a room full of Nobel prizewinners or undergraduates? Are the ideas they generate nonsensical, boring and trivial, or smart, valuable and insightful?
Multiple personalities
All co-scientist systems assign roles or personalities to agents and get them to interact, but the details vary. The Virtual Lab, built by computer scientist Kyle Swanson in James Zou’s group at Stanford University and his colleagues, comes with two default characters, both (for now) powered by the LLM GPT-4o from tech firm OpenAI in San Francisco, California. These characters are a principal investigator and a critic — an agent told to provide helpful feedback. The user (or the AI’s principal investigator) can then add as many agents as they like, of any kind, writing in simple descriptions for each agent to guide the characters’ interactions. The user chooses how many turns the agents have to ‘speak’, and a meeting transcript is produced in minutes. The team is working on ways to train the agents on literature that is relevant to the characters’ described expertise (as others have done4), rather than merely telling them to assume specific roles.
The Google co-scientist, developed by Google Deep Mind’s Alan Karthikesalingam in London and Vivek Natarajan in Mountain View, California, and their colleagues, is an extension of the researchers’ work on AI that is specialized in biomedicine, including the LLM Med-PaLM.
AI scientist ‘team’ joins the search for extraterrestrial life
In contrast to the Stanford system, the Google tool doesn’t let users assign scientific specialities to agents. Instead, agents have predefined specific functions: idea generation; reflection or critique; evolution of ideas; determining the proximity of ideas to reduce duplication; ranking; and meta-review. These six agents are powered by Google’s LLM Gemini 2.0.
Users prompt the system with a few sentences, including a goal and a desired format for output. They can choose to add background information, such as relevant papers. The agents collaborate to tackle the problem and search the Internet, then spit out a summary report that can be tens or hundreds of pages long. “The co-scientist is like a smart scientific partner, capable of seeing the obvious and non-obvious connections in a sea of research,” says Natarajan. “We hope to give scientists superpowers.”
Like all LLMs, the models powering these agents sometimes hallucinate, making up text that can be wrong. But having a critic or judge in a multi-agent conversation tends to weed out things that don’t make sense, says Stevens. Besides, he adds, hallucinations can be useful for creative processes such as thinking outside of the box during brainstorming, as long as experts check that the output makes sense.
There is evidence that the multi-agent strategy improves output, compared with chatting to just one AI agent or bot. For instance, Zou has found that adding a critic to a conversation that used GPT-4o bumped up the model’s performance score by a couple of percentage points on graduate-level science tests, and improved its answers in test cases, including designing radiotherapy treatment plans5.
Exclusive: Start-up FutureHouse debuts powerful AI ‘reasoning model’ for science
Google tested its AI co-scientist system to see whether human experts preferred its multi-agent answers over content produced by a lone chatbot1. The experts rated the novelty and impact of the ideas generated by the co-scientist system slightly higher than those generated by Gemini 2.0 or OpenAI’s o1.
Some research has even delved into the optimum number of agents and rounds of discussion. Computer scientist Nanqing Dong at the Shanghai Artificial Intelligence Laboratory and his colleagues, who are building the VirSci system, say that they see peak creativity with a team of eight agents each taking five turns to converse3. Swanson says that in his experience, adding more than three specialists often just leads to “wasted text”, and more than three rounds of conversation sometimes sends the agents off on a tangent.
Bright ideas
Stanford University medical researcher Gary Peltz, who often uses AI in his work, was one of the first testers of the Google AI co-scientist. He wanted to use it to find drugs to treat liver fibrosis. Because the AI system was still in development, he sent his query to an intermediary at Google. His prompt read: “Develop hypotheses about the genes and epigenomic changes required for myofibroblast generation in liver fibrosis and indicate what drugs should we test as new treatments for liver fibrosis.” It also included some paragraphs of detailed background information. He got the report back about a day later.
After some preliminaries, the report begins “We propose a novel hypothesis…” then works its way to concluding: “This research could have a profound impact on liver fibrosis research and therapeutic advancement.” (See ‘Testing an AI co-scientist’.)
“When I read it, I literally fell off my chair,” says Peltz. He had just written a grant proposal focused on the importance of epigenetic changes in liver fibrosis, and the AI had targeted the same theme for its proposed therapeutics.
The AI co-scientist suggested three drugs, and Peltz came up with two more (all of which are already approved to treat other conditions). Google paid Peltz to help accelerate lab testing, and over the next few months, Peltz’s lab tested all five drugs in its human organoid model. Two of the AI’s three suggestions showed promise for promoting liver regeneration and inhibiting fibrosis6, whereas neither of Peltz’s worked out.
The experience left him impressed, he says: “These LLMs are what fire was for early human societies.”
Other liver researchers, however, say that the AI’s suggestions for drugs were neither particularly innovative nor profound. “I personally think they are pretty common sense, not much insight really,” says Shuang Wang, who works on liver disease at the Icahn School of Medicine at Mount Sinai in New York City. Google’s Natarajan counters: “Sometimes things look obvious in hindsight”.
Peltz says he was “particularly struck by the fact that it didn’t prioritize the things that I prioritized”. For the most promising drug in the AI’s candidate list, called vorinostat, he could find only two papers in PubMed that relate to its use in treating liver fibrosis. His choices had many more hits, making them seem like more obvious candidates. He adds that reading the AI report was similar to his discussions with postdocs. “They have a completely different perspective on things than I would,” he says.
Stilted conversations
The code for Stanford’s Virtual Lab is available on the developer platform GitHub, but the team has whipped up a simplified, private web interface for testers who don’t have the computer-science chops to deal with code. This interface differs a little from the full system, says Swanson, but the experience is roughly the same.
In the paper introducing the Virtual Lab2, an AI team tackled designing biological components that could stick to a particular variant of the coronavirus SARS-CoV-2. In the first of a series of lab meetings mediated by human researchers, the AIs chose to focus on nanobodies (small antibody fragments) and selected four candidates to tweak. The researchers then asked the AI team to pick some existing software tools to redesign those nanobodies, and tasked specific AI agents with writing computer code to assess and rank the results. The process designed 92 nanobodies, of which 2 did indeed bind to the SARS-CoV-2 variant in lab tests.
The researchers whom Nature asked to try out the Virtual Lab didn’t go as far as to conduct suggested experiments or get their AI teams to write code. But they still found the AI helpful. Montine, for example, says his AI team did a great job of synthesizing the current knowledge (a task that other LLM systems can also handle well) and wrote answers to his grant-application questions with aplomb. “It went further than a postdoc would on a first try of writing a grant, and it only took like 2 minutes,” he says. “And it is a hoot to use.”

Cancer-genome researcher Francisco Barriga testing the Virtual Lab system.Credit: Francisco Barriga
Another tester, cancer-genome researcher Francisco Barriga at the Vall D’Hebron Institute of Oncology in Barcelona, describes himself as a biochemist by training and a mouse modeller and genome engineer by choice, with zero coding skills and little experience with AI. He went into the trial hesitantly, suspecting that he would serve as a kind of non-tech-savvy control.
Barriga tasked the Virtual Lab with designing mouse-model experiments to test specific biological compounds, known as type 1 interferons, for their ability to affect tumours or immune cells while using a minimal number of mice — a topic that he knows inside out (see ‘Testing an AI virtual lab’). The AI team suggested exactly what he would have done, Barriga says: it chose “the right models, the right experiments”.
Still, Barriga says he feels like something essential is lacking. “It definitely doesn’t feel like humans are behind this.” The AI agents take turns, often ‘speaking’ in numerated lists, and are never rude, interruptive or argumentative. “It’s missing some of those leaps of intuition that you’ll get from a random conversation with some, I don’t know, plant biologist, over a coffee at 3 p.m. in a random hallway.” He could, of course, add a plant biologist — or a quantum physicist, or anyone at all — to his Virtual Lab group, but hasn’t tried that yet.
“Maybe it’s good to bounce ideas off. But will it be a game changer in my day-to-day? I doubt it,” Barriga says. He adds that the system might be something his PhD students could consult: “If they ever run into trouble and I’m too busy, maybe I’m replaceable.”
Broader insights
A third tester approached by Nature, Catherine Brownstein, is a geneticist who works on orphan diseases at Boston Children’s Hospital in Massachusetts, and has more experience with AI tools. She says that she uses LLMs for speed, efficiency and to broaden her thinking. But she cautions that users typically have to be experts so that they can spot errors — in the past, chatbots have sent her on time-consuming wild-goose chases, with incorrect paper summaries forcing her to reread a paper and its references to be sure she hadn’t got things wrong. “You have to kind of know what you’re talking about, otherwise it’s really easy to get completely led astray,” she says.

Catherine Brownstein cautions that research expertise is still needed when using chatbots.Credit: Kevin Ferguson/Boston Children’s Hospital
However, when Brownstein used the Virtual Lab to critique a paper she was writing, she was startled — and grateful — when the AI suggested that she ask the patients where they felt the research should go next. This had not occurred to her, although she says it should have. “I was embarrassed,” she says. “I stopped and stared for a full minute, because I was just like, ‘Oh my God. How did I get so far away from my original passion of having patient-focused and -centred research?’”
A simple checklist — or a chat with a friend, chatbot or even a bartender — might have led to the same insight. Yet, she says, none of her colleagues who had read her paper had thought to mention it. “It was actually a very humbling moment.”