Enhancing LLM collaboration for smarter, more efficient solutions

“Co-LLM” uses a general-purpose large language model to start replying to a prompt, with a “switch variable” intervening at certain words to call upon a more accurate answer from the expert model.

Image: Alex Shipps/MIT CSAIL

“Co-LLM” algorithm helps a general-purpose AI model collaborate with an expert large language model by combining the best parts of both answers, leading to more factual responses.

Alex Shipps | MIT CSAIL

September 16, 2024

Ever been asked a question you only knew part of the answer to? To give a more informed response, your best move would be to phone a friend with more knowledge on the subject.

This collaborative process can also help large language models (LLMs) improve their accuracy. Still, it’s been difficult to teach LLMs to recognize when they should collaborate with another model on an answer. Instead of using complex formulas or large amounts of labeled data to spell out where models should work together, researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have envisioned a more organic approach.

Their new algorithm, called “Co-LLM,” can pair a general-purpose base LLM with a more specialized model and help them work together. As the former crafts an answer, Co-LLM reviews each word (or token) within its response to see where it can call upon a more accurate answer from the expert model. This process leads to more accurate replies to things like medical prompts and math and reasoning problems. Since the expert model is not needed at each iteration, this also leads to more efficient response generation.

To decide when a base model needs help from an expert model, the framework uses machine learning to train a “switch variable,” or a tool that can indicate the competence of each word within the two LLMs’ responses. The switch is like a project manager, finding areas where it should call in a specialist. If you asked Co-LLM to name some examples of extinct bear species, for instance, two models would draft answers together. The general-purpose LLM begins to put together a reply, with the switch variable intervening at the parts where it can slot in a better token from the expert model, such as adding the year when the bear species became extinct.

“With Co-LLM, we’re essentially training a general-purpose LLM to ‘phone’ an expert model when needed,” says Shannon Shen, an MIT PhD student in electrical engineering and computer science and CSAIL affiliate who’s a lead author on a new paper about the approach. “We use domain-specific data to teach the base model about its counterpart’s expertise in areas like biomedical tasks and math and reasoning questions. This process automatically finds the parts of the data that are hard for the base model to generate, and then it instructs the base model to switch to the expert LLM, which was pretrained on data from a similar field. The general-purpose model provides the ‘scaffolding’ generation, and when it calls on the specialized LLM, it prompts the expert to generate the desired tokens. Our findings indicate that the LLMs learn patterns of collaboration organically, resembling how humans recognize when to call upon an expert to fill in the blanks.”

A combination of flexibility and factuality

Imagine asking a general-purpose LLM to name the ingredients of a specific prescription drug. It may reply incorrectly, necessitating the expertise of a specialized model.

To showcase Co-LLM’s flexibility, the researchers used data like the BioASQ medical set to couple a base LLM with expert LLMs in different domains, like the Meditron model, which is pretrained on unlabeled medical data. This enabled the algorithm to help answer inquiries a biomedical expert would typically receive, such as naming the mechanisms causing a particular disease.

For example, if you asked a simple LLM alone to name the ingredients of a specific prescription drug, it may reply incorrectly. With the added expertise of a model that specializes in biomedical data, you’d get a more accurate answer. Co-LLM also alerts users where to double-check answers.

Another example of Co-LLM’s performance boost: When tasked with solving a math problem like “a3 · a2 if a=5,” the general-purpose model incorrectly calculated the answer to be 125. As Co-LLM trained the model to collaborate more with a large math LLM called Llemma, together they determined that the correct solution was 3,125.

Co-LLM gave more accurate replies than fine-tuned simple LLMs and untuned specialized models working independently. Co-LLM can guide two models that were trained differently to work together, whereas other effective LLM collaboration approaches, such as “Proxy Tuning,” need all of their component models to be trained similarly. Additionally, this baseline requires each model to be used simultaneously to produce the answer, whereas MIT’s algorithm simply activates its expert model for particular tokens, leading to more efficient generation.

When to ask the expert

The MIT researchers’ algorithm highlights that imitating human teamwork more closely can increase accuracy in multi-LLM collaboration. To further elevate its factual precision, the team may draw from human self-correction: They’re considering a more robust deferral approach that can backtrack when the expert model doesn’t give a correct response. This upgrade would allow Co-LLM to course-correct so the algorithm can still give a satisfactory reply.

The team would also like to update the expert model (via only training the base model) when new information is available, keeping answers as current as possible. This would allow Co-LLM to pair the most up-to-date information with strong reasoning power. Eventually, the model could assist with enterprise documents, using the latest information it has to update them accordingly. Co-LLM could also train small, private models to work with a more powerful LLM to improve documents that must remain within the server.

“Co-LLM presents an interesting approach for learning to choose between two models to improve efficiency and performance,” says Colin Raffel, associate professor at the University of Toronto and an associate research director at the Vector Institute, who wasn’t involved in the research. “Since routing decisions are made at the token-level, Co-LLM provides a granular way of deferring difficult generation steps to a more powerful model. The unique combination of model-token-level routing also provides a great deal of flexibility that similar methods lack. Co-LLM contributes to an important line of work that aims to develop ecosystems of specialized models to outperform expensive monolithic AI systems.”

Shen wrote the paper with four other CSAIL affiliates: PhD student Hunter Lang ’17, MEng ’18; former postdoc and Apple AI/ML researcher Bailin Wang; MIT assistant professor of electrical engineering and computer science Yoon Kim, and professor and Jameel Clinic member David Sontag PhD ’10, who are both part of MIT-IBM Watson AI Lab. Their research was supported, in part, by the National Science Foundation, The National Defense Science and Engineering Graduate (NDSEG) Fellowship, MIT-IBM Watson AI Lab, and Amazon. Their work was presented at the Annual Meeting of the Association for Computational Linguistics.