Since the release of OpenAI’s Generative Pretrained Transformer (GPT) models, large language models (LLMs) have demonstrated remarkable capabilities in medical reasoning and knowledge-based examinations. After the introduction of GPT-3.5 in December 2022, many studies assessed its performance on medical licensing exams like the United States Medical Licensing Examination (USMLE). Researchers have discovered that there was a significant variability in responses to diagnostic questions, which reflects the probabilistic nature of LLM token generation with each improved version. This variability underscores the inherent limitations of LLM and can lead to multiple reasoning paths, some of which are correct and others are incorrect or hallucinated. It is important to recognize these divergent reasoning pathways, which contain complementary insights. This study aimed to examine whether structured deliberation in multiple AI models can enhance diagnostic accuracy and reasoning reliability.
This study introduced a “Council of AI Agents,” a collaborative framework that comprises many GPT-4 cases aimed at simulating collective reasoning. The researchers expected that deliberation in many agents might reduce variability and promote consensus and improve correctness as compared to individual response or a majority voting system.
Five autonomous GPT-4 agents were programmed to answer USMLE questions under the supervision of the Facilitator algorithm. Each agent restated the question independently and reasoned through possible answers by using medical logic, and proposed a single answer. The Facilitator compared all the responses, identified the disagreements, and guided the agents by iteratively re-prompting the agents to discuss until a consensus or stability was obtained.
The dataset involved 325 publicly available USMLE questions from 2022 sample exams: Step 1 (94 questions), Step 2 Clinical Knowledge [CK] (109), and Step 3 (122). Questions with tables or images were excluded. The researchers assessed the accuracy against the official answer keys and used semantic entropy. Entropy is a numerical measure of disagreement and evaluates how the deliberation of the Council converges over time. High entropy indicated higher variability, and low entropy showed increased agreement in agents.
The results showed a marked improvement in performance with deliberation. The Council’s consensus accuracy reached 97% for Step 1, 93% for Step 2CK, and 94% for Step 3 as compared with individual model accuracies of 77 to 79%. About 22% of the questions required deliberation because of at least one initial error of the agent. The Council outperformed majority voting (95% vs 91%), and analysis revealed that deliberation was highly effective at correcting errors. The odds of converting an incorrect majority into a correct consensus were 5 times greater than the reverse, with only one case of regression from a correct to an incorrect answer. The Council was unable to generate a correct consensus when all agents were initially wrong, underscoring the need for at least one valid reasoning path.
Deliberation required two to four rounds with an average of 3.6 for Step 1, 4.1 for Step 2CK, and 2.4 for Step 3. Semantic entropy continuously reduced in rounds, which confirms the progressive convergence. Entropy approached zero even when the final consensus was incorrect. This indicates deliberation makes sure agreement, but it does not always ensure truth.
The Council of AI agents effectively transformed individual response variability into a collaborative advantage and produced the highest AI accuracy reported on the USMLE. The framework showed that structured and multi-agent deliberation can improve reasoning interpretability and reliability. Limitations are high computational costs, token restrictions limiting the discussion, and reliance on a single GPT-4 model. The Council depended on at least one initial correct answer. Future research must explore the heterogeneous Councils that combine different LLMs like Claude, GPT, Bard, and parallelize the deliberation to improve robustness and efficiency. This study highlights the potential of collective AI reasoning to improve medical knowledge assessment and complex decision-making.
Reference: Shaikh Y, Jeelani-Shaikh ZA, Jeelani MM, Javaid A, Mahmud T, Gaglani S, et al. Collaborative intelligence in AI: evaluating the performance of a council of AIs on the USMLE. PLOS Digit Health. 2025;4(10):e0000787. doi:10.1371/journal.pdig.0000787





