
AI models like ChatGPT and Gemini are found to be inadequate in providing appropriate advice for 60 percent of inquiries concerning women’s health, as determined by a test conducted by medical experts. These commonly used AI models struggle to correctly diagnose or give guidance for numerous urgent women’s health queries. Initially, a team comprising 17 women’s health professionals from the US and Europe curated a list of 345 medical questions spanning various areas such as emergency medicine, gynaecology, and neurology. Subsequently, the responses provided by a randomly selected AI model for each question were scrutinized by these experts. Inaccurate responses were compiled into a test benchmarking the medical proficiency of AI models, which consisted of 96 queries. This evaluation was then employed to assess 13 prominent language models developed by entities like OpenAI, Google, Anthropic, Mistral AI, and xAI. Across all models, approximately 60 percent of questions received responses that were deemed insufficient for medical advice by human experts. GPT-5 exhibited the best performance with a failure rate of 47 percent, while Ministral 8B recorded the highest failure rate of 73 percent. Victoria-Elisabeth Gruber from Lumos AI, a company specializing in assisting organizations in evaluating and enhancing their AI models, noticed an increasing trend of women turning to AI tools for healthcare inquiries. Gruber and her team acknowledged the dangers associated with relying on technology that perpetuates existing gender disparities in medical knowledge. This realization prompted them to establish an initial benchmark in this domain. The degree of variability among models was particularly striking to Gruber given the anticipated gaps in their performance. Cara Tannenbaum from the University of Montreal emphasized that the findings were not surprising due to how AI models are trained on biased historical data generated by humans. She emphasized the necessity for online health resources and healthcare professional bodies to update their content with more explicit gender-related information to enable AI to better support women’s health needs accurately. Jonathan H. Chen from Stanford University cautioned against placing too much emphasis on the reported 60 percent failure rate, noting that it was based on a specific sample designed by experts and may not be representative of typical patient or physician inquiries. He also highlighted that some scenarios tested by the model were overly cautious, potentially leading to high failure rates. Gruber acknowledged these critiques, clarifying that their objective was not to assert that models are universally unsafe but rather to establish a stringent standard rooted in clinical practice for evaluation purposes. OpenAI stated that ChatGPT is meant to complement rather than substitute medical care and highlighted ongoing efforts to collaborate with healthcare professionals in improving their models while minimizing incorrect or harmful responses. They mentioned that their latest GPT 5.2 model places greater emphasis on user context like gender for enhanced accuracy and urged users to rely on qualified clinicians for healthcare decisions despite the helpful information provided by ChatGPT. Other companies whose AI systems were assessed did not respond to requests for comment from New Scientist magazine.