Ask ChatGPT or Bard about the best medical care and their answers will contain information you can trust … [+]
If you ask ChatGPT how many procedures a particular surgeon performs or what a specific hospital’s infection rate is, the OpenAI and Microsoft chatbot inevitably replies with some version of, “I don’t.”
But depending on how you ask, Google’s Bard gives a very different answer, even recommending a “consultation” with certain clinicians.
Bard told me how many knee replacement surgeries will be performed in 2021 by major Chicago hospitals, their infection rates, and the national average. It even told me which surgeon in Chicago does the most knee surgeries and its infection rate. When I asked about heart bypass surgery, Bard gave both the death rate for some local hospitals and the national average for comparison. While Bard sometimes referred to himself as the source of information and his response began with, “To my knowledge,” other times it referred to well-known and respected organizations.
There was only one problem. If Google warns itself, “Bard is experimental…so double check the information in Bard’s answers.” As I followed that advice, the truth began to inextricably merge with “truthfulness— Comedian Stephen Colbert’s memorable term to describe information that is perceived as true not because of supporting facts, but because it “feels” true.
Take, for example, knee replacement surgery, also known as knee replacement. It is one of the most common surgical procedures, with almost 1.4 million performances in 2022. When I asked Bard which surgeon does the most knee replacements in Chicago, the answer was Dr. Richard A Berger. Berger, who is affiliated with both Rush University Medical Center and Midwest Orthopedics, has done more than 10,000 knee replacements, Bard told me. In response to a subsequent question, Bard added that Berger’s infection rate was 0.5 percent, significantly lower than the national average of 1.2 percent. That low percentage was attributed to factors such as “Dr. Berger’s experience, his use of minimally invasive techniques, and his meticulous attention to detail.”
With chatbots, every word in a search counts. When I changed the question a bit and asked, ‘Which surgeon does the most knee replacements in Chicago area?”, Bard no longer gave one name. Instead, it listed seven “of the most well-known surgeons” – including Berger – who are “all highly skilled and experienced”, “have a long track record of success” and are “renowned for compassionate care”.
As with ChatGPT, Bard’s answers to every medical-related question contain copious warnings, such as “no surgery is without risk”. Yet Bard still said bluntly, “If you’re considering knee replacement surgery, I’d recommend making an appointment with one of these [seven] surgeons.”
ChatGPT eschews words like “recommend,” but it confidently reassured me that its list of four “upper knee replacement surgeons” was based on “their expertise and patient outcomes.”
While these endorsements are very different from the list of search engine websites we’ve become accustomed to, they’re more understandable when you consider how “generative artificial intelligence” chatbots like ChatGPT and Bard are trained.
Bard and ChatGPT both rely on information from the internet, where individual orthopedic surgeons often feature prominently. For example, details of Berger’s practice can be found at his website and in numerous media profiles, including one Chicago grandstand story tell how athletes and celebrities from all over the country come to him for care. Unfortunately, it is impossible to know to what extent the chatbots reflect what the surgeons say about themselves versus data from objective sources.
Courtney Kelly, Berger’s director of business development, confirmed the figure of “more than 10,000” surgical volumes, noting that the practice posted that number on its website several years ago. Kelly added that the practice only disclosed an overall complication rate of less than one percent, but she confirmed that about half of that figure represented infections.
While the infection data for Berger may be accurate, the quoted source, the Joint Commission, was not. A spokesman for the Joint Commission, which examines hospitals for overall quality, said it does not collect individual infection rates from surgeons. Similarly, a Berger colleague at Midwest Orthopedics, who was also said to have an infection rate of 0.5 percent, had that number attributed by Bard to the Centers for Medicare & Medicaid Services (CMS). Not only was I unable to find any CMS data on individual clinician infection rates or volumes, the CMS Hospital Compare site only provides the hospital infection rate for a combination of knee and hip surgeries.
In response to another question I asked Bard, it gave breast cancer death rates in some of Chicago’s largest hospitals, though it was careful to point out that the numbers were just averages for that condition. But again the attribution, this time to the American Hospital Association, did not hold. The trade group said it does not collect that kind of data.
I went into more detail about life-and-death procedures and asked Bard about the death rate for heart valve surgery in a few local hospitals. The quick response was impressively sophisticated. Bard provided hospital risk-adjusted mortality rates for isolated aortic valve replacement and for mitral valve replacement, along with a national average for each (2.9 percent and 3.3 percent, respectively). The numbers were credited to the Society of Thoracic Surgeons (STS), whose data is considered the “gold standard” for this type of information.
For comparison, I asked ChatGPT about those same national death rates. Like Bard, ChatGPT cited STS, but the mortality rate for an isolated aortic valve replacement procedure was much lower (1.6 percent), while the mortality rate for the mitral valve was about the same (2.7 percent).
Before dismissing Bard’s descriptions of the quality of care provided by individual hospitals and physicians as woefully flawed, consider the alternatives. The advertisements in which hospitals proclaim their clinical prowess may not quite qualify as “truthfulness,” but they certainly carefully select which truths to tell. Meanwhile, I am not aware of any publicly available hospital or physician data that providers do not protest are unreliable, whether sourced US news and world report or the Leapfrog Group (which Bard and ChatGPT also refer to) or the federal Medicare program.
(STS data is an asterisked exception, as individual clinician or group performance information is only publicly available if affected clinicians choose to disclose it.)
What Bard and ChatGPT provide is a powerful conversation starter, one that paves the way for doctors and patients to candidly discuss the safety and quality of care and, inevitably, for that discussion to expand into wider societal discussion. The chatbots provide information that, as it improves, eventually becomes a public demand for consistent medical excellenceas I wrote it 25 years ago in a book exploring the nascent information age.
I asked John Morrow, an experienced (human) data analyst and the founder of Franklin Trust Ratings, how he would advise providers to respond.
“It’s time for the industry to standardize and disclose,” Morrow said. “Otherwise things like ChatGPT and Bard are going to create pandemonium and reduce trust.”