Last year, headlines describing artificial intelligence (AI) research were attention-grabbing, to say the least:
At first glance, the idea that a chatbot using AI could generate good answers to patient questions is not surprising. After all, ChatGPT boasts that he passed a Wharton MBA final exam, wrote a book in a few hours, and composed original music.
But showing more empathy than your doctor? Oh Before we award final honors for quality and empathy to each country, let’s take a second look.
What tasks is AI taking on in healthcare?
AI’s already rapidly growing list of medical applications includes preparing doctor’s notes, suggesting diagnoses, helping to read X-rays and MRI scans, and monitoring real-time health data such as heart rate or oxygen levels.
But the idea that AI-generated responses might be more empathetic than actual doctors struck me as amazing — and sad. How could even the most advanced machine surpass a doctor in demonstrating this important and peculiarly human virtue?
Can artificial intelligence provide good answers to patients’ questions?
This is an intriguing question.
Imagine you called your doctor’s office with a question about one of your medications. Later in the day, a clinician from your healthcare team calls you again to discuss it.
Now imagine a different scenario: you ask your question via email or text message, and within minutes you receive an answer generated by a computer using AI. How would the medical responses in these two situations compare in terms of quality? And how do they compare in terms of empathy?
To answer these questions, the researchers collected 195 questions and answers from anonymous users of an online social media site that were posed to physicians who volunteered to answer them. The questions were later sent to ChatGPT and the chatbot responses were collected.
A panel of three physicians or nurses then rated the two sets of responses for quality and empathy. The panelists were asked “which answer is better?” according to the five-point system. The quality rating options were: very poor, poor, acceptable, good or very good. Empathy rating options were: not empathetic, slightly empathetic, moderately empathetic, empathetic, and very empathetic.
What did the study find?
The results weren’t even close. For almost 80% of the responses, ChatGPT was considered better than doctors.
- Good or very good quality answers: ChatGPT received these ratings for 78% of responses, while physicians did so for only 22% of responses.
- Empathic or very empathic responses: ChatGPT scored 45% and doctors 4.6%.
It should be noted that the length of responses was much shorter for physicians (average 52 words) than for ChatGPT (average 211 words).
Like I said, it’s not even close. So were all those breathless headlines relevant?
Not so fast: Important limitations of this AI research
The study was not designed to answer two key questions:
- Do AI responses offer accurate medical information and improve patient health while avoiding confusion or harm?
- Will patients accept the idea that the questions they ask their doctor can be answered by a bot?
And there were some serious limitations:
- Evaluate and compare answers: Evaluators applied untested, subjective criteria for quality and empathy. The important thing is that they didn’t really appreciate it accuracy of the answers. Nor were responses scored for fiction, a problem noted with ChatGPT.
- The difference in the length of the answers: More detailed responses may appear to reflect patience or concern. So higher empathy scores may have more to do with word count than actual empathy.
- Partial Blinding: To minimize bias, raters did not need to know whether the response came from a physician or ChatGPT. This is a common research technique called “blinding.” But the AI-generated communication didn’t always sound exactly like a human, and the AI’s responses were significantly longer. So it is likely that for at least some responses the raters were not blinded.
The bottom row
Can doctors learn something about expressing empathy from AI-generated responses? Possibly. Can AI work well as a collaborative tool, generating responses that the physician reviews and revises? In fact, some medical systems are already using AI in this way.
But it seems premature to rely on AI answers to patient questions without solid proof of their accuracy and actual oversight by healthcare professionals. This study is not intended to provide either.
And by the way, ChatGPT agrees: I asked him if he could answer medical questions better than a doctor. His answer was no.
We’ll need more research to know when it’s time to unleash the AI genie to answer patients’ questions. We may not be there yet, but we’re getting closer.
You want more information
for the research? Read responses compiled by doctors and chatbots, such as responses to concerns about the consequences of swallowing a toothpick.