Sometimes, even the most powerful artificial intelligence systems struggle to recognize the simplest of words — such as “no,” “not,” or “doesn’t.”
In a recent study, MIT researchers found that vision-language models — widely used in medical imaging, manufacturing, and media search — often fail to correctly process negation words. And the consequences can be serious.
Take healthcare, for example. If a radiologist uses an AI model to speed up the diagnosis of chest X-rays and the patient shows swelling in the tissue but no enlarged heart, the AI’s failure to correctly interpret the word “no” could completely change the diagnosis.
“There’s a lot of justified excitement around large language models (LLMs) and vision-language models (VLMs),” Kumail Alhamoud, an MIT graduate student and lead author of the study, told Newsmax. “But this excitement can sometimes obscure the limitations of these systems, especially for professionals outside of AI. What seems like a simple or intuitive query to a human may be completely misunderstood by the model.”
Vision-language models used in AI chatbots are based on the transformer model originally developed by Google researchers. Transformer models excel at “capturing how a word can mean different things in different contexts.”
“What they effectively do is build a representation of all the words in a given sentence or paragraph together, rather than word-by-word,” Karin Verspoor, dean of the School of Computing Technologies at Royal Melbourne Institute of Technologies, Australia, told Newsmax.
Verspoor, who specializes in AI-driven natural language processing for biomedical data and scientific literature, explained that some words carry different meanings depending on context.
For example, the word “orange” is represented differently by transformer models in the sentences: “She wore an orange sweater” versus “She had an orange for a snack.” The model assigns unique mathematical representations based on the word’s meaning in each sentence.
However, negation words like not, no, and un- don’t have distinct, context-specific meanings in the same way, making them harder for these models to accurately process and represent.
In medicine, capturing negative information is essential — it’s just as important to know what conditions a patient does not have as it is to know what they do.
Accurately identifying what has been ruled out is critical for reaching the right diagnosis.
As Verspoor explains, medical records are filled with negation phrases such as: “No family history of breast cancer,” “No evidence of anemia,” “No fracture,” “No abnormalities noted on exam.”
These statements play a vital role in guiding medical decisions and ensuring accurate patient care.
“If negation is missed or ignored, the system will make mistakes in understanding the characteristics of a patient,” she added. “It’s also important in contexts like clinical trials, where inclusion and exclusion criteria define which patients qualify.”
The MIT team tested vision-language models (VLMs) and found that they often performed at or below random chance when processing captions that contained negation.
In side-by-side comparisons where nearly identical captions differed only by the presence of a word like “not” or an excluded object, the models frequently chose the wrong answer.
The researchers discovered an “affirmation bias”—as Verspoor explained, VLMs tend to overlook negation and instead focus on identifying positive objects in an image.
This bias is a shortcut that stems from how these models are trained. Most VLMs learn from image-caption pairs that describe what is in the image, not what is absent.
As study member Marzyeh Ghassemi, associate professor in MIT’s Department of Electrical Engineering and Computer Science, points out:
“No one writes captions that say: ‘A dog jumping a fence — without helicopters.’ So VLMs never learn what absence looks like.”
To address this shortcoming, the researchers created a synthetic dataset with captions that specifically describe what is not in each image.
Fine-tuning VLMs with this dataset improved their performance by 10% in tasks involving negated-image retrieval and by 30% in multiple-choice captioning accuracy.