Large Language Models (LLMs) and artificial intelligence (AI) systems generate responses based on training data and learned patterns. They do not inherently understand truth or context. As a result, AI models hallucinate. Sometimes, its responses aren’t based in facts, have logical inconsistencies, or are entirely fictitious.
In 2024, LLMs were hallucinating between 3% and 10% of the time. However, as LLMs became more popular and started being used more ubiquitously for decision-making, research, and as part of knowledge-based systems, hallucinations became a priority for the industry to collectively fix.
Companies realized the risk when AI started misinforming users in critical applications. The illusion of accuracy was even more dangerous as GPT started providing citations, but on taking a deeper look, these citations either didn’t contain the information or didn’t exist. In industries where precision is crucial, LLMs needed to significantly and quickly improve their performance to continue providing value to users.
And users started asking one crucial question: “Is this true?”
Why do LLMs hallucinate?
LLM hallucinations manifest in different ways. It creates text and presents facts even though they’re not true; it creates images that are plausible but contain impossible details; it makes assumptions like just because someone is wearing white, they are a doctor; it guesses to fill gaps in a narrative rather than looking back to verify information. Stanford University's RegLab researchers discovered that hallucinations are "pervasive and disturbing" in response to verifiable legal requests, with rates ranging from 69% to 88%. These context gaps become very problematic in specialized domains.
It’s similar to a child learning new words in some ways. If they hear "dog" and "bark" together often enough, they might assume all four-legged animals bark. This is similar to AI’s struggle with overfitting and underfitting. If a model is too rigid, it sees false patterns everywhere. If it's too loose, it generalizes incorrectly, like assuming all animals bark just because some do.
A lack of understanding of the context is also another challenge. Imagine reading a book where each page is shown to you separately and out of order. You might get the general idea, but you’d miss important nuances. LLMs process information in chunks, so without the full picture, they sometimes make mistakes that a human wouldn’t.
At its core, the LLM isn't really "thinking" the way humans do; it’s predicting. Like a supercharged autocorrect, it suggests the most likely next word based on patterns, not on true understanding. Just as autocorrect might turn "I'll be there soon" into "I'll be three spoons," LLMs can generate confident-sounding but incorrect responses.
Finally, prompting matters. The more context we give, the better the results. DSPy and the likes are trying to remove this dependency, but most of the time, if you give vague or misleading instructions, AI may latch onto the wrong patterns. Instead of not responding at all, AI fills in the gaps with the most likely response. It doesn’t matter that the answer is incorrect.
Managing and mitigating LLM hallucinations
Red teaming: Red teaming is a systematic approach to reducing AI hallucinations where expert teams deliberately stress-test AI systems by attempting to provoke false or fabricated responses. This proactive strategy helps identify vulnerabilities before they manifest in real-world applications. In a bank, for example, a red team might challenge an AI-powered investment advisor by presenting complex, hypothetical market scenarios. They could describe a fictitious geopolitical event and ask the AI to predict its impact on specific commodity prices. If the AI confidently generates detailed price forecasts based on this non-existent event, it would reveal a critical flaw in the system's ability to distinguish between facts and fiction. The development team would then use these insights to refine the model, perhaps by implementing stronger fact-checking mechanisms or improving the AI's ability to express uncertainty when faced with unfamiliar scenarios. This process helps ensure that when deployed in actual financial advisory roles, the solution is less likely to provide baseless recommendations that could lead to poor investment decisions.
Guardrails: Predefined rules and filters that constrain AI responses within verified parameters can ensure that answers align with established standards and prevent hallucinations. In the context of cybersecurity, guardrails play a crucial role in improving the safety and reliability of an LLM-powered security solution. If there was a system designed to analyze network traffic patterns and identify potential security breaches, guardrails could be put in place to restrict the threat classifications to a preset list of attack vectors or create a standardized scale for severity ratings. That way, if the system detected an unusual pattern in network traffic, it might be able to flag it as a potential distributed denial-of-service (DDoS) attack but would verify against the presets before classifying them or rating them. These guardrails prevent LLM solutions from generating false alarms, misclassifying threats, or recommending inappropriate actions.
Retrieval-Augmented Generation (RAG): It improves the results produced by LLM models by retrieving relevant information from trusted sources before generating a response. A RAG-enabled AI legal research tool could query a database of case law and statutes before answering legal questions. The knowledge system would first retrieve up-to-date legal documents, then generate a comprehensive response. This will ensure grounded and authoritative responses, and the solution can become particularly useful if it is trained with your proprietary data.
Fine-tuning with domain-specific data: Train the models on verified, industry-specific datasets to improve accuracy. In the financial sector, an AI-powered risk assessment tool can be fine-tuned with historical market data, regulatory filings, and company-specific financial reports. Your GenAI solution will develop a nuanced understanding of complex financial instruments and market dynamics. Therefore, it can generate more reliable risk analyses and investment recommendations.
Hybrid human-AI oversight: Human in the loop becomes incredibly important in healthcare where stakes are high. The healthcare LLM can perhaps review patient history, provide recommendations, and even make predictions around possible health risks, but it is imperative that a healthcare professional review these recommendations, adjust them, and then share them with the patient. As the system keeps learning, it gets a better understanding of the nuances and improves in accuracy.
The cost of error in healthcare, financial services, legal, or even the security industry is too high. Lives, economies, and verdicts are at stake here.
A deep dive into Galileo
With customers in the healthcare industry (clinical research, diabetes prediction, medical documentation, and medical coding), the legal industry, and payments, it was imperative that the Zemoso team evaluate and test tools and methodologies. The LLM and AI engineers at Zemoso use tools and processes such as DeepEval, DSPy, Red Teaming, and RAG to reduce the risk of hallucinations. While creating architecture decision records for one of our customers, we landed upon Galileo.
The reality is that no data set will be flawless. It will have errors.
No LLM or AI model will be flawless. It will have biases.
The point of an LLM is to do human-like reasoning; therefore, like it or not, it will make guesses, and it will fill in the gaps.
Therefore, to create robust systems, the aim was to detect hallucinations and flag them to the user with an associated confidence score, even if it cannot be completely avoided.
Galileo Evaluation Foundation Models (EFMs) claim to be 97% cheaper, 11x faster, and 18% more accurate than evaluations done by OpenAI GPT 3.5. However, Zemoso’s interest was more in Galileo’s ability to evaluate solutions and monitor the performance. It also helps with data quality management.
For this specific ADR, the Zemoso team used it to set up custom alerts to catch issues before they escalate, like ensuring that a chatbot doesn’t recommend a closed attraction on a travel website. We also tested its ability to integrate human feedback and adjust outputs to real-world scenarios. Galileo ensured that the answers that the system gave were within a given context, were fact-checked, and followed the prompts provided closely.
Glossary
LLM hallucinations: "LLM hallucination" refers to when a large language model (LLM) generates text that is factually incorrect, nonsensical, or completely unrelated to the given prompt, essentially "making up" information that isn't true, similar to how a person might hallucinate in a dream; it occurs when the AI model tries to fill in gaps in its knowledge with fabricated details, often presenting them as accurate information.
Red teaming: AI red teaming is a structured approach to testing artificial intelligence systems by simulating adversarial conditions to identify vulnerabilities and flaws. This practice has evolved from military exercises and traditional cybersecurity red teaming, where "red teams" simulate attacks against "blue teams" to evaluate defenses. In the context of AI, red teaming aims to uncover issues that may not be evident through conventional testing methods.
Guardrails: LLM guardrails for hallucination are a set of rules and checks that prevent large language models (LLMs) from generating inaccurate or inappropriate responses. Guardrails can detect and filter out hallucinations, which are when LLMs generate text that doesn't match the input.
RAG: Retrieval-Augmented Generation, or “RAG,” is a technique that enhances the performance of large language models by integrating external knowledge bases. This approach allows the model to retrieve relevant information from authoritative sources before generating a response, thereby improving accuracy and relevance.
Foundational models: Foundation models are a transformative class of artificial intelligence (AI) models designed to process and generate information across various modalities, such as text, images, and audio. They are characterized by their ability to learn from vast amounts of data and adapt to a wide range of tasks with minimal additional training.
Galileo: Galileo is a company that provides an evaluation intelligence platform that empowers AI teams to evaluate, iterate, monitor, and protect generative AI applications at an enterprise scale. It offers modules for experimentation, monitoring, debugging, and protection, ensuring the development of trustworthy and efficient AI applications.
ADR: An Architecture Decision Record (ADR) is a document that records important decisions made during the development of a system. We use ADRs to document the context, rationale, and consequences of these decisions.

