In June 2023, a federal judge in the Southern District of New York sanctioned two attorneys and their law firm after they submitted a brief citing cases that did not exist. The cases had not been misquoted or mischaracterized. They had never existed at all. ChatGPT invented them, complete with realistic-sounding party names, docket numbers, and holdings. The attorneys filed the brief. The court sanctioned them $5,000 and required them to notify the judges whose names appeared in the fabricated opinions.
The case was Mata v. Avianca, Inc., and it became the most-cited example of what the legal technology industry now calls "hallucination." But hallucination is not a bug that will be patched in the next model update. It is a structural property of how large language models work. Understanding the architecture is the only way to know which AI tools are safe to use in legal practice and which are not.
What a Large Language Model Actually Does
When you ask ChatGPT or a similar general-purpose language model for a case citation, you are not querying a database. You are asking a statistical system to predict what a plausible-sounding answer looks like.
Large language models are trained on massive amounts of text. During training, the model learns statistical relationships between words, phrases, and concepts. When you give it a prompt, it generates a response by repeatedly predicting the most probable next token given everything that came before. It is, at its core, a very sophisticated text completion system.
This architecture is extraordinarily capable for many tasks. It can summarize documents, explain complex concepts, draft correspondence, and assist with analysis. But it has a fundamental limitation when it comes to factual recall: the model does not retrieve information from a stored record. It generates text that looks like what the answer should look like.
When the model has seen enough examples of case citations during training, it knows what they structurally look like: party names, a reporter, a volume number, a page number, a year, a court. It can generate output in that format confidently and fluently, regardless of whether the underlying case actually exists in any reporter. There is no lookup. There is no verification. There is only prediction.
A language model "hallucinates" when it generates output that is grammatically and stylistically correct but factually false. For legal citations, this means case names, docket numbers, and holdings that sound real but are not. The model is not lying. It is completing text in the most statistically probable direction, and sometimes that direction leads somewhere that does not exist.
Why General LLMs Are the Wrong Tool for Case Research
The Mata v. Avianca case was not an anomaly. Since 2023, courts have seen similar incidents involving fabricated citations in filings across multiple jurisdictions. Bar associations and courts have issued guidance on AI use in legal practice. The ABA has addressed it in ethics opinions. State bars have issued their own guidance. The pattern is consistent: attorneys who treat a general-purpose chatbot as a legal research database get burned.
The problem is not that general LLMs are bad tools. The problem is that they are the wrong tool for a specific task. Asking ChatGPT whether Procopio v. Wilkie changed the law on Blue Water Navy presumptions is a reasonable use of the tool: you get a summary of what the model learned about the case during training, and you can verify it elsewhere. But asking ChatGPT to find you three cases supporting a specific proposition and give you the citations is an invitation to disaster, because the model will complete that task with the same confident fluency whether or not the cases it names actually exist.
"The court is not prepared to conclude that the mere use of an artificial intelligence tool for legal research constitutes misconduct. But such use is misconduct when the attorney files the result without verification."
Mata v. Avianca, Inc., S.D.N.Y. 2023 (paraphrased from the court's order)That last point is worth dwelling on. The judge in Mata was not ruling that AI is categorically off-limits. The ruling was about the failure to verify. But verification only works if you have something real to verify against. When the citation is fabricated, there is nothing to find. The attorney who filed the brief in Mata did try to verify, and was further misled when the chatbot confirmed the fictional cases existed. That is the nature of the problem: a system optimized to generate plausible-sounding output will generate plausible-sounding confirmations too.
The Architectural Fix: Retrieval Before Generation
The approach that solves this problem is called Retrieval-Augmented Generation, or RAG. The key distinction is in the order of operations.
In a pure language model workflow:
- You submit a prompt.
- The model generates a response from its parametric memory (the knowledge encoded during training).
- The response may or may not correspond to anything real.
In a retrieval-augmented workflow:
- You submit a query.
- The system searches a real, bounded source set of actual documents and returns the most relevant ones.
- The language model generates a response grounded in those retrieved documents, with source citations pointing back to the originals.
The safety property is simple: if a document is not in the source set, it cannot appear in your results. The system cannot invent a case that is not there, because it does not generate cases from parametric memory. It retrieves them from a real archive and shows you the actual text.
A retrieval-based system is bounded by its source set. It can only return what is actually indexed. That constraint, which might sound like a limitation, is exactly what makes it safe for legal research. You are searching an archive of real documents, not prompting a text completion engine.
What This Looks Like in Practice
| Question | General LLM | Retrieval-Based Tool |
|---|---|---|
| Can it fabricate citations? | Yes. Generates plausible-sounding citations from statistical patterns. | No. Only returns documents that exist in the indexed source set. |
| Where does the answer come from? | Parametric memory encoded during training. No live lookup. | Actual documents retrieved from a real archive at query time. |
| Can you verify the source? | Only by searching independently. The model gives you no document to verify against. | Yes. Every result includes the verbatim passage, docket number, date, and court. |
| Is the source set current? | Bounded by training cutoff. New decisions may not be included. | Depends on how frequently the source set is updated. Auditable. |
| Does it know if a case is still good law? | May reflect outdated understanding with no indication of staleness. | Depends on tool. Auditable; can flag overruled authority when seeded correctly. |
The Limits of Retrieval-Based Tools
It is worth being honest about what retrieval-based AI does not solve.
First, a retrieval system is only as good as its source set. If the relevant case is not indexed, it will not appear. This is a different failure mode than hallucination, but it is still a failure mode. Any retrieval-based tool should make its source set transparent: what is indexed, how current it is, and what the coverage gaps are.
Second, retrieval-based tools still use a language model to generate summaries and explanations. That generative step can introduce errors in interpretation even when the source documents are real. The discipline of reading the verbatim passage, not just the summary, matters.
Third, even a retrieved document can be a bad citation if the case has been overruled or the regulatory provision has been amended. Retrieval grounding eliminates the fabrication problem. It does not eliminate the good-law problem. That requires a citator layer on top.
Retrieval-based AI eliminates the category of risk that got attorneys sanctioned in Mata. It does not replace professional judgment, citator verification, or careful reading of the source. It replaces the part of the workflow where a general chatbot could hand you a citation that never existed.
What to Look for in an AI Legal Research Tool
If you are evaluating AI tools for your practice, here are the questions worth asking:
- Is the source set explicit? The tool should tell you exactly what is indexed: which courts, which date ranges, how many documents, when it was last updated. If this is opaque, that is a warning sign.
- Does every result link to the source document? You should be able to read the original opinion. A result that gives you a summary but no way to open the underlying document has not actually improved your verification burden.
- Does it show you the verbatim passage? The specific language a court used matters. A paraphrase that drops a limiting word can change the meaning of a holding entirely.
- Does it surface any indication of good-law status? Not all tools do this. A tool that returns overruled authority without flagging it has a different kind of reliability problem than hallucination, but it is still a reliability problem.
- Is it designed for your specific practice area? A general-purpose legal research tool optimized for contract disputes is not the same as a tool trained on the VA disability law source set. Source depth matters for specialized practice areas.
The Broader Point
The legal profession is not going to stop using AI. The economics are too compelling and the tools are improving too fast. The question is not whether to use AI, but which tools use architectures that are safe for legal work and which do not.
The distinction is not subtle. A tool that generates citations from statistical patterns is fundamentally different from a tool that retrieves them from a real, bounded archive. One can fabricate. The other cannot. That difference is not a matter of degree or how carefully you prompt the system. It is architectural.
Attorneys who understand this distinction can use AI tools aggressively and safely. Attorneys who do not are taking a risk that the courts have already demonstrated they will not excuse.
Case Strategy Services searches a real archive of CAVC, CAFC, OGC, and 38 CFR documents. Every result includes the source document, the verbatim passage, and the docket number. Nothing is generated from parametric memory.
Start Free 48-Hour Trial →No commitment. No sales pitch. Just the tool, live, on your questions.