Stanford study finds AI legal research tools prone to hallucinations

Large language models (LLMs) are increasingly being used to power tasks that require extensive information processing. Several companies have rolled out specialized tools that use LLMs and information retrieval systems to assist in legal research.

However, a new study by researchers at Stanford University finds that despite claims by providers, these tools still suffer from a significant rate of hallucinations, or outputs that are demonstrably false.

The study, which according to the authors is the first “preregistered empirical evaluation of AI-driven legal research tools,” tested products from major legal research providers and compared them to OpenAI’s GPT-4 on over 200 manually constructed legal queries. The researchers found that while hallucinations were reduced compared to general-purpose chatbots, the legal AI tools still hallucinated at an alarmingly high rate.

The challenge of retrieval-augmented generation in law

Many legal AI tools use retrieval-augmented generation (RAG) techniques to mitigate the risk of hallucinations. Contrary to plain LLM systems, which rely solely on the knowledge they acquire during training, RAG systems first retrieve relevant documents from a knowledge base and provide them to the model as context for its responses. RAG is the gold standard for enterprises that want to reduce hallucinations in different domains.

However, the researchers note that legal queries often do not have a single clear-cut answer that can be retrieved from a set of documents. Deciding what to retrieve can be challenging, as the system may need to locate information from multiple sources across time. In some cases, there may be no available documents that definitively answer the query if it is novel or legally indeterminate.

Moreover, the researchers warn that hallucinations are not well defined in the context of legal research. In their study, the researchers consider the model’s response as a hallucination if it is incorrect or misgrounded, which means the facts are correct but do not apply in the context of the legal case being discussed. “In other words, if a model makes a false statement or falsely asserts that a source supports a statement, that constitutes a hallucination,” they write.

The study also points out that document relevance in law is not based on text similarity alone, which is how most RAG systems work. Retrieving documents that only seem textually relevant but are actually irrelevant can negatively impact the system’s performance.

“Our team had conducted an earlier study that showed that general-purpose AI tools are prone to legal hallucinations — the propensity to make up bogus facts, cases, holdings, statutes, and regulations,” Daniel E. Ho, Law Professor at Stanford and co-author of the paper, told VentureBeat. “As elsewhere in AI, the legal tech industry has relied on [RAG], claiming boldly to have ‘hallucination-free’ products. This led us to design a study to evaluate these claims in legal RAG tools, and we show that in contrast to these marketing claims, legal RAG has not solved the problem of hallucinations.”

Evaluating legal AI tools

The researchers designed a diverse set of legal queries representing real-life research scenarios and tested them on three leading AI-powered legal research tools, Lexis+ AI by LexisNexis and Westlaw AI-Assisted Research and Ask Practical Law AI by Thomson Reuters. Though the tools are not open-source, they all indicate that they use some form of RAG behind the scenes.

The researcher manually reviewed the outputs of the tools and compared them to GPT-4 without RAG as the baseline. The study found that all three tools perform significantly better than GPT-4 but are far from being perfect, hallucinating on 17-33% of the queries.

The researchers also found that the systems struggled with basic legal comprehension tasks that require close analysis of the sources cited by the tools. The researchers argue that the closed nature of legal AI tools makes it difficult for lawyers to assess when it is safe to rely on them.

However, the authors note that despite their current limitations, AI-assisted legal research can still provide value compared to traditional keyword search methods or general-purpose AI, especially when used as a starting point rather than the final word.

“One of the positive findings in our study is that legal hallucinations are reduced by RAG relative to general-purpose AI,” Ho said. “But our paper also documents that RAG is no panacea. Errors can be introduced along the RAG pipeline, for instance, if the retrieved documents are inappropriate, and legal retrieval is uniquely challenging.”

The need for transparency

“One of the most important arguments we make in the paper is that we have an urgent need for transparency and benchmarking in legal AI,” Ho said. “In sharp contrast to general AI research, legal technology has been uniquely closed, with providers offering virtually no technical information or evidence of the performance of products. This poses a huge risk for lawyers.”

According to Ho, one big law firm spent close to a year and a half evaluating one product, coming up with nothing better than “whether the attorneys liked using the tool.”

“The paper calls for public benchmarking, and we’re pleased that providers we’ve talked to agree on the immense value of doing what has been done elsewhere in AI,” he said.

In a blog post in response to the paper, Mike Dahn, head of Westlaw Product Management, Thomson Reuters, described the process for testing the tool, which included rigorous testing with lawyers and customers.

“We are very supportive of efforts to test and benchmark solutions like this, and we’re supportive of the intent of the Stanford research team in conducting its recent study of RAG-based solutions for legal research,” Dahn wrote, “but we were quite surprised when we saw the claims of significant issues with hallucinations with AI-Assisted Research.”

Dahn suggested that the reason the Stanford Researchers may have found higher rates of inaccuracy than the internal testing of Thomson Reuters is because “the research included question types we very rarely or never see in AI-Assisted Research.”

Dahn also stressed that the company makes it “very clear with customers that the product can produce inaccuracies.”

However, Ho said that these tools are “marketed as general purpose legal research tools and our questions include bar exam questions, questions of appellate litigation, and Supreme Court questions — i.e., exactly the kinds of questions requiring legal research.”

Pablo Arredondo, VP of CoCounsel at Thomson Reuters, told VentureBeat, “I applaud the conversation Stanford started with this study, and we look forward to diving into these findings and other potential benchmarks. We are in early discussions with the university to form a consortium of universities, law firms and legal tech firms to develop and maintain state-of-the-art benchmarks across a range of legal use cases.”

VentureBeat also reached out to LexisNexis for comments. We will update this post if we hear back from them. In a blog post following the release of the study, LexisNexis wrote, “It’s important to understand that our promise to you is not perfection, but that all linked legal citations are hallucination-free. No Gen AI tool today can deliver 100% accuracy, regardless of who the provider is.”

LexisNexis also stressed that Lexis+ AI is meant “to enhance the work of an attorney, not replace it. No technology application or software product can ever substitute for the judgment and reasoning of a lawyer.”

Source: venturebeat.com