Patronus AI Launches Industry-first LLM Benchmark

Model evaluation shows state-of-the-art systems fail spectacularly on finance-related questions

Patronus AI today launched “FinanceBench”, the industry’s first benchmark for testing how LLMs perform on financial questions.

Developed by AI researchers at Patronus AI and 15 financial industry domain experts, FinanceBench is a high quality, large-scale set of 10,000 question and answer pairs based on publicly available financial documents like SEC 10Ks, SEC 10Qs, SEC 8Ks, earnings reports, and earnings call transcripts. It is presented as a first line of evaluation for LLMs on financial questions, with more advanced tests to be released in the future.

Initial analysis by Patronus AI shows that state-of-the-art LLM retrieval systems fail spectacularly on a sample set of questions from FinanceBench.

GPT-4 Turbo with a retrieval system fails 81% of the time
Llama 2 with a retrieval system fails 81% of the time

Patronus AI also evaluated LLMs with long context windows, noting that they perform better but are less practical for use in a production setting. In particular,

GPT-4 Turbo with long context fails 21% of the time
Anthropic’s Claude-2 with long context fails 24% of the time

Fintech Insights: Bridging the Labor Gap in Accounting using AI Technology

Patronus AI notes that LLM retrieval systems are commonly used by enterprises today for multiple reasons. LLMs with long context windows are not only much slower and more expensive to use, but the context windows are still not large enough to support long documents typically used by analysts.

“While LLMs show promise in analyzing mass volumes of financial data, most models out in the market need a lot of refinement and steering to work properly,” Anand Kannappan, CEO and co-founder, Patronus AI. “And based on our specific evaluation of GPT-4 Turbo and other models, the margin of error is just too big for financial applications.”

“Analysts are spending valuable time creating prompt test sets to evaluate LLM retrieval systems and manually inspecting outputs to identify hallucinations,” Rebecca Qian, CTO and co-founder, Patronus AI. “And there exist no benchmarks that can help identify exactly where LLMs fail in real world financial use cases. This is precisely why we developed FinanceBench.”

The new benchmark spans several LLM capabilities in finance:

Numerical reasoning: Finance metrics requiring numerical calculations, e.g. EBITDA, PE ratio, CAGR.
Information retrieval: Specific details extracted directly from the documents.
Logical reasoning: Questions involving financial recommendations, which require interpretation and a degree of subjectivity.
World knowledge: Basic accounting and finance questions that analysts are familiar with.

As a part of this release, customers can now evaluate their LLM system against FinanceBench on the Patronus AI platform. The platform can also detect hallucinations and other unexpected LLM behavior on financial questions in a scalable way. Several financial services companies are piloting Patronus AI in the coming months.

Read More About Fintech Interview: Global Fintech Interview with Randall Tidwell, CFO at Serviceaide

[To share your insights with us, please write to pghosh@itechseries.com ]

Patronus AI Launches Industry-first LLM Benchmark for Finance to Address Hallucinations

Model evaluation shows state-of-the-art systems fail spectacularly on finance-related questions

PR Newswire

goLance Announces “Hire Now” Feature to Easily Onboard and Pay Freelancers Without...

N26 Survey Finds Americans Saved Over $2,000 due to...

Rapyd selected as Rakuten Viber’s first official Payments Provider...

Zūm Rails Appoints Payments Veteran Philipp Postrehovsky as Chief...

About

Visit Our Other Sites

Quick Links

FOLLOW US

Please fill your details and we'll get in touch with you!

Model evaluation shows state-of-the-art systems fail spectacularly on finance-related questions

Related posts