Projects on the Implementation of LLMs
- How Much Should We Trust LLM-Based Measures for Accounting and Finance Research? (Solo-authored, SSRN)
- Linguistic Complexity and Investor Horizons (with Brian Bushee, WP)
1. How Much Should We Trust LLM-Based Measures for Accounting and Finance Research?
This paper explores the reliability of self-reported confidence scores from large language models (LLMs), such as ChatGPT, within the context of accounting and finance research.
I examine two key aspects: (1) how well LLMs’ expressed confidence aligns with actual accuracy (calibration), and (2) how effectively these models distinguish between correct and incorrect predictions on average (failure prediction).
The results show a 40% gap between prediction accuracy and self-reported confidence scores, indicating significant overconfidence.
I examine two key aspects: (1) how well LLMs’ expressed confidence aligns with actual accuracy (calibration), and (2) how effectively these models distinguish between correct and incorrect predictions on average (failure prediction).
The results show a 40% gap between prediction accuracy and self-reported confidence scores, indicating significant overconfidence.
- A modified prompt strategy reduces this gap to 19%, while a fine-tuning approach eliminates it altogether.
- Although not well-calibrated, the Chain-of-Thought (CoT) prompt strategy improves failure prediction over standard prompting, supporting better cross-group comparisons.
- Additionally, smaller non-generative models like RoBERTa do not exhibit overconfidence and outperform prompted ChatGPT in both calibration and failure prediction when fine-tuned.
1.1 Why Should We Care?
Accounting and finance researchers are increasingly asking ChatGPT to provide confidence scores from 0 to 100 alongside its answers for use in empirical analysis (e.g., Bond et al. [2023]; Kim et al. [2024]; Qu [2024]).
However, ChatGPT's expressed confidence is based on the probability of selecting the next word in a sequence, not the actual accuracy of its answers. Studies from AI literature support this. Xiong et al. [2023] found that ChatGPT tends to express confidence in multiples of 5, with most values between 80% and 100%, similar to human expressions of confidence. Zhou et al. [2023] also observed that when ChatGPT stated “I am 90% sure” in a trivia task, it was only correct 57% of the time.
However, ChatGPT's expressed confidence is based on the probability of selecting the next word in a sequence, not the actual accuracy of its answers. Studies from AI literature support this. Xiong et al. [2023] found that ChatGPT tends to express confidence in multiples of 5, with most values between 80% and 100%, similar to human expressions of confidence. Zhou et al. [2023] also observed that when ChatGPT stated “I am 90% sure” in a trivia task, it was only correct 57% of the time.
ChatGPT’s overconfidence is an example of hallucination, a byproduct of its text generation mechanism. At its core, generative LLMs function similarly to the autocomplete feature used by search engines. When we ask a question, ChatGPT selects the best word based on probabilities, then iteratively chooses the next best word, continuing this process until the response is complete. This entire process unfolds in real time as it generates the answer as you can see below.
1.2 How Do I Test Self-Reported Confidence Scores?
In this paper, I conduct experiments using ChatGPT on sentiment analysis of financial text, a common task in accounting and finance research.
- I evaluate ChatGPT’s calibration and failure prediction performance to assess its level of overconfidence.
- I explore methods to reduce this overconfidence, such as applying techniques from AI literature, fine-tuning models, and using non-generative LLMs.
- I demonstrate how the results of empirical analyses can vary significantly depending on the methods used to obtain confidence scores.
1.3 What Are the Main Findings?
The results show a 40% gap between prediction accuracy and self-reported confidence scores, indicating significant overconfidence.
A modified prompt strategy, prompting the model repeatedly to generate the top-K guesses each time, reduces this gap to 19%, while a fine-tuning approach eliminates it altogether. Fine-tuned non-generative models like RoBERTa and FinBERT perform comparably to fine-tuned ChatGPT, despite being 1,000 times smaller. However, failure prediction shows little variation between prompt strategies and fine-tuning methods, while CoT prompt strategy improves failure prediction over standard prompting by 10%.
Overall, the fine-tuning approach produces better-calibrated confidence scores that closely match the actual accuracy of answer, and CoT prompt strategy as well as the fine-tuning approach support better cross-group comparisons.
A modified prompt strategy, prompting the model repeatedly to generate the top-K guesses each time, reduces this gap to 19%, while a fine-tuning approach eliminates it altogether. Fine-tuned non-generative models like RoBERTa and FinBERT perform comparably to fine-tuned ChatGPT, despite being 1,000 times smaller. However, failure prediction shows little variation between prompt strategies and fine-tuning methods, while CoT prompt strategy improves failure prediction over standard prompting by 10%.
Overall, the fine-tuning approach produces better-calibrated confidence scores that closely match the actual accuracy of answer, and CoT prompt strategy as well as the fine-tuning approach support better cross-group comparisons.
1.4 A Few More Thoughts
- In conclusion, this paper demonstrates that self-reported confidence scores from LLMs suffer from overconfidence. To address this issue, I recommend fine-tuning if a labeled training dataset is available. Smaller non-generative models can also be effective. If a labeled dataset is not available, the repeated top-K guesses prompt strategy can help deflate confidence scores closer to the accuracy level.
- Good calibration is essential when making cross-sectional inferences. For example, if we have two samples—Sample A labeled "Positive" with a confidence score of 80% and Sample B also labeled "Positive" but with a 40% confidence score—saying "Sample A is 40% more positive than Sample B" only holds if the confidence scores are well-calibrated. Without proper calibration, the 40% difference is meaningless.
- Good failure prediction allows for comparisons at the aggregate level, even if confidence scores are not perfectly aligned. In such cases, the difference in confidence scores may not be directly interpretable, but failure prediction still supports comparisons on average. For example, Kim et al. [2024] use CoT prompt strategy and report results across high-confidence (top 25%) and low-confidence (bottom 25%) groups.
- There is a trade-off between calibration and accuracy. For example, a model that classifies samples perfectly but is nearly 100% confident all the time—such as a fine-tuned version of ChatGPT—turns the measure into more of an indicator. In this scenario, the model effectively sorts samples into discrete categories rather than along a continuum. While this can be useful in some applications, it's important to recognize this limitation when interpreting results in studies aiming to capture cross-sectional differences, as demonstrated in this paper's empirical analysis.
1.5 Would GPT-4o or o1 perform differently?
Short answer: No. The current version of the paper uses GPT-3.5 Turbo as the main model due to research budget constraints, but Xiong et al. [2023] report similar findings with GPT-4 in tests of the model’s ability to express uncertainty. I plan to repeat the experiments with newer models soon and will update the findings. However, the overconfidence problem is structural rather than a matter of scale. Unless the text generation mechanism in LLMs changes, this issue is likely to persist.
2. Linguistic Complexity and Investor Horizons (with Brian Bushee)
We offer a novel empirical approach to measuring linguistic complexity by finetuning prominent open-source large language models (LLMs) such as Google's BERT and Meta's Llama2. This approach has shown effective across samples of disclosure and earnings call transcripts.
Paper descriptions, linguistic complexity score data (disclosures and conference call transcripts from 2008 to 2023), and fine-tuned models will be available soon!