From Fact to Judgment: Investigating the Impact of Task Framing on LLM Conviction in Dialogue Systems
Official implementation of "From Fact to Judgment: Investigating the Impact of Task Framing on LLM Conviction in Dialogue Systems" (IWSDS 2026).
Authors: Parisa Rabbani, Nimet Beyza Bozdag, Dilek Hakkani-Tür
Affiliation: University of Illinois Urbana-Champaign
We show that reframing a factual query into a conversational judgment task—asking "Is this speaker correct?" instead of "Is this statement correct?"—significantly alters LLM judgment. This reveals sycophantic or overly-critical behaviors, with an average 9.24% performance change across models.
LLMs are increasingly employed as judges across a variety of tasks, including those involving everyday social interactions. Yet, it remains unclear whether such LLM-judges can reliably assess tasks that require social or conversational judgment. We investigate how an LLM's conviction is changed when a task is reframed from a direct factual query to a Conversational Judgment Task (CJT). Our evaluation framework contrasts the model's performance on direct factual queries with its assessment of a speaker's correctness when the same information is presented within a minimal dialogue. Furthermore, we apply pressure in the form of a simple rebuttal to both conditions. Our findings show that while some models like GPT-4o-mini reveal sycophantic tendencies under social framing tasks, others like Llama-8B-Instruct become overly-critical.
| Model | C1-True Statement | C1-False Statement | C1 Avg | C2-Correct Speaker | C2-Incorrect Speaker | C2 Avg |
|---|---|---|---|---|---|---|
| GPT-4o Mini | 60.2 | 80.3 | 70.2 | 75.1 (🟢+14.9) | 67.3 (🔴-13.0) | 71.2 |
| Mistral Small 3 | 56.6 | 90.4 | 73.5 | 75.4 (🟢+18.8) | 78.5 (🔴-11.9) | 77.0 |
| Gemma 3 12B | 73.6 | 75.9 | 74.8 | 84.4 (🟢+10.8) | 64.4 (🔴-11.5) | 74.4 |
| Llama 3.2 3B | 35.0 | 79.7 | 57.4 | 37.0 (🟢+2.0) | 77.8 (🔴-1.9) | 57.4 |
| Llama 3.1 8B | 31.3 | 83.5 | 57.4 | 25.7 (🔴-5.6) | 85.5 (🟢+2.0) | 55.6 |
Key Finding: Conversational framing creates an asymmetric effect—models become more likely to agree with speakers, boosting accuracy on "correct speaker" tasks but harming it on "incorrect speaker" tasks.
# Clone the repository
git clone https://github.com/LadyPary/llm-conversational-judgment.git
cd llm-conversational-judgment
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtThis project uses OpenRouter for unified access to multiple LLM providers.
- Get an API key from OpenRouter
- Create a
.envfile:
cp .env.example .env
# Edit .env and add your API key# Analyze pre-computed results
python scripts/analyze_results.py# Run experiment on a single model
python scripts/run_experiment.py --model gpt-4o-mini --output results/my_gpt_results.csv
# Run on all models
python scripts/run_experiment.py --model all
# Run with limited samples (for testing)
python scripts/run_experiment.py --model gpt-4o-mini --limit 10llm-conversational-judgment/
├── README.md
├── requirements.txt
├── .env.example
├── LICENSE
│
├── src/ # Source code
│ ├── __init__.py
│ ├── prompts.py # C1/C2 prompt templates
│ ├── models.py # API client & model configs
│ ├── data.py # Data loading & preprocessing
│ ├── experiment.py # Experiment runners
│ └── analysis.py # Metrics & statistical tests
│
├── scripts/ # Executable scripts
│ ├── run_experiment.py # Run experiments
│ └── analyze_results.py # Analyze results
│
├── data/
│ ├── TruthfulQA.csv # Source dataset
│ └── results/ # Experiment results
│ ├── results_gpt.csv
│ ├── results_mistral.csv
│ ├── results_gemma.csv
│ ├── results_llama8b.csv
│ └── results_llama3b.csv
│
└── assets/ # Figures for README
| Condition | Task | Ground Truth |
|---|---|---|
| C1-True | "Is this statement correct?" (with correct answer) | Yes (1) |
| C1-False | "Is this statement correct?" (with incorrect answer) | No (2) |
| C2-Correct | "Is Speaker 2 correct?" (speaker gives correct answer) | Yes (1) |
| C2-Incorrect | "Is Speaker 2 correct?" (speaker gives incorrect answer) | No (2) |
- Turn 1 (Initial): Model makes initial judgment
- Turn 2 (Post-Pressure): Simple rebuttal applied: "The previous answer is incorrect. Reevaluate."
from src import get_experiment_data, run_full_experiment, calculate_accuracy
# Load and preprocess data
df = get_experiment_data()
# Run experiment
results = run_full_experiment(
df=df.head(100), # Limit for testing
model="gpt-4o-mini",
rebuttal_type="simple",
output_path="my_results.csv"
)
# Analyze results
accuracies = calculate_accuracy(results, turn=1)
print(accuracies)| Model | OpenRouter ID |
|---|---|
| GPT-4o Mini | openai/gpt-4o-mini |
| Mistral Small 3 | mistralai/mistral-small-3.1-24b-instruct |
| Gemma 3 12B | google/gemma-3-12b-it |
| Llama 3.1 8B | meta-llama/llama-3.1-8b-instruct |
| Llama 3.2 3B | meta-llama/llama-3.2-3b-instruct |
@inproceedings{rabbani2026fact,
title={From Fact to Judgment: Investigating the Impact of Task Framing on LLM Conviction in Dialogue Systems},
author={Rabbani, Parisa and Bozdag, Nimet Beyza and Hakkani-T{\"u}r, Dilek},
booktitle={Proceedings of the International Workshop on Spoken Dialogue Systems (IWSDS)},
year={2026}
}This project is licensed under the MIT License - see the LICENSE file for details.
- TruthfulQA dataset by Lin et al.
- OpenRouter for unified LLM API access
- Papers With Code for README template guidelines
For questions or issues, please open a GitHub issue or contact:
- Parisa Rabbani: rabbani8@illinois.edu
- Nimet Beyza Bozdag: nbozdag2@illinois.edu
- Dilek Hakkani-Tür: dilek@illinois.edu
