🏆 TD-EVAL Model Evaluation Leaderboard
This leaderboard displays aggregated model performance across multiple evaluation metrics.
Variants:
mwoz: Baseline variant.
tau-airline: Airline specialty variant.
tau-retail: Retail specialty variant.
Use the checkboxes below to select which variants to include. At least one variant must be active.
mwoz
tau-airline
tau-retail
Search models
Sort by:
Average Score ▼
Conversation Consistency
Backend Consistency
Policy Completeness
Textbox
Average Score
Textbox
Conversation Consistency
Textbox
Backend Consistency
Textbox
Policy Completeness