Leaderboards

UC Berkeley

Chatbot Arena LLM Leaderboard: Community-driven Evaluation for Best LLM and AI chatbots

Chatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbots

Maintained by researchers at UC Berkeley SkyLab and LMArena

Berkeley Function-Calling Leaderboard

Berkeley Function Calling Leaderboard V3 (aka Berkeley Tool Calling Leaderboard V3)

The Berkeley Function Calling Leaderboard V3 (also called Berkeley Tool Calling Leaderboard V3) evaluates the LLM's ability to call functions (aka tools) accurately. This leaderboard consists of real-world data and will be updated periodically. For more information on the evaluation dataset and methodology, please refer to our blogs: BFCL-v1 introducing AST as an evaluation metric, BFCL-v2 introducing enterprise and OSS-contributed functions, and BFCL-v3 introducing multi-turn interactions. Checkout code and data.

Coding Evaluation

BigCodeBench
Big Code Models Leaderboard
Chatbot Arena Leaderboard
CrossCodeEval
ClassEval
CRUXEval
Code Lingua
Evo-Eval
EffiBench
HumanEval.jl - Julia version HumanEval with EvalPlus test cases
LiveCodeBench
MHPP
NaturalCodeBench
RepoBench
SWE-bench
TabbyML Leaderboard
TestEval
https://evalplus.github.io/leaderboard.html

PreviousModel LeaderBoards NextQwen3

Last updated 6 months ago

hashtagUC Berkeley

hashtagChatbot Arena LLM Leaderboard: Community-driven Evaluation for Best LLM and AI chatbots

hashtagBerkeley Function-Calling Leaderboard

hashtagCoding Evaluation

UC Berkeley

Chatbot Arena LLM Leaderboard: Community-driven Evaluation for Best LLM and AI chatbots

Berkeley Function-Calling Leaderboard

Coding Evaluation