Ryan Whitwam
2025-05-01 16:31:00
arstechnica.com
The rapid proliferation of AI chatbots has made it difficult to know which models are actually improving and which are falling behind. Traditional academic benchmarks only tell you so much, which has led many to lean on vibes-based analysis from LM Arena. However, a new study claims this popular AI ranking platform is rife with unfair practices, favoring large companies that just so happen to rank near the top of the index. The site’s operators, however, say the study draws the wrong conclusions.
LM Arena was created in 2023 as a research project at UC Berkeley. The pitch is simple—users feed a prompt into two unidentified AI models in the “Chatbot Arena” and evaluate the outputs to vote on the one they like more. This data is aggregated in the LM Arena leaderboard that shows which models people like the most, which can help track improvements in AI models.
Companies are paying more attention to this ranking as the AI market heats up. Google noted when it released Gemini 2.5 Pro that the model debuted at the top of the LM Arena leaderboard, where it remains to this day. Meanwhile, DeepSeek’s strong performance in the Chatbot Arena earlier this year helped to catapult it to the upper echelons of the LLM race.
The researchers, hailing from Cohere Labs, Princeton, and MIT, believe AI developers may have placed too much stock in LM Arena. The new study, available on the preprint arXiv server, claims the arena rankings are distorted by practices that make it easier for proprietary chatbots to outperform open ones. The authors say LM Arena allows developers of proprietary large language models (LLMs) to test multiple versions of their AI on the platform. However, only the highest performing one is added to the public leaderboard.
Meta tested 27 versions of Llama-4 before releasing the version that appeared on the leaderboard.
Credit:
Shivalika Singh et al.
Some AI developers are taking extreme advantage of the private testing option. The study reports that Meta tested a whopping 27 private variants of Llama-4 before release. Google is also a beneficiary of LM Arena’s private testing system, having tested 10 variants of Gemini and Gemma between January and March 2025.
This study also calls out LM Arena for what appears to be much greater promotion of private models like Gemini, ChatGPT, and Claude. Developers collect data on model interactions from the Chatbot Arena API, but teams focusing on open models consistently get the short end of the stick.
Enhance your driving experience with the P12 Pro 4K Mirror Dash Cam Smart Driving Assistant, featuring Front and Rear Cameras, Voice Control, Night Vision, and Parking Monitoring. With a 4.3/5-star rating from 2,070 reviews and over 1,000 units sold in the past month, it’s a top-rated choice for drivers. The dash cam comes with a 32GB Memory Card included, making it ready to use out of the box. Available now for just $119.99, plus a $20 coupon at checkout. Don’t miss out on this smart driving essential from Amazon!
Help Power Techcratic’s Future – Scan To Support
If Techcratic’s content and insights have helped you, consider giving back by supporting the platform with crypto. Every contribution makes a difference, whether it’s for high-quality content, server maintenance, or future updates. Techcratic is constantly evolving, and your support helps drive that progress.
As a solo operator who wears all the hats, creating content, managing the tech, and running the site, your support allows me to stay focused on delivering valuable resources. Your support keeps everything running smoothly and enables me to continue creating the content you love. I’m deeply grateful for your support, it truly means the world to me! Thank you!
BITCOIN bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge Scan the QR code with your crypto wallet app |
DOGECOIN D64GwvvYQxFXYyan3oQCrmWfidf6T3JpBA Scan the QR code with your crypto wallet app |
ETHEREUM 0xe9BC980DF3d985730dA827996B43E4A62CCBAA7a Scan the QR code with your crypto wallet app |
Please read the Privacy and Security Disclaimer on how Techcratic handles your support.
Disclaimer: As an Amazon Associate, Techcratic may earn from qualifying purchases.