Programming & Development

Arena ELO History: el gráfico que expone cómo se degradan los LLM

Un investigador independiente acaba de publicar una herramienta que expone uno de los patrones peor documentados de la industria de la inteligencia artificial: los modelos de lenguaje grandes se degradan después de su lanzamiento. Arena AI Model ELO History, publicado por Erwin Mayer en su sitio personal, traza la evolución diaria del rating ELO de cada modelo flagship en LM Arena desde 2023 hasta hoy. La premisa es simple: cuando Anthropic, OpenAI, Google o xAI lanzan un modelo, los benchmarks iniciales suelen ser impresionantes. Lo que esta visualización revela es que esos mismos modelos, semanas o meses después, suelen perder puntos en evaluaciones ciegas hechas por humanos reales. El gráfico no opina; solo muestra los datos. TL;DR Erwin Mayer publicó Arena AI Model ELO History, un dashboard que grafica el rating ELO diario de cada laboratorio en LM Arena desde 2023.- Los datos se actualizan automáticamente desde el dataset oficial del LM Arena Leaderboard en Hu

DEV Community

21h ago

0 0

Discussion

Your thoughts matter!

Your input is valuable—be the first to share it!

No comments yet.

Be the first to share your take and keep the conversation moving.

Join the conversation

UPVOTERS

Community appreciation

See who found this content valuable and showed their support.

No upvotes yet.

Be the first to show your appreciation for this content.

TOPICS

Explore the same topics

Discover more content from the topics this post is mapped to.

tmctmt.com

Mullvad exit IPs are surprisingly identifying

Comments

Anna Theodorou

2026-05-15 05:40

dev.to

Smart Routing, Transfer Family Ingestion, and Voice Chat — Perm…

What This Post Covers This is a companion article to the FSx for ONTAP S3 Access Points Serverless Patterns series. While that series focuses on serverless …

David König

2026-05-15 05:33

dev.to

AI 週報 — 2026-05-08 to 2026-05-15 | OpenAI 做顧問、Anthropic 做生態，模型公…

本週一句話：OpenAI 正在變成一家顧問公司，Anthropic 正在變成一家平台公司— —兩者都不約同時放棄了「模型即產品」的故事。模型公司集體轉向：從 API 銷售到制度性資源 140 億美元— —這是 OpenAI 對其新建「Deployment Company」的估值，聲稱…

DEV Community

1h ago

dev.to

How to Send Auth Codes via WhatsApp in Your App With Kinde

Your users are in San Francisco, Jakarta, São Paulo, and Sydney. They have WhatsApp open all day. They check it before they check their SMS inbox. When your ap…

David König

2h ago

arkadiyt.com

Removing the Modem and GPS from My 2024 RAV4 Hybrid

Comments

Hacker News

8h ago

dev.to

Building a DPI-Resistant VPN with VLESS REALITY & Nginx (Open S…

tags: opensource, security, python, bash If you live in a region with strict internet censorship (like China, Iran, or Russia), you probably know that the gol…

DEV Community

9h ago

Keep browsing

Explore more from this topic

Dive into the full feed of curated posts covering Programming & Development.

Browse Topics

Continue exploring

Discover more content that aligns with your interests and this post.

dev.to

Smart Routing, Transfer Family Ingestion, and Voice Chat — Perm…

What This Post Covers This is a companion article to the FSx for ONTAP S3 Access Points Serverless Patterns series. While that series focuses on serverless …

David König

2026-05-15 05:33

dev.to

AI 週報 — 2026-05-08 to 2026-05-15 | OpenAI 做顧問、Anthropic 做生態，模型公…

DEV Community

1h ago

dev.to

How to Send Auth Codes via WhatsApp in Your App With Kinde

Your users are in San Francisco, Jakarta, São Paulo, and Sydney. They have WhatsApp open all day. They check it before they check their SMS inbox. When your ap…

David König

2h ago

dev.to

Building a DPI-Resistant VPN with VLESS REALITY & Nginx (Open S…

tags: opensource, security, python, bash If you live in a region with strict internet censorship (like China, Iran, or Russia), you probably know that the gol…

DEV Community

9h ago

dev.to

If You Can Survive a Toddler, You Can Ship LLMs in Production

A few years back I was running a time-series pipeline that scored incoming product reviews on a 1-10 scale. The scorer was an LLM. Reviews rolled in continuous…

Fashion Kavitha

9h ago

dev.to

We scanned 50+ MCP servers and found HIGH-severity bugs in Atla…

MCPSafe (mcpsafe. io) runs automated security scans of Model Context Protocol (MCP) server repositories using a five-model LLM judge panel and a purpose-built…

David König

1d ago

Still curious?

See more related posts

Keep the inspiration flowing with fresh submissions and trending finds from the community.

View Latest

Arena ELO History: el gráfico que expone cómo se degradan los LLM

Your thoughts matter!

Join the conversation

Community appreciation

Explore the same topics

Explore more from this topic

Continue exploring

See more related posts

Share Content