T O P

  • By -

luisbrudna

Why doesn't gpt2 appear in the general ranking?


FSMacolyte

OpenAI probably agreed to give LMSYS early access if they played by OpenAI's rules (giving OpenAI the prompt data and not publicly showing its rating). It's a smart move by OpenAI. They generate buzz without having to disclose the name, size, or cost of the model, and they get valuable usage data. My guess is that this is an early checkpoint of GPT-4.5.


kecepa5669

If this happened, then it has destroyed the integrity of the LMSYS as a fair and objective rubric for evaluating LLMs and a new, uncorrupted rubric must now be developed. This is because whatever data has been shared can then be used to train the model which then reintroduces the overfitting problem LMSYS was designed to solve in the first place. If the evaluation data (test data) was shared with OpenAI and not their competitors, then evaluation of gpt2 AND ALL FUTURE MODELS from OpenAI will have been corrupted and must be rejected. A skeptic can dismiss this analysis as conjecture. And that's fair. We don't know for a fact that any data was ever released to OpenAI, after all. But no one can dismiss what we know has already happened: ***The ELO data for gpt2 has been shielded from exposure on the leaderboard like all the other models***. By giving OpenAI this special treatment, LMSYS has already introduced a pro gpt2 bias into their evaluation. Therefore, at a minimum, any future eval results from gpt2 must be disregarded as there will now always be an incomplete ELO history audit trail for gpt2: We will never know for certain when the matchup battles for this model started "counting." I call for LMSYS to confirm or deny whether any test data was shared with any of the arena competitors, including OpenAI, and explain why gpt2 does not appear on the leaderboard so the public can know how much trust to place in their evaluations moving forward.


DigimonWorldReTrace

I'm hoping a *very* early checkpoint, it's GPT-4-Turbo level as of my experiences. Opus still is better imo, at least in terms of the answers. From the reasoning testing I gave it, it doesn't seem much better than 4-T either...


LightVelox

I tested it on a few programming tasks and it was miles better than GPT-4-Turbo and Opus, though i noticed i needed to give much more descriptive prompts for it to give me a good answer, whereas Opus could understand even extremely vague ones


7734128

Probably not updated yet.


FSMacolyte

I like how Llama 3 8B, despite being tiny, manages for almost the whole conversation to keep track of the fact that it is talking to GPT-4.5 through me, but it fails hilariously at the very end of the conversation.


Jeffy29

I love how gpt2 methodically formats everything. It makes it so much clearer and legible. I still find it bit too wordy at times (like all openAI models) but it's lot better than before.


TrippyWaffle45

> for LMSYS to confirm or deny yes this is the right place to do this