Based on the verified leaderboard at https://arcprize.org/leaderboard. Same concept but opposite dimension as https://manifold.markets/EchoNolan/when-will-the-first-model-reach-50
I will likely extend the close date until this is achieved.
Update 2026-04-14 (PST) (AI summary of creator comment): The creator is considering that scaffolds and non-LLMs may not count for resolution. The current inclination is that only relatively "plain" LLM models would qualify, meaning:
Third-party scaffolds around existing models may not count
Non-traditional LLMs may not count
No final decision has been made yet; the creator is open to suggestions from traders.
Update 2026-04-15 (PST) (AI summary of creator comment): The creator is concerned that third-party harnesses/scaffolds (e.g., a trivial harness around an existing model) should not count for resolution, as this would turn the question into "which lab submits a harness to the verification process first." Only solutions on the verified leaderboard at arcprize.org/leaderboard are in-bounds for resolution. The spirit of the market is the first relatively plain LLM model reaching 50%, not a scaffolded solution.
Pessoas também estão operando
Hmm, I realize that it is unclear from the resolution criteria (and frankly unclear to my own intuitive understanding of the question) how this should resolve if the first verified solution over 50% is:
Not a traditional LLM at all?
A third-party scaffold around an existing model?
I can see arguments for these results not counting and only a âpureâ model counting. I can see arguments that third-party scaffolds should be allowed and resolve to âOtherâ, or equally to the lab of the underlying model.
I dunno.
I think I am currently inclined to say scaffolds and non-LLMs donât count, and the spirit of the market is the first relatively âplainâ LLM model but I realize thatâs probably contentious given I argued against this exact interpretation in a previous similar market đ
@traders no decision yet, Iâm open to suggestions.
@eapache I don't know how you could possibly define "scaffolding" here in an objective way. I would ignore all "scaffolding" objections unless its shown that the solution was only possible because the private ARC-AGI-3 questions leaked. Otherwise - whichever lab's model first reaches 50% should count as the answer - ignoring everything else.
I would advocate for keeping things simple, leaving https://arcprize.org/leaderboard as the
only arbiter of resolution (and using the used underlying model(s) to determine which "lab" reached 50% first, spread among the many winners if many reach it at the same time). they've said in communications that arc agi 3's point is to test general capabilities so they won't be running it against specialized scaffolds at all, but if they end up going back on that they'll probably have had a good reason to do it? i'd rather this just reflect top AI capabilities that are sufficiently not-gotchas to be judged by the arciprize team as counting
My original comment was prompted by https://blog.alexisfox.dev/arcagi3 which uses Opus 4.6 to score ~80% on the public set using a fairly trivial harness. Itâs not verified (yet?) so itâs not in-bounds for resolution (yet?) but I think accepting solutions like this would turn the question into âwhich lab submits a benchmark-specific harness to the verification process firstâ which was not at all the spirit of the questionâŠ
@eapache the public set is much easier than the private set, and idk if they ll even verify these, Nor would they show up in the leaderboard (which is about the semi private set). Can resolve based on which lab had a model that hit 50% that was released first, to avoid the problem around evaluating one harness first. But fwiw none of this is imo likely to come up, maybe simpler to explicitly ban third party scaffolds