Will scaling transformers lead to a 60% score on ARC-AGI-2?

101Ṁ1357

2030

98%

chance

ALL

Will any plain transformer model achieve 60% or more on ARC-AGI-2 by 2030?

The inference cost to achieve this result does not matter.

The model that achieves this result must use the same "transformer recipe" common between 2023-2025: techniques like RLHF/RLAIF/CoT/RAG/vision encoders are allowed, but any specialized components must also be made of vanilla transformer blocks; Any new inductive biases, such as tree-search, neurosymbolic logic, etc. would not qualify.

The result must be verified by at least one reputable, unaffiliated org (ARC, Epoch, OpenAI Evals, academic lab, etc.) or a publicly re-runnable result (notebook on Kaggle, etc.).

Resolution uses the ARC-AGI-2 evaluation set and scoring script as published on arcprize.org on the day this market opens. Later revisions are ignored.

Update 2025-10-12 (PST) (AI summary of creator comment): PPO, GRPO, and RLVR are allowed training methods.

Generating synthetic data using other models to train the transformer is allowed, as long as the final model follows the common transformer recipe.

Technical AI Timelines

OpenAI

AI Impacts

AGI

Get

1,000

to start trading!

Pessoas também estão operando

Is scale unnecessary for intelligence (<10B param human-competitive STEM model before 2027)?

20% chance

When artificial general intelligence (AGI) exists, what will be true?

Will Artificial General Intelligence (AGI) lead directly to the development of Artificial Superintelligence (ASI)?

68% chance

Is scale unnecessary for intelligence (<10B param human-competitive STEM model before 2030)?

72% chance

Artificial general intelligence (AGI) is possible in principle

94% chance

Will tweaking current Large Language Models (LLMs) lead us to achieving Artificial General Intelligence (AGI)?

19% chance

Will non-human intelligence be confirmed before artificial general intelligence?

19% chance

Will there be a test-time scaling overhang for AI aesthetics?

79% chance

Will the first AGI be a large language model?

45% chance

Will inference-time scaling improve the generation of images with correct geometric shapes? (in generative AI)

Ordenar por:

https://x.com/arcprize/status/1999182732845547795

"GPT-5.2 Pro (High) is SOTA for ARC-AGI-2, scoring 54.2% for $15.72/task."

@lumi is PPO or GRPO, or ig RLVR, allowed?

also is generating synthetic data to train this model on (using other models) but training with the common transformer recipe allowed?

@Bayesian both are allowed

@CraigDemel wanna bet more on this around market price? I can do a lot more volume

or anyone else; ping me

@Bayesian good for now, thanks!

Honestly I don't believe 60% or more on ARC-AGI-2 is truly AGI in any meaningful sense:

Humans can score 100%, not 60.

It's a single benchmark that doesn't really test the full breadth of capabilities. It's definitely possible to have a system that's good at this benchmark while being useless in other tasks.

I propose renaming the question.