The ARChitects - Technical Report

ARC Prize 2025 Solution Summary

Daniel Franzen 1* Jan Disselhoff 1* David Hartmann 2*

1 JGU Mainz 2 Lambda, Inc. * “The ARChitects” Kaggle team members

TL;DR:

This year, we used an exploration-exploitation strategy; improving last year’s solution while simultaneously exploring several new directions.

Our Initial Strategy: Autoregressive Model Improvements

In the first half of the competition, we’ve been focusing to max out our last year’s approach:

You can find all the details in our ICML publication, read the shorter summary below, or jump directly down to our most recent addition to the community, a masked diffusion-based ARC solver.

Previously on ARC Prize 2024

Our last year’s approach consisted of three key ideas:

  1. Data Augmentations: We applied ARC-specific data augmentations to improve the effectiveness of training (both offline and the test-time training).

  2. Depth-First Search Sampling: By caching results during generation, tree traversal of all possible solutions above a certain threshold enables to generate a large number of candidate solutions quickly without any additional memory costs.

  3. Product-of-Expert (PoE) Scoring: The model itself has already a good estimate of the probability that a candidate is correct. By re-applying the augmentations to the candidates and aggregating using the product of neg-loglikelihoods, we get stable selection mechanism to decide which candidate is most probable, considering all “views” at the same time.

Meanwhile, Back at ARC Prize 2025 …

So what has changed since our 2024 solution in terms of the AR approach? Three things helped us to squeeze out as much performance as possible to reach the top of the leaderboard, on July 21st - well, at least for a brief period of time.

Using this approach, we achieved a score of 16.94% on the public ARC2 leaderboard on August 11th. We realized quickly that this approach was not enough for this years’ ARC competition. So, while maxing out the performance using the autoregressive approach, we started research towards a better model - using masked diffusion language models.

Our Final Submission: Recursive Masked Diffusion Model

When we attended ICML 2025 to present last year’s paper, it became clear to us that our autoregressive approach wouldn’t be strong enough for certain puzzle types. Especially tasks like this one (which we haven’t seen being solved at all, yet) and this one (which our model did solve in the end), where the model must understand and alter the global structure of the solution.

From that point on, we leaned fully into an exploration-exploitation strategy: continuing to push our old model to its limits, while simultaneously working on entirely new techniques.

Two Issues Stood Out Immediately:

  1. The AR model was not trained to re-iterate upon it’s own first guess, and therefore unable to perform the needed puzzle-ling and trying of several solutions. This was only done implicitly using our “Product-of-Expert” selection mechanism.
  2. However, the simple augmentations we used previously, while being a valuable tool to gauge the model’s assessment of it’s own global mistakes, weren’t expressive enough to mitigate the issues of autoregressive models on the new benchmark. In particular we saw a lot of trouble on puzzle and simulation tasks, as well as issues with diagonal line predictions.

Towards Recursive Masked Diffusion Models

The observations above ultimately led us to a masked diffusion approach and, later on, to a recursive and continuous sampling method that allowed the model to improve upon its own guesses.

We used a LLaDA-8B masked diffusion LLM and fine-tuned it to de-mask the solution, assuming the output shape shape is already known.

Technical Overview

In essence, we used:

Model & Training Process

In terms of the model setup, besides hyperparameter tuning, we adapted the positional encoding and the masking method to better suit the ARC setting.

Observations: Token Algebra Enables Recursion

Three observations really helped to boost the performance of the masked diffusion model.

  1. Model Output Looks Like Model Input Already For discrete masked diffusion, the model predicts a distribution over tokens at each position, and that output can be fed right back in as the next input. Typical remasking strategies use the following loop:

    guess = model(all_masked)
    for i in range(steps):
        guess = mask_strategy(guess)
        guess = sample(guess)
        guess = model(guess)

    This gave us an early hint: the model can refine its own guesses. However, this usually only happens after a discretization step, here done in sample. But is this discretization step actually fully necessary?

  2. There Are No Tokens (🥄)

    Modern LLMs don’t actually operate on discrete symbols except in the first embedding layer. Inside the model, every token is just a point in continuous embedding space. And in that space we can create linear combinations of input tokens!

    (Other people are using similar techniques, for example in T2I diffusion models, see here.)

    Input (what we pretend the model sees)

    <token_what>, <token_is>, <token_2>, <token_plus>, <token_2>, <token_question>

    Actual internal representation

    <token_is>[0.756, 0.452, 0.112, 0.980, 0.221, ...]

    Token Algebra

    Because tokens are just vectors, we can blend them:

    <token_is> → 0.5 * <token_is> + 0.5 * <token_equals>

    This simple idea (that tokens can be mixed) turns out to unlock recursion for diffusion models.

  3. Key Observation: Soft-Masking

    We discovered especially interesting behaviour when we added the <mask> embedding to every input position: <color_ANY> * 1.0 + <mask> * 1.0

    It seems that during training the model has learned that <mask> means “this position needs improvement”. So when we soft-mask all positions, we are essentially instructing the model to refine the direction of the guess. In other words, adding <mask> everywhere implicitly turns the diffusion model into a continuous iterative solver that improves upon its own output each step. This creates a natural form of recursive self-improvement, where each iteration becomes: guess → soft-mask → refine → repeat

    This alone isn’t very stable without further tricks, though. Probably, because the model has never seen it’s own outputs as input during training. For this reason (and because we found this techniques too late in the competition to finalize this approach, say, by baking the recursion into the fine-tuning of the model), our sampling methods were in that sense “stabilization” techniques around this observation.

In summary, the combined observation here, and the one that enabled recursive sampling, is that the model works surprisingly well with:

This means inference is not limited to hard tokens. Instead, we can nudge the model by giving it soft combinations of token embeddings.

Sampling Strategy

During development, we worked on two sampling methods in parallel (which is part of our on-going strategy of following multiple promising directions at the same time). Both sampling methods were able to solve different tasks initially, and, over the course of the last two weeks of the competition, we merged the efforts by using the best parts of each.

The core tricks where:

Having found the soft-masking trick only 5 days before the competition ended, we decided to focus on the latter approach, since it required less refinement steps for this compute-budget limited competition.

However, the other method (discrete projection + noise) did solve different tasks for longer refinement loops, indicating to be also a promising candidate for future iterations.

Soft-Masking Sampling Loop

Sampling with masked diffusion LLMs (dLLMs) starts by providing a fully masked output grid and asking the model to replace the mask tokens with content tokens. The sampling loop that follows is typically some variation of: re-masking parts of the input, feeding it back into the dLLM, and requesting a refined prediction. This process repeats until the model converges on a stable solution. Note that we use a different random augmentation of the task in each refinement step.

The extension to use soft-masking is very similar:

# start fully masked
logits = np.zeros(shape)
logits[..., mask_position] = 1

for step in range(iterations):
    logits = model(logits)
    
    # discretize, normalize or any other mixing of logits
    logits = normalize(logits)
    
      # soft-mask every position
    logits[..., mask_position] = 1

The normalization method serves two purposes:

  1. it ensures that the model can actually handle the logits, both numerically and also conceptually, since it was never trained on non-discrete inputs, and
  2. it also has the potential to re-introduce noise, encouraging additional creative exploration during sampling. While this did help in our tests, the extra noise also has led to a larger sampling budgets to fully benefit from it. (For this reason, we did not pursue this direction for this competition).

Most-Visited-Candidate Selection

Convergence alone was not stable enough. Especially on harder tasks where the model remained uncertain and tended to flip back and forth between two or more solution states. As a result, the recursive sampling loop typically produced multiple candidate solutions.

We found that the most reliable approach was a stateful selection method: counting which candidates were visited most often during the soft-masking refinement process. For each task, we then selected the two most-visited candidates as our final guesses.

Alongside this, we also explored several stateless scoring methods, which do not require access to the sampling history. Two of them were particularly interesting:

Across all experiments, both stateful and stateless methods worked comparably well, but our final submission relied on most-visited count for selecting both the first and second guess.

Shape Prediction

With the demasking model working reliably, one major challenge remains: determining the size of the output grid. To address this, we introduce a second model and make a small adjustment to the data format. As before, we provide multiple input/output pairs, but we alter the representation of the final output grid to make the model predict its shape:

We then perform an additional finetuning run, initializing from our previously finetuned demasking model. In the model input, all tokens of the final output grid are replaced by mask tokens and have a loss applied to them, as we want the model to correctly place the delimiter tokens on the grid, which we then use during inference to detect the correct size of the output.

Things That Didn’t Work

The exploratory nature of this year’s competition meant trying a lot of ideas: probably a hundred on paper, a couple dozen in practice, and it’s been only a handful that truly increased the accuracy. We would like to share what we’ve learned, but also want to highlight the things that looked promising on paper but didn’t translate well in practice: architectural adaptations we were excited about, and large-scale synthetic data generation.

Synthetic Data

We did work on synthetic data early on. Our best-guess approach involved fine-tuning a coder LLM (Qwen/Qwen2.5-Coder-32B-Instruct) to produce steerable Atari-like game screens on up-to 30x30 pixels. The approach was to use GRPO training and vision LLM as a reward to decide whether the rendered screen look like Atari game or not.

Specifically, we created a “game screen” generator, asking to present a specific feature of a game. For example: “the dashboard of a flight simulator”, “the UI of a real time strategy game”, “a jump-and-run side-scroller”, “a close-up of a fighting game”, “the health bar in a …. game”. The resulting generators (some shown below) quickly resembled game screens, and allowed to “steer” some features using input variables.

The hardest part, however, was to define “meaningful puzzles” on these Atari-like game generators.

One approach was to use our best predictive model so far to decide whether a task was too easy or too hard, and using that as the reward function for an additional GRPO training loop. Another idea involved to measure the compressibility (both using LLM or classic compression algorithms) of the input to output mappings, since “less” compressible games contain more “independent noise”, thus cannot be inferred unambiguously.

Our best approach of synthetic generation, unfortunately, didn’t scale well enough; only 1 in about 50 to 100 generations convinced us to be novel and interesting enough. However, early tests showed that our small, semi-manually curated synthetic dataset of 150 additional tasks did not yield sufficient performance gains to justify pursuing this direction

Novel Techniques on the Horizon

Throughout the competition, we kept a close eye on new ideas popping up on X and arXiv, and a few of them sparked directions we explored ourselves. One particularly exciting development of tiny recursive models, which not only validated ARC as a meaningful benchmark in our opinion but also highlighted how crucial the embedding space for ARC-like tasks is. Their approach shares some conceptual ideas with ours, since both rely on recursion between input and output, but the embedding design and task representations are radically different, and that difference seems to matter a lot.

Inspired by this, we also experimented with ways to find better representations of the ARC problem space. We tried alternative positional encodings, hybrid schemes, and various embedding tweaks, though most of our tests didn’t end up helping much.

On the architectural side, we tested a range of ideas: tiny language models, Canon layers, H-Net–style architectures, intermediate reasoning tokens, and alternative grid representations. Some were promising, others dead ends. Life of trying a thousand small ideas, you name it. ¯*(ツ)*/¯

But also remember: Only because these approaches did not pan out for us, doesn’t mean that they can’t work at all!

Compute Budget

For the duration of the challenge, Lambda has generously provided us with one to three GH200 machines for development and experiments (depending on current needs) throughout the whole competition time, and, in addition to that with two weeks of 16 x NVIDIA H100 GPUs to max out our solution in the final two weeks of the competition. Early on, we used an additional machine with 8 x NVIDIA A100 GPUs for a total runtime of about three weeks, mainly while exploring the synthetic data generation, which, however, was not used for the final submission.

We sincerely like to thank Lambda for providing us with resources that enabled rapid iteration of our ideas.

Final Submission’s Results

Since our final submission involved two models, one to predict the shape of the solution, and, another one to fill out the empty solution space, our estimates of the final score had a higher variance.

Assuming the shape was known, our best two sampling techniques achieved a score

To infer the shape, we used a second LLaDA Model, which achieved an accuracy of about 85% ± 2% on the eval data set in predicting the correct shape given the example set of the respective ARC task and the challenge input.

Combining these two models, we expected a score of about 26%, however, our best submission achieved a score of 21.67% on the public leaderboard, which suggests some overfitting towards the evaluation set.

Things To Improve

Because of the multiplicative effect of the additional shape model (accuracy of the shape model times the accuracy of the known-shape model), a combined model would most likely helped to improve the score further. The main reason why we started with a dedicated model for each sub-task was the limited compute budget on the Kaggle servers, in terms of both memory and speed.

Similarly, the recursive sampling method that we found in the final 5 days of the competition was not reinforced in a proper training objective; the model never learned to use it’s own logits, yet we used it in that way anyways. To improve upon this, the training loop would have required a single change of one (or more) additional recursive forward passes of the model, potentially eliminating any requirement to schedule and normalize logits during inference. (Making the normalization just right has shown to be rather difficult, otherwise inference became instable for longer recursions.)

Looking at our scores of the last weeks, we made the biggest jumps when we scaled-up training resources for faster iterations in the last two weeks; better models enabled faster re-iterations of the sampling techniques.

We should have scaled up earlier. ;D

See Y’All Soon!

This year has been an exciting ride for us: from the moment we switched from autoregressive models to masked diffusion, to the breakthrough that continuous, recursive sampling could actually work with the right tricks. It was also inspiring to see ARC becoming more and more the de-facto benchmark in the broader community, with results from Anthropic, Gemini, OpenAI, and xAI pushing everyone forward.

One lesson that surprised us again and again: models with lower training losses don’t just perform better. Instead, they unlock entirely new ways of using them, which was the reason why we found the soft-masking tricks so late in the competition. Also, many of last year’s optimizations didn’t transfer to the new model, but taking a two-pronged exploration-exploitation approach helped us move fast while developing novel methods. On the practical side, we kept things generalizable by being careful with submissions (we used only 60 submissions throughout the competition, and only 17 of those used our diffusion approach), reducing sampling steps of the diffusion refinement, shortening test-time training, and always being slightly biased towards more parameter free methods.

Thanks for following along, we’re looking forward to what comes next in the ARC community. <3

Greetings, The ARChitects - that is: Daniel, Jan & David. 👋