Research & Benchmarks

ARC-AGI Benchmark Results

Exploring how different neural network training strategies perform on the Abstraction and Reasoning Corpus (ARC) β€” a benchmark designed to measure genuine artificial general intelligence through novel reasoning tasks that require learning from just a few examples.

400
Training Tasks
400
Evaluation Tasks
6
Training Modes
3500+
Architectures Tested

🧠 What is ARC-AGI?

The Abstraction and Reasoning Corpus (ARC) is a benchmark created by FranΓ§ois Chollet to measure machine intelligence in a way that goes beyond pattern recognition. Each task presents a few input-output examples, and the system must infer the underlying transformation rule to apply it to a new input β€” much like an IQ test for AI.

Unlike traditional ML benchmarks, ARC tasks require genuine abstraction: recognizing objects, understanding spatial relationships, counting, symmetry detection, and more β€” all from just 2-4 examples.

πŸ“Š What We're Measuring

  • Stability β€” How consistent is the accuracy over time? (Higher = more stable)
  • Throughput β€” Samples processed per second during real-time task switching
  • Consistency β€” How reliably does it perform across different tasks?
  • Tasks Solved β€” Number of unique ARC tasks where the model achieved pixel-perfect accuracy
πŸ“ˆ Training Modes

Mode Comparison

Comparing 6 different training strategies in real-time task switching: switching between 400 tasks every 100ms while maintaining accuracy.

⚑ Real-Time Task Switching: Task1 β†’ Task2 β†’ Task3 β†’ ... β†’ Task400 β†’ Task1 β†’ ...
We track pixel accuracy % every 100ms while switching between 400 ARC tasks. NormalBP PAUSES to batch train (reduced throughput) while StepTweenChain trains EVERY sample (maintains accuracy while switching).
Accuracy Over Time
Comparing stability and throughput across training modes

πŸ† WINNER

β€”

⚠️ WORST

β€”

βš™οΈ SCORING ALGORITHM

Formula:
β€’ Stability = 100 - stddev of window accuracies
β€’ Consistency = % of windows above threshold
β€’ Throughput = outputs per second
πŸ“Š Mode Comparison
10-second real-time task switching benchmark results
Mode Stability Throughput Consistency Solved Score
πŸ‘‘ Collective Intelligence

Council of 1000

1000 randomized neural network architectures competing to find unique task solutions. Testing statistical saturation: does the discovery curve flatten or keep rising?

πŸ‘‘ THE COUNCIL OF 1000 - Massive Scale Architecture Search
1000 randomized agents competing to find unique task solutions. Testing statistical saturation: Does the discovery curve flatten or keep rising?

πŸ“Š COUNCIL METRICS

β€”
Total Agents
β€”
Collective Tasks Solved
β€”
Efficiency (tasks per 100 agents)

⏱️ RUN INFO

Duration: β€”
Workers: β€”
Timestamp: β€”

🧠 COLLECTIVE WISDOM

πŸ† TOP 10 EXPERTS

Agent Architecture Solved
πŸ“ˆ Discovery Curve - Cumulative Unique Tasks
X = Agents processed, Y = Unique tasks discovered. Is the curve still rising?
🦎 Architecture Search

Evolutionary Zoo

2500+ mutant architectures with different topologies, brain types, activations, and learning rates β€” exploring the architectural fitness landscape.

🦎 THE EVOLUTIONARY ZOO - Deep Architectural Mutations
2500 mutant architectures with different topologies (1x1, 2x2, 3x3, 4x1, 1x4, 2x3, 3x2), brain types (MHA, LSTM, RNN, Dense), activations (ReLU, LeakyReLU, SiLU, Tanh), and log-uniform learning rates.

πŸ“Š ZOO METRICS

β€”
Total Mutants
β€”
Collective Tasks Solved
β€”
Duration

πŸ† HALL OF FAME - TOP 10 MUTANTS

# Mutation Config Solved

🧠 COLLECTIVE WISDOM

🌳 Phylogenetic Tree - Species Comparison
Unique tasks solved by each architecture type (species)
πŸ“ˆ Discovery Curve
Cumulative unique tasks as mutants are processed

πŸ”¬ What Do These Results Mean?

Our experiments reveal several key insights about neural network training strategies for few-shot reasoning tasks:

  • StepTweenChain excels at real-time adaptation β€” By training on every sample without batching, it maintains high accuracy even when rapidly switching between tasks.
  • Collective intelligence matters β€” No single architecture solves all tasks. The Council of 1000 shows that combining diverse architectures covers more of the task space.
  • Architecture diversity beats optimization β€” The Evolutionary Zoo demonstrates that exploring different topologies and brain types yields better coverage than hyperparameter tuning alone.
  • Discovery curves are still rising β€” Adding more architectures continues to discover new solvable tasks, suggesting we haven't hit the ceiling yet.