Research & Benchmarks

ARC-AGI Benchmark Results

Exploring how different neural network training strategies perform on the Abstraction and Reasoning Corpus (ARC) — a benchmark designed to measure genuine artificial general intelligence through novel reasoning tasks that require learning from just a few examples.

400

Training Tasks

400

Evaluation Tasks

Training Modes

3500+

Architectures Tested

🧠 What is ARC-AGI?

The Abstraction and Reasoning Corpus (ARC) is a benchmark created by François Chollet to measure machine intelligence in a way that goes beyond pattern recognition. Each task presents a few input-output examples, and the system must infer the underlying transformation rule to apply it to a new input — much like an IQ test for AI.

Unlike traditional ML benchmarks, ARC tasks require genuine abstraction: recognizing objects, understanding spatial relationships, counting, symmetry detection, and more — all from just 2-4 examples.

📊 What We're Measuring

Stability — How consistent is the accuracy over time? (Higher = more stable)
Throughput — Samples processed per second during real-time task switching
Consistency — How reliably does it perform across different tasks?
Tasks Solved — Number of unique ARC tasks where the model achieved pixel-perfect accuracy

📈 Training Modes

Mode Comparison

Comparing 6 different training strategies in real-time task switching: switching between 400 tasks every 100ms while maintaining accuracy.

Accuracy Over Time

Comparing stability and throughput across training modes

🏆 WINNER

—

⚠️ WORST

—

⚙️ SCORING ALGORITHM

Formula:

• Stability = 100 - stddev of window accuracies

• Consistency = % of windows above threshold

• Throughput = outputs per second

📊 Mode Comparison

10-second real-time task switching benchmark results

Mode	Stability	Throughput	Consistency	Solved	Score

👑 Collective Intelligence

Council of 1000

1000 randomized neural network architectures competing to find unique task solutions. Testing statistical saturation: does the discovery curve flatten or keep rising?

📊 COUNCIL METRICS

—

Total Agents

—

Collective Tasks Solved

—

Efficiency (tasks per 100 agents)

⏱️ RUN INFO

Duration: —

Workers: —

Timestamp: —

🧠 COLLECTIVE WISDOM

🏆 TOP 10 EXPERTS

Agent	Architecture	Solved

📈 Discovery Curve - Cumulative Unique Tasks

X = Agents processed, Y = Unique tasks discovered. Is the curve still rising?

🦎 Architecture Search

Evolutionary Zoo

2500+ mutant architectures with different topologies, brain types, activations, and learning rates — exploring the architectural fitness landscape.

📊 ZOO METRICS

—

Total Mutants

—

Collective Tasks Solved

—

Duration

🏆 HALL OF FAME - TOP 10 MUTANTS

#	Mutation Config	Solved

🧠 COLLECTIVE WISDOM

🌳 Phylogenetic Tree - Species Comparison

Unique tasks solved by each architecture type (species)

📈 Discovery Curve

Cumulative unique tasks as mutants are processed

🔬 What Do These Results Mean?

Our experiments reveal several key insights about neural network training strategies for few-shot reasoning tasks:

StepTweenChain excels at real-time adaptation — By training on every sample without batching, it maintains high accuracy even when rapidly switching between tasks.
Collective intelligence matters — No single architecture solves all tasks. The Council of 1000 shows that combining diverse architectures covers more of the task space.
Architecture diversity beats optimization — The Evolutionary Zoo demonstrates that exploring different topologies and brain types yields better coverage than hyperparameter tuning alone.
Discovery curves are still rising — Adding more architectures continues to discover new solvable tasks, suggesting we haven't hit the ceiling yet.