The full implementation of this tag-partitioned training approach is available as open-source on GitHub: NeuralArena: MNIST Neural Network Partitioning . This includes the Go-based neural architecture, training strategy, and experiment setup for reproducing results.
The training setup divides the MNIST dataset by a tag (even vs. odd digit labels) and routes each subset through a different partition of a single neural network. In each epoch, two separate forward-and-backward passes are performed: one on even-labeled examples (tag 0) activating one subset of the network, and one on odd-labeled examples (tag 1) activating a different subset. The network’s layers are partially connected so that each “tag” uses its own dedicated subnetwork (with some possible shared layers) while the other part remains inactive. This effectively trains two specialized pathways within one model, each expert in classifying either even or odd digits, analogous to an ensemble of experts within one neural net.
The described approach is a form of conditional computation, where only a portion of the model is active for a given input. Mixture-of-Experts (MoE) models, originally introduced by Jacobs et al. (1991), consist of multiple expert sub-networks each trained on a subset of the data, combined with a gating mechanism that selects an expert per input (What is mixture of experts? | IBM). Modern MoE systems (e.g., Shazeer et al., 2017) use a trainable gating network that learns to route each input to one or a few experts, activating only those parts of the model for that example ([2503.07137] A Comprehensive Survey of Mixture-of-Experts). This yields sparse activation and can greatly expand model capacity without proportional cost. In essence, the user’s tag-partitioning is equivalent to a two-expert MoE with a hardwired gating function (using even/odd label parity) instead of a learned gating network.
Researchers have explored architectures where certain neurons or layers are active only under specific conditions. For example, conditional gating of channels or layers has been employed in continual learning settings where task-specific masks or gating modules activate different filters based on the task or input (Conditional Channel Gated Networks). In these approaches, a task label or a learned classifier is used to select a subnetwork for each input, conceptually similar to using a “tag” to choose a partition of the model.
There are strategies to split training by groups of classes. For instance, Hinton et al. (2015) introduced “specialist” networks that focus on a subset of classes the main model finds confusable, and SplitNet (Kim et al., 2017) automatically partitions a network by grouping output classes into disjoint sub-networks (SplitNet). The user’s even/odd division is a manually defined instance of this idea, where two disjoint branches handle different groups of classes.
The idea of having a single neural network with distinct sub-networks activated per condition is well-established. This strategy is essentially a manual implementation of conditional computation, closely related to established methods such as mixture-of-experts and multi-task learning.
There is little evidence in the literature of using label parity specifically as a gating signal. While this choice is somewhat contrived, it serves as an example of static expert assignment—using known labels to route training samples to specialized network partitions.
Compared to mixture-of-experts or conditional gating models, this implementation resembles a hard mixture-of-experts with two experts and a fixed routing rule. Although MoE models have been extensively studied, the approach here is a specific instance rather than a fundamentally new paradigm.
Implementing such training schemes in Go is relatively uncommon in both academia and industry. While Go-based frameworks like Gorgonia exist, few research papers focus on conditional computation in Go. The uniqueness of the implementation lies more in its engineering choice rather than in adding scientific novelty.
The full implementation of this tag-partitioned training approach is available as open-source on GitHub: NeuralArena: MNIST Neural Network Partitioning . This includes the Go-based neural architecture, training strategy, and experiment setup for reproducing results.
In summary, the tag-partitioned forward/backward pass approach is a specific instance of conditional computation rather than a novel training paradigm. While using label parity as a gating signal is an interesting and instructive demonstration, it aligns closely with existing techniques such as mixture-of-experts and specialist models. Its primary innovation lies in its engineered implementation in Go, making it a valuable educational and experimental tool rather than a fundamentally new research contribution.