dmf-archive/PILF: PILF: A IPWT-inspired continual learning framework designed to mitigate catastrophic forgetting and improve efficiency using a Surprise-gated Mixture of Experts (MoE) model.

2025-06-27 07:10:00
github.com

Document Version: 3.0

Core Concept: A cognitive learning framework designed to transform fixed hyperparameters (like learning rate, model capacity) into dynamic policies driven in real-time by the intrinsic “surprise” (Surprise) of data. It is essentially an adaptive hyperparameter scheduling algorithm that allows a model to autonomously decide “how much to learn” and “with what capacity to learn” based on the value of the learning content. This framework originates from the Integrated Predictive Workspace Theory, with further details available in the paper at https://github.com/dmf-archive/IPWT.

1. Design Philosophy: From “Fixed Rules” to “Dynamic Policies”

Traditional training paradigms rely on manually set hyperparameters that are typically fixed or decay according to a predetermined schedule throughout the training process. This “one-size-fits-all” approach ignores the vast differences in learning value contained in different data batches.

PILF’s design philosophy is: to replace static, human-set rules with dynamic, data-driven policies.

It no longer blindly uses a fixed learning rate or model capacity. Instead, it dynamically and proportionally adjusts its learning behavior by assessing the Surprise from each data batch:

Dynamic Learning Rate: When Surprise is moderate, it signals valuable “learnable zone” information, and the system assigns a higher learning rate. When Surprise is too low (redundant information) or too high (anomalous information), it assigns a learning rate close to zero, naturally achieving “ignore” and “reject” effects. This directly replaces manually set learning rate schedulers.
Dynamic Capacity: In a Mixture-of-Experts (MoE) architecture, Surprise not only adjusts the learning rate but also determines the number of “experts” k to activate. Simple tasks (low Surprise) require only a few experts, while complex tasks (high Surprise) dynamically engage more experts. This replaces fixed Top-K routing.

2. Core Implementation: From PILR-S to PILF

Stage 1: PILR-S (Predictive Integrity-driven Learning Rate Scheduler)

PILR-S is the direct application of the PILF idea on any standard neural network. It focuses on one question: How to dynamically adjust the learning rate based on Surprise? This is achieved using the core calculation toolkit from the SigmaPI project, which is a required dependency. The testing framework and experiments for PILF are detailed in Section 3.

It replaces the traditional “gating” logic of whether to execute optimizer.step() with a smooth, continuous learning rate modulator.

sequenceDiagram
    participant Trainer
    participant Model
    participant SigmaPI_Monitor
    participant LRScheduler as PILR-S
    participant Optimizer

    Trainer->>Model: Feedforward
    Model-->>Trainer: Return logits

    Trainer->>SigmaPI_Monitor: calculate(model, logits)
    SigmaPI_Monitor-->>Trainer: Return pi_metrics (incl. Surprise)

    Trainer->>LRScheduler: update(Surprise)
    activate LRScheduler
    LRScheduler->>LRScheduler: lr_modifier = gaussian(Surprise, EMA, std)
    LRScheduler-->>Trainer: Return lr_modifier
    deactivate LRScheduler

    Trainer->>Trainer: Calculate loss & loss.backward()
    
    Trainer->>Optimizer: Set effective_lr = base_lr * lr_modifier
    Trainer->>Optimizer: step()
    Trainer->>Optimizer: Restore base_lr

Mechanism Explained:

Surprise Calculation: Currently, Surprise is calculated using the norm of the backpropagation gradients. In the future, it is entirely feasible to use accumulated gradients from the Forward-Forward Algorithm as the source of surprise. This process would not need to wait for expensive backpropagation, allowing for a rapid assessment of learning value.
Dynamic Modulation: The PILR-S module receives the Surprise and calculates a smooth modulation factor lr_modifier (ranging from 0 to 1) using a Gaussian function exp(-0.5 * ((surprise - mu) / sigma)^2), based on its relationship with the Exponential Moving Average (EMA) and standard deviation (std) of Surprise.
Weight Update: The standard loss.backward() is executed only after lr_modifier is calculated. Subsequently, the optimizer uses effective_lr = base_lr * lr_modifier to perform the weight update. optimizer.step() is always executed, but its update magnitude has been pre-emptively and dynamically scaled by Surprise.

Stage 2: PILF (The Complete Form – Dynamic Learning Rate + Dynamic Capacity)

PILF is the full implementation on an MoE architecture, extending the dynamic scheduling concept to model capacity allocation.

graph TD
    Input --> InitialSurprise["Initial Surprise Assessment"]
    
    subgraph DynamicPolicy [Surprise-Driven Dynamic Policy]
        direction LR
        InitialSurprise -- "g(Surprise)" --> k_Value["k = g(S)"]
        InitialSurprise -- "f(Surprise)" --> lr_mod_Value["lr_mod = f(S)"]
    end

    k_Value --> HierarchicalGatingNetwork["Hierarchical Gating (route to k experts)"]
    HierarchicalGatingNetwork --> MicroExpertPool[...]
    
    MicroExpertPool --> Aggregator
    Aggregator --> Logits

    Logits --> LossCalculation
    LossCalculation -- Gradients --> SelectiveUpdate

    subgraph SelectiveUpdate [Selective Update Module]
        direction LR
        lr_mod_Value --> SetLR["Set effective_lr"]
        SetLR --> OptimizerStep["Optimizer.step()"]
    end
    
    OptimizerStep -- Updates only active experts & gating --> FinalModel

Training Loop Explained:

Dual Dynamic Decision: The model receives data and calculates an initial Surprise. Based on this Surprise, PILF makes two decisions in parallel:
- Capacity Decision: k = g(Surprise), determining how many experts to activate.
- Learning Rate Decision: lr_modifier = f(Surprise), determining the learning intensity.
Dynamic Routing and Computation: The gating network routes the task to the most appropriate experts based on the k value.
Dynamic Weight Update: After calculating the loss and gradients, the optimizer uses the effective learning rate modulated by lr_modifier to update only the activated experts and the gating network.

3. Model Zoo & Experiments

Our test suite is now centered around a lightweight (~1M parameter) Vision Transformer architecture to facilitate rapid experimentation on cognitive learning principles. We compare three main variants on CIFAR-10, using SVHN as an Out-of-Distribution (OOD) validation set.

The goal is to observe how different learning strategies perform under resource constraints, providing a clearer view of the benefits of mechanisms like Predictive Integrity Learning Rate Scheduler (PILR-S).

“Don’t just train your model. Understand its mind.”

Baseline ViT	4×1 MoE-ViT	16×4 MoE-ViT	16×4 PILR-S-MoE-ViT with 3σ Learning
~0.81M	~1.21M	~1.23M	~1.23M

MNIST Rehearsal Experiments

We also conducted rehearsal experiments on MNIST and FashionMNIST datasets to further explore continual learning capabilities.

8×2 all time (FashionMNIST -> MNIST)	8×2 in pretrain + 8×2 PILR-S in rehearsal (FashionMNIST -> MNIST)	8×2 PILR-S all time (FashionMNIST -> MNIST)

This project relies on the sigma-pi package for core calculations. To replicate the experiments and use the full testing framework, you must first clone this repository.

git clone https://github.com/dmf-archive/PILF.git
cd PILF

Note: This package does not automatically install PyTorch. Please manually install the appropriate version for your system (CPU or CUDA) before proceeding. For CUDA-enabled systems, it is recommended to use uv or pip:

# Example for CUDA 12.1
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

After setting up PyTorch, install the testing framework dependencies:

The testing framework is modular and configuration-driven.

4.1. Configure Your Experiment

Create or modify a configuration file in test/configs/. For example, test/configs/base_vit.py:

# test/configs/base_vit.py

# Model parameters
model_config = {
    'model_type': 'base',
    'embed_dim': 128,
    'depth': 6,
    # ... other model params
}

# Training parameters
train_config = {
    'epochs': 20,
    'batch_size': 256,
    # ... other training params
}

Launch the experiment from the root directory using the test/run_experiment.py script:

python test/run_experiment.py --config test/configs/base_vit.py

To run the other variants, simply point to their respective config files:

# Run MoE-ViT experiment
python test/run_experiment.py --config test/configs/moe_vit.py

# Run PILR-S-MoE-ViT experiment
python test/run_experiment.py --config test/configs/gbp_moe_vit.py

5. Theoretical Contributions

Transforms Hyperparameters into Policies: Converts learning rate and model capacity from developer-set “static hyperparameters” into “dynamic policies” that the model adjusts autonomously based on data value.
Unifies “Learning” and “Forgetting”: By linking the learning rate to Surprise, PILF provides a unified framework to handle learning, ignoring (low Surprise leads to low lr), and rejecting (high Surprise leads to low lr), thereby intrinsically mitigating catastrophic forgetting.
On-Demand Resource Allocation: (PILF) achieves true on-demand computation, where simple tasks consume minimal resources, and complex tasks dynamically call upon more resources, significantly improving efficiency.

This project is licensed under the AGPLv3. See the LICENSE file for details.

Source Link

Keep your files stored safely and securely with the SanDisk 2TB Extreme Portable SSD. With over 69,505 ratings and an impressive 4.6 out of 5 stars, this product has been purchased over 8K+ times in the past month. At only $129.99, this Amazon’s Choice product is a must-have for secure file storage.

Help keep private content private with the included password protection featuring 256-bit AES hardware encryption. Order now for just $129.99 on Amazon!

Start your free Amazon Prime trial
today and unlock unlimited streaming and more!

Help Power Techcratic’s Future – Scan To Support

If Techcratic’s content and insights have helped you, consider giving back by supporting the platform with crypto. Every contribution makes a difference, whether it’s for high-quality content, server maintenance, or future updates. Techcratic is constantly evolving, and your support helps drive that progress.

As a solo operator who wears all the hats, creating content, managing the tech, and running the site, your support allows me to stay focused on delivering valuable resources. Your support keeps everything running smoothly and enables me to continue creating the content you love. I’m deeply grateful for your support, it truly means the world to me! Thank you!

BITCOIN

Bitcoin Logo