Tech Showcase: XCS for Low SWaP Online Reinforcement Learning

Bringing Intelligent Learning to Resource-Constrained Environments

At Transparent AI, we're committed to developing machine learning solutions that are not only powerful but also explainable, efficient, and suitable for deployment on devices with limited size, weight, and power (SWaP) constraints. Today, we're excited to share our latest project: implementing eXtended Classifier Systems (XCS) on a Raspberry Pi to showcase the potential of rule-based reinforcement learning on low-power devices.

What Are Learning Classifier Systems and Why Do They Matter?

Learning Classifier Systems (LCS) are a family of rule-based machine learning algorithms that combine reinforcement learning with evolutionary computing. The eXtended Classifier System (XCS) is a particularly successful variant that has several distinct advantages over neural networks for certain applications:

  1. True Online Learning: Unlike neural networks that often suffer from catastrophic forgetting when learning new information, XCS performs true online reinforcement learning. It can continuously adapt to new situations without forgetting previously learned knowledge, making it ideal for systems that must learn in the field.

  2. Transparent and Explainable: XCS produces human-readable rules that clearly explain how decisions are made. Each rule associates a specific state pattern with an action and prediction. For example, in our Tic-tac-toe implementation, a rule might state: "If the opponent has two marks in a row and the third cell is empty, then place your mark in the empty cell to block."

  3. Modular Knowledge Sharing: Perhaps most importantly for multi-agent systems, XCS's rule-based structure enables agents to share discrete units of knowledge with each other. Instead of transferring entire weight matrices like neural networks, XCS agents can exchange individual rules or rule sets, facilitating more efficient multi-agent reinforcement learning (MARL).

Rule Induction: Improving the Learning Process

In our implementation, we've replaced the traditional genetic algorithm component of XCS with a more targeted rule induction methodology. While genetic algorithms work well for unstructured problems, the more structured nature of our demonstration games (Tic-tac-toe and Iterated Prisoner's Dilemma) benefits from this refined approach.

Our rule induction process creates new rules through three mechanisms:

  • Specialization: Making general rules more specific by replacing wildcards with concrete values

  • Generalization: Creating broader rules by replacing specific values with wildcards

  • Pattern Extraction: Identifying common patterns across successful rules

This approach leads to more stable learning with faster convergence to optimal solutions than the sometimes chaotic nature of genetic operators.

The Tech Showcase: Learning Games on a Raspberry Pi

To demonstrate the capabilities of XCS on low SWaP systems, we implemented two classic game environments on a Raspberry Pi 5. What makes this showcase particularly impressive is the online reinforcement learning aspect—our XCS agents take feedback and learn from each individual game played, continuously improving their performance without any "training phase" separate from deployment.

To accelerate the learning process, we created teacher agents that play against the XCS agents, allowing them to experience thousands of games in minutes rather than the hours or days it would take with human opponents. However, the learning process is identical to what would occur during human play—the XCS agent would develop the same strategies if trained one game at a time against a human player, just more slowly.

Tic-Tac-Toe: Learning Perfect Play in a Complex State Space

While Tic-tac-toe might seem simple, its state space is surprisingly large. With each of the 9 spaces having 4 possible values (empty, X, O, or a "don't care" wildcard in the rule representation), the theoretical rule space is 4^9, or 262,144 possible patterns. The ideal solution for perfect play is to force a draw when both players play optimally.

We implemented two types of teacher agents:

  1. Heuristic Teacher: Uses simple rules like "win if possible" and "block opponent's winning moves," but doesn't look ahead multiple moves.

  2. MinMax Teacher: Implements the minimax algorithm with alpha-beta pruning to play optimally, looking ahead to the end of the game.

Results Against the Heuristic Teacher:

  • XCS learned the draw condition within the first 250 games

  • Training on 1000 games took only 10 seconds

  • The final rule population contained just 650 rules—a mere 0.2% of the total state space

  • This compact rule set demonstrates the agent's ability to generalize effectively

Results Against the MinMax Teacher:

  • Learning against this optimal player was more challenging

  • Required approximately 800 games to consistently achieve draws

  • Training time increased to about 500 seconds for 1000 games

  • Final population size was slightly larger at 700 rules (0.26% of state space)

Iterated Prisoner's Dilemma: Outsmarting Tit-for-Tat

The Iterated Prisoner's Dilemma (IPD) is a cornerstone of game theory, modeling cooperation and competition between rational agents. Since Robert Axelrod's famous tournaments in the 1980s, researchers have sought algorithms that perform well in this seemingly simple but strategically deep game.

Tit-for-Tat emerged as a remarkably successful strategy: cooperate on the first move, then simply copy what your opponent did in the previous round. Despite its simplicity, Tit-for-Tat has proven difficult to consistently outperform due to its balance of cooperation with swift punishment for defection.

Our XCS agent's performance against Tit-for-Tat was impressive:

  • Learned an effective counter-strategy within just 25 games

  • Over 1000 games in 40 seconds, the XCS agent maintained a 25-game rolling average score that either tied with or exceeded Tit-for-Tat

  • The rule population remained extremely compact at only 26 rules throughout training

This doesn't mean the XCS agent won every single game—rather, it learned from any losses and adapted its strategy to maintain a winning average over time, demonstrating the power of online reinforcement learning.

Resource Efficiency: Truly Low SWaP Performance

One of the most impressive aspects of this demonstration was the minimal resource footprint of the XCS agents:

  • Memory Usage:

    • Tic-Tac-Toe: Only 62MB of the 16GB available RAM

    • Iterated Prisoner's Dilemma: Just 115MB of RAM

  • Power Consumption:

    • Total power draw: ~4.5 watts

    • Raspberry Pi 5 idle consumption: 2.4 watts

    • Net power used by XCS models: ~2.1 watts

These figures highlight why XCS is particularly well-suited for low SWaP applications where neural networks might be prohibitively resource-intensive.

Conclusion: Applications and Future Directions

Our demonstration successfully showcases the potential of XCS for online reinforcement learning in low SWaP environments. The implications extend far beyond game playing, with applications including:

  • Unmanned Vehicles and Drones: Adaptive navigation and decision-making with minimal power requirements

  • Remote Sensors: Smart sensors that learn from environmental patterns without requiring constant retraining

  • Edge Computing Devices: Intelligent systems that can adapt to user behavior or changing conditions

  • Field-Deployed Systems: Any application where systems must learn in situ without relying on cloud connections

The explainable nature of XCS's rules also makes these systems more trustworthy in critical applications, as decisions can be audited and understood by human operators.

At Transparent AI, we're continuing to explore how rule-based machine learning can provide effective alternatives to neural networks in resource-constrained environments. This demonstration represents just one step toward more efficient, explainable, and adaptable AI systems for the real world.

Previous
Previous

Tech Showcase: HDC Capabilities in Analogical Reasoning

Next
Next

Tech Showcase: HDC for Low SWaP Image Classification