Chapter 9

Realisations

This chapter documents three deployed realisations of the distributed reinforcement learning framework developed in the preceding chapters. The first is a hardware device—a USB-C prototype built around a Raspberry Pi Zero 2 W, bench-tested at the signal level and validated for end-to-end signal flow, with task-level evaluation remaining future work. The second is EnvCraft (https://envcraft.com), the production deployment of the environment-generation research of Chapter 6, in which users describe games in plain language and receive validated, browser-playable Gymnasium environments through a nine-stage generation pipeline. The third is RLPlayground (https://rlplayground.com), the production deployment of the CALF framework of Chapter 8, in which users run personal distributed reinforcement learning sessions over the NEXUS relay infrastructure with GPU-accelerated DQN training. Together, the three realisations show that each element of the thesis's deployment stack—environment generation, edge encoding, and distributed execution—has been instantiated in operational form, with all three sharing a common interface contract that makes them interoperable in principle.

Previous chapter Next chapter

Chapter Abstract

9.1

Introduction

The preceding chapters resolved each of the thesis's deployment challenges in turn. Policy graphs (Chapter 5) provide the modular execution abstraction, decomposing complex control into callable specialists with hard routing and commitment bounds. EnvCraft (Chapter 6) provides the validated training environment diversity that makes generalisation measurable beyond narrow fixed benchmarks. MiniConv (Chapter 7) provides edge-efficient visual encoding, compiling convolutional inference to OpenGL fragment shaders executable on commodity embedded GPUs. CALF (Chapter 8) provides network-aware distributed execution, training policies to remain robust under the latency, jitter, and packet loss that real communication channels impose. This chapter documents the points at which those contributions meet the world: three deployed systems, each a realisation of one or more of the contributions, each operational today.

The first realisation is physical. A hardware device built around a \$15 (USD) Raspberry Pi Zero 2 W receives display output from a host computer via DisplayPort Alt Mode over a single USB-C connection, runs MiniConv inference locally on the VideoCore GPU, and returns control decisions as USB Human Interface Device (HID) events over the same connector—placing a trained policy graph inside an unmodified machine's input chain without software modification of the host. The device is built and bench-tested at the signal level; full task-level evaluation of trained policy graphs on the assembled hardware remains future work. The second realisation is digital. EnvCraft (https://envcraft.com) is a publicly accessible web service that instantiates the environment-generation pipeline of Chapter 6 in production form, allowing any user to describe a game or task in plain language and receive a validated, browser-playable Gymnasium environment within minutes. The third realisation is infrastructural. RLPlayground (https://rlplayground.com) is a hosted deployment of the CALF framework of Chapter 8, in which users run personal distributed training sessions connected via the NEXUS relay to GPU-accelerated DQN training, with trained agents watchable in the browser.

What makes these three systems realisations of a single vision rather than three separate projects is the interface contract they share. The observation space exposed by every EnvCraft environment is an RGB pixel array $\mathtt{Box}(0,\,255,\,(H,\,W,\,3),\,\mathtt{uint8})$ with $H, W \in [64, 512]$ ; the action space is $\mathtt{MultiDiscrete}([5,\,2,\,2])$ —a five-direction movement command and two binary auxiliary buttons. This is the same contract that BrowserEnv defined in Chapter 5, that MiniConv encodes in Chapter 7, that CALF distributes in Chapter 8, and that the hardware device exposes at the USB-C connection: the Pi captures pixel frames from the host display, encodes them with MiniConv, and injects the resulting $\mathtt{MultiDiscrete}$ action through its HID gadget driver. An environment built in EnvCraft can be trained in RLPlayground and, in principle, deployed on the physical hardware device, because all three speak the same interface language.

9.2

Hardware

9.2.1

USB-C Signal Path

High-level USB-C signal flow for the prototype. A USB Power Delivery controller powers the system and negotiates DisplayPort Alt Mode. USB 2.0 (D+/D$-$) remains connected to the Raspberry Pi in Linux device mode, where it exposes management networking and USB HID functionality. DisplayPort video is routed through an Alt Mode multiplexer, converted to HDMI, then bridged to CSI-2 for capture on the Raspberry Pi. — Figure 9.1High-level USB-C signal flow for the prototype. A USB Power Delivery controller powers the system and negotiates DisplayPort Alt Mode. USB 2.0 (D+/D $$-$$ ) remains connected to the Raspberry Pi in Linux device mode, where it exposes management networking and USB HID functionality. DisplayPort video is routed through an Alt Mode multiplexer, converted to HDMI, then bridged to CSI-2 for capture on the Raspberry Pi.

USB-C pinout diagram illustrating the separation of signal pathways. High-speed differential pairs (TX, RX), shown in red, are used for DisplayPort Alt Mode and routed through the Alt Mode switch into the DisplayPort-to-HDMI converter. USB 2.0 data lines (D+, D$-$), shown in dark blue, remain independent and connect directly to the Raspberry Pi for HID and virtual Ethernet functionality. Configuration Channel (CC) and VBUS pins, shown in light blue, enable power negotiation via the power-delivery controller, which distributes 5\,V to the Raspberry Pi and 3.3\,V to the video-conversion components. — Figure 9.2USB-C pinout diagram illustrating the separation of signal pathways. High-speed differential pairs (TX, RX), shown in red, are used for DisplayPort Alt Mode and routed through the Alt Mode switch into the DisplayPort-to-HDMI converter. USB 2.0 data lines (D+, D $$-$$ ), shown in dark blue, remain independent and connect directly to the Raspberry Pi for HID and virtual Ethernet functionality. Configuration Channel (CC) and VBUS pins, shown in light blue, enable power negotiation via the power-delivery controller, which distributes 5\,V to the Raspberry Pi and 3.3\,V to the video-conversion components.

A USB-C output from a host computer provides multiple signal pathways over a single connector: USB 2.0 data lines, high-speed differential pairs for DisplayPort Alt Mode, and power-delivery negotiation. As illustrated in Figure 9.2, the USB-C connector separates these into three independent paths. The TPS65987D power-delivery controller manages the negotiation process, enabling DisplayPort Alt Mode and distributing VBUS (5\,V) to the Raspberry Pi whilst generating a 3.3\,V rail for the HD3SS460 DisplayPort multiplexer and the DisplayPort-to-HDMI converter. The USB 2.0 D+/D $$-$$ lines are routed directly to the Raspberry Pi, allowing it to function as a USB HID device without interference from the DisplayPort switching circuitry.

Once DisplayPort Alt Mode is established, the HD3SS460 multiplexer routes the TX/RX differential pairs to the DisplayPort-to-HDMI converter, which translates them into an HDMI signal. The AUX channel (SBU1/SBU2) is directed to the converter for DisplayPort link training and communication. The HDMI output feeds the Toshiba TC358743XBG, which converts it into a CSI-2 video stream for capture on the Raspberry Pi. Throughout this process, the TPS65987D monitors hot-plug-detect (HPD) signals, ensuring proper display connection status is relayed to the source; the HD3SS460 uses the controller's output to switch differential pairs correctly regardless of USB-C cable orientation.

Power distribution is managed entirely by the TPS65987D, which passes VBUS to the Raspberry Pi whilst supplying 3.3\,V to the video-conversion chain. The Raspberry Pi operates in USB OTG mode, receiving power whilst simultaneously acting as a USB HID device over the USB 2.0 lines. By keeping the USB 2.0 and DisplayPort signal paths independent, the system allows simultaneous HID input, video conversion, and power delivery through a single USB-C connection—the same physical connector that any modern laptop or desktop exposes as a standard display output.

9.2.2

Runtime Path and Prototype Status

The intended runtime is an end-to-end loop. The host emits display output over USB-C via DisplayPort Alt Mode; the power-delivery controller negotiates the session, the Alt Mode multiplexer routes the differential pairs, and the resulting video is bridged into CSI-2 for capture on the Raspberry Pi. The Pi performs the first processing stage: frame acquisition, optional resizing, and lightweight local model execution—specifically, a MiniConv-style visual encoder running as an OpenGL fragment shader on the VideoCore GPU, producing a compact feature tensor per frame at low power consumption.

From that point, execution may remain local or continue remotely. In the distributed configuration the earlier chapters describe, feature tensors are forwarded over a CALF networking channel—routed, in the deployed configuration, via the NEXUS relay (nexus.standardrl.com:57012) described in Section 9.5.1—to a remote policy head, which returns either a direct action or a routing decision selecting the next active unit in the policy graph. The resulting control decision returns to the Raspberry Pi, which injects it into the host as keyboard or mouse input through the USB HID interface on the USB 2.0 lines. The complete loop is therefore: capture path, local encoding, optional remote policy execution, HID action return—a physical instantiation of the split-policy architecture from Chapter 7, mediated by the networking model from Chapter 8, running policy-graph units from Chapter 5.

The maturity of the system is mixed and should be stated plainly. The physical prototype, board-level integration, USB-C signal separation, and HID/device-mode pathway have been built and bench-tested. The video-capture and host-control paths have been validated at the signal level, confirming end-to-end flow from DisplayPort capture through CSI-2 to the Pi's camera interface and from the Pi's USB gadget driver to the host's HID stack; characterisation of frame-capture rate and HID round-trip latency at the task level remains to be completed on the assembled hardware. What remains future work is the full task-level evaluation of trained policy graphs: coupling a MiniConv encoder to the capture pipeline, connecting its output to a CALF channel, and evaluating closed-loop performance on a concrete computer-interaction task.

9.3

BrowserEnv as Training Setting

BrowserEnv architecture for computer-interaction policy training (Chapter chapter:paper2). Each instance runs Firefox in an isolated Docker container; agents connect via VNC for pixel observations and low-level input, whilst a lightweight WebSocket extension provides structured interaction signals. The observation--action interface—pixel frames in, keyboard/mouse events out—is structurally identical to that exposed by the physical device, enabling direct transfer from browser-based training to hardware deployment. — Figure 9.3BrowserEnv architecture for computer-interaction policy training (Chapter 5). Each instance runs Firefox in an isolated Docker container; agents connect via VNC for pixel observations and low-level input, whilst a lightweight WebSocket extension provides structured interaction signals. The observation--action interface—pixel frames in, keyboard/mouse events out—is structurally identical to that exposed by the physical device, enabling direct transfer from browser-based training to hardware deployment.

The software counterpart to the hardware device is a browser-based training environment. BrowserEnv (Chapter 5) presents a live web browser as a Gymnasium-compatible environment: the agent observes rendered page content—screenshot pixels, DOM structure, or a combination—and acts through synthetic keyboard and mouse events. Reward is derived from structured interaction signals: form completion, navigation to target URLs, element interaction sequences. The environment family spans a wide range of computer-interaction tasks, from simple button clicks to multi-step web workflows, providing the training diversity that Chapter 6's EnvCraft analysis showed to matter for held-out performance.

The critical property for this hardware chapter is interface alignment: the observation and action spaces BrowserEnv exposes in simulation are structurally identical to those the physical device exposes at deployment. The Pi captures display output as pixel frames, which MiniConv encodes using the same architecture trained in BrowserEnv. The policy's action output—discrete keyboard scan codes or relative mouse deltas—is injected by the Pi's USB HID gadget driver using the same event format that BrowserEnv's synthetic input layer simulates. There is therefore no interface gap between training and deployment: a policy trained in BrowserEnv can, in principle, be deployed on the physical device without modification to observation preprocessing or action post-processing.

This alignment is the direct hardware analogue of the sim-to-real methodology developed in Chapter 8: just as CALF's network-aware training ensured that Mode 2 synthetic network conditions accurately represented Mode 3 real conditions, BrowserEnv's interface alignment ensures that simulation-side training accurately represents device-side deployment. The remaining gap—rendering fidelity, timing jitter, and any HID latency not captured in simulation—is the hardware-side analogue of the network axis addressed by CALF: a domain shift that further work on the physical prototype must characterise and, where possible, close. A complete deployment loop would close with a policy trained in RLPlayground's hosted CALF environment—validated against the same interface contract—then deployed to a physical device whose MiniConv encoder and NEXUS-connected CALF channel provide the matching deployment path. The two sections that follow describe the software side of that loop, which is operational today.

9.4

EnvCraft

: Environment Generation in Production sec:realisations:envcraft

9.4.1

From Research Pipeline to Live Service

The environment-generation system described in Chapter 6 comprised a multi-stage LLM pipeline producing 9,694 validated Gymnasium environments, evaluated through ten-fold cross-validation on held-out Tetris variants and demonstrating statistically significant positive transfer within environment families. That system has since been deployed as a publicly accessible web service at https://envcraft.com, under the name EnvCraft.

The EnvCraft web interface at https://envcraft.com. Users describe a game or interactive task in plain language; the service guides them through structured specification and generates a validated, browser-playable Gymnasium environment through a nine-stage pipeline. The fixed interface contract—$MultiDiscrete([5,\,2,\,2])$ actions, $Box(0,\,255,\,(H,\,W,\,3),\,uint8)$ observations—ensures that every generated environment is directly compatible with RLPlayground training and, in principle, with hardware deployment. — Figure 9.4The EnvCraft web interface at https://envcraft.com. Users describe a game or interactive task in plain language; the service guides them through structured specification and generates a validated, browser-playable Gymnasium environment through a nine-stage pipeline. The fixed interface contract— $\mathtt{MultiDiscrete}([5,\,2,\,2])$ actions, $\mathtt{Box}(0,\,255,\,(H,\,W,\,3),\,\mathtt{uint8})$ observations—ensures that every generated environment is directly compatible with RLPlayground training and, in principle, with hardware deployment.

The workflow begins with a natural-language description. A user types a description—a snake game where walls slow down the snake, or a 2D maze escape with timed doors that open and close, or a gravity platformer where the player must collect keys before the level floods—and EnvCraft engages them in a structured specification dialogue, collecting the title, theme, core objective, win and lose conditions, reward terms, difficulty level, and maximum episode length. Once the specification is complete, a nine-stage generation pipeline executes automatically, with live progress streamed to the browser window. On success, the environment is playable immediately in the browser—rendered at ten to fifteen frames per second, controlled by keyboard, using the same $\mathtt{MultiDiscrete}([5,\,2,\,2])$ action interface that any training agent would use. Users can collect experience datasets from their own play, submit GPU-accelerated DQN training jobs, and watch trained agents playing back through the same browser interface. The service carries the attribution: EnvCraft is part of ActionLearn, work carried out at the University of Cambridge on the future of real-world reinforcement learning; the research paper underlying the system is linked directly from the homepage.

Example specification dialogue on EnvCraft. The system iteratively refines a free-text game description into a structured JSON specification, asking for clarification on objectives, win and lose conditions, reward terms, and difficulty. The structured specification then drives the nine-stage generation pipeline. This chat-based workflow—collecting title, theme, objective, terminal conditions, reward structure, and episode length—is the production instantiation of the brief-generation stage described in Chapter chapter:hierarchy. — Figure 9.5Example specification dialogue on EnvCraft. The system iteratively refines a free-text game description into a structured JSON specification, asking for clarification on objectives, win and lose conditions, reward terms, and difficulty. The structured specification then drives the nine-stage generation pipeline. This chat-based workflow—collecting title, theme, objective, terminal conditions, reward structure, and episode length—is the production instantiation of the brief-generation stage described in Chapter 6.

9.4.2

The Production Pipeline

The production pipeline extends the evaluated research system with additional validation gates and a mandatory interface declaration, reaching nine stages in total. The full sequence is as follows.

Stage 0: Specification chat. An LLM-driven dialogue collects the structured JSON specification from the user's natural-language description, iterating until all required fields are present and internally consistent.

Stage 1: Specification lint. A rule-based checker validates field types, reward bound consistency, and action-space alignment before code generation begins, catching specification errors that would otherwise propagate downstream.

Stage 2: Code generation. An LLM generates a complete env.py implementing the BrowserRLEnv abstract base class. The generated class must define reset(), step(), and render() methods, and must declare a module-level ENV_MANIFEST dictionary and a MINIMUM_SKILL_THRESHOLD constant. The MINIMUM_SKILL_THRESHOLD is a human-readable declaration of the minimum agent performance level required to demonstrate non-trivial competence—a production operationalisation of the difficulty-filtering principle underlying the privileged-agent validation in Chapter 6.

Stage 3: Static safety check. An AST-level scanner rejects code that imports or calls banned identifiers (os.system, subprocess, socket, and others), ensuring that generated environments cannot exfiltrate data or escape the sandbox at the import level.

Stage 4: Dynamic validation. The environment executes inside an isolated Docker container with no network access, a non-root user, a read-only filesystem, a 256\,MB memory limit, and a sixty-second timeout. The validator runs reset(), exercises step() across a short trajectory, checks observation shape against the declared contract, verifies reward bounds, and tests determinism across identical seeds.

Stage 5: Visual validation. Rendered frames are captured and submitted to a multimodal vision model for majority-vote pass/fail assessment. This gate catches environments that execute and validate numerically but render degenerate output: blank screens, corrupted sprites, or visual layouts inconsistent with the specification. No equivalent gate existed in the research pipeline described in Chapter 6; its addition substantially reduces the rate of visually broken environments reaching users.

Stage 6: Policy smoke test. A random rollout of five hundred steps verifies that the environment makes positive reward achievable—that some trajectory, however unlikely, can accrue non-zero return. Environments in which positive reward is structurally impossible under any strategy are rejected at this stage.

Stage 7: Trivial-policy check. The validator verifies that maximum return is not achievable by a fixed or near-random strategy. This prevents environments in which the optimal policy is trivially farmable without learning, filtering out cases where, for example, holding a single button pressed yields maximal reward indefinitely.

Stage 8: Behavioural commentary. An LLM-assisted stage generates a brief natural-language description of the environment's emergent dynamics—the kinds of strategy that succeed and fail—which is stored alongside the environment and surfaced to users and to the training pipeline.

A generated environment passes all nine gates or is rejected with a stage-specific reason code made available to the user, who may revise their specification and resubmit.

A selection of environments generated by EnvCraft. Each environment was produced from a plain-language description, validated through the nine-stage pipeline, and rendered at 10--15 frames per second in the browser. The fixed interface contract—$MultiDiscrete([5,\,2,\,2])$ actions, pixel observations—is common to all; game mechanics, visual themes, and reward structures vary freely across the generated corpus. — Figure 9.6A selection of environments generated by EnvCraft. Each environment was produced from a plain-language description, validated through the nine-stage pipeline, and rendered at 10--15 frames per second in the browser. The fixed interface contract— $\mathtt{MultiDiscrete}([5,\,2,\,2])$ actions, pixel observations—is common to all; game mechanics, visual themes, and reward structures vary freely across the generated corpus.

9.4.3

What the Production System Adds and What It Does Not Claim

The empirical results reported in Chapter 6—68.7\,% of held-out environments showing positive transfer, a mean improvement of 1.96 episode steps, and the within-family generalisation scaling experiment demonstrating monotonically increasing transfer with corpus size—were produced with the research pipeline under the ten-fold cross-validation protocol described there. The production system is a deployment of that research contribution, extended with additional validation gates and a substantially richer user-facing workflow; it should not be interpreted as a source of new empirical claims. In particular, the within-family limitation identified in Chapter 6—that the generalisation evidence is specific to Tetris variant families and does not establish cross-domain transfer—is not addressed by the production deployment. The production system generates a larger and more diverse corpus of environments than the research system, but no cross-domain generalisation experiment has been conducted under the same held-out protocol.

What the production deployment does confirm is the pipeline's operational viability: the nine-stage generation process runs reliably, the Docker sandbox validation prevents unsafe code from reaching users, and the browser-playable output is accessible without installation or configuration. The prompt version used by the production system (version 1.5.0) differs from the version used in the evaluated research system; no systematic comparison of the two prompt versions has been conducted. These are infrastructure properties rather than algorithmic results, and they are the appropriate standard by which the deployment should be assessed.

9.5

RLPlayground: Distributed Training Infrastructure

9.5.1

NEXUS and the Relay Architecture

The infrastructure layer underlying RLPlayground is the NEXUS relay: a custom TCP relay running at nexus.standardrl.com:57012 that allows distributed CALF nodes to communicate across the public internet without requiring public IP addresses, VPN configuration, or pre-arranged network topology. Each node—whether a personal laptop, a cloud server, a GPU training instance, or an embedded device—registers with NEXUS using RSA challenge-response authentication: the relay issues an 8-byte challenge, the node signs it using its registered private key, and the relay verifies the signature before admitting the node to a named session group. Nodes within a group can then exchange length-prefixed packets through the relay regardless of their physical location or network provider.

NEXUS implements the three-layer hierarchy described in Chapter 8: the relay layer (NEXUS itself) handles authentication, session management, and packet routing; the host layer manages named session groups and tracks node membership; the services layer carries the application-level messages—env_reset, env_step, and action packets—that CALF uses to coordinate environment-side and policy-side execution. This hierarchy was designed specifically to support the heterogeneous deployment topology that CALF requires: a user's browser session might host the environment, a cloud server might host the policy, and a GPU cluster might run the training loop, with NEXUS providing transparent message routing between all three.

Chapter 8's experimental evaluation was conducted in controlled local-area network conditions—Ethernet and Wi-Fi in a university laboratory setting. This reflects the scope of the empirical study, not a limitation of the infrastructure. The NEXUS relay routes sessions across the public internet as a matter of course; multi-hop communication through heterogeneous network providers, across Cloudflare tunnels and residential broadband links, is the normal operating condition for RLPlayground users. Extending CALF's empirical evaluation to WAN and multi-hop conditions—measuring the degradation profile and the network-aware training benefit under cellular jitter or intercontinental routing—remains a natural direction for future work; the infrastructure to run those experiments is operational.

9.5.2

Personal CALF Environments

RLPlayground (https://rlplayground.com) provides each authenticated user with a personal CALF container: a Docker instance identified by a UUID and exposed at https://\uuid\.rlverse.com via Cloudflare tunnel. The container runs the CALF node software, registers with NEXUS, and joins the user's session group. From the user's perspective, the container is a private instance of the CALF framework: it can connect to the user's EnvCraft environments, receive environment observations over the session group, and send actions back—all mediated by NEXUS and transparent to the environment and policy code.

The session group also accepts external nodes. A user with access to their own hardware—a Raspberry Pi, a laptop running local simulation, or any machine on which the CALF node software can execute—can register it as a NEXUS client using a generated RSA key pair, join the session group, and participate in the training loop alongside the cloud-hosted container. This bring-your-own-hardware capability is the mechanism by which the hardware device described in Section 9.2 would eventually connect to RLPlayground's training infrastructure: the Pi Zero 2 W would register as a NEXUS client, join the user's group, and receive policy decisions from the remote policy head via the same channel described in Section 9.2.2. The software infrastructure for this connection is operational; the hardware-side integration—coupling MiniConv to the CSI-2 capture pipeline and opening a CALF channel from the Pi's network interface—remains the outstanding engineering task on the hardware path.

The service carries the tagline distributed RL training that survives the real world, and reproduces the CALF experimental results from Chapter 8 on its homepage alongside the citation for the underlying paper : CartPole return 495 (clean) $\to$ 92 (Wi-Fi degraded) without network-aware training versus 378 with CALF, a reduction in degradation of roughly four-fold; MiniGrid success rate 94\,% (clean) $\to$ 44\,% (Wi-Fi degraded) without versus 74\,% with CALF. These figures are taken directly from Tables 8.3 and 8.4 and form the basis of the service's headline claims. Users can verify them against the published arXiv preprint (arXiv:2603.12543).

9.5.3

GPU Training and the End-to-End Pipeline

For users who submit training jobs, RLPlayground provides an end-to-end pipeline. An LLM policy-generation step reads the environment's source code and generates a heuristic Python policy class: a starting strategy informed by the game's rules and reward structure, implemented without access to internal state, using only the same pixel observations available to any trained agent. This heuristic policy then executes a seeded experience-collection phase, populating a replay buffer with trajectories that demonstrate basic competence and ensure the buffer is not empty when training begins. The GPU training phase runs Stable-Baselines3 DQN for a configurable number of timesteps (five hundred thousand by default, corresponding to the training budget used in Chapter 6's generalisation experiments), using the seeded buffer as a warm start. On completion, the resulting model is saved and made available for browser playback; the user can watch the trained agent interact with their generated environment in real time.

The current training algorithm is DQN—a flat, single-policy learner. The distributed training infrastructure supports the policy graph architecture from Chapter 5 in principle—the NEXUS relay can route messages between arbitrarily many nodes, and the session-group model supports the multi-unit coordination that hard routing requires—but the current production deployment uses DQN as its training backbone for reasons of engineering simplicity and training stability. Policy graph training in the production system, with hard routing, commitment bounds, and multiple specialist units, is the natural next development step; the infrastructure does not require fundamental revision to support it, only the integration of the policy-graph training loop described in Chapter 5 into the GPU training pipeline.

The division of labour between the edge device and the cloud training service instantiates, in operational form, the System 1/System 2 architecture that the thesis proposes throughout. The Pi Zero 2 W on the deployment side—running MiniConv inference at up to five frames per second on a \$15 USD SIMD GPU, reacting to the host display within the HID latency budget—corresponds to the reactive, fast, resource-constrained System 1 layer. The GPU Hub on the training side—running five hundred thousand steps of DQN over minutes, deliberating over the full environment distribution before committing a policy—corresponds to the deliberative, slow, resource-rich System 2 layer. The trained policy crosses from System 2 to System 1 via NEXUS: the remote policy head receives compact feature tensors from the Pi, selects actions, and returns them as HID events. This is not an architectural aspiration but a description of an infrastructure whose components are each, individually, operational.

9.5.4

What the Deployment Confirms and What Remains

The production deployment confirms the CALF infrastructure's operational viability across real heterogeneous networks. NEXUS relay sessions run through Cloudflare tunnels and across diverse network providers, in conditions that the LAN-only laboratory experiments of Chapter 8 did not test. The observation that the infrastructure operates reliably under these conditions is not an empirical RL result—it does not extend or modify the reward degradation numbers reported in Chapter 8—but it does validate the architectural claim that NEXUS-mediated CALF sessions are robust enough to serve real users outside laboratory conditions.

What remains undone in the production deployment is the GPU CALF integration: a NEXUS-connected GPU agent that receives environment observations from the session group, runs DQN inference and online training on GPU, and sends actions back through the relay without requiring the experience-collection and offline-training separation of the current pipeline. This component exists in skeletal form in the codebase; when it is complete, RLPlayground will support fully online GPU-accelerated CALF training with the distributed topology used in the CartPole and MiniGrid experiments—extended from LAN to public internet—and the GPU CALF node will be connectable to any NEXUS session, including one in which the Pi Zero 2 W hardware device is the environment-side endpoint.

9.6

Three Realisations of One Vision

The chapters preceding this one assembled the thesis's argument in stages. Policy graphs provide the modular execution abstraction; EnvCraft provides the validated environment diversity that makes generalisation measurable; MiniConv provides the edge-efficient encoding that makes policy execution lightweight on commodity hardware; CALF provides the network-aware training that makes distributed execution robust to the communication impairments of real deployment. This chapter has shown where those contributions converge with the world: a hardware device, an environment-generation service, and a distributed training platform—each a realisation of one or more strands of the thesis, and all three speaking the same interface language.

The four deployment gaps identified in Chapter 4 each find a corresponding realisation here. The interpretability gap—the difficulty of understanding and auditing the decisions of monolithic policies—is addressed structurally by the policy graph architecture, whose routing traces make the active specialist visible at every step; in RLPlayground, those traces are accessible to users inspecting their trained agents. The generalisation gap—the overfitting of policies to narrow training distributions—is addressed by EnvCraft's production corpus of validated environments, each with distinct mechanics, reward structures, and win conditions. The edge deployment gap—the computational cost of running vision-based policies on resource-constrained hardware—is addressed by MiniConv on the Pi Zero 2 W, which demonstrates that OpenGL shader inference can sustain the encoding rate required for real-time interaction. The latency and communication gap—the degradation of policies trained under idealised synchrony when deployed over real networks—is addressed by CALF, whose training methodology is now accessible to any user through RLPlayground's hosted service.

The interface contract is what makes the three realisations interoperable rather than merely related. Any environment generated by EnvCraft exposes a $\mathtt{MultiDiscrete}([5,\,2,\,2])$ action space and a $\mathtt{Box}(0,\,255,\,(H,\,W,\,3),\,\mathtt{uint8})$ observation space—the same interface that MiniConv encodes, that CALF distributes, and that the hardware device exposes at the USB-C connection. A policy trained in RLPlayground against an EnvCraft environment could, without interface modification, be deployed to the Pi Zero 2 W, receive observations through its CSI-2 capture pipeline, and inject decisions through its HID gadget driver. The gap between that potential and its current realisation is a hardware engineering task—coupling the capture pipeline to a MiniConv encoder, opening the CALF channel, and evaluating closed-loop performance on the physical device—not an algorithmic or infrastructure one.

What the three realisations do not yet constitute is a closed-loop empirical demonstration. The hardware device has no task-level evaluation. The EnvCraft production system has not been evaluated under the same cross-validation protocol as the research pipeline. The RLPlayground training pipeline uses DQN rather than policy graphs, and the GPU CALF integration remains future work. Each system is honest about its maturity: the infrastructure is real, the claims are bounded, and the remaining steps are named.

The deeper significance of those remaining steps is worth stating. In his 1914 essay Ensayos sobre automática, Torres Quevedo insisted that machines must possess discernimiento—the capacity to weigh the circumstances surrounding them in determining their actions. El Ajedrecista gave that claim physical form: an automaton that played real chess moves on physical pieces, in the real world, without a human intermediary. A policy graph running on a USB-C device interacting with an unmodified laptop is a modern expression of the same ambition: a learned system that perceives a real environment through its native output channel, acts on it through its native input channel, and does so without requiring special instrumentation of the host. The gap between Torres's relay logic and a trained MiniConv policy is vast; the gap between the operational software infrastructure described in this chapter and a policy that actually plays on a real screen is small by comparison. The environments are generated, the training infrastructure is running, and the signals flow through the hardware. Connecting them is the natural final step.