Building the initial codebase and passing the baseline tests. Still a long way from a successfull run, however.
Welcome back. For the past three weeks I have been reading and planning. However, this week I've begun building. I now have a running codebase with a Gymnasium environment wrapping Basilisk, a PPO training pipeline, a validation script, and ran a training run with 200,000 timesteps on a fixed LEO-to-GEO transfer. However, its trajectories are not quite right. This post covers the most significant design decisions I made building the environment.
Blog 3 laid out the conceptual mapping from Basilisk to a Gymnasium environment. This week I locked in the concrete formulation. The three fundamental choices were what the agent observes, what it controls, and what Basilisk runs at per decision step.
Last week's open question was whether to observe Cartesian state or Keplerian elements. I chose both. The 16-dimensional observation combines normalised ECI position and velocity with the orbital elements most useful for interpreting orbit shape, specifically semi-major axis $a$, eccentricity $e$, and the angular elements encoded as sines and cosines to avoid branch-cut discontinuities. Everything is scaled to roughly $[-3, 3]$ for training stability.
Notice how I included the target SMA and eccentricity directly in the observation. Without them, the policy network can never know what orbit it is trying to reach, which means it cannot generalize to the randomised-orbit curriculum I plan to enable later. The agent needs to read its goal, not memorise it.
I went with burn on-times rather than delta-v vectors. Last week's question was whether to have the agent command an abstract velocity change or a physical firing duration. I chose the latter for two reasons. First, it maps directly to Basilisk's THRArrayOnTimeCmdMsg with no additional controller layer. Second, every burn depletes the tank in a way that shows up in the $m/m_0$ observation term, and the agent is responsible for that tradeoff.
Basilisk's thrusterStateEffector in this configuration does not feed propellant depletion back into the orbital integrator. The propagated mass stays fixed at the wet mass. I track fuel consumption analytically in Python and use it for observations and the reward. The error from this constant-mass assumption accumulates to under 5% in delta-v over a full Hohmann, which is acceptable for the current stage.
Following Zavoli and Federici's discretisation, I use 40 fixed decision steps per episode. The episode length is set to $1.5 \times T_\text{Hohmann}$, giving the agent 50% more time than the ideal two-burn transfer requires. Each step therefore spans a substantial chunk of orbital time, roughly 400 seconds per step for the LEO-to-GEO problem. The agent is as a result making 40 coarse burn-or-coast decisions spread across the entire transfer window.
The two essential components were managing Basilisk's simulation lifecycle inside Gymnasium's reset-step loop, and handling the attitude settling period at the start of every episode.
Basilisk has no checkpoint-restore mechanism, so very call to reset() constructs a new SimBaseClass from scratch, configures the spacecraft dynamics, thruster effector, navigation sensor, and attitude control modules, initialises the orbital state, and calls InitializeSimulation(). Memory leaks were a potential problem from doing this tens of thousands of times in a single Python session, but in practice the Python garbage collector handles the C++ resource cleanup correctly when the previous sim object is dereferenced.
This was the detail I had not fully appreciated until I ran the first episode without it. The velocityPoint guidance law slews the spacecraft to align the thruster with the prograde direction, and the mrpFeedback controller closes the attitude loop with a settling time on the order of several hundred seconds. On an equatorial orbit, the initial attitude misalignment from the identity MRP can be close to 90 degrees. If the agent fires the thruster immediately at step zero, the burn is nearly perpendicular to the velocity vector and the orbit degrades rather than rises. The reference scenarioOrbitManeuverTH handled this with an explicit pre-burn coast; I replicated it with a fixed 600-second settling period before the first observation is returned. Although the agent never sees this coast, the physics depends on it.
I went with in-process. The latency per step() is dominated by the Basilisk integrator itself, not inter-process communication overhead, so there was nothing to gain from a separate process. The main risk of memory accumulation across thousands of episode resets did not materialise in practice.
Before running any training I wrote two test modules and used Gymnasium's built-in environment checker to catch API violations.
test_env_basic.py verifies the Gymnasium contract: observations returned by reset() and step() must lie within the declared observation_space bounds at every step, including degenerate states near the termination conditions. I added clipping as the final operation in _get_obs() to handle edge cases where orbital elements go singular near perfectly circular orbits.
test_env_oracle.py runs a hardcoded burn sequence designed to approximate the optimal Hohmann transfer with full burn fraction at perigee, coast through the transfer ellipse, and a second burn at apogee to circularise. It checks that the oracle agent reaches a final SMA within a reasonable fraction of the target without triggering any catastrophic termination. This is the physics validation test: if a hand-coded near-optimal agent cannot complete the transfer, the simulation is wrong.
With the environment passing all tests I ran the first training run: 200,000 timesteps, PPO with a 64x64 MLP, fixed orbit, single environment, seed 42. With a deliberately minimal configuration, I wanted a clean diagnostic signal before scaling up.
Turns out, the reward curve ended up being erratic, bouncing up and down with no clear trend across the full run. The training did not diverge catastrophically, but it definitely did not converge. The evaluation results were more interesting. Running 20 episodes with the best checkpoint and plotting orbital SMA over each decision step gives this:
Evaluation trajectories over 20 episodes. The red dashed line is the target orbit. The bottom panel shows the thruster command at each step. Most episodes overshoot the target significantly; Ep8 misses by under 1%.
Every episode starts at LEO and every trajectory rises. The spacecraft is consistently firing prograde and climbing out of LEO. A random policy would produce flat or chaotic SMA traces, so something is being learned.
However, SMA errors range from under 10% to over 300%, with most episodes overshooting the target altitude by a wide margin. Episode 8 rises, levels off near the target, and circularises correctly. It is surrounded by episodes that sail past the target entirely, reaching two or three times the target SMA before the step budget expires.
In the bottom panel we see the thruster command sequence follows the same coarse pattern across every episode, where burn fraction starts near 1.0 and tapers gradually down to near zero by step 40. The agent has converged on a monotonically decreasing burn schedule rather than the two distinct on-off pulses that the Hohmann transfer actually requires. It is learning a heuristic that sometimes produces a good answer and usually overshoots.
A Hohmann transfer uses two short impulsive burns: a prograde burn at perigee to raise apogee to the target altitude, a coast through the transfer ellipse, and then a second prograde burn at apogee to circularise. The optimal command sequence is pulsed: burn, stop, coast for ~20 steps, burn, stop. The debug policy is a continuous taper: burn hard early, bleed off gradually. It has learned to go up, but not when to stop.
The trajectories show that the reward function never penalizes overshooting. To see why, we look at the SMA progress term:
The SMA progress term needs to reward proximity to the target, not just upward movement. A potential based on signed distance in $(a, e)$ space, something like $\Phi(s) \propto -|a - a_\text{target}| / a_\text{target}$, would penalize overshooting and undershooting symmetrically. The terminal success bonus needs to come down to the same order of magnitude as the accumulated per-step signal, so the agent is not navigating in the dark between episode boundaries. And the eccentricity shaping should only be valuable near the right altitude, not universally.
Getting reward shaping right on a sparse, delayed-outcome problem is notoriously difficult, and a first-pass reward function almost never works. The good news is that the environment is correct, the physics is correct, and the policy is learning. Because of Episode 8's near-perfect trajectory, I now know the task is solvable with this setup.
Episodes that happen to time out near the target altitude receive the massive +200 terminal bonus, spiking the curve upward, episodes that miss receive only per-step shaping, landing near zero, and the occasional crashes add a -50 dip. The critic cannot learn accurate value estimates under this level of variance, so I lose the advantage estimates that PPO's clipped objective trains on. Even with $\varepsilon = 0.2$ limiting per-update policy change, a noisy advantage signal means noisy gradient updates, which is visible in the policy's instability.
Next week will start off on the reward function. I am not going to scale up to more timesteps or parallel environments until the reward curve shows a cleanly improving trend on this fixed, deterministic problem.
The specific changes I plan to try include replacing the one-sided SMA progress term with a distance-to-target potential that penalizes overshoot symmetrically with undershoot, rebalancing the terminal bonus to be proportional to the per-step reward scale, and possibly tying the eccentricity shaping weight to proximity to the target altitude so the agent is only rewarded for circularising in the right place. I also want to look at the value function loss in TensorBoard more carefully. If the critic is not learning an accurate value estimate, which seems likely given the variance, the advantage estimates will be garbage regardless of how well the actor is searching.
One thing that keeps sticking with me is that despite the chaos in the reward curve, the EvalCallback managed to find a checkpoint that produced Episode 8, which was a near-perfect Hohmann transfer with under 10% SMA error. That means the task is tractable with this exact environment and this exact policy architecture. Something in the reward signal, however noisy, was enough to push the policy toward the right answer at least once.