Senior Project Blog · Week 1 · March 3, 2026

Teaching a Spacecraft
to Think for Itself

An introduction to using deep reinforcement learning to autonomously optimize fuel consumption across multi-planetary missions.

01 The Problem with Getting Somewhere

Space is expensive because of the fuel. Every kilogram you want to move requires more fuel, adding more weight, and finally requiring even more fuel to lift. This is described by the Tsiolkovsky rocket equation and it makes the cost of space-travel increase exponentially. Delivering one kilogram to low Earth orbit costs roughly $7,342. Delivering that same kilogram to the surface of the Moon costs nearly $1.2 million. Push past Mars, and the numbers become staggering.

$7K Cost per kg to Low Earth Orbit

$1.2M Cost per kg to the Moon's surface

$11B Estimated upper cost of NASA's Mars Sample Return

There is, however, a trick that spacecraft engineers have long exploited: the gravity assist. By going past a planet at the right angle and speed, spacecraft can "steal" energy from a planet's orbital motion around the Sun, dramatically reducing the fuel required for deep-space missions. The Voyager probes, launched in 1977, used a rare planetary alignment to achieve a "Grand Tour" of the outer solar system that would have been physically impossible without gravity assists. Cassini used three separate gravity assists (Venus twice, then Earth) just to build up enough speed to reach Saturn.

Left: Voyager 2's trajectory through the outer solar system, using gravity assists at Jupiter, Saturn, Uranus, and Neptune. Right: How a gravity assist works in two reference frames: in the planet's frame the spacecraft's speed is unchanged, but in the Sun's frame it gains velocity from the planet's orbital motion.

The problem is that planning these trajectories is extraordinarily complex. The search space for multi-gravity-assist routes is enormous as every possible sequence and timing represents a different option, and traditional optimization methods struggle with this "combinatorial explosion." These methods are also slow as they require extensive computation up front and cannot adapt when reality differs from the model. As a result, they need ground operators to intervene every time a correction is required. For missions to the outer planets, where communication delays can exceed an hour each way, waiting for a human to correct course is simply not an option.

The Core Problem

Traditional methods find near-optimal trajectories but are rigid and computationally expensive. A spacecraft that could learn its own guidance policy that adapts in real-time to sensor noise and unexpected perturbations would be a fundamentally different kind of mission capability.

This is the problem that motivates my senior project. I want to explore whether modern deep reinforcement learning, specifically an algorithm called Proximal Policy Optimization (PPO), can learn fuel-optimal trajectory policies for multi-planetary missions while remaining robust to real-world uncertainties while still running onboard in milliseconds.

02 Week 1: Into the Literature

This first week was devoted entirely to reading to build a strong theoretical foundation. My reading list covered three major sources, each addressing a different layer of the problem: the mathematics of learning, the algorithm I plan to use, and the state of the art in applying that algorithm to spacecraft.

— Reading Log, Week 1 —

Reinforcement Learning: An Introduction (2nd ed.)

Sutton & Barto, 2018 — Chapters 1, 3, 5, 13

Foundation: MDPs, Monte Carlo, Policy Gradients

Proximal Policy Optimization Algorithms

Schulman, Wolski, Dhariwal, Radford & Klimov, 2017

Core algorithm for the project

Reinforcement Learning for Low-Thrust Trajectory Design of Interplanetary Missions

Zavoli & Federici, 2021

Most directly related prior work

What follows is a walkthrough of the key ideas I pulled from each source and how they connect to what I'm trying to build.

03 Learning by Doing: The RL Framework

Sutton and Barto's textbook is essentially the foundational reference for the entire field of reinforcement learning. Rather than learning from labeled examples (supervised learning) or by finding structure in data (unsupervised learning), RL is about learning from interaction. An agent takes actions in an environment, observes what happens, and gradually learns which actions lead to the most reward. As Sutton and Barto describe it in Chapter 1, the RL agent is always trying to maximize a cumulative reward signal, and crucially, nobody tells it how to do this. It must discover this on its own through trial and error.

Chapter 3 formalizes this with the concept of the Markov Decision Process (MDP), which is the mathematical framework used to model trajectory problems. In an MDP, at each time step $t$, an agent observes a state $s$, chooses an action $a$, receives a reward $r$, and transitions to a new state $s'$. The agent's goal is to find a policy (a mapping from states to actions) maximizing its expected cumulative future reward.

Components in a typical Reinforcement Learning (RL) system. — The reinforcement learning loop: an agent selects actions in an environment, which returns a reward signal and an updated state representation, driving the agent to learn an optimal policy through trial and error.

# The Bellman Optimality Equation (Sutton & Barto, Ch. 3)
$$v^*(s) = \max_a \sum_{s',\,r} p(s', r \mid s, a)\bigl[r + \gamma \cdot v^*(s')\bigr]$$ # Policy: maps state → action
$$\pi(a \mid s) \;\rightarrow\; \text{probability of taking action } a \text{ in state } s$$ # Return: discounted sum of future rewards
$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots \qquad (\gamma \in [0,1])$$

Spacecraft trajectory optimization surpisingly fits naturally into this framework. The spacecraft's position, velocity, and remaining fuel mass define the state. The thrust vector (magnitude and direction) defines the action. The negative of fuel consumed is a simple reward. And the optimal policy is the one that gets the spacecraft to its destination using the least propellant. Zavoli and Federici (2021) explicitly formalize a three-dimensional Earth-Mars rendezvous mission this way, defining the spacecraft state at time step $k$ as a seven-dimensional vector comprising position, velocity, and mass.

Chapter 5 introduced me to Monte Carlo methods, which I'll need for validating my trained policies. The key idea in MC methods is that you can estimate the expected return of a policy by simply running it many times and averaging the results. I'll run the same policy across hundreds or thousands of randomly varied scenarios, adding different levels of sensor noise and unexpected perturbations, and measure how often it succeeds. Sutton and Barto frame this as estimating value functions directly from sampled episodes, which makes MC methods especially powerful in environments where we don't have a perfect model of the dynamics (which is always true for real spacecraft).

Chapter 13 on Policy Gradient Methods was perhaps the most directly relevant to my project. Rather than learning the value of states and then deriving a policy, policy gradient methods directly optimize the parameters of the policy itself, following the gradient of expected return. The key result is the Policy Gradient Theorem, which gives us a way to compute this gradient from sampled experience:

# Policy Gradient Theorem (Sutton & Barto, Ch. 13)
$$\nabla J(\theta) \propto \sum_s \mu (s) \sum_a q_\pi (s,a) \nabla \pi (a|s,\bm{\theta})$$ # Advantage Actor-Critic: subtract baseline to reduce variance
$$A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)$$ # Actor learns policy, Critic estimates value function

For continuous action spaces, the same as spacecraft thrust commands, Sutton and Barto recommend parameterizing the policy as a Gaussian distribution, with the neural network outputting the mean (and sometimes standard deviation) of that distribution. This allows the agent to output continuous-valued thrust vectors rather than being restricted to discrete choices.

Key Insight from Sutton & Barto

The critical tradeoff in RL is between exploration (trying new actions to learn what is better) and exploitation (using what has already been learned). For spacecraft guidance, this has a real cost: exploration means occasionally wasting fuel on suboptimal maneuvers during training. The design of the reward function and the training environment must balance these pressures carefully.

04 PPO: Stable Learning at Scale

The Schulman et al. (2017) paper introduced Proximal Policy Optimization, which has since become one of the most widely used deep RL algorithms. When you update a neural network policy using gradient ascent, it's easy to accidentally take a step that drastically changes the policy's behavior, destroying everything it has learned. Earlier approaches like Trust Region Policy Optimization (TRPO) addressed this with an explicit constraint on how much the policy could change per update, but TRPO required second-order optimization (expensive and complex to implement).

PPO's insight is to instead use a clipped surrogate objective, which modifies the loss function to discourage large policy updates without requiring any expensive constraint machinery. Let $r(\theta)$ be the probability ratio between the new and old policy for a given action:

# PPO Clipped Objective (Schulman et al., 2017, Eq. 7)
$$r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$$ $$L^{\text{CLIP}}(\theta) = \mathbb{E}\Bigl[\min\bigl(r_t(\theta)\hat{A}_t,\;\text{clip}(r_t(\theta),\,1-\varepsilon,\,1+\varepsilon)\hat{A}_t\bigr)\Bigr]$$ # $\varepsilon = 0.2$ is the standard hyperparameter
# $\hat{A}_t$ = advantage estimate at timestep $t$
# If advantage is positive → clip upward gain; if negative → clip downward

The clipping means that if an update would move the policy too far in a beneficial direction (i.e., when $r_t(\theta) > 1 + \varepsilon$ and $\hat{A}_t > 0$), that excess gain is simply ignored and the algorithm doesn't get credit for going too far. This creates a natural "pessimistic bound" that discourages aggressive updates. Schulman et al. found that using $\varepsilon = 0.2$ performed best in their benchmarks, outperforming TRPO on almost all continuous control tasks while being dramatically simpler to implement.

In addition to the clipped objective, PPO also adds a value function loss term and an entropy bonus to its total objective:

# Full PPO Objective (Schulman et al., 2017, Eq. 9)
$$L^{\text{total}}(\theta) = \mathbb{E}\bigl[L^{\text{CLIP}}(\theta) - c_1 L^{\text{VF}}(\theta) + c_2 S[\pi_\theta](s_t)\bigr]$$ # $L^{\text{VF}}$ = value function MSE loss (Critic)
# $S[\pi]$ = entropy bonus (encourages exploration)
# $c_1, c_2$ are tunable coefficients

The entropy bonus is particularly important for my application. Without it, the policy could converge prematurely to a deterministic and suboptimal solution. For instance, it may always go in a particular direction regardless of the spacecraft's actual state. Adding entropy to the objective pushes the policy to remain somewhat stochastic during training, maintaining exploration. Schulman et al. validated PPO across 49 Atari games and 7 MuJoCo continuous control benchmarks, showing it struck the best overall balance between complexity and final performance.

Why PPO for Spacecraft?

Spacecraft trajectory optimization involves a high-dimensional, continuous action space (thrust magnitude and direction) with complex, nonlinear dynamics. PPO's strengths of stable training without second-order methods and empirically strong performance on continuous control make it a natural fit.

05 Prior Works

Zavoli and Federici's (2021) paper is the work I keep returning to as the closest precedent for my project. They apply PPO to a three-dimensional, time-fixed Earth-Mars rendezvous mission. It's a minimum-fuel problem where a spacecraft must match Mars's position and velocity at a specified arrival time. What makes the paper especially relevant is that they explicitly introduce four distinct types of uncertainty: state noise (unmodeled dynamics), observation noise (sensor error), control actuation errors (thruster imprecision), and missed thrust events (complete loss of thrust for one or more consecutive time steps).

The MDP formulation they use maps directly onto the Sutton & Barto framework I studied this week. The spacecraft state at each time step is a seven-dimensional vector of position, velocity, and mass. The commanded action is an impulsive delta-V (a change in velocity), subject to a maximum magnitude constraint derived from the Tsiolkovsky equation and the Sims-Flanagan trajectory model. The reward function penalizes propellant consumption, constraint violations on the maximum delta-V, and terminal state errors (missing Mars at the end).

A Hohmann transfer orbit: the minimum-energy two-impulse manoeuvre that moves a spacecraft between two circular orbits by following an elliptical path tangent to both.

In the deterministic (no-noise) case, the RL policy trained by PPO achieved a final spacecraft mass of 600.23 kg which is nearly identical to the 599.98 kg obtained by a classical indirect optimization method that took seconds to compute (versus 2-3 hours for the RL training). This validates that PPO can, in principle, match optimal solutions. The real power emerged in the stochastic scenarios: when the policy was trained with state uncertainty and then tested across 500 Monte Carlo episodes, it succeeded 80% of the time. By contrast, the deterministic policy (trained without noise) succeeded essentially 0% of the time in those same noisy conditions as it had no ability to adapt.

The Gap This Paper Leaves Open

Zavoli and Federici focus on a single fixed mission: Earth to Mars in 358.79 days with one trajectory arc. They explicitly note that the high computational cost "discourages the use of a model-free RL algorithm" for deterministic problems. Multi-gravity-assist trajectories, which involve sequential planetary encounters and vastly larger search spaces, remain largely unexplored in the RL literature. This is the gap I aim to address.

Zavoli and Federici used an $\varepsilon$-constraint relaxation technique, where the terminal constraint tolerance starts loose ($10^{-2}$) and tightens to the final value ($10^{-3}$) halfway through training. This gives the policy room to explore freely at first before being held to strict accuracy standards. I anticipate needing a similar curriculum-style approach in my own work, especially as mission complexity increases with multi-gravity-assist scenarios.

06 The Question I'm Trying to Answer

All three readings this week converge on a single, sharp inquiry: Can deep reinforcement learning, specifically PPO, enable a spacecraft to autonomously optimize fuel consumption across multi-gravity-assist trajectories while correcting for real-world uncertainties in real time?

This breaks down into three sub-questions that will drive my experimental design:

Sub-Question 1

What RL architecture (feedforward MLP, LSTM, or transformer) most effectively learns optimal control policies across the different phases of a multi-gravity-assist mission?

Sub-Question 2

How robust are the learned policies to uncertainties like state noise, observation noise, control errors, and missed thrust events as quantified by Monte Carlo validation campaigns?

Sub-Question 3

Can a trained RL policy achieve fuel efficiency comparable to or better than traditional methods (indirect optimization, heuristic controllers) while delivering real-time inference in milliseconds?

The third question is the most practically significant. Sutton and Barto describe RL policies as fundamentally different from traditional planners: once trained, they are just a single forward pass through a neural network, which would take milliseconds regardless of mission complexity. An indirect optimizer for a complex gravity-assist sequence might take hours or days. For a spacecraft already en route to Jupiter with communication delays of 35+ minutes, the difference between a 2-millisecond onboard decision and a 70-minute round-trip consultation with Earth could be the difference between mission success and failure.

07 What I Believe I'll Find

— Hypothesis —

I believe that a PPO-trained policy, when given a well-designed reward function and integrated with a high-fidelity astrodynamics simulator (Basilisk), will successfully learn fuel-optimal guidance for at least simplified gravity-assist trajectories, and that this policy will generalize robustly to uncertain environments in a way that deterministic trajectories simply cannot.

Specifically, I expect the RL policy to trade a small amount of propellant efficiency (perhaps 1-5% worse than the theoretical optimum) in exchange for dramatically improved robustness, or the ability to correct mid-flight for sensor noise, thruster errors, and missed burns. This is precisely the tradeoff Zavoli and Federici observed: their robust policies used slightly more fuel in ideal conditions but maintained mission success rates above 70% in the presence of severe perturbations, while the deterministic policy failed entirely.

I also believe the architecture choice will matter. Gravity-assist trajectories have distinct mission phases (cruise, approach, flyby, post-flyby correction), and I suspect that architectures with memory like LSTMs may outperform simple feedforward networks by better representing the temporal structure of the trajectory. This remains an open question in the literature that I hope to shed some light on.

The deeper question I'm exploring is whether RL can connect theoretically optimal trajectories and practical real-time implementation. Traditional methods often produce trajectories that assume perfect knowledge and a perfectly performing spacecraft. The real universe is messier. Reinforcement learning, as Sutton and Barto describe it, is fundamentally a framework for decision-making under uncertainty. An RL agent learns a policy, which maps any possible situation to the best action in that situation. That is exactly what a spacecraft navigating an unpredictable deep-space environment needs.

08 Looking Ahead

Next week, I'll be shifting from theory to orbital mechanics. I'll be reading Schaub and Junkins on the analytical mechanics of space systems, and Sarli and Taheri on interplanetary gravity-assist trajectory optimization. I'll also be setting up my development environment, which means installing the Basilisk astrodynamics simulation framework from the University of Colorado Boulder and verifying that its Python API works as expected.

One thing that's been on my mind after this first week of reading: the Zavoli and Federici paper used a relatively simple 40-step mission model (Sims-Flanagan with 40 impulses) and still took 10-12 hours of training on a modern CPU. My project aims to incorporate more complex dynamics through Basilisk's higher-fidelity simulation environment. Managing computational cost is going to be a central challenge. I may need to explore techniques like parallelized environment simulation and curriculum learning to make training tractable.

I'm curious whether anyone reading this has experience with Basilisk specifically, or with integrating high-fidelity astrodynamics simulators into RL training loops. The latency per environment step is the critical bottleneck, and I'd love to hear how others have approached this kind of simulation-in-the-loop RL problem.