We propose two complementary approaches to improve the integration between RL controllers and safety filters.

Abstract

Reinforcement learning (RL) controllers are flexible and performant but rarely guarantee safety. Safety filters impart hard safety guarantees to RL controllers while maintaining flexibility. However, safety filters cause undesired behaviours due to the separation of the controller and the safety filter, degrading performance and robustness. This extended abstract unifies two complementary approaches aimed at improving the integration between the safety filter and the RL controller. The first extends the objective horizon of a safety filter to minimize corrections over a longer horizon. The second incorporates safety filters into the training of RL controllers, improving sample efficiency and policy performance. Together, these methods improve the training and deployment of RL controllers while guaranteeing safety.


1 - Multi-Step Safety Filters

 
Comparison of chattering with one-step MPSF and multi-step MSPF

Fig. 1: Chattering caused by the standard one-step MPSF versus the multi-step MPSF. The multi-step filter reduces the peak-to-peak amplitude of chattering from 16.3cm to 3.6cm.

1.1 - Motivation

Safety filters impart hard safety guarantees to controllers, including deep learning controllers [1]. Model predictive safety filters (MPSFs) are a category of safety filters that leverage model predictive control (MPC) to predict whether uncertified (i.e., potentially unsafe) inputs sent from the controller will violate the constraints. In the case of a potential future constraint violation, the MPSF determines the minimal deviation from the uncertified input that results in constraint satisfaction.

Despite strong theoretical guarantees, MPSFs may cause chattering and high-magnitude corrections. Chattering occurs when the controller directs the system towards a constraint boundary and is repeatedly stopped by the safety filter. This leads to jerky and oscillatory behaviour, degrading performance and potentially causing constraint violations.


1.2 - Method

The standard (one-step) safety filter objective function is [1]:

   $J_{\text{SF},1} = \|\pi_{\text{uncert}}(\textbf{x}_k) - \textbf{u}_{0|k}\|^2,$

where $\textbf{x}_k$ is the state at time step $k$, $\pi_{\text{uncert}}$ is the RL policy, and $\textbf{u}_{0|k}$ is the input to be applied (the optimization variable). By generalizing to multiple steps, the filter can minimize corrections over a longer prediction horizon:

   $J_{\text{SF},M} = \sum_{j=0}^{M-1} w(j)\| \pi_{\text{uncert}}(\textbf{z}_{j|k}) - \textbf{u}_{j|k} \|^2,$

where $w(\cdot) : \mathbb{N}_{0} \to \mathbb{R}^+$ calculates the weights associated with the $j\text{-th}$ correction, $M$ is the filtering horizon, $\textbf{z}_{j|k}$ is the estimated future state at the $(k + j)$-th time step computed at time step $k$, and $\textbf{u}_{j|k}$ is the input at the $(k + j)$-th time step computed at time step $k$. The inputs are the optimization variables. This allows the agent to proactively correct actions to avoid unsafe states.

Contributions: We propose generalizing the standard safety filter objective function to minimize corrections over a horizon. We apply this approach to model predictive safety filters (MPSFs) and prove our approach inherits the theoretical recursive feasibility guarantees of the underlying MPC. We demonstrate this multi-step approach reduces chattering, jerkiness, and other potentially unsafe corrective actions.


1.3 - Results

To determine the efficacy of the proposed multi-step MPSF, we ran experiments on a simulated cartpole in the safe learning-based control simulation environment $\texttt{safe-control-gym}$ [2] and on a real quadrotor, the Crazyflie 2.0. The underlying MPC is a robust nonlinear MPC formulation [3]. The experiments test the standard one-step MPSF compared to our proposed multi-step MPSF with $M=2, 5, 10$. Additionally, we consider the one-step MPSF with regularization $J_{\text{reg}}$.

1.3.1 - Simulation Experiments

Graph of rate of change of inputs Graph of magnitude of corrections Graph of max corrections

Fig. 3: Results for simulated experiments on a cartpole testing the multi-step approach.

Above, we see the norm of the rate of change of the inputs compared to the one-step MPSF is reduced by up to 73%, the magnitude of corrections is reduced by up to 25%, and the maximum correction is reduced by up to 52%. This demonstrates that the proposed filter effectively reduces chattering, achieving a similar norm of the rate of change of the inputs compared to the uncertified control inputs, and generally decreases the overall correction effort as well.

1.3.2 - Real Hardware Experiments

Graph of rate of change of inputs Graph of magnitude of corrections Graph of max corrections

Fig. 4: Results for real hardware experiments on a Crazyflie 2.0 quadrotor testing the multi-step approach.

Our proposed approach significantly reduces the norm of the rate of change of the inputs, reducing it by 80% compared to the one-step approach when $M=10$. The maximum correction and magnitude of corrections are either maintained or decreased, and both are decreased by over 30% compared to the one-step approach when $M=10$. We see that the one-step with regularization is outperformed in every metric by the proposed approach with $M=5$ and $M=10$, including the norm of the rate of change of the inputs.

2 - RL Training with Safety Filters

 
Visualization of trajectories flown by quadrotor trained with and without safety filter

Fig. 2: An RL controller trained without a safety filter (blue) tracks a reference trajectory (black), but unforeseen interactions with the safety filter cause poor tracking. When trained with a safety filter (green), the behaviour is smoother and more performant. The constraints are in red.

2.1 - Motivation

Safety filters impart hard safety guarantees to RL controllers while maintaining flexibility. However, adding a safety filter changes how the controller interacts with the environment. If this is not accounted for when training the controller, adding a safety filter can cause undesired behaviours, degrading performance and robustness.

Contributions: In this paper, we analyze three modifications to the training process of any RL controller through the incorporation of a safety filter. These modifications can be combined or used separately and can be applied to any RL controller and safety filter. We found that the modifications significantly improve sample efficiency, eliminate constraint violations during training, improve final performance, and reduce chattering on the certified system.


2.2 - Methods

2.2.1 - Filtering Training Actions

During training, the controller generates uncertified actions $\textbf{u}_{\text{uncert}, k} \in \mathbb{U}$. By applying the safety filter $\textbf{u}_{\text{cert}, k} = \pi_{\text{SF}}(\textbf{x}_{k}, \textbf{u}_{\text{uncert}, k})$, safety is guaranteed during training [4], improving sample efficiency by focusing on safe areas. Additionally, this maximizes the return on the final certified system on which it will be evaluated.

2.2.2 - Penalizing Corrections

We can penalize corrections during training to encourage the RL to execute safe actions [4]. The magnitude of the correction measures how unsafe the action was. Thus, we penalize the reward by $\alpha \|\textbf{u}_{\text{uncert}, k} - \textbf{u}_{\text{cert}, k} \|_2^2$, where $\alpha > 0$ is a tuneable weight. This reduces corrections and thus reduces chattering and jerkiness.

2.2.3 - Safely Resetting the Environment

Sample efficiency can be improved by using the safety filter to avoid initiating an episode in an unsafe state. We will sample $\textbf{x}_0 \sim \mathbb{S}$, where $\mathbb{S}$ is the set of starting states, and then determine the feasibility of certifying an input from that state. If the safety filtering optimization is feasible, $\textbf{x}_0$ is safe. If infeasible, another starting state is randomly generated until a feasible starting state is found.


2.3 - Results

The controllers were evaluated on a simulation of a Crazyflie 2.0 using the $\texttt{safe-control-gym}$ [2] and a real Crazyflie 2.0 [5]. The trajectory tracking task consists of tracking a figure-eight reference in three dimensions. The position is constrained to be 5% smaller than the full extent of the trajectory.

2.3.1 - Simulation Experiments

Graph of return Graph of training convergence

Fig. 5: Results for simulated experiments on a Crazyflie 2.0 quadrotor testing the various training modifications.

Every combination of the modifications was trained and evaluated. "Std." refers to the baseline with no training modifications. The other approaches are combinations of the training modifications: FA = Filtering Actions, PC = Penalizing Corrections, SR = Safe Reset. All the modifications improve the return, while filtering actions significantly improves convergence. Combining all the modifications leads to the best return and convergence.
Graph of return Graph of training convergence

Fig. 6: Results for simulated experiments on a Crazyflie 2.0 quadrotor testing the effects of the correction penalty weight $\alpha$ and the constraint violation penalty $\beta$.

To study the effects of reward penalties, we conducted experiments with various values of the correction penalization weight $\alpha$ and the constraint violation penalty $\beta$. We compare the safe approach (all of the training modifications together, denoted "Safe") with $\alpha \in \{0.1, 1, 10, 100\}$ to the standard training (none of the training modifications, denoted "Std.") with $\beta \in \{0, 0.01, 0.1, 1\}$. We see that the safe approaches improve performance compared to the standard training without constraint penalties and significantly improve convergence compared to standard training with and without constraint penalties. Additionally, the safe approaches reduce constraint violations during training to nearly zero.

2.3.2 - Real Hardware Experiments

Graph of return Graph of training convergence

Fig. 7: Results for real hardware experiments on a Crazyflie 2.0 quadrotor testing the combined training modifications.

The safe approach increases the total return by up to 20% compared to the standard training and by up to 1.5% for the $\beta=1$ baseline. The convergence of the safe approaches and the constraint penalty approach are roughly equal and better than the standard training.

Watch the Full Video

BibTex

Multi-Step Safety Filters


@inproceedings{multi-step-mpsfs,
  author={Pizarro Bejarano, Federico and Brunke, Lukas and Schoellig, Angela P.},
  booktitle={2023 62nd IEEE Conference on Decision and Control (CDC)},
  title={Multi-Step Model Predictive Safety Filters: Reducing Chattering by Increasing the Prediction Horizon},
  year={2023},
  pages={4723-4730},
  doi={10.1109/CDC49753.2023.10383734}
}
      

Safety Filtering While Training


@article{sf-while-training,
  author={Pizarro Bejarano, Federico and Brunke, Lukas and Schoellig, Angela P.},
  journal={IEEE Robotics and Automation Letters},
  title={Safety Filtering While Training: Improving the Performance and Sample Efficiency of Reinforcement Learning Agents},
  year={2025},
  volume={10},
  number={1},
  pages={788-795},
  doi={10.1109/LRA.2024.3512374}
}
      

References

[1] L. Brunke, M. Greeff, A. W. Hall, Z. Yuan, S. Zhou, J. Panerati, and A. P. Schoellig, “Safe learning in robotics: From learning-based control to safe reinforcement learning,” Annual Review of Control, Robotics, and Autonomous Systems, 2022.

[2] Z. Yuan, A. W. Hall, S. Zhou, L. Brunke, M. Greeff, J. Panerati, and A. P. Schoellig, “$\texttt{safe-control-gym}$: A unified benchmark suite for safe learning-based control and Reinforcement learning in robotics,” IEEE Robotics and Automation Letters, 2022.

[3] J. Köhler, R. Soloperto, M. A. Müller, and F. Allgöwer, “A computationally efficient robust model predictive control framework for uncertain nonlinear systems - extended version,” IEEE Transactions on Automatic Control, 2021.

[4] H. Krasowski, J. Thumm, M. Müller, L. Schäfer, X. Wang, and M. Althoff, “Provably safe reinforcement learning: Conceptual analysis, survey, and benchmarking,” Transactions on Machine Learning Research, 2023.

[5] S. Teetaert, W. Zhao, et al., “A remote sim2real aerial competition: Fostering reproducibility and solutions' diversity in robotics challenges,” arXiv preprint arXiv:2308.16743, 2023.