Multi-Agent Quadrotor Testbed Control Design:
Integral Sliding Mode vs. Reinforcement Learning∗
Steven L. Waslander†, Gabriel M. Hoffmann†,‡
Ph.D. Candidate
Aeronautics and Astronautics
Stanford University
{stevenw, gabeh}@stanford.edu
Jung Soon Jang
Research Associate
Aeronautics and Astronautics
Stanford University
jsjang@stanford.edu
Claire J. Tomlin
Associate Professor
Aeronautics and Astronautics
Stanford University
tomlin@stanford.edu
Abstract—The Stanford Testbed of Autonomous Rotorcraft
for Multi-Agent Control (STARMAC) is a multi-vehicle testbed
currently comprised of two quadrotors, also called X4-flyers,
with capacity for eight. This paper presents a comparison
of control design techniques, specifically for outdoor altitude
control, in and above ground effect, that accommodate the
unique dynamics of the aircraft. Due to the complex airflow in-
duced by the four interacting rotors, classical linear techniques
failed to provide sufficient stability. Integral Sliding Mode and
Reinforcement Learning control are presented as two design
techniques for accommodating the nonlinear disturbances. The
methods both result in greatly improved performance over
classical control techniques.
I. INTRODUCTION
As first introduced by the authors in [1], the Stanford
Testbed of Autonomous Rotorcraft for Multi-Agent Con-
trol (STARMAC) is an aerial platform intended to validate
novel multi-vehicle control techniques and present real-world
problems for further investigation. The base vehicle for
STARMAC is a four rotor aircraft with fixed pitch blades,
referred to as a quadrotor, or an X4-flyer. They are capable
of 15 minute outdoor flights in a 100m square area [1].
Fig. 1. One of the STARMAC quadrotors in action.
There have been numerous projects involving quadrotors
to date, with the first known hover occurring in October,
1922 [2]. Recent interest in the quadrotor concept has been
sparked by commercial remote control versions, such as the
∗Research supported by ONR under the MURI contract N00014-02-1-
0720 called “CoMotion: Computational Methods for Collaborative Motion”,
by the NASA Joint University Program under grant NAG 2-1564, and by
NASA grant NCC 2-5536.
† These authors contributed equally to this work.
‡Funding provided by National Defense Science and Engineering Grant.
DraganFlyer IV [3]. Many groups [4]–[7] have seen signif-
icant success in developing autonomous quadrotor vehicles.
To date, however, STARMAC is the only operational multi-
vehicle quadrotor platform capable of autonomous outdoor
flight, without tethers or motion guides.
The first major milestone for STARMAC was autonomous
hover control, with closed loop control of attitude, altitude
and position. Using inertial sensing, the attitude of the
aircraft is simple to control, by applying small variations in
the relative speeds of the blades. In fact, standard integral
LQR techniques were applied to provide reliable attitude
stability and tracking for the vehicle. Position control was
also achieved with an integral LQR, with careful design in
order to ensure spectral separation of the successive loops.
Unfortunately, altitude control proves less straightforward.
There are many factors that affect the altitude loop specif-
ically that do not amend themselves to classical control
techniques. Foremost is the highly nonlinear and destabilizing
effect of four rotor downwashes interacting. In our experi-
ence, this effect becomes critical when motion is not damped
by motion guides or tethers. Empirical observation during
manual flight revealed a noticeable loss in thrust upon descent
through the highly turbulent flow field. Similar aerodynamic
phenomenon for helicopters have been studied extensively
[8], but not for the quadrotor, due to its relative obscurity
and complexity. Other factors that introduce disturbances into
the altitude control loop include blade flex, ground effect
and battery discharge dynamics. Although these effects are
also present in generating attitude controlling moments, the
differential nature of the control input eliminates much of the
absolute thrust disturbances that complicate altitude control.
Additional complications arise from the limited choice
in low cost, high resolution altitude sensors. An ultrasonic
ranging device [9] was used, which suffers from non-
Gaussian noise—false echoes and dropouts. The resulting
raw data stream includes spikes and echoes that are difficult
to mitigate, and most successfully handled by rejection of
infeasible measurements prior to Kalman filtering.
In order to accommodate this combination of noise and
disturbances, two distinct approaches are adopted. Integral
Sliding Mode (ISM) control [10]–[12] takes the approach
that the disturbances cannot be modeled, and instead designs
a control law that is guaranteed to be robust to disturbances as
long as they do not exceed a certain magnitude. Model-based
reinforcement learning [13] creates a dynamic model based
on recorded inputs and responses, without any knowledge
of the underlying dynamics, and then seeks an optimal
control law using an optimization technique based on the
learned model. This paper presents an exposition of both
methods and contrasts the techniques from both a design and
implementation point of view.
II. SYSTEM DESCRIPTION
STARMAC consists of a fleet of quadrotors and a ground
station. The system communicates over a Bluetooth Class 1
network. The core of the aircraft are microcontroller circuit
boards designed and assembled at Stanford, for this project.
The microcontrollers run real-time control code, interface
with sensors and the ground station, and supervise the system.
The aircraft are capable of sensing position, attitude, and
proximity to the ground. The differential GPS receiver is the
Trimble Lassen LP, operating on the L1 band, providing 1
Hz updates. The IMU is the MicroStrain 3DM-G, a low cost,
light weight IMU that delivers 76 Hz attitude, attitude rate,
and acceleration readings. The distance from the ground is
found using ultrasonic ranging at 12 Hz.
The ground station consists of a laptop computer, to
interface with the aircraft, and a GPS receiver, to provide
differential corrections. It also has a battery charger, and
joysticks for control-augmented manual flight, when desired.
III. QUADROTOR DYNAMICS
The derivation of the nonlinear dynamics is performed in
North-East-Down (NED) inertial and body fixed coordinates.
Let {eN, eE, eD} denote the inertial axes, and {xB,yB, zB}
denote the body axes, as defined in Figure 2. Euler angles of
the body axes are {φ, θ, ψ} with respect to the eN, eE and
eD axes, respectively, and are referred to as roll, pitch and
yaw. Let r be defined as the position vector from the inertial
origin to the vehicle center of gravity (CG), and let ωB be
defined as the angular velocity in the body frame. The current
velocity direction is referred to as ev in inertial coordinates.
Fig. 2. Free body diagram of a quadrotor aircraft.
The rotors, numbered 1− 4, are mounted outboard on the
xB, yB, −xB and −yB axes, respectively, with position
vectors ri with respect to the CG. Each rotor produces an
aerodynamic torque, Qi, and thrust, Ti, both parallel to the
rotor’s axis of rotation, and both used for vehicle control.
Here, Ti ≈ ui kt1+0.1s , where ui is the voltage applied to the
motors, as determined from a load cell test. In flight, Ti can
vary greatly from this approximation. The torques, Qi, are
proportional to the rotor thrust, and are given by Qi = krTi.
Rotors 1 and 3 rotate in the opposite direction as rotors 2
and 4, so that counteracting aerodynamic torques can be used
independently for yaw control. Horizontal velocity results in
a moment on the rotors, Ri, about −ev, and a drag force,
Di, in the direction, −ev.
The body drag force is defined as DB , vehicle mass is
m, acceleration due to gravity is g, and the inertia matrix is
I ∈ R3×3. A free body diagram is depicted in Figure 2. The
total force, F, and moment, M, can be summed as,
F = −DBev + mgeD +
4∑
i=1
(−TizB −Diev) (1)
M =
4∑
i=1
(QizB −Riev −Di(ri × ev) + Ti(ri × zB))
(2)
The full nonlinear dynamics can be described as,
mr¨ = F
Iω˙B + ωB × IωB = M
(3)
where the total angular momentum of the rotors is assumed
to be near zero, because they are counter-rotating. Near hover
conditions, the contributions by rolling moment and drag can
be neglected in Equations (1) and (2). Define the total thrust
as T =
∑4
i=1 Ti. The translational motion is defined by,
mr¨ = F = −Rψ ·Rθ ·RφTzB + mgeD (4)
where Rφ, Rθ, and Rψ are the rotation matrices for roll,
pitch, and yaw, respectively. Applying the small angle ap-
proximation to the rotation matrices,
m
⎡
⎣ r¨xr¨y
r¨z
⎤
⎦ =
⎡
⎣ 1 ψ θψ 1 φ
θ −φ 1
⎤
⎦
⎡
⎣ 00
−T
⎤
⎦+
⎡
⎣ 00
mg
⎤
⎦ (5)
Finally, assuming total thrust approximately counteracts grav-
ity, T ≈ T¯ = mg, except in the eD axis,
m
⎡
⎣ r¨xr¨y
r¨z
⎤
⎦ =
⎡
⎣ 00
mg
⎤
⎦+
⎡
⎣ 0 −T¯ 0T¯ 0 0
0 0 1
⎤
⎦
⎡
⎣ φθ
T
⎤
⎦ (6)
For small angular velocities, the Euler angle accelerations
are determined from Equation (3) by dropping the second
order term, ω × Iω, and expanding the thrust into its four
constituents. The angular equations become,
⎡
⎣ Ixφ¨Iy θ¨
Izψ¨
⎤
⎦ =
⎡
⎣ 0 l 0 −ll 0 −l 0
Kr −Kr Kr −Kr
⎤
⎦
⎡
⎢⎢⎣
T1
T2
T3
T4
⎤
⎥⎥⎦ (7)
where the moment arm length l = ||ri× zB|| is identical for
all rotors due to symmetry. The resulting linear models can
now be used for control design.
IV. ESTIMATION AND CONTROL DESIGN
Applying the concept of spectral separation, inner loop
control of attitude and altitude is performed by commanding
motor voltages, and outer loop position control is performed
by commanding attitude requests for the inner loop. Accurate
attitude control of the plant in Equation (7) is achieved with
an integral LQR controller design to account for thrust biases.
Position estimation is performed using a navigation filter that
combines horizontal position and velocity information from
GPS, vertical position and estimated velocity information
from the ultrasonic ranger, and acceleration and angular rates
from the IMU in a Kalman filter that includes bias estimates.
Integral LQR techniques are applied to the horizontal com-
ponents of the linear position plant described in Equation (6).
The resulting hover performance is shown in Figure 6.
As described above, altitude control suffers exceedingly
from unmodeled dynamics. In fact, manual command of
the throttle for altitude control remains a challenge for the
authors to this day. Additional complications arise from
the ultrasonic ranging sensor, which has frequent erroneous
readings, as seen in Figure 3. To alleviate the effect of this
noise, rejection of infeasible measurements is used to remove
much of the non-Gaussian noise component. This is followed
by altitude and altitude rate estimation by Kalman filtering,
which adds lag to the estimate. This section proceeds with
a derivation of two control techniques that can be used to
overcome the unmodeled dynamics and the remaining noise.
180 190 200 210 220
0
50
100
150
Time [s]
So
ni
c
Ra
ng
e
[cm
]
Fig. 3. Characteristic unprocessed ultrasonic ranging data, displaying
spikes, false echoes and dropouts. Powered flight commences at 185 seconds.
A. Integral Sliding Mode Control
A linear approximation to the altitude error dynamics of a
quadrotor aircraft in hover is given by,
x˙1 = x2
x˙2 = u + ξ(g, x) (8)
where {x1, x2} = {(rz,des−rz), (r˙z,des−r˙z)} are the altitude
error states, u =
∑4
i=1 ui is the control input, and ξ(·) is a
bounded model of disturbances and dynamic uncertainty. It
is assumed that ξ(·) satisfies ‖ξ‖ ≤ γ, where γ is the upper
bounded norm of ξ(·).
In early attempts to stabilize this system, it was observed
that LQR control was not able to address the instability
and performance degradation due to ξ(g, x). Sliding Mode
Control (SMC) was adapted to provide a systematic approach
to the problem of maintaining stability and consistent perfor-
mance in the face of modeling imprecision and disturbances.
However, until the system dynamics reach the sliding mani-
fold, such nice properties of SMC are not assured. In order
to provide robust control throughout the flight envelope, the
Integral Sliding Mode (ISM) technique is applied.
The ISM control is designed in two parts. First, a standard
successive loop closure is applied to the linear plant. Second,
integral sliding mode techniques are applied to guarantee
disturbance rejection. Let
u = up + ud
up = −Kpx1 −Kdx2 (9)
where Kp and Kd are proportional and derivative loop gains
that stabilize the linear dynamics without disturbances. For
disturbance rejection, a sliding surface, s, is designed,
s = s0(x1, x2) + z
s0 = α(x2 + kx1) (10)
such that state trajectories are forced towards the manifold
s = 0. Here, s0 is a conventional sliding mode design,
z is an additional term that enables integral control to be
included, and α, k ∈ R are positive constants. Based on
the following Lyapunov function candidate, V = 12s
2
, the
control component, ud, can be determined such that V˙ < 0,
guranteeing convergence to the sliding manifold.
V˙ = ss˙ = s
[
α(x˙2 + kx˙1) + z˙
]
= s
[
α(up + ud + ξ(g, x) + kx2) + z˙
]
< 0 (11)
The above condition holds if z˙ = −α(up + kx2) and ud can
be guaranteed to satisfy,
s
[
ud + ξ(g, x)
]
< 0, α > 0 (12)
Since the disturbances, ξ(g, x), are bounded by γ, define ud
to be ud = −λs with λ ∈ R. Equation (11) becomes,
V˙ = s
[
α(−λs + ξ(g, x)
]
≤ α
[
− λ|s|2 + γ|s|
]
< 0 (13)
and it can be seen that λ|s| − γ > 0. As a result, for up and
ud as above, the sliding mode condition holds when,
|s| >
γ
λ
(14)
With the input derived above, the dynamics are guaranteed
to evolve such that s decays to within the boundary layer,
γ
λ
, of the sliding manifold. Additionally, the system does
not suffer from input chatter as conventional sliding mode
controllers do, as the control law does not include a switching
function along the sliding mode.
V. REINFORCEMENT LEARNING CONTROL
An alternate approach is to implement a reinforcement
learning controller. Much work has been done on continuous
state-action space reinforcement learning methods [13], [14].
For this work, a nonlinear, nonparametric model of the
system is first constructed using flight data, approximating
the system as a stochastic Markov process [15], [16]. Then a
model-based reinforcement learning algorithm uses the model
in policy-iteration to search for an optimal control policy that
can be implemented on the embedded microprocessors.
In order to model the aircraft dynamics as a stochas-
tic Markov process, a Locally Weighted Linear Regression
(LWLR) approach is used to map the current state, S(t) ∈
R
ns
, and input, u(t) ∈ Rnu , onto the subsequent state esti-
mate, Sˆ(t+1). In this application, S = [ rz r˙z r¨z V ],
where V is the battery level. In the altitude loop, the input,
u ∈ R, is the total motor power, u. The subsequent state
mapping is the summation of the traditional LWLR estimate,
using the current state and input, with the random vector,
v ∈ Rns , representing unmodeled noise. The value for v is
drawn from the distribution of output error as determined by
using a maximum likelihood estimate [16] of the Gaussian
noise in the LWLR estimate. Although the true distribution
is not perfectly Gaussian, this model is found to be adequate.
The LWLR method [17] is well suited to this problem, as it
fits a non-parametric curve to the local structure of the data.
The scheme extends least squares by assigning weights to
each training data point according to its proximity to the input
value, for which the output is to be computed. The technique
requires a sizable set of training data in order to reflect the
full dynamics of the system, which is captured from flights
flown under both automatic and manually controlled thrust,
with the attitude states under automatic control.
For m training data points, the input training samples are
stored in X ∈ R(m)×(ns+nu+1), and the outputs correspond-
ing to those inputs are stored in Y ∈ Rm×ns . These matrices
are defined as
X =
⎡
⎢⎣
1 S(t1)
T
u(t1)
T
.
.
.
.
.
.
.
.
.
1 S(tm)
T
u(tm)
T
⎤
⎥⎦ , Y =
⎡
⎢⎣
S(t1 + 1)
T
.
.
.
S(tm + 1)
T
⎤
⎥⎦
(15)
The column of ones in X enables the inclusion of a constant
offset in the solution, as used in linear regression.
The diagonal weighting matrix W ∈ Rm×m, which acts on
X , has one diagonal entry for each training data point. That
entry gives more weight to training data points that are close
to the S(t) and u(t) for which Sˆ(t + 1) is to be computed.
The distance measure used in this work is
Wi,i = exp
(
−||x(i) − x||
2τ2
)
(16)
where x(i) is the ith row of X , x is the vector
[ 1 S(t)
T
u(t)
T
], and fit parameter τ is used to adjust
the range of influence of training points. The value for τ can
be tuned by cross validation to prevent over- or under-fitting
the data. Note that it may be necessary to scale the columns
before taking the Euclidean norm to prevent undue influence
of one state on the W matrix.
The subsequent state estimate is computed by summing
the LWLR estimate with v,
Sˆ(t + 1) =
(
X
T
WX
)
−1
X
T
W
T
x + v (17)
Because W is a continuous function of x and X , as x is
varied, the resulting estimate is a continuous non-parametric
curve capturing the local structure of the data. The matrix
computations, in code, exploit the large diagonal matrix W ;
as each Wi,i is computed, it is multiplied by row x(i), and
stored in WX .
The matrix being inverted is poorly conditioned, because
weakly related data points have little influence, so their
contribution cannot be accurately numerically inverted. To
more accurately compute the numerical inversion, one can
perform a singular value decomposition, (XTWX) = UΣV T .
Then, numerical error during inversion can be avoided by
using the n singular values σi with values of σmaxσi < Cmax,
where the value of Cmax is chosen by cross validation. In
this work, Cmax ≈ 10 was found to minimize numerical
error, and was typically satisfied by n = 1. The inverse can
be directly computed using the n upper singular values in the
diagonal matrix Σn ∈ Rn×n, and the corresponding singular
vectors, in Un ∈ Rm×n and Vn ∈ Rm×n. Thus, the stochastic
Markov model becomes
Sˆ(t + 1) = VnΣ
−1
n U
T
nX
T
W
T
x + v (18)
Next, model-based reinforcement learning is implemented,
incorporating the stochastic Markov model, to design a
controller. A quadratic reward function is used,
R(S,Sref ) = −c1(rz − rz,ref )
2 − c2r˙
2
z (19)
where R : R2ns → R, c1 > 0 and c2 > 0 are constants
giving reward for accurate tracking and good damping re-
spectively, and Sref = [ rz,ref r˙z,ref r¨z,ref Vref ] is
the reference state desired for the system.
The control policy maps the observed state S onto the input
command u. In this work, the state space has the constraint
of rz ≥ 0, and the input command has the constraint of
0 ≤ u ≤ umax. The control policy is chosen to be
π(S,w) = w1 + w2(rz − rz,ref ) + w3r˙z + w4r¨z (20)
where w ∈ Rnc is the vector of policy coefficients
w1, . . . , wnc . Linear functions were sufficient to achieve good
stability and performance. Additional terms, such as battery
level and integral of altitude error, could be included to make
the policy more resilient to differing flight conditions.
Policy iteration is performed as explained in Algorithm 1.
The algorithm aims to find the value of w that yields the
greatest total reward Rtotal, as determined by simulating the
system over a finite horizon from a set of random initial
conditions, and summing the values of R(S,Sref ) at each
state encountered.
Algorithm 1 Model-Based Reinforcement Learning
1: Generate set S0 of random initial states
2: Generate set T of random reference trajectories
3: Initialize w to reasonable values
4: Rbest ← −∞, wbest ← w
5: repeat
6: Rtotal ← 0
7: for s0 ∈ S0, t ∈ T do
8: S(0) ← s0
9: for t = 0 to tmax − 1 do