20 3. DEEP LEARNING FOR VEHICLE CONTROL
complexity can be reduced when compared to those seen in lateral control. is has meant that
instead of vision based set-ups, ranging sensors have been more widely used. Moreover, these
one-dimensional measurements also enable easier design of reward functions for reinforcement
learning since they often directly serve as metrics for performance. While reinforcement learn-
ing has provided good results, it has multiple drawbacks. e major drawback of reinforcement
learning is the large amount of experiences required for training convergence, which can be ex-
pensive and/or time consuming to collect [56, 100]. In contrast, supervised learning methods
often lack the level of generalization offered by reinforcement learning, but tend to provide faster
convergence, given a suitably high quality data set can be obtained. For these reasons, combin-
ing the two learning methods can provide significant advantages. Using supervised learning as a
pre-training stage for reinforcement learning, Zhao et al. [22, 101, 102] presented their Super-
vised Actor-Critic (SAC) algorithm for longitudinal control. e system was trained for vehicle
following on dry road conditions, and then evaluated on both dry and wet road conditions. Dur-
ing evaluation, the SAC controller was compared to a PID and a supervised learning (without
reinforcement learning) controller and the SAC controller was found to exhibit the best perfor-
mance. ese results show that combining supervised and reinforcement learning can leverage
the advantages of both strategies.
A summary of research works discussed in this section can be seen in Table 3.2. e
general trend in longitudinal control has been to utilize the generalization capabilities of rein-
forcement learning to perform well in different environments. Unlike lateral control, the inputs
seen here are mostly lower-dimensional measurements of host vehicle states and measurements
relative to nearby vehicles (or other objects), which can be obtained from ranging sensors. is
means that network complexity can be reduced, and there is no need to use CNNs. However,
as discussed in this chapter, in reinforcement learning, the reward function needs to successfully
represent the desirability of being in any given state. For a complex task such as driving, this will
require considering multiple conflicting objectives, such as desired velocity, passenger comfort,
driver habits, and safety. A poorly designed reward function can lead to training divergence,
poor control performance, or unexpected behavior.
3.1.3 FULL VEHICLE CONTROL
ere has also been some recent interest in investigating vehicle controllers which control all
actions of the vehicle (steering, acceleration, braking) simultaneously. For instance, Zheng et
al. [103] proposed a decision-making system for an autonomous vehicle in a highway scenario
based on reinforcement learning. e system considered three factors in the reward function:
(1) safety, evaluated based on distances to adjacent vehicles; (2) smoothness, based on accumu-
lation of longitudinal speed changes over time; and (3) velocity tracking, based on the difference
between current velocity and desired velocity. e authors used a Least Squares Policy Iteration
approach to find the optimal decision-making policy. e system was then evaluated in a simple
highway simulation, where the vehicle had to undergo multiple overtake maneuvres. e system
3.1. AUTONOMOUS VEHICLE CONTROL 21
Table 3.2: A comparison of longitudinal control techniques
Ref.
Learning
Strategy
Network Inputs Outputs Pros Cons
[94] Fuzzy
reinforcement
learning
Feedforward
network with
one hidden layer
Relative distance,
relative speed,
previous control
input
rottle
angle, brake
torque
Model-free, continuous
action values
Single term reward
function
[
23] Reinforcement
learning
Feedforward
network with
one hidden layer
Time headway,
headway derivative
Accelerate,
brake, or
no-op
Maintains a safe
distance
Oscillatory accelera-
tion behavior,
no term for comfort
in reward function
[
95] Reinforcement
learning
Actor-Critic
Network
Velocity, velocity
tracking error,
acceleration error,
expected acceler-
ation
Gas and
brake com-
mands
Learns from minimal
training data
Noisy behavior
of the steering
signal
[
98] Reinforcement
learning
Feedforward
network with
ve hidden
layers
Vehicle velocity,
relative position of
the pedestrian for
past fi ve time steps
Discretized
deceleration
actions
Reliably avoids
collisions
Only considers
collision avoidance
with pedestrians
[96] Reinforcement
learning
Feedforward
network with
one hidden
layer
Relative distance,
relative velocity,
relative accelera-
tion (normalized)
Desired accel-
eration
Provides smooth
driving styles,
learns personal
driving styles
No methods for
preventing learning
of bad habits
from human drivers
[97] Reinforcement
learning
Actor-Critic
Network
Relative distance,
host velocity, rel-
ative
velocity, host
acceleration
Desired accel-
eration
Performs well in a
variety of scenarios,
safety and comfort
considered, learns
personal driving styles
Adapting unsafe
driver habits could
degrade safety
[22] Supervised
reinforcement
learning
Actor-Critic
Network
Relative distance,
relative velocity
Desired accel-
eration
Pre-training by
supervised learning
accelerates learning
process and helps
guarantee convergence,
performs well in
critical scenarios
Requires supervision
to converge,
driving comfort
not considered
22 3. DEEP LEARNING FOR VEHICLE CONTROL
could successfully complete all overtaking manoeuvres. In contrast, Shalev-Shwartz et al. [104]
consider a more complex scenario in which an autonomous vehicle has to operate around un-
predictable vehicles. e aim of the project was to design an agent which can pass a roundabout
safely and efficiently. e performance of the agent was evaluated based on (1) keeping a safe
distance from other vehicles at all times, (2) the time to finish the route, and (3) smoothness of
the acceleration policy. e authors utilized a RNN to accomplish this task, which was chosen
due to its ability to learn the function between a chosen action, the current state, and the state
at the next time step without explicitly relying on any Markovian assumptions. Moreover, by
explicitly expressing the dynamics of the system in a transparent way, prior knowledge could be
incorporated into the system more easily. e described method was shown to learn to slowdown
when approaching roundabouts, to give way to aggressive drivers, and to safely continue when
merging with less aggressive drivers. In the initial testing, the next state was decomposed into
a predictable part, including velocities and locations of adjacent vehicles, and a non-predictable
part, consisting of the accelerations of the adjacent vehicles. e dynamics of the predictable
part of the next state was provided to the agent. However, in the next phase of testing, all state
parameters at the next time step were considered unpredictable and instead had to be learned
during training. e learning process was more challenging in these conditions, but still suc-
ceeded. Additionally, the authors claimed that the described method could be adapted to other
driving policies such as lane change decisions, highway exit and merge, negotiation of the right
of way in junctions, yielding for pedestrians, and complicated planning in urban scenarios.
As mentioned before, supervised learning can significantly reduce training time for a con-
trol system. For this reason, Xia et al. [105] presented a vehicle control algorithm based on Q-
learning combined with a pre-training phase based on expert demonstration. A filtered experi-
ence replay, where previous experiences were stored while poor performances were eliminated,
was used to improve the convergence during training. e use of pre-training and filtered experi-
ence replay was shown to not only improve final performance of the control policy, but also speed
up convergence by up to 71%. Comparing two learning algorithms for lane keeping, Sallab et
al. [106] investigated the effect of continuous and discretized action spaces. A DQN was chosen
for discrete action spaces, while an actor-critic algorithm was used for continuous action values.
e two networks were trained and evaluated driving around a simulated race track, where their
goal was to complete the track while staying near the center of the lane. e ability to utilize
continuous action values resulted in significantly stronger performance with the actor-critic al-
gorithm, enabling a much smoother control policy. In contrast, the DQN algorithm struggled
to stay near the center of the lane, especially on curved roads. ese results demonstrated the
advantages of using continuous action values.
Using vision to control the autonomous vehicle, Zhang et al. [107] presented their su-
pervised learning algorithm, SafeDAgger, based on the Dataset Aggregation (DAgger) [108]
imitation learning technique. In DAgger, the first phase of training consists of traditional super-
vised learning where the model learns from a training set collected from an expert completing
3.1. AUTONOMOUS VEHICLE CONTROL 23
a task. Next, the trained policy performs at the given task and these experiences are collected
into a new data set. e expert is then asked to label the newly collected data with the correct
actions. e labeled data sets are then added to the original data set to create an augmented data
set, which is used to further train the model. e process is then repeated in the DAgger train-
ing framework. is approach is used to overcome one of the limitations of supervised learning
for control policies. When using a supervised learning-based algorithm, as the trained model
makes slight errors, it veers away from its known operational environment and such errors can
compound over time. For instance, when steering a vehicle, the model may have learned from a
data set where the expert is always in (or near) the center of the lane. However, when the trained
network is steering the vehicle, small errors in the predicted steering angle can slowly bring the
vehicle away from the center of the lane. If the network has never seen training examples where
the vehicle is veering away from the center of the lane, it will not know how to correct its own
mistakes. Instead, in the DAgger framework, the network can make these mistakes, and the
examples are then labeled with correct actions by the expert, and the next training phase then
allows the network to learn to correct its mistakes. is results in a control policy more robust
to small errors, however, it introduces further complexity and cost in accessing the expert for
further ground truth queries. e framework was later extended in SafeDAgger, by developing a
method that estimates, for the current state, whether the trained policy is likely to deviate from
the reference (expert) policy. If it is likely to deviate more than a specified threshold, the refer-
ence policy is used to control the vehicle instead. e authors used the SafeDAgger method to
train a CNN to control the lateral and longitudinal actions of an autonomous vehicle while driv-
ing on a test track. ree algorithms were compared: traditional supervised learning, DAgger,
and SafeDAgger, and SafeDAgger was found to perform the best. Another work exploring a
DAgger-type framework for autonomous driving was presented by Pan et al. [109]. e method
used a model predictive control algorithm to provide the reference policy using high cost sen-
sors, while the learned policy was modeled by a CNN with low cost camera sensors only. e
advantage of this approach was that the reference policy can use highly accurate sensing data,
while the trained policy can be deployed using low cost sensors only. e approach was evalu-
ated experimentally on a sub-scale vehicle driving at high speeds around a dirt track, where the
trained policy demonstrated it could operate the vehicle using the low cost sensors only.
In a move away from direct perception methods, Wang et al. [110] presented an object-
centric deep learning control policy for autonomous vehicles. e method used a CNN which
identified salient features in the image, and these features were then used to output a control
action, which was discretized to {left, straight, right, fast, slow, stop} and then executed by a
PID controller. Simulation experiments in Grand eft Auto V demonstrated the proposed
approach outperforms control policies without any attention or those based on heuristic feature
selection. In another work, Porav and Newman [111] presented a collision avoidance system
using deep reinforcement learning. e proposed method used a variational autoencoder coupled
with a RNN to predict continuous actions for steering and braking using semantically segmented
24 3. DEEP LEARNING FOR VEHICLE CONTROL
images as input. Compared to the collision avoidance system with braking only by Chae et
al. [98], a significant improvement in reducing collision rates for TTC values below 1.5 s was
demonstrated, with up to 60% reduction in collision rates.
ese works demonstrate that a DNN can be trained for simple driving tasks such as
staying in its lane, following a vehicle, or collision avoidance. However, driving not only consists
of multiple such tasks, but also requires outside context of a higher goal as well. Humans drive
vehicles with the aim to start from point A and arrive at point B. erefore, having the ability to
only follow a single lane is not enough to fully automate this process. For example, it has been
reported that end-to-end road following algorithms will oscillate between two different driving
directions when coming to a fork in the road [83]. erefore, the networks should also be able
to learn to utilize a higher level goal to take correct turns and arrive at the target destination.
Aiming to provide such context awareness for autonomous vehicles, Hecker et al. [112] used
a route plan as an additional input to a CNN-based control policy. e network was trained
with supervised learning, where the human driver was following a route plan. ese driving
actions, along with the route plan, were then included in the training set. Although no live
testing with the network in control of the car was completed, qualitative testing with example
images from the collected data set suggested the model was learning to follow a given route.
In a similar approach, Codevilla et al. [113] used navigation commands as an additional input
to the network. is allows the human or a route planner to tell the autonomous vehicle where
it should turn at intersections, and the trained control policy simply follows the given route. A
network architecture which shared the initial layers, but had different sub-modules (feedforward
layers) for each navigational command, was found to work best for this approach. When given a
high-level navigational command (e.g., turn left), the algorithm uses the final feedforward layers
specifically trained for that command. e trained model was evaluated both in simulation and
in the real world using a sub-scale vehicle. e proposed approach was shown to successfully
learn to follow high-level navigational commands.
In contrast to the widely popular end-to-end techniques, Waymo recently presented their
autonomous driving system, ChauffeurNet [114]. ChauffeurNet uses what the authors refer to
as mid-to-mid learning, where the input to the network is a pre-processed top-down view of
the surrounding area and salient objects, while the output is a high-level command describing
target way point, vehicle heading, and velocity. e final control actions are then provided by a
low-level controller. e use of mid-level representation allows easy mixing of simulated and real
input data, making transfer from simulation to the real world easier. Furthermore, this means
that the model can learn general driving behavior, without the burden of learning perception
and low-level control tasks. In order to avoid dangerous behavior, the training set was further
augmented by synthesizing examples of the vehicle in incorrect lane positions or veering off the
road, which enabled the network to learn how to correct the vehicle from mistakes. e trained
model was evaluated in simulation and real-world experiments, demonstrating desirable driving
behavior in both cases. e authors noted that the use of synthesized data, augmented losses,
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset