Research Challenges

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

20 3. DEEP LEARNING FOR VEHICLE CONTROL

complexity can be reduced when compared to those seen in lateral control. is has meant that

instead of vision based set-ups, ranging sensors have been more widely used. Moreover, these

one-dimensional measurements also enable easier design of reward functions for reinforcement

learning since they often directly serve as metrics for performance. While reinforcement learn-

ing has provided good results, it has multiple drawbacks. e major drawback of reinforcement

learning is the large amount of experiences required for training convergence, which can be ex-

pensive and/or time consuming to collect [56, 100]. In contrast, supervised learning methods

often lack the level of generalization oﬀered by reinforcement learning, but tend to provide faster

convergence, given a suitably high quality data set can be obtained. For these reasons, combin-

ing the two learning methods can provide signiﬁcant advantages. Using supervised learning as a

pre-training stage for reinforcement learning, Zhao et al. [22, 101, 102] presented their Super-

vised Actor-Critic (SAC) algorithm for longitudinal control. e system was trained for vehicle

following on dry road conditions, and then evaluated on both dry and wet road conditions. Dur-

ing evaluation, the SAC controller was compared to a PID and a supervised learning (without

reinforcement learning) controller and the SAC controller was found to exhibit the best perfor-

mance. ese results show that combining supervised and reinforcement learning can leverage

the advantages of both strategies.

A summary of research works discussed in this section can be seen in Table 3.2. e

general trend in longitudinal control has been to utilize the generalization capabilities of rein-

forcement learning to perform well in diﬀerent environments. Unlike lateral control, the inputs

seen here are mostly lower-dimensional measurements of host vehicle states and measurements

relative to nearby vehicles (or other objects), which can be obtained from ranging sensors. is

means that network complexity can be reduced, and there is no need to use CNNs. However,

as discussed in this chapter, in reinforcement learning, the reward function needs to successfully

represent the desirability of being in any given state. For a complex task such as driving, this will

require considering multiple conﬂicting objectives, such as desired velocity, passenger comfort,

driver habits, and safety. A poorly designed reward function can lead to training divergence,

poor control performance, or unexpected behavior.

3.1.3 FULL VEHICLE CONTROL

ere has also been some recent interest in investigating vehicle controllers which control all

actions of the vehicle (steering, acceleration, braking) simultaneously. For instance, Zheng et

al. [103] proposed a decision-making system for an autonomous vehicle in a highway scenario

based on reinforcement learning. e system considered three factors in the reward function:

(1) safety, evaluated based on distances to adjacent vehicles; (2) smoothness, based on accumu-

lation of longitudinal speed changes over time; and (3) velocity tracking, based on the diﬀerence

between current velocity and desired velocity. e authors used a Least Squares Policy Iteration

approach to ﬁnd the optimal decision-making policy. e system was then evaluated in a simple

highway simulation, where the vehicle had to undergo multiple overtake maneuvres. e system

3.1. AUTONOMOUS VEHICLE CONTROL 21

Table 3.2: A comparison of longitudinal control techniques

Ref.

Learning

Strategy

Network Inputs Outputs Pros Cons

[94] Fuzzy

reinforcement

learning

Feedforward

network with

one hidden layer

Relative distance,

relative speed,

previous control

input

 rottle

angle, brake

torque

Model-free, continuous

action values

Single term reward

function

[

23] Reinforcement

learning

Feedforward

network with

one hidden layer

Time headway,

headway derivative

Accelerate,

brake, or

no-op

Maintains a safe

distance

Oscillatory accelera-

tion behavior,

no term for comfort

in reward function

[

95] Reinforcement

learning

Actor-Critic

Network

Velocity, velocity

tracking error,

acceleration error,

expected acceler-

ation

Gas and

brake com-

mands

Learns from minimal

training data

Noisy behavior

of the steering

signal

[

98] Reinforcement

learning

Feedforward

network with

ﬁ ve hidden

layers

Vehicle velocity,

relative position of

the pedestrian for

past ﬁ ve time steps

Discretized

deceleration

actions

Reliably avoids

collisions

Only considers

collision avoidance

with pedestrians

[96] Reinforcement

learning

Feedforward

network with

one hidden

layer

Relative distance,

relative velocity,

relative accelera-

tion (normalized)

Desired accel-

eration

Provides smooth

driving styles,

learns personal

driving styles

No methods for

preventing learning

of bad habits

from human drivers

[97] Reinforcement

learning

Actor-Critic

Network

Relative distance,

host velocity, rel-

ative

velocity, host

acceleration

Desired accel-

eration

Performs well in a

variety of scenarios,

safety and comfort

considered, learns

personal driving styles

Adapting unsafe

driver habits could

degrade safety

[22] Supervised

reinforcement

learning

Actor-Critic

Network

Relative distance,

relative velocity

Desired accel-

eration

Pre-training by

supervised learning

accelerates learning

process and helps

guarantee convergence,

performs well in

critical scenarios

Requires supervision

to converge,

driving comfort

not considered

22 3. DEEP LEARNING FOR VEHICLE CONTROL

could successfully complete all overtaking manoeuvres. In contrast, Shalev-Shwartz et al. [104]

consider a more complex scenario in which an autonomous vehicle has to operate around un-

predictable vehicles. e aim of the project was to design an agent which can pass a roundabout

safely and eﬃciently. e performance of the agent was evaluated based on (1) keeping a safe

distance from other vehicles at all times, (2) the time to ﬁnish the route, and (3) smoothness of

the acceleration policy. e authors utilized a RNN to accomplish this task, which was chosen

due to its ability to learn the function between a chosen action, the current state, and the state

at the next time step without explicitly relying on any Markovian assumptions. Moreover, by

explicitly expressing the dynamics of the system in a transparent way, prior knowledge could be

incorporated into the system more easily. e described method was shown to learn to slowdown

when approaching roundabouts, to give way to aggressive drivers, and to safely continue when

merging with less aggressive drivers. In the initial testing, the next state was decomposed into

a predictable part, including velocities and locations of adjacent vehicles, and a non-predictable

part, consisting of the accelerations of the adjacent vehicles. e dynamics of the predictable

part of the next state was provided to the agent. However, in the next phase of testing, all state

parameters at the next time step were considered unpredictable and instead had to be learned

during training. e learning process was more challenging in these conditions, but still suc-

ceeded. Additionally, the authors claimed that the described method could be adapted to other

driving policies such as lane change decisions, highway exit and merge, negotiation of the right

of way in junctions, yielding for pedestrians, and complicated planning in urban scenarios.

As mentioned before, supervised learning can signiﬁcantly reduce training time for a con-

trol system. For this reason, Xia et al. [105] presented a vehicle control algorithm based on Q-

learning combined with a pre-training phase based on expert demonstration. A ﬁltered experi-

ence replay, where previous experiences were stored while poor performances were eliminated,

was used to improve the convergence during training. e use of pre-training and ﬁltered experi-

ence replay was shown to not only improve ﬁnal performance of the control policy, but also speed

up convergence by up to 71%. Comparing two learning algorithms for lane keeping, Sallab et

al. [106] investigated the eﬀect of continuous and discretized action spaces. A DQN was chosen

for discrete action spaces, while an actor-critic algorithm was used for continuous action values.

e two networks were trained and evaluated driving around a simulated race track, where their

goal was to complete the track while staying near the center of the lane. e ability to utilize

continuous action values resulted in signiﬁcantly stronger performance with the actor-critic al-

gorithm, enabling a much smoother control policy. In contrast, the DQN algorithm struggled

to stay near the center of the lane, especially on curved roads. ese results demonstrated the

advantages of using continuous action values.

Using vision to control the autonomous vehicle, Zhang et al. [107] presented their su-

pervised learning algorithm, SafeDAgger, based on the Dataset Aggregation (DAgger) [108]

imitation learning technique. In DAgger, the ﬁrst phase of training consists of traditional super-

vised learning where the model learns from a training set collected from an expert completing

3.1. AUTONOMOUS VEHICLE CONTROL 23

a task. Next, the trained policy performs at the given task and these experiences are collected

into a new data set. e expert is then asked to label the newly collected data with the correct

actions. e labeled data sets are then added to the original data set to create an augmented data

set, which is used to further train the model. e process is then repeated in the DAgger train-

ing framework. is approach is used to overcome one of the limitations of supervised learning

for control policies. When using a supervised learning-based algorithm, as the trained model

makes slight errors, it veers away from its known operational environment and such errors can

compound over time. For instance, when steering a vehicle, the model may have learned from a

data set where the expert is always in (or near) the center of the lane. However, when the trained

network is steering the vehicle, small errors in the predicted steering angle can slowly bring the

vehicle away from the center of the lane. If the network has never seen training examples where

the vehicle is veering away from the center of the lane, it will not know how to correct its own

mistakes. Instead, in the DAgger framework, the network can make these mistakes, and the

examples are then labeled with correct actions by the expert, and the next training phase then

allows the network to learn to correct its mistakes. is results in a control policy more robust

to small errors, however, it introduces further complexity and cost in accessing the expert for

further ground truth queries. e framework was later extended in SafeDAgger, by developing a

method that estimates, for the current state, whether the trained policy is likely to deviate from

the reference (expert) policy. If it is likely to deviate more than a speciﬁed threshold, the refer-

ence policy is used to control the vehicle instead. e authors used the SafeDAgger method to

train a CNN to control the lateral and longitudinal actions of an autonomous vehicle while driv-

ing on a test track. ree algorithms were compared: traditional supervised learning, DAgger,

and SafeDAgger, and SafeDAgger was found to perform the best. Another work exploring a

DAgger-type framework for autonomous driving was presented by Pan et al. [109]. e method

used a model predictive control algorithm to provide the reference policy using high cost sen-

sors, while the learned policy was modeled by a CNN with low cost camera sensors only. e

advantage of this approach was that the reference policy can use highly accurate sensing data,

while the trained policy can be deployed using low cost sensors only. e approach was evalu-

ated experimentally on a sub-scale vehicle driving at high speeds around a dirt track, where the

trained policy demonstrated it could operate the vehicle using the low cost sensors only.

In a move away from direct perception methods, Wang et al. [110] presented an object-

centric deep learning control policy for autonomous vehicles. e method used a CNN which

identiﬁed salient features in the image, and these features were then used to output a control

action, which was discretized to {left, straight, right, fast, slow, stop} and then executed by a

PID controller. Simulation experiments in Grand eft Auto V demonstrated the proposed

approach outperforms control policies without any attention or those based on heuristic feature

selection. In another work, Porav and Newman [111] presented a collision avoidance system

using deep reinforcement learning. e proposed method used a variational autoencoder coupled

with a RNN to predict continuous actions for steering and braking using semantically segmented

24 3. DEEP LEARNING FOR VEHICLE CONTROL

images as input. Compared to the collision avoidance system with braking only by Chae et

al. [98], a signiﬁcant improvement in reducing collision rates for TTC values below 1.5 s was

demonstrated, with up to 60% reduction in collision rates.

ese works demonstrate that a DNN can be trained for simple driving tasks such as

staying in its lane, following a vehicle, or collision avoidance. However, driving not only consists

of multiple such tasks, but also requires outside context of a higher goal as well. Humans drive

vehicles with the aim to start from point A and arrive at point B. erefore, having the ability to

only follow a single lane is not enough to fully automate this process. For example, it has been

reported that end-to-end road following algorithms will oscillate between two diﬀerent driving

directions when coming to a fork in the road [83]. erefore, the networks should also be able

to learn to utilize a higher level goal to take correct turns and arrive at the target destination.

Aiming to provide such context awareness for autonomous vehicles, Hecker et al. [112] used

a route plan as an additional input to a CNN-based control policy. e network was trained

with supervised learning, where the human driver was following a route plan. ese driving

actions, along with the route plan, were then included in the training set. Although no live

testing with the network in control of the car was completed, qualitative testing with example

images from the collected data set suggested the model was learning to follow a given route.

In a similar approach, Codevilla et al. [113] used navigation commands as an additional input

to the network. is allows the human or a route planner to tell the autonomous vehicle where

it should turn at intersections, and the trained control policy simply follows the given route. A

network architecture which shared the initial layers, but had diﬀerent sub-modules (feedforward

layers) for each navigational command, was found to work best for this approach. When given a

high-level navigational command (e.g., turn left), the algorithm uses the ﬁnal feedforward layers

speciﬁcally trained for that command. e trained model was evaluated both in simulation and

in the real world using a sub-scale vehicle. e proposed approach was shown to successfully

learn to follow high-level navigational commands.

In contrast to the widely popular end-to-end techniques, Waymo recently presented their

autonomous driving system, ChauﬀeurNet [114]. ChauﬀeurNet uses what the authors refer to

as mid-to-mid learning, where the input to the network is a pre-processed top-down view of

the surrounding area and salient objects, while the output is a high-level command describing

target way point, vehicle heading, and velocity. e ﬁnal control actions are then provided by a

low-level controller. e use of mid-level representation allows easy mixing of simulated and real

input data, making transfer from simulation to the real world easier. Furthermore, this means

that the model can learn general driving behavior, without the burden of learning perception

and low-level control tasks. In order to avoid dangerous behavior, the training set was further

augmented by synthesizing examples of the vehicle in incorrect lane positions or veering oﬀ the

road, which enabled the network to learn how to correct the vehicle from mistakes. e trained

model was evaluated in simulation and real-world experiments, demonstrating desirable driving

behavior in both cases. e authors noted that the use of synthesized data, augmented losses,

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Research Challenges

Create new playlist

Sign In

Sign Up

Table of Contents for
Research Challenges