32 4. KNOWLEDGE-GUIDED COMPATIBILITY MODELING
further employed to guide the training of the student network p of interest (i.e., the aforemen-
tioned data-driven neural network designed for compatibility modeling). Ultimately, we aim
to achieve a good balance between the superior prediction performance of student network p
and the mimic capability of student network p to teacher network q. Accordingly, we have the
objective formulation at iteration t as
.tC1/
D arg min
X
.i;j;k/2D
S
n
.1 /L
bpr
m
p
ij
; m
p
ik
C L
crs
q
.t/
.i; j; k/; p.i; j; k/
o
C
2
2
F
; (4.5)
where L
crs
stands for the cross-entropy loss, p.i; j; k/ and q.i; j; k/ refer to the sum-normalized
distribution over the compatibility scores predicted by the student network p and teacher net-
work q, (i.e., Œm
p
ij
; m
p
ik
and Œm
q
ij
; m
q
ik
), respectively. is the imitation parameter calibrating the
relative importance of these two objectives.
4.3.4 TEACHER NETWORK CONSTRUCTION
As the teacher network plays a pivotal role in the knowledge distillation process, we now proceed
to introduce the derivation of the teacher network q. On the one hand, we expect that the student
network p can learn well from the teacher network q and this property can be naturally measured
by the closeness between the compatibility prediction of both networks p and q. On the other
hand, we attempt to utilize the rule regularizer to encode the general domain knowledge. In
particular, we adapt the teacher network construction method proposed in [45, 46] as follows:
min
q
KL.q.i; j; k/ k p.i; j; k// C
X
l
E
q
Œf
l
.i; j; k/; (4.6)
where C is the balance regularization parameter and KL measures the KL-divergence between
p.i; j; k/ and q.i; j; k/. is formulation has proven to be a convex problem and can be optimized
with the following closed-form solutions,
q
.i; j; k/ / p.i; j; k/ exp
n
X
l
C
l
f
l
.i; j; k/
o
; (4.7)
where
l
stands for the confidence of the l-th rule and the larger
l
indicates the stronger rule
constraint. f
l
.i; j; k/ is the l-th rule constraint function devised to reward the predictions of the
student network that meet the rules while penalize the others. In our work, given the sample
.i; j; k/, we expect to reward the compatibility m
ij
, if .i; j / satisfies the positive rule but .i; k/
not or .i; k/ triggers the negative rule while .i; j / not. In particular, we define f
ij
l
.i; j; k/, the
4.3. METHODOLOGY 33
element of f
l
.i; j; k/ calibrating m
ij
, as follows:
f
ij
l
.i; j; k/ D
8
ˆ
<
ˆ
:
1; if
(
ı
l
.ij/ D 1; ı
l
.ik/ D 0; l 2 R
C
;
ı
l
.ij/ D 0; ı
l
.ik/ D 1; l 2 R
;
0; others,
(4.8)
where ı
l
.ab/ D 1.0/ means that the sample .a; b/ satisfies the l-th rule (or not). We define the
other element f
ik
l
.i; j; k/ of f
l
.i; j; k/ corresponding m
ik
in the same manner.
Traditionally,
l
in Eq. (4.7) can be either manually assigned or automatically learned
from the data, and both ways assume the rules have universal confidence to all samples. How-
ever, in fact, different rules may have different confidence levels for different samples, which
can be attributed to the fact that the human knowledge rules can be general and fuzzy. It is
intractable to directly pre-define the universal rule confidence. erefore, considering that dif-
ferent rules can flexibly contribute to the guidance to the given samples, we adopt the attention
mechanism [1], which has proven to be effective in many machine learning tasks such as the
recommendation [7, 11] and representation learning [95]. e key to the success of attention
mechanism lies in the observation that human tends to selectively attend to parts of the input
signal rather than the entirety at once during the process of human recognition. In our work, we
adopt the soft attention model to assign the rule confidence adaptively according to the given
samples. In particular, for a given sample .i; j; k/ and the set of rules it activates R.i; j; k/, we
assign
l
.i; j; k/ as follows:
0
l
.i; j; k/
D
w
T
W
t
Œ
Qv
i
; Qc
i
C W
b
Qv
j
; Qc
j
C W
b
Œ
Qv
k
; Qc
k
C W
l
r
l
C b
/
C c; l 2 R.i; j; k/; (4.9)
where the W
t
2 R
h.D
v
CD
t
/
, W
b
2 R
h.D
v
CD
t
/
, W
l
2 R
hL
, w 2 R
h
, b 2 R
h
, and c are the
model parameters. h represents the hidden layer size of the attention network. r
l
2 R
L
stands
for the one-hot encoding of the l-th rule. e attention scores are then normalized as follows:
l
.i; j; k/ D
exp
0
l
.i; j; k/
P
u2R.i;j;k/
exp
0
u
.i; j; k/
: (4.10)
Figure 4.3 illustrates the workflow of our model, while the optimization procedure of our
framework is summarized in Algorithm 4.1. Notably, the teacher network is first constructed
from the student network at the very beginning of the training, which may induce the poor
guidance at first. erefore, we expect the whole framework favors to the prediction of the
ground truth more at the initial stage but gradually bias toward the imitate capability of the
student network to the teacher network. erefore, we adopt the parameter assigning strategy
in [45] to assign dynamically, which keeps increasing as the training process goes.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset