Rule Construction

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

32 4. KNOWLEDGE-GUIDED COMPATIBILITY MODELING

further employed to guide the training of the student network p of interest (i.e., the aforemen-

tioned data-driven neural network designed for compatibility modeling). Ultimately, we aim

to achieve a good balance between the superior prediction performance of student network p

and the mimic capability of student network p to teacher network q. Accordingly, we have the

objective formulation at iteration t as

‚

.tC1/

D arg min

‚

.i;j;k/2D

.1  /L

bpr



; m



C L

crs



.t/

.i; j; k/; p.i; j; k/

o





‚



; (4.5)

where L

crs

stands for the cross-entropy loss, p.i; j; k/ and q.i; j; k/ refer to the sum-normalized

distribution over the compatibility scores predicted by the student network p and teacher net-

work q, (i.e., Œm

; m

 and Œm

; m

), respectively.  is the imitation parameter calibrating the

relative importance of these two objectives.

4.3.4 TEACHER NETWORK CONSTRUCTION

As the teacher network plays a pivotal role in the knowledge distillation process, we now proceed

to introduce the derivation of the teacher network q. On the one hand, we expect that the student

network p can learn well from the teacher network q and this property can be naturally measured

by the closeness between the compatibility prediction of both networks p and q. On the other

hand, we attempt to utilize the rule regularizer to encode the general domain knowledge. In

particular, we adapt the teacher network construction method proposed in [45, 46] as follows:

min

KL.q.i; j; k/ k p.i; j; k//  C

Œf

.i; j; k/; (4.6)

where C is the balance regularization parameter and KL measures the KL-divergence between

p.i; j; k/ and q.i; j; k/. is formulation has proven to be a convex problem and can be optimized

with the following closed-form solutions,



.i; j; k/ / p.i; j; k/ exp

C 

.i; j; k/

; (4.7)

where 

stands for the conﬁdence of the l-th rule and the larger 

indicates the stronger rule

constraint. f

.i; j; k/ is the l-th rule constraint function devised to reward the predictions of the

student network that meet the rules while penalize the others. In our work, given the sample

.i; j; k/, we expect to reward the compatibility m

, if .i; j / satisﬁes the positive rule but .i; k/

not or .i; k/ triggers the negative rule while .i; j / not. In particular, we deﬁne f

.i; j; k/, the

4.3. METHODOLOGY 33

element of f

.i; j; k/ calibrating m

, as follows:

.i; j; k/ D

1; if

(

.ij/ D 1; ı

.ik/ D 0; l 2 R

;

.ij/ D 0; ı

.ik/ D 1; l 2 R



;

0; others,

(4.8)

where ı

.ab/ D 1.0/ means that the sample .a; b/ satisﬁes the l-th rule (or not). We deﬁne the

other element f

.i; j; k/ of f

.i; j; k/ corresponding m

in the same manner.

Traditionally, 

in Eq. (4.7) can be either manually assigned or automatically learned

from the data, and both ways assume the rules have universal conﬁdence to all samples. How-

ever, in fact, diﬀerent rules may have diﬀerent conﬁdence levels for diﬀerent samples, which

can be attributed to the fact that the human knowledge rules can be general and fuzzy. It is

intractable to directly pre-deﬁne the universal rule conﬁdence. erefore, considering that dif-

ferent rules can ﬂexibly contribute to the guidance to the given samples, we adopt the attention

mechanism [1], which has proven to be eﬀective in many machine learning tasks such as the

recommendation [7, 11] and representation learning [95]. e key to the success of attention

mechanism lies in the observation that human tends to selectively attend to parts of the input

signal rather than the entirety at once during the process of human recognition. In our work, we

adopt the soft attention model to assign the rule conﬁdence adaptively according to the given

samples. In particular, for a given sample .i; j; k/ and the set of rules it activates R.i; j; k/, we

assign 

.i; j; k/ as follows:



.i; j; k/





; Qc



C W



; Qc



C W

; Qc



C W

C b

C c; l 2 R.i; j; k/; (4.9)

where the W

2 R

h.D

, W

2 R

h.D

, W

2 R

hL

, w 2 R

, b 2 R

, and c are the

model parameters. h represents the hidden layer size of the attention network. r

2 R

stands

for the one-hot encoding of the l-th rule. e attention scores are then normalized as follows:



.i; j; k/ D

exp





.i; j; k/



u2R.i;j;k/

exp





.i; j; k/



: (4.10)

Figure 4.3 illustrates the workﬂow of our model, while the optimization procedure of our

framework is summarized in Algorithm 4.1. Notably, the teacher network is ﬁrst constructed

from the student network at the very beginning of the training, which may induce the poor

guidance at ﬁrst. erefore, we expect the whole framework favors to the prediction of the

ground truth more at the initial stage but gradually bias toward the imitate capability of the

student network to the teacher network. erefore, we adopt the parameter assigning strategy

in [45] to assign  dynamically, which keeps  increasing as the training process goes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Rule Construction

Create new playlist

Sign In

Sign Up

Table of Contents for
Rule Construction