5.3. METHODOLOGY 49
tributes. Each attribute a
q
is associated with a set of elements representing its possible values
E
q
D fe
1
q
; e
2
q
; : : : ; e
M
q
q
g, where e
i
q
refers to the i-th element and M
q
is the total number of el-
ements regarding a
q
. For simplicity, we compile all E
q
s in order and hence derive a unified
set of attribute elements E D
S
Q
qD1
E
q
D fe
1
; e
2
; : : : ; e
M
g, where M D
P
Q
qD1
M
q
. In addition,
we have a set of positive top-bottom pairs S D f.t
i
1
; b
j
1
/; .t
i
2
; b
j
2
/; : : : ; .t
i
N
; b
j
N
/g composed by
fashion experts, where N is the total number of positive pairs. Accordingly, for each top t
i
, we
can derive a set of positive bottoms B
C
i
D fb
j
2 Bj.t
i
; b
j
/ 2 Sg. Let s
ij
denote the compatibility
between the top t
i
and bottom b
j
, based on which we can distinguish whether the given fashion
items are compatible or not.
5.3.2 SEMANTIC ATTRIBUTE REPRESENTATION
As a matter of fact, the online fashion item is usually characterized by a visual image, certain user-
generated contextual description and structured category labels. In a sense, the visual image and
structured category labels can faithfully capture the essential features of fashion items, such as
the
color
,
shape
, and
category
, while the user-generated contextual description may be unreliable
as it can be intrinsically noisy, not to mention the mendacious ones edited by crafty sellers.
erefore, similar to the existing work [139], we only exploit the reliable visual cues as well as
the structured category information to model the compatibility between fashion items. Notably,
existing efforts mainly adopt advanced deep neural networks to learn the effective presentations
for fashion items and measure the compatibility owning to their compelling success in various
research tasks. Nevertheless, as a pure data-driven learning scheme, deep neural network suffers
from the poor interpretability due to the fact that each dimension of the learned representation
cannot explicitly refer to the intuitive semantic aspect of fashion items. Toward this end, we aim
to learn the meaningful representations for fashion items, whose dimensions directly stand for
the semantic attributes and hence enhance the model interpretability.
On one hand, regarding the sophisticated visual signals, we argue that taking advantage of
the well pre-trained attribute classification networks is the most natural and straightforward way
to obtain the interpretable semantic representations of fashion items. As to ensure the perfor-
mance of the attribute classification networks, we align each attribute a
q
with a separate attribute
classification network h
q
. It is worth noting that as the category information also contributes
an essential attribute of fashion items, here we have Q 1 attributes characterized by the visual
cues. We feed the visual image I
i
of the i-th top/bottom into these h
q
s, and obtain the semantic
attribute representations as follows:
f
q
i
D h
q
I
i
j
q
; q D 1; 2; : : : ; Q 1; (5.1)
where
q
denotes the network parameter of h
q
and f
q
i
2 R
M
q
is the network output of h
q
. e
d -th entry in f
q
i
refers to the probability that the top t
i
presents the attribute element e
d
q
. In
particular, we denote f
v
i
D Œf
1
i
I f
2
i
I : : : I f
Q1
i
as the final semantic attribute representation of the
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset