Discriminative Feature Learning with Consistent Attention Regularization for Person Re-identification 1. Introduction Person re-identification (Re-ID) is a critical technology in video surveillance, which aims to associate the same pedestrian across the non-overlapping camera views. With the blooming of convolutional neural network, the current deep feature learning based methods [5, 8, 53, 61] have significantly outperformed a variety of traditional feature learning based approaches [33, 43]. In practice, it is critical to learn a discriminative feature representation in solving the person Re-ID problem. However, the learned features are very easily degenerated by target misalignment and background clutter, because most of the deep feature ∗ Jinjun Wang is the corresponding author. +OCIG +OCIG DPSOLWXGH *KIJ.GXGN $ # +OCIG +OCIG GLPHQVLRQ D )HDWXUH/HDUQLQJZLWKRXW2XU&RQVLVWHQW$WWHQWLRQ5HJXODUL]HU +OCIG DPSOLWXGH $ *KIJ.GXGN # /KF.GXGN FRQVLVWHQWDWWHQWLRQ UHJXODUL]HU .QY.GXGN Person re-identification (Re-ID) has undergone a rapid development with the blooming of deep neural network. Most methods are very easily affected by target misalignment and background clutter in the training process. In this paper, we propose a simple yet effective feedforward attention network to address the two mentioned problems, in which a novel consistent attention regularizer and an improved triplet loss are designed to learn foreground attentive features for person Re-ID. Specifically, the consistent attention regularizer aims to keep the deduced foreground masks similar from the low-level, mid-level and high-level feature maps. As a result, the network will focus on the foreground regions at the lower layers, which is benefit to learn discriminative features from the foreground regions at the higher layers. Last but not least, the improved triplet loss is introduced to enhance the feature learning capability, which can jointly minimize the intra-class distance and maximize the inter-class distance in each triplet unit. Experimental results on the Market1501, DukeMTMC-reID and CUHK03 datasets have shown that our method outperforms most of the state-of-the-art approaches. /KF.GXGN Abstract .QY.GXGN Sanping Zhou 1 , Fei Wang 2 , Zeyi Huang 3 , Jinjun Wang 1 ∗ 1. The Institute of Artificial Intelligence and Robotic, Xi’an Jiaotong University 2. School of Computer Science and Technology, Xi’an Jiaotong University 3. Robotics Institute, Carnegie Mellon University +OCIG +OCIG +OCIG GLPHQVLRQ E )HDWXUH/HDUQLQJZLWK2XU&RQVLVWHQW$WWHQWLRQ5HJXODUL]HU Figure 1. Motivation of our consistent attention regularizer, which aims to drive the network focus on foreground regions at the lower layers. Therefore the network will learn a discriminative feature representation to enhance the useful signals from point A and suppress the noise signals from point B, at the higher layers. From the final features learned in (a) and (b), we can find that the consistent attention regularizer is critical to associate two samples with target misalignment and background clutter. learning based methods usually try to learn discriminative features from the whole input images. As a data-driven approach, the deep feature learning based methods [22, 46, 50] can autonomously focus most of their attentions on the foreground regions of input images. However, the networks are very easily misguided if we haven’t an explicit regularizer to drive its attention in the feature learning process [58]. To solve this problem, two mainstream approaches have been widely studied in the past few years. The first line of methods are based on the part-based networks [5, 38, 62], in which they try to learn discriminative features from the predefined body parts. The second line of methods are based on the foreground attentions [20, 29, 34, 39, 54, 59], in which person masks are used to drive the attention in a supervised manner or attention mechanisms are applied to deduce the attention in an unsupervised manner. In general, it is much easier to learn a discriminative feature representation with the annotated person masks, because it can help the network to precisely focus on the foreground regions at the lower layers. 8040 Many off-the-shelf methods [9, 57] have been widely used to generate the foreground masks for person Re-ID, however the resulting person masks are usually in poor quality due to the low resolution of input images. As a result, there is a high risk that the foreground attention will be misguided at the lower layers [34]. In order to alleviate this problem, it is better to incorporate the discriminative feature learning and foreground attention deducing in an end-to-end network, because they can benefit from each other in the training process. As shown in Figure 1, it becomes an important issue that how to deduce the foreground attentions at the lower layers, so as to learn the foreground attentive features at the higher layers and suppress the noise signals caused by target misalignment and background clutter. In this paper, we design a simple yet effective attention network to learn a discriminative feature representation from the foreground regions for person Re-ID. Our method is inspired by the phenomenon [58] that the high-level feature maps usually contain much more semantic information than the low-level feature maps. Therefore, it will be much easier to deduce the high-quality foreground masks from the high-level feature maps rather than from the low-level feature maps. Specifically, we first design a novel feedforward attention network which can learn the foreground masks from the low-level, mid-level and high-level feature maps, respectively. Then, a novel consistent attention regularizer is designed to transmit the foreground information from the high-level to mid-level and low-level feature maps. In this manner, the high-quality foreground masks learned from the high-level feature maps can be further used to help the lower layers focus on the foreground regions. Finally, an improved triplet loss is introduced to enhance the feature learning capability, which can jointly minimize the intraclass distance and maximize the inter-class distance in each triplet unit. Our network is trained in an end-to-end manner, which can effectively learn discriminative features to match images of the same person in a large camera system. The main contributions of our paper can be highlighted as follows: 1) A novel feedforward attention network is designed to learn foreground masks from the low-level, midlevel and high-level feature maps, respectively. 2) A novel consistent attention regularizer is put forward to keep the deduced foreground masks similar in the training process, which is benefit to drive the network to focus on foreground regions at the lower layers. 3) A novel triplet loss is built to supervise feature learning by jointly minimizing the intra-class distance and maximizing the inter-class distance in each triplet unit. We conduct extensive experiments on the Market1501 [56], DukeMTMC-reID [27] and CUHK03 [54] datasets, which have shown significant improvements by our method as compared with the state-ofthe-art approaches. 2. Related Work Our method aims to learn a discriminative feature representation through the consistent attention regularization, therefore we review two lines of related works in terms of deep feature learning and deep attention learning. Deep feature learning. A robust feature representation is very critical to solve the person Re-ID problem, and the deep feature learning based methods mainly focus on learning a discriminative feature representation from input images. For this purpose, different loss functions have been developed, such as the triplet loss [8], quadruplet loss [5], center loss [47], and softmax loss [16], to guide the feature learning process. Meanwhile, a large number of well-known networks have been designed to extract features from the input images, including the ResNet [10], DenseNet [13], MobileNet [28] and ShuffleNet [23]. In addition, different part strategies [5, 17, 38, 60] have been widely used to enhance the feature representation capability of backbone networks. In recent years, the Generative Adversarial Networks (GAN) [7, 45, 58] have been extensively studied to augment the training data for person ReID, which is an effective way to enhance the generalization ability of leaned features. Despite learning features from the single images, another line of methods [3, 24, 49, 63] have tried to learn the temporal-spatial features from video clips. Due to the strong representation capability of deep neural network, the deep feature learning based methods have achieved the state-of-the-art performance on the benchmark datasets for person Re-ID. Deep attention learning. The deep attention learning has been extensively studied in the computer vision community, which can effectively improve the algorithm’s performance by addressing the useful information [40]. In general, the deep attention learning based methods can be divided into the supervised and unsupervised lines. In the former ones, the labeled ground truth is needed to supervise the learning process. For example, the foreground masks [15, 34, 39] have been widely used to guide the networks to focus their attentions on the body regions, so as to learn discriminative features for person Re-ID. Besides, the predefined regions [12, 55] are usually used to drive the network to learn fine features from the local regions, which have been extensively studied in solving the fine-grained image classification problem. In the later ones, the selfattention mechanisms or heuristic knowledge are usually used to guide the attention learning. For instance, several works [20, 54] have designed different attention modules to guide the networks to put their attentions on the discriminative body regions. The deep residual attention learning [41] has been successfully applied in image classification. In addition, the temporal-spatial clues [25, 35] have been widely used to supervise the attention learning in video recognition and classification. 8041 2XU&RQVLVWHQW$WWHQWLRQ5HJXODUL]HU 6RIWPD[ DQG 7ULSOHW (OHPHQWZLVH3URGXFW )RUHJURXQG 0DVN )RUHJURXQG 0DVN )RUHJURXQG 0DVN 2EMHFWLYH)XQFWLRQ )RUHJURXQG 0DVN %1 6WUXFWXUHRI2XU)HHGIRUZDUG$WWHQWLRQ1HWZRUN ,QSXWV &RQY 3RRO 5HV 5HV 5HV 5HV *$3 )& Given a set of training samples X = {Xi , Yi }N i=1 , in which Xi indicates the ith input image and Yi represents the corresponding label, our method tries to learn a discriminative feature representation from the foregound regions of input images. The structure of our feedforward attention network is illustrated in Figure 2, in which a novel consistent attention regularizer and an improved triplet loss are designed to learn discriminative features for person Re-ID. Without loss of generality, we choose the ResNet50 [10] as backbone. In the following paragraphs, we will explain our method in detail. 3.1. Network Structure Our feedforward attention network aims to learn discriminative features from the foreground regions, therefore two requirements need to be satisfied in the network design. Firstly, the backbone network should be powerful enough, so as to extract discriminative features at the output layer. In our network structure, we choose the ResNet50 as our backbone, which is mainly consisted of a convolutional layer, a max pooling layer and four residual blocks. In particular, one Global Average Pooling (GAP) [21] layer and a FullyConnected (FC) layer are used to obtain a 2048 dimensional feature vector. Besides, one Batch Normalization (BN) [14] layer is deployed between the GAP and FC layers. Secondly, an attention module should be designed to deduce the foreground masks from feature maps. For this purpose, we take heat map to represent the foreground mask and use the resulting foreground mask to filter the corresponding feature maps in the training process. As shown in Figure 3, our attention module takes the feature maps Tk as input and outputs the deduced foreground mask Hk , which can be modeled as follows: Hk = Mask (Tk ; Θk ), (1) )HDWXUH PDSV ݇܂ ͵ ൈ ͵ ൈ ͳ݇ GLODWHG ͵ ൈ ͵ ൈ ͳ݇ GLODWHG ͵ ൈ ͵ ൈ ͳ݇ GLODWHG GLODWHGFRQYROXWLRQ ͳ ൈ ͳ ൈ ʹ݇ ͳ ൈ ͳ ൈ ʹ݇ ͳ ൈ ͳ ൈ ʹ݇ 1RUPDOFRQYROXWLRQ 6LJPRLG 3. Our method ͳ ൈ ͳ ൈ ͳ Figure 2. Illustration of our feedforward attention network, which works as follows: The foreground masks are firstly learned from the low-level, mid-level and high-level feature maps, respectively. Then, the consistent attention regularizer is applied to keep the deduced foreground masks similar, so as to drive the network focus on foreground regions at the lower layers. Finally, the improved triplet loss and softmax loss are jointly used to learn discriminative features in a multi-task learning framework. +HDWPDS RXWSXWV ۶݇ 6LJPRLG)XQFWLRQ Figure 3. Illustration of our attention module. For simplicity, we suppose the input feature maps Tk have Lk feature maps, then we fuse them in a gradual way: L1k = 21 Lk and L2k = 21 L1k . Besides, three dilated convolutional layers with different dilation ratios are used to deduce the foreground mask from a local to global view. where Θk represents the parameters of our k th attention module. In our design, we have the following considerations: 1) At first, we take two convolutional layers to reduce the number of feature maps to 1/4 of its own, so as to summarize them in a gradual way. Then, another convolutional layer with a kernel in size of 1 × 1 is applied to further get the heat map. At last, a sigmoid function is used to normalize the heat map in [0, 1]. 2) The multi-scale information has been applied to deduce the foreground masks from a local to global view. As the same in [17], three different receptive fields, namely 7, 5 and 3, have been used to extract the context information by using different dilation ratios in the dilated convolutional layers. Once the attention module is designed, we embed it in the ResNet50 and use the resulting heat map to filter the output feature maps of each residual block as follows: Tak (x, y, c) = Tbk (x, y, c) × Hk (x, y), (2) where Hk (x, y) denotes the deduced attention response at the coordinate (x, y), Tak (x, y, c) and Tbk (x, y, c) represent the output and input responses at the coordinate (x, y) from the cth feature map, respectively. As shown in Figure 2, our feedforward attention network works as follows: 1) In the forward propagation, the backbone network first extracts 8042 ߲ͳ ൌ െʹሺ ݆ܠെ ݇ ܠሻ ߲࢞݅ ߲ͳ ൌ െʹሺ ݅ܠെ ݆ܠሻ ߲݆ܠ ߲ͳ ൌ ʹሺ ݅ ܠെ ݇ ܠሻ ߲݇ ܠ ݅ܠ ߲ͳ ߲݅ܠ ݇ܠ ܠԢ ߲ͳ ߲݇ܠ ܠԢ ܠԢ ߲ͳ ߲݆ܠ D *UDGLHQWRIEDVLFWULSOHWORVV ݅ܠ ݆ܠ ߲ ߲݇ܠ ߲ ߲݅ ܠ ܠԢ ݇ܠ ࢉ ߲ ݆ ൌ ʹሺ ݅ ܠെ ࢉ݅ ሻ ߲ܠ ܠԢ ߲݅ ݆ ࢉ݅ ሻ ൌ ʹሺ ݆ܠെ ߲݆ܠ ߲ ݆ ൌ െʹሺ ݇ܠെ ࢉ݅ ሻ ߲݇ ܠ ܠԢ ߲ ߲݆ܠ ݆ܠ E *UDGLHQWRILPSURYHGWULSOHWORVV Figure 4. Differences between the two triplet losses in gradient back-propagation. In particular, our triplet loss introduces one point cji on the line between xi and xj to model all the pairwise relationships in each triplet unit, so as to consistently minimize the intra-class distance in the training process. the discriminative features from input images, then the attention module deduces the foreground mask from the corresponding feature maps, and finally the generated feature maps are further filtered by the resulting foreground masks with the element-wise product. 2) The parameters of backbone network and attention modules are jointly optimized in the backward propagation, therefore our feedforward attention network will focus most of its own attentions on the foreground regions in the next iteration. 3.2. Objective Function The objective function is consisted of two loss terms and one regularizer, which can be formulated as follows: L(W, Θ) = L1 (X; W) + αL2 (X; W) + L3 (H; Θ), (3) where L1 (·) represents the softmax loss, L2 (·) indicates the improved triplet loss, L3 (·) denotes the consistent attention regularizer, and α is a constant weight. In the training process, the two loss terms aim to learn a discriminative feature representation from the raw input images, and the consistent attention regularizer tries to keep these foreground masks similar, which are deduced from the low-level, mid-level and high-level feature maps, respectively. Because of its powerful capability, the softmax loss has been widely used in training the deep neural network. Therefore, we introduce it to supervise the feature learning process, which can be formulated as follows: L1 (X; W) = exp(pT 1 N Yi x i ) − log( ), Tx ) i=1 N exp(p g i g (4) where pg denotes the g th column of the learned classifier, and xi represents the feature vector learned by our feedforward attention network for input image Xi . In order to apply the improved triplet loss to learn the discriminative features from input images, we first organize the training samples into a set of triplet units, S = {(Xi , Xj , Xk )}, in which (Xi , Xj ) represents a positive pair with Yi = Yj , and (Xi , Xk ) indicates a negative pair with Yi = Yk . In each triplet unit, we solve a ranking problem by using the improved triplet loss: T = [m + d(xi , cji ) + d(xj , cji ) − d(xk , cji )]+ , (5) z1 − z2 22 where d(z1 , z2 ) = denotes the squired distance in feature space, m represents the margin parameter, and cji = ηxi + (1 − η)xj indicates one point lied on the line between xi and xj 1 . As a result, xi and xj will move towards cji , and the intra-class distance can be consistently minimized in the training process. Discussion. To the best of our knowledge, a series of triplet losses have been designed in the past few years. The basic triplet loss [8] is defined as follows: T1 = [m + d(xi , xj ) − d(xj , xk )]+ . (D1) Besides, some researchers have focused on how to improve the gradient back-propagation in their modifications. For example, the dual triplet loss [52] is defined as follows: 1 T2 = [m + d(xi,xj) − [d(xi,xk) + d(xj,xk)]]+ , (D2) 2 and the symmetric triplet loss [62] is defined as follows: T3 = [m + d(xi,xj) − [ud(xi,xk) + vd(xj,xk)]]+ . (D3) Firstly, we compare the gradient back-propagation between our triplet loss and the basic one, as shown in Figure 4, and the differences come from two aspects: 1) The basic triplet loss only considers one positive pair (Xi , Xj ) and one negative pair (Xi , Xk ), which neglects another negative pair (Xj , Xk ) in their formulation. Our triplet loss introduces the center point cji of positive pair to help model all the pairwise relationships in each triplet unit. 2) Because of the resulting advantages in gradient back-propagation, our triplet loss can continuously minimize the intra-class distance, while the basic triplet loss is hard to achieve this goal in the training process. Secondly, we conclude the relationships of these triplet losses as follows: 1) We can find that T2 (xi , xj , xk ) = 1 2 [T1 (xi , xj , xk ) + T1 (xj , xi , xk )], which indicates that it is important to model all the pairwise relationships in each triplet unit. 2) The symmetric loss is a generalized version of the dual triplet loss, in which it designs a novel algorithm to update u and v in the training process. 3) Our triplet loss doesn’t need to use any additional algorithm to achieve a more robust performance than the symmetric triplet loss. Now, we extend our triplet loss to the whole triplet units, which can be formulated as follows: 1 T(xi , xj , xk ), (6) L2 (X; W) = (Xi ,Xj ,Xk )∈S |S| where |S| indicates the number of triplet units in S. 1 In order to keep our triplet loss outperforms the basic one, we need to set η ∈ (0, 1), and we choose η = 0.5 in all the experiments. If η = 1, the basic triplet loss will become a special case of our method. 8043 ݶ D ۶ͳ E ,QSXW LPDJH ۶ͳ ݲ ۶ʹ ൏ ۶ͳ ǡ ۶ʹ 5HV ݶ ݶ ۶ʹ ݲ ۶͵ ݲ ൏ ۶ʹ ǡ ۶͵ 5HV ۶Ͷ ۶͵ ൏ ۶͵ ǡ ۶Ͷ 5HV ۶Ͷ 5HV Figure 5. Illustration of the deduced heat maps from the low-level, mid-level and high-level feature maps, respectively. In particular, (a) shows the learned heat maps without applying the consistent attention regularizer, and (b) shows the learned heat maps by using our consistent attention regularizer. Finally, we introduce the consistent attention regularizer to keep all the deduced foreground masks similar in the training process, which is defined as follows: K K+1 β k 2 + ̟ Hk+1 − H Hk 1 , F K K +1 k=1 k=1 (7) where K + 1 denotes the number of heat maps, and β, ̟ k is in the same size are two constant weights. Besides, H with Hk+1 , which is obtained by a max-pooling of Hk with stride 2. Because there are four residual blocks in the ResNet50, we set K = 3 in all the experiments. Our consistent attention regularizer is consisted of two terms, i.e., the consistence term and sparsity term, in which: 1) The consistence term aims to keep these heat maps similar, which are learned from the low-level, mid-level and high-level feature maps, respectively. As a result, the high-quality foreground masks learned from the high-level feature maps can be used to help the network focus on foreground regions at the lower layers. 2) The sparsity term tends to do feature selection, which is benefit to remove some false positive responses in background. We compare two different sets of heat maps in Figure 5, from which we can see that the heat maps learned by using our consistent attention regularizer are much better than these without using this regularizer . L3 (H; Θ) = 3.3. Optimization We optimize the deep parameters W, Θ by using the Stochastic Gradient Descent (SGD) algorithm. For simplicity, we take Ω = [W, Θ] as a whole and compute the partial derivate of Eq. (3) as follows: ∂L2 (X; W) ∂L3 (H; Θ) ∂L(Ω) ∂L1 (X; W) = +α + , ∂Ω ∂W ∂W ∂Θ (8) where ∂L1 (X; W)/∂W can be easily computed by using the off-the-shelf algorithm, and ∂L2 (X; W)/∂W and ∂L3 (H; Θ)/∂Θ are derived in the following paragraphs. Algorithm 1 Consistent attention gradient descent. Input: The training data X, learning rate τ , maximum iteration number Q, weight parameters α, β and ̟, and margin parameter m. Output: The network parameters Ω = [W, Θ]. repeat repeat 1 1) Compute ∂L ∂W using the off-the-shelf algorithm; ∂L2 2) Compute ∂W according to Eq. (9); 3 3) Compute ∂L ∂Θ according to Eq. (11); ∂L 4) Update the gradients ∂Ω according to Eq. (8); until Traverse all the triplet inputs {(xi , xj , xk )} in each min-batch; ∂L 2. Update Ω(q+1) = Ω(q) − τq ∂Ω (q) and q ← q + 1. until q > Q We denote r = m + d(xi , cji ) + d(xj , cji ) − d(xk , cji ), then the partial derivate of our triplet loss can be formulated as follows: ⎧ ∂P(xi ,xj ,xk ) 1 , if r > 0, ∂L2 (X; W) ⎨ |S| ∂W (xi ,xj ,xk )∈S ,(9) = ⎩ ∂W 0 , else. in which ∂P(xi , xj , xk )/∂W is computed as follows: ∂P(xi , xj , xk ) ∂xi − ∂cji = 2(xi − cji ) · ∂W ∂W ∂xj − ∂cji . +2(xj − cji ) · ∂W − ∂cji ∂x k −2(xk − cji ) · ∂W (10) The partial derivate of our consistent attention regularizer is computed as follows: K K+1 k=1 k=1 ∂L3 (H;Θ) β k )+ ̟ ℓc (Hk+1 ,H ℓs (Hk ), (11) = ∂Θ K K+1 k ) and ℓs (Hk ) are computed as follows: where ℓc (Hk+1 ,H k ) = 2(Hk+1 − H k ) · ∂Hk+1 −∂ Hk , (12) ℓc (Hk+1 ,H ∂Θ ∂Hk , (13) ℓs (Hk ) = sign(Hk ) · ∂Θ where sign(·) denotes the sign function, in which sign(z) = 1 if z > 0, and otherwise sign(z) = −1. Because our method needs to back-propagate gradients to learn a discriminative feature representation by using our consistent attention regularizer, we name it as the consistent attention gradient descent algorithm. Algorithm 1 shows the overall implementation of our training process. 8044 1 2 3 4 5 ResNet. ResNet. ResNet. ResNet. ResNet. S BT S+BT IT S+IT Market1501 Single-Query Multi-Query Top 1 mAP Top 1 mAP 87.5 72.8 91.2 79.4 87.0 72.4 91.3 79.5 89.1 75.0 92.4 81.0 89.7 75.8 92.9 81.4 93.4 79.2 94.2 82.5 6 7 8 9 10 ResNet.(AM) ResNet.(AM) ResNet.(AM) ResNet.(AM) ResNet.(AM) S BT S+BT IT S+IT 87.8 87.1 89.4 90.2 93.9 73.0 72.5 75.4 76.6 79.5 91.6 91.2 92.5 93.3 94.6 79.8 79.5 81.1 82.0 82.9 78.9 78.1 81.2 79.8 82.6 63.6 62.0 68.1 65.2 69.1 74.1 76.5 81.1 81.3 88.4 92.8 93.6 95.8 96.1 97.8 70.9 72.9 77.8 78.1 85.5 90.9 91.8 94.3 94.4 96.6 11 12 13 14 15 ResNet.(AM) ResNet.(AM) ResNet.(AM) ResNet.(AM) ResNet.(AM) S+CA BT+CA S+BT+CA IT+CA S+IT+CA 89.3 88.9 92.1 93.3 96.1 75.4 74.9 78.6 79.2 84.7 92.7 92.7 93.8 95.2 98.2 81.2 80.9 82.5 83.7 87.3 81.6 80.9 83.5 83.1 86.3 68.4 67.9 70.4 70.2 73.1 78.5 80.1 86.6 89.1 96.9 94.6 95.4 97.2 98.1 99.6 75.1 76.9 82.4 87.1 93.2 93.2 93.8 96.0 97.3 99.2 Index Network Losses DukeMTMC-reID Single-Query Top 1 mAP 78.3 62.1 77.6 61.8 79.7 64.9 79.2 64.5 82.1 68.4 CUHK03 Labeled Detected Top 1 Top 5 Top 1 Top 5 72.1 91.2 66.5 88.4 73.2 92.2 68.1 89.6 76.8 93.8 74.8 93.0 77.1 94.2 74.1 92.9 82.4 96.6 78.4 94.5 Table 1. Matching rates(%) of different variants of our method on the three benchmark datasets, in which 1) AM: Attention Module; 2) S: Softmax Loss; 3) BT: Basic Triplet Loss; 4) IT: Improved Triplet Loss; 5) CA: Consistent Attention Regularizer. 4. Experiments 4.1. Settings Datasets. We conduct experiments on three large-scale datasets, i.e., the Market1501 [56], DukeMTMC-reID [27] and CUHK03 [18]. The Market1501 dataset contains 32,668 images, including 12,936 training samples from 751 identities, and 19,732 testing samples from 750 identities, respectively. The DukeMTMC-reID dataset is consisted of 1,812 identities captured from 8 different cameras, in which 16,522 images from 702 identities are used as training samples, 2,228 images of another 702 identities are used as queries, and the remaining 17,661 noisy images are also used for the gallery set. The CUHK03 dataset contains 13,164 images of 1,467 identities, in which samples of 1,367 identities are randomly chosen for training, and the samples of remaining identities are used for testing. Implementation. In our implementation, we first resize the input images into 256 × 128, then followed by a random cropping and flipping for data augmentation. The batch size is 32, the learning rate is τ = 0.01 and decayed by 0.1 at every 10 epochs. The weight parameters are set as α = β = 0.1 and ̟ = 0.01, and the margin parameter is chosen as m = 1.0. Once the the network is trained, we simply use it to extract features from the testing images and formulate the person Re-ID as a nearest neighbor search problem. 4.2. Ablation Study Variants. To evaluate how much our method improves the final results, we design 15 experiments on each dataset, as shown in Table 1, which can well support the following conclusions: 1) The multi-task learning framework is more effective than the single-task learning framework in learning discriminative features; 2) The improved triplet loss is superior than the basic triplet loss in supervising the feature learning; 3) The attention subnetwork can slightly improve Methods LDNS [51] (CVPR2016) PDC [36] (ICCV2017) DLPA [54] (ICCV2017) SVDNet [37] (ICCV2017) DCAF [17] (CVPR2017) SSM [1] (CVPR2017) DPFL [6] (CVPR2017) JLML [19](IJCAI2017) PRGP [39] (CVPR2018) DGRW [30] (CVPR2018) BraidNet [44] (CVPR2018) AACN [48] (CVPR2018) GCSL [4] (CVPR2018) SGGNN [31] (ECCV2018) PN-GAN [26] (ECCV2018) Our Method Labeled Top 1 Top 5 62.6 90.5 88.7 98.6 85.1 97.6 – – 74.2 94.3 76.6 94.6 86.7 82.8 83.2 98.0 91.7 98.2 94.9 98.7 88.2 98.7 91.4 98.9 90.2 98.5 95.3 99.1 79.8 96.2 96.9 99.6 Detected Top 1 Top 5 54.7 84.8 78.3 94.8 – – 81.8 95.2 68.0 91.0 72.7 92.4 82.0 78.1 80.6 96.9 – – – – 85.9 98.5 89.5 97.7 88.8 97.2 – – – – 93.2 99.2 Table 2. The matching rates(%) comparison with the state-of-theart methods on the CUHK03 dataset, in which ‘–’ means they do not report the corresponding results. the network’s representation capability; 4) The consistent attention regularizer can guide the attention subnetwork to better explore the foreground regions of input images. As a result, we incorporate our three contributions in a multi-task learning framework to learn a discriminative feature representation for person Re-ID. In the next paragraph, we will explain the above conclusions in detail. For clarity, we try to check the above conclusions based on the performances on the Market1501 dataset using the single-query evaluation. To evaluate how much the multitask learning framework outperforms the single-task learning framework, we can compare the experimental results as listed in indexes 1, 2 and 3; indexes 1, 4 and 5; indexes 6, 7 and 8; indexes 6, 9 and 10; indexes 11, 12 and 13; and indexes 11, 14 and 15, from which we can find that the multi-task learning framework can significantly improve the person Re-ID result in all the six situations. Take the experimental results in indexes 1, 2 and 3 for an example, the 8045 Influence of Top 1 Top 1 mAP = 0.0 = 0.1 = 0.2 = 0.3 = 0.4 to the final matching rate Top 1 100 Influence of m to the final matching rate mAP 100 60 60 60 (b) (c) (d) = 0.0 = 0.1 = 0.2 = 0.3 = 0.4 Top 1 mAP 80 80 80 60 Influence of to the final matching rate mAP 100 80 (a) Influence of to the final matching rate 100 = 0.0 = 0.01 = 0.02 = 0.03 = 0.04 m = 0.6 m = 0.8 m = 1.0 m = 1.2 m = 1.4 Figure 6. Influences of different parameter settings to the final matching rates. Specifically, we compare the Top 1 and mAP performances of our method on the Market1501 dataset using the single-query evaluation, in which the detailed influences of α, β, ̟ and m are illustrated in (a) to (d), respectively. Methods LDNS [51] (CVPR2016) PDC [36] (ICCV2017) SVDNet [37] (ICCV2017) DLPA [54] (ICCV2017) DPFL [6] (CVPR2017) PRGP [39] (CVPR2018) MLFN [2] (CVPR2018) HA-CAN [20] (CVPR2018) DGRW [30] (CVPR2018) DuATM [32] (CVPR2018) MGCAN [34] (CVPR2018) BraidNet [44] (CVPR2018) AACN [48] (CVPR2018) GCSL [4] (CVPR2018) PCB [38] (ECCV2018) SGGNN [31] (ECCV2018) PN-GAN [26] (ECCV2018) MGN [42] (ACM MM2018) Our Method Single-query Top 1 mAP 61.0 35.6 84.1 63.4 82.3 62.1 81.0 63.4 88.6 72.6 81.2 – 90.0 74.3 91.2 75.7 92.7 82.5 91.4 76.6 83.8 74.3 83.7 69.5 85.9 66.9 93.5 81.6 93.8 81.6 92.3 82.8 89.4 72.6 95.7 86.9 96.1 84.7 Multi-query Top 1 mAP 71.6 46.0 – – – – – – 92.3 80.7 – – 92.3 82.4 93.8 82.8 – – – – – – – – 76.8 59.3 – – – – – – 92.9 80.2 96.9 90.7 98.2 87.3 Table 3. The matching rates(%) comparison with the state-of-theart methods on the Market1501 dataset, in which ‘–’ means they do not report the corresponding results. Methods SVDNet [37] (ICCV2017) DLPA [54] (ICCV2017) GAN [58] (ICCV2017) DPFL [6] (CVPR2017) MLFN [2] (CVPR2018) HA-CAN [20] (CVPR2018) DGRW [30] (CVPR2018) DuATM [32] (CVPR2018) BraidNet [44] (CVPR2018) AACN [48] (CVPR2018) GCSL [4] (CVPR2018) PCB [38] (ECCV2018) SGGNN [31] (ECCV2018) PN-GAN [26] (ECCV2018) MGN [42] (ACM MM2018) Our Method Top 1 75.9 81.0 67.7 79.2 81.0 80.5 80.7 81.8 76.4 76.8 84.9 83.3 81.1 73.6 88.7 86.3 Top 5 86.4 63.4 – – – – 88.5 90.2 – – – 90.5 88.4 – 92.3 Top10 89.5 – – – – – 90.8 – – – – 92.5 91.2 88.8 95.2 mAP 56.3 – 47.1 60.6 62.8 60.8 66.4 64.6 59.5 59.3 69.5 69.2 68.2 53.2 78.4 73.1 Table 4. The matching rates(%) comparison with the state-of-theart methods on the DukeMTMC-reID dataset, in which ‘–’ means they do not report the corresponding results. S+T outperforms S and T by 1.6% and 2.1% in Top 1, and 2.2% and 2.6% in mAP, respectively. For the improvements by our triplet loss, we compare the results between indexes 2 and 4; between indexes 3 and 5; between indexes 7 and 9; between indexes 8 and 10; between 11 and 14; and between 13 and 15, respectively. The results explain that the improved triplet loss is superior than the basic triplet loss in learning discriminative features. For instance, the results obtained by our triplet loss outperform that achieved by the basic triplet loss by 3.1% in Top 1 and 4.1% in mAP, when we compare the performances between indexes 7 and 9. From the results listed in Block 1 (as shown in indexes 1 to 5) and Block 2 (as shown in indexes 5 to 10), we can see that the improvements by the attention subnetwork is insignificant, because it is hard to directly deduce attention from the low-level feature maps. Specifically, the improvements are only 0.3%, 0.1%, 0.3%, 0.5% and 0.5% in Top 1, and 0.2%, 0.1%, 0.4%, 0.8% and 0.3% in mAP, when we compare the corresponding results between Block 1 and Block 2, respectively. When the consistent attention regularizer is used to help deduce attention, the results can be significantly improved. Specifically, the improvements are 1.5%, 1.8%, 2.7%, 3.1% and 2.2% in Top 1, and 2.4%, 2.4%, 3.2%, 2.6% and 5.2% in mAP, when we compare the corresponding results between Block 2 and Block 3 (as shown in indexes 11 to 15), respectively. Parameters. As in most of the deep learning methods, the performance of our method is also highly dependent on the weight parameters α, β and ̟, and the margin parameter m. In order to clarify this influence, we design four sets of experiments to evaluate how the parameter setting effects the final person Re-ID performance. Specifically, we only change one parameter and keep the others fixed in each set of experiments, so as to evaluate how the varying parameter effects the final performance. For simplicity, we conduct the experiments on the Market1501 datasets and evaluate the results using the single-query evaluation. The results are shown in Figure 6, from which we find that: 1) The experimental results are robust to α, β and m, in which a large variation range is allowed to maintain the final person Re-ID performance in a relatively high level. 2) The experimental results are slight sensitive to ̟, because the sparsity is hard to control in the training process. If ̟ is large, some of the useful information may be filtered out, therefore the person Re-ID performance will be seriously affected. If ̟ is small, the ability of feature selection will be weaken, which is also not benefit to further improve the final performance. Taking the two situations into account, we prefer to choose a small ̟ in our experiments. 8046 3WGT[ %7*- 3WGT[ /CTMGV 3WGT[ &WMG/6/%TG+& Figure 7. Visualization of the averaged heat maps on the CUHK03, Market1501 and DukeMTMC-reID datasets. From the results we can see that the network can focus on foreground regions at the lower layers by using the consistent attention regularizer. Losses BT DT ST Our Triplet CUHK03 Top 1 Top 5 88.6 97.2 90.3 98.2 92.8 98.6 96.9 99.6 Market1501 Top 1 mAP 92.1 78.6 93.5 79.8 94.2 80.3 96.1 84.7 DukeMTMC Top 1 mAP 83.5 70.4 84.2 70.9 85.0 71.5 86.3 73.1 Table 5. Results of four different triplet losses on three benchmark datasets, in which ‘BT’ denotes the basic triplet loss, ‘DT’ means the dual triplet loss and ‘ST’ indicates the symmetric triplet loss. Visualization. Our consistent attention regularizer can effectively keep these foreground masks similar, which are deduced from the low-level, mid-level and high-level feature maps, respectively. As a result, our network will focus its attention on foreground regions at the lower layers. We visualize the averaged heat maps on the three datasets, as shown in Figure 7, from which we can find that most of the network’s attention has been focused on the foreground regions across the lower to higher layers. Therefore, the resulting features will be very robust to target misalignment and background clutter. 4.3. Comparison Results Firstly, we compare our method with many state-ofthe-art competitors on the CUHK03, Market1501 and DukeMTMC-reID datasets, as shown in Table 2 to Table 4. From the result we can see that: 1) Our method has achieved the best result on the CUHK03 dataset, in which it outperforms the previous best performed SGGNN [31] by 1.6% in Top 1; 2) Our method performs closely to MGN [42] on the Market1501 and DukeMTMC-reID datasets, in which our method is better in Top 1 and the MGN is better in mAP. The reason comes from two aspects: 1) Our network is much lighter, while the MGN needs to take three part branch networks to extract features; 2) Our triplet loss doesn’t use any hard mimining strategy, while the MGN further applies the batchhard triplet loss [11] improve the final results. From this point of view, our method can achieve a competitive result in a very simple yet effective way. Secondly, we compare the performances of four different triplet losses, as shown in Table 5, on the three datasets. From the results we can conclude that: 1) The dual triplet loss outperforms the basic triplet loss, and the symmetric triplet loss outperforms the dual triplet loss on all the three datasets, which indicate that it is an effective way to revise the gradient back-propagation in minimizing the intra-class distances. 2) Our triplet loss outperforms the symmetric triplet loss on all the three datasets, because it doesn’t need to introduce any additional algorithm to help update weights in the training process. 5. Conclusion In this paper, we propose a simple yet effective feedforward attention network to learn discriminative features from the foreground regions for person Re-ID. Specifically, a novel consistent attention regularizer is designed to drive the foreground masks similar, which are deduced from the low-level, mid-level and high-level feature maps, respectively. As a result, the network will focus on the foreground regions at the lower layers, and the network can effectively deal with the target misalignment and background clutter at the higher layers. Besides, a novel triplet loss is introduced to enhance the feature learning capability, which can jointly minimize the intra-class distance and maximize the inter-class distance in each triplet unit. Extensive experimental results on the Market1501, DukeMTMC-reID and CUHK03 datasets have shown that our method outperforms most of the state-of-the-art approaches. Acknowledgement This work is jointly supported by the National Key Research and Development Program of China under Grant No. 2017YFA0700800, and the National Natural Science Foundation of China Grant No. 61629301. 8047 References [1] Song Bai, Xiang Bai, and Qi Tian. Scalable person reidentification on supervised smoothed manifold. In CVPR, July 2017. 6 [2] Xiaobin Chang, Timothy M Hospedales, and Tao Xiang. Multi-level factorisation net for person re-identification. In CVPR, volume 1, page 2, 2018. 7 [3] Dapeng Chen, Hongsheng Li, Tong Xiao, Shuai Yi, and Xiaogang Wang. Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding. In CVPR, June 2018. 2 [4] Dapeng Chen, Dan Xu, Hongsheng Li, Nicu Sebe, and Xiaogang Wang. Group consistent similarity learning via deep crf for person re-identification. In CVPR, pages 8649–8658, 2018. 6, 7 [5] Weihua Chen, Xiaotang Chen, Jianguo Zhang, and Kaiqi Huang. A multi-task deep network for person reidentification. In AAAI, pages 3988–3994, 2017. 1, 2 [6] Yanbei Chen, Xiatian Zhu, and Shaogang Gong. Person reidentification by deep learning multi-scale representations. In CVPR, pages 2590–2600, 2017. 6, 7 [7] Weijian Deng, Liang Zheng, Qixiang Ye, Guoliang Kang, Yi Yang, and Jianbin Jiao. Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In CVPR, June 2018. 2 [8] Shengyong Ding, Liang Lin, Guangrun Wang, and Hongyang Chao. Deep feature learning with relative distance comparison for person re-identification. PR, 48(10):2993– 3003, 2015. 1, 2, 4 [9] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, pages 2961–2969, 2017. 2 [10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016. 2, 3 [11] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017. 8 [12] Shaoli Huang, Zhe Xu, Dacheng Tao, and Ya Zhang. Partstacked cnn for fine-grained visual categorization. In CVPR, pages 1173–1182, 2016. 2 [13] Forrest Iandola, Matt Moskewicz, Sergey Karayev, Ross Girshick, Trevor Darrell, and Kurt Keutzer. Densenet: Implementing efficient convnet descriptor pyramids. arXiv preprint arXiv:1404.1869, 2014. 2 [14] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. 3 [15] Mahdi M. Kalayeh, Emrah Basaran, Muhittin Gkmen, Mustafa E. Kamasak, and Mubarak Shah. Human semantic parsing for person re-identification. In CVPR, June 2018. 2 [16] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012. 2 [17] Dangwei Li, Xiaotang Chen, Zhang Zhang, and Kaiqi Huang. Learning deep context-aware features over body and [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] 8048 latent parts for person re-identification. In CVPR, pages 384– 393, 2017. 2, 3, 6 Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. Deepreid: Deep filter pairing neural network for person reidentification. In CVPR, pages 152–159, 2014. 6 Wei Li, Xiatian Zhu, and Shaogang Gong. Person reidentification by deep joint learning of multi-loss classification. In AAAI, pages 2194–2200, 2017. 6 Wei Li, Xiatian Zhu, and Shaogang Gong. Harmonious attention network for person re-identification. In CVPR, June 2018. 1, 2, 7 Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013. 3 Jinxian Liu, Bingbing Ni, Yichao Yan, Peng Zhou, Shuo Cheng, and Jianguo Hu. Pose transferrable person reidentification. In CVPR, June 2018. 1 Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. arXiv preprint arXiv:1807.11164, 2018. 2 Niall McLaughlin, Jesus Martinez del Rincon, and Paul Miller. Recurrent convolutional network for video-based person re-identification. In CVPR, pages 1325–1334, 2016. 2 Wenjie Pei, Tadas Baltrušaitis, David MJ Tax, and LouisPhilippe Morency. Temporal attention-gated model for robust sequence classification. In CVPR, pages 820–829. IEEE, 2017. 2 Xuelin Qian, Yanwei Fu, Tao Xiang, Wenxuan Wang, Jie Qiu, Yang Wu, Yu-Gang Jiang, and Xiangyang Xue. Posenormalized image generation for person re-identification. In ECCV, September 2018. 6, 7 Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. In European Conference on Computer Vision workshop on Benchmarking MultiTarget Tracking, 2016. 2, 6 Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. arXiv preprint arXiv:1801.04381, 2018. 2 M. Saquib Sarfraz, Arne Schumann, Andreas Eberle, and Rainer Stiefelhagen. A pose-sensitive embedding for person re-identification with expanded cross neighborhood reranking. In CVPR, June 2018. 1 Yantao Shen, Hongsheng Li, Tong Xiao, Shuai Yi, Dapeng Chen, and Xiaogang Wang. Deep group-shuffling random walk for person re-identification. In CVPR, pages 2265– 2274, 2018. 6, 7 Yantao Shen, Hongsheng Li, Shuai Yi, Dapeng Chen, and Xiaogang Wang. Person re-identification with deep similarity-guided graph neural network. In ECCV, September 2018. 6, 7, 8 Jianlou Si, Honggang Zhang, Chun-Guang Li, Jason Kuen, Xiangfei Kong, Alex C. Kot, and Gang Wang. Dual attention matching network for context-aware feature sequence based person re-identification. In CVPR, June 2018. 7 [33] Josef Sivic and Andrew Zisserman. Video google: A text retrieval approach to object matching in videos. In ICCV, page 1470. IEEE, 2003. 1 [34] Chunfeng Song, Yan Huang, Wanli Ouyang, and Liang Wang. Mask-guided contrastive attention model for person re-identification. In CVPR, June 2018. 1, 2, 7 [35] Jifei Song, Qian Yu, Yi-Zhe Song, Tao Xiang, and Timothy M Hospedales. Deep spatial-semantic attention for finegrained sketch-based image retrieval. In ICCV, pages 5552– 5561, 2017. 2 [36] Chi Su, Jianing Li, Shiliang Zhang, Junliang Xing, Wen Gao, and Qi Tian. Pose-driven deep convolutional model for person re-identification. In ICCV, pages 3980–3989. IEEE, 2017. 6, 7 [37] Yifan Sun, Liang Zheng, Weijian Deng, and Shengjin Wang. Svdnet for pedestrian retrieval. In ICCV, Oct 2017. 6, 7 [38] Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In ECCV, September 2018. 1, 2, 7 [39] Maoqing Tian, Shuai Yi, Hongsheng Li, Shihua Li, Xuesen Zhang, Jianping Shi, Junjie Yan, and Xiaogang Wang. Eliminating background-bias for robust person re-identification. In CVPR, pages 5794–5803, 2018. 1, 2, 6, 7 [40] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, pages 5998– 6008, 2017. 2 [41] Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. Residual attention network for image classification. In CVPR, pages 3156–3164, 2017. 2 [42] Guanshuo Wang, Yufeng Yuan, Xiong Chen, Jiwei Li, and Xi Zhou. Learning discriminative features with multiple granularities for person re-identification. In ACM MM, pages 274– 282. ACM, 2018. 7, 8 [43] Jinjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, Thomas Huang, and Yihong Gong. Locality-constrained linear coding for image classification. In CVPR, pages 3360–3367. Citeseer, 2010. 1 [44] Yicheng Wang, Zhenzhong Chen, Feng Wu, and Gang Wang. Person re-identification with cascaded pairwise convolutions. In CVPR, pages 1470–1478, 2018. 6, 7 [45] Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer gan to bridge domain gap for person reidentification. In CVPR, June 2018. 2 [46] Xing Wei, Yue Zhang, Yihong Gong, and Nanning Zheng. Kernelized subspace pooling for deep local descriptors. In CVPR, June 2018. 1 [47] Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. A discriminative feature learning approach for deep face recognition. In ECCV, pages 499–515. Springer, 2016. 2 [48] Jing Xu, Rui Zhao, Feng Zhu, Huaming Wang, and Wanli Ouyang. Attention-aware compositional network for person re-identification. arXiv preprint arXiv:1805.03344, 2018. 6, 7 [49] Shuangjie Xu, Yu Cheng, Kang Gu, Yang Yang, Shiyu Chang, and Pan Zhou. Jointly attentive spatial-temporal pooling networks for video-based person re-identification. In ICCV, Oct 2017. 2 [50] Masataka Yamaguchi, Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. Spatio-temporal person retrieval via natural language queries. In ICCV, Oct 2017. 1 [51] Li Zhang, Tao Xiang, and Shaogang Gong. Learning a discriminative null space for person re-identification. In CVPR, pages 1239–1248, 2016. 6, 7 [52] Shun Zhang, Yihong Gong, Jia-Bin Huang, Jongwoo Lim, Jinjun Wang, Narendra Ahuja, and Ming-Hsuan Yang. Tracking persons-of-interest via adaptive discriminative features. In ECCV, pages 415–433. Springer, 2016. 4 [53] Shizhou Zhang, Qi Zhang, Xing Wei, Yanning Zhang, and Yong Xia. Person re-identification with triplet focal loss. IEEE Access, 6:78092–78099, 2018. 1 [54] Liming Zhao, Xi Li, Yueting Zhuang, and Jingdong Wang. Deeply-learned part-aligned representations for person reidentification. In ICCV, Oct 2017. 1, 2, 6, 7 [55] Heliang Zheng, Jianlong Fu, Tao Mei, and Jiebo Luo. Learning multi-attention convolutional neural network for finegrained image recognition. In ICCV, volume 6, 2017. 2 [56] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. Scalable person re-identification: A benchmark. In ICCV, pages 1116–1124, 2015. 2, 6 [57] Shuai Zheng, Sadeep Jayasumana, Bernardino RomeraParedes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip HS Torr. Conditional random fields as recurrent neural networks. In ICCV, pages 1529–1537, 2015. 2 [58] Zhedong Zheng, Liang Zheng, and Yi Yang. A discriminatively learned cnn embedding for person reidentification. TOMM, 14(1):13, 2017. 1, 2, 7 [59] Sanping Zhou, Jinjun Wang, Deyu Meng, Yudong Liang, Yihong Gong, and Nanning Zheng. Discriminative feature learning with foreground attention for person reidentification. TIP, 2019. 1 [60] Sanping Zhou, Jinjun Wang, Deyu Meng, Xiaomeng Xin, Yubing Li, Yihong Gong, and Nanning Zheng. Deep selfpaced learning for person re-identification. PR, 76:739–751, 2018. 2 [61] Sanping Zhou, Jinjun Wang, Rui Shi, Qiqi Hou, Yihong Gong, and Nanning Zheng. Large margin learning in set-toset similarity comparison for person reidentification. TMM, 20(3):593–604, 2017. 1 [62] Sanping Zhou, Jinjun Wang, Jiayun Wang, Yihong Gong, and Nanning Zheng. Point to set similarity based deep feature learning for person reidentification. In CVPR, volume 6, 2017. 1, 4 [63] Zhen Zhou, Yan Huang, Wei Wang, Liang Wang, and Tieniu Tan. See the forest for the trees: Joint spatial and temporal recurrent neural networks for video-based person reidentification. In CVPR, pages 6776–6785. IEEE, 2017. 2 8049
US