Received June 28, 2019, accepted July 3, 2019, date of publication July 16, 2019, date of current version August 5, 2019. Digital Object Identifier 10.1109/ACCESS.2019.2929080 Precise Ship Location With CNN Filter Selection From Optical Aerial Images SAMER ALASHHAB1 , ANTONIO-JAVIER GALLEGO 1,2 , ANTONIO PERTUSA 1,2 , AND PABLO GIL 1,3 , (Senior Member, IEEE) 1 Computer Science Research Institute, University of Alicante, 03690 Alicante, Spain 2 Department of Software and Computing Systems, University of Alicante, 03690 Alicante, Spain 3 Department of Physics, Systems Engineering and Signal Theory, University of Alicante, 03690 Alicante, Spain Corresponding author: Antonio-Javier Gallego (
[email protected]) This work was supported in part by the Spanish Government’s Ministry of Economy, Industry, and Competitiveness under Project RTC-2014-1863-8, and in part by the Babcock MCS Spain under Project INAER4-14Y (IDI-20141234). ABSTRACT This paper presents a method that can be used for the efficient detection of small maritime objects. The proposed method employs aerial images in the visible spectrum as inputs to train a categorical convolutional neural network for the classification of ships. A subset of those filters that make the greatest contribution to the classification of the target class is selected from the inner layers of the CNN. The gradients with respect to the input image are then calculated on these filters, which are subsequently normalized and combined. Thresholding and a morphological operation are then applied in order to eventually obtain the localization. One of the advantages of the proposed approach with regard to previous object detection methods is that it is only required to label a few images with bounding boxes of the targets to be trained for localization. The method was evaluated with an extended version of the MASATI (MAritime SATellite Imagery) dataset. This new dataset has more than 7 000 images, 4 157 of which contain ships. Using only 14 training images, the proposed approach achieves better results for small targets than other well-known object detection methods, which also require many more training images. INDEX TERMS Artificial neural networks, learning systems, object detection, remote sensing. I. INTRODUCTION viewed from overhead can have any orientation (e.g. ships can Systems for automatic ship detection are very important for have any heading angle, whereas the traffic lights or trees in maritime surveillance operations. They can be used to mon- ImageNet are reliably vertical). itor marine traffic [1], illegal fishing, and sea border activ- Object detection can be addressed using different strate- ities, and also during search and rescue operations such as gies. The most evident technique is the use of a sliding the detection of bodies lying in the sea [2]. These types of window on the input image, which yields a prediction for each algorithms are usually based on information gathered from frame until the entire image has been processed. In this case, satellite or aerial images, either by means of visible spectrum the accuracy of the detection varies according to the size of imagery or through the use of SAR-type sensors [3]–[6], and the window and the overlap used. However, this approach is each one has different advantages and disadvantages. very slow and computationally expensive. Most recent works The detection of small objects in large swaths of imagery is overcome these limitations by performing classification and one of the primary problems in aerial imagery analytics [7], localization simultaneously. and is a particularly challenging task in satellite imagery. The automatic detection of ships has been an active The objects of interest in this type of images are often very research field for decades, and continues to attract increas- small and densely clustered, while in other types of lateral or ing interest. The first techniques used for ship detection general images the targets are much larger and more promi- were based on hand-crafted descriptors. For example, nent, as occurs in the ImageNet dataset [8]. Moreover, objects Lure et al. [9] and Weiss et al. [10] proposed a detection system for the tracking of ships using High Resolution The associate editor coordinating the review of this manuscript and Radiometer imagery. In this work, image features were approving it for publication was Amjad Ali. first extracted and subsequently classified using similarity VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ 96567 S. Alashhab et al.: Precise Ship Location With CNN Filter Selection From Optical Aerial Images measures obtained from features. More recent examples process, the filters learned by the categorical network are include the use of Boosted Local Structured HOG-LBP analyzed in order to select only those that will allow targets for object localization [11], a multi-fold multiple instance to be detected with greater precision. The method calculates learning procedure [12], implicit cues from image tags [13], the gradient obtained between the activations of each of these or image pixel intensity probabilities combined with LBP filters and the input image. It then normalizes and combines descriptors [14]. A complete review of ship detection meth- these gradients, in addition to applying a threshold and a ods can be found in [15]. morphological operation, in order to eventually obtain the Selecting hand-crafted features that can be employed location of the targets. to detect targets in images is a challenging task, particu- This approach was evaluated with an extended version larly when objects have a different appearance and size. of the MASATI (Maritime SATellite Imagery) dataset [1], Recent image classification techniques have attempted to to which more than one thousand images of ships were added, deal with this problem by making use of Deep Learning in addition to the labeling of their locations. The new dataset techniques [16] and, in particular, Convolutional Neural Net- consists of a total of 7,389 aerial images, of which ships works (CNN), to perform classification without having to represent only 0.03 % of the pixels. apply either hand-crafted feature extraction or pre-processing We also performed a comparison with current state-of- techniques. The performance of these networks has proven to the-art approaches based on deep learning, and specifically be close to the human level, or even better for some types of with RetinaNet [33], Faster R-CNN [30], YOLO v2 [34], tasks. Widely known CNN topologies include Xception [17], YOLO v3 [35], YOLT [7], and class-activation maps using Inceptionv3 [18], ResNet [19], and VGG [20], among others. backpropagation with VGG-16 and VGG-19 [36]. The results For example, Wu et al. [21] classified ships using a CNN of this comparison are very competitive as regards small and then unified iterative bounding-box regression and ship objects, particularly when the background is relatively uni- classification in a multi-task network. In Yang et al. [22], form, as occurs with the ship detection task, thus demonstrat- in addition to the bounding boxes, the orientations of the ing that the approach can generalize and learn with very few ships were provided by using a model consisting of five images. parts: a Dense Feature Pyramid Network, an adaptive region The remainder of the paper is organized as follows: The of interest alignment, a rotational bounding box regression, following section provides a review of the state of the art a prow direction prediction, and a rotational non-maximum of object detection methods; the proposed weakly-supervised suppression. Yu et al. [23] used Haar-like features to obtain object detection method is described in Section III; the the approximate positions of ships, and then applied a PCNet new version of the MASATI dataset used for evaluation is architecture to the candidate windows. described in Section IV; the series of experiments carried out Many deep learning methods are dedicated to the detec- is detailed in Section V, and finally, the main conclusions of tion of objects in general. A review of those methods can this work are summarized in Section VI. be found in [24]–[26], while an evaluation of small object detection can be found in [27], which analyzes the results of II. STATE-OF-THE-ART known methods such as YOLO (You Only Look Once) [28], In this section, we review the state of the art of object SSD (Single Shot MultiBox Detector) [29], and Faster detection methods, which are, in the scope of this work, R-CNN [30]. divided into supervised and weakly-supervised object detec- However, these types of techniques also have a number tion methods. of disadvantages, principally the fact that, since they are supervised methods, they need a large amount of labeled A. SUPERVISED OBJECT DETECTION METHODS data in order to be trained, which is a very expensive task Object detection methods can be roughly classified [24] in terms of time, resources and effort. In addition, methods as one-stage detectors (including methods such as YOLO using weakly-supervised techniques usually have a very low [28], [34], [35], RetinaNet [33], or SSD [29]), two-stage accuracy as regards detecting small objects. Moreover, object detectors (Faster R-CNN [30] or YOLT [7]), cascade detection networks usually require adaptations when targets detectors (Bai & Ghanem [37]), and part-based models are very small [31], [32], which makes it impossible to apply (Dai et al. [38]). this type of methods in a general manner. One of the first two-stage object detectors was Faster In this paper, we propose a weakly-supervised deep learn- R-CNN [30], a method consisting of class-agnostic proposals ing method for efficient object detection. The method is and class-specific detections. In this work, the authors present particularly focused on the detection of small ships in satellite an efficient fully convolutional approach denominated as images and requires only a few training data labeled with the Region Proposal Network (RPN) that can be used to propose location (bounding boxes) of the ships to obtain their precise regions. The detector further classifies and refines bounding position. The proposed approach addresses the object detec- boxes around those proposals. tion task on the basis of a network trained for classification. One of the best-known single stage object detectors The low precision of the weakly-supervised algorithms is is YOLO (You Only Look Once) [28]. This architecture improved through the use of a filter selection process. In this addresses object detection as a regression problem in order 96568 VOLUME 7, 2019 S. Alashhab et al.: Precise Ship Location With CNN Filter Selection From Optical Aerial Images to obtain spatially separate bounding boxes and associated input image and a given class is computed as the average of class probabilities. A single neural network directly predicts gradients for the filters when the feature map value is positive. bounding boxes and class probabilities from full images in In the context of this paper, we shall denominate this method one evaluation. YOLO v2 [34] was an improvement to the as BP. A more recent technique, denominated as Class Acti- first version. In this new version, the image was divided into vation Mapping (CAM), was proposed in [42]. In this case, regions, and bounding boxes and probabilities were predicted the feature maps of the last convolutional layer are spatially for each region. It outperformed previous state-of-the-art pooled using a Global Average Pooling (GAP) [43] operation methods, such as Faster R-CNN [30] and SSD [29]. YOLO and are linearly transformed using the weights learned from v3 [35] is an improvement to YOLO v2 which, despite being the final layer to obtain the class-activation map. larger, is faster and more accurate. The main issue of CAM is that it is necessary to adapt RetinaNet [33] proposes a focal loss that makes it possible architectures with fully-connected layers in order to use this to train a high-accuracy one-stage detector. The focal loss was method, and also that it requires the retraining of multiple designed to address the one-stage object detection scenario linear classifiers (one for each class) after the initial model in which there is an extreme imbalance between foreground has been trained. Grad-CAM (Gradient-weighted Class Acti- and background classes during training. The name RetinaNet vation Mapping) [44] was introduced to overcome these lim- originates from its dense sampling of object locations in an itations and to enable its use with any CNN architecture input image. Its design comprises an efficient in-network without having to adapt it. feature pyramid and the use of anchor boxes. The proposed method belongs to this group, since it makes YOLT (You Only Look Twice) [7] is one of the specific use of the feature maps learned by a CNN and the gradient methods for the detection of ships in satellite imagery. It is a obtained for each of the activation maps with respect to the two-stage detector consisting of a fully-convolutional neural input layer. However, this method introduces a filter selection network with a passthrough layer (similar to identity map- process that uses only those filters that detect the target class pings in ResNet [19]) that concatenates the final layer onto with greater precision. It also combines the selected filers in the last convolutional layer, thus giving the detector access to order to improve the accuracy of the location and remove the finer grained features of this expanded feature map. possible false positives. A number of existing methods use feature maps for ship detection and have a similar architecture to Faster R-CNN (employing a two-step methodology). For example, III. METHOD Li et al. [39] proposed a topology similarly to Faster R-CNN Previous weakly-supervised localization methods can help (called HSF-Net) that employs a regional proposal network show the regions from the image that make the greatest con- to generate ship candidates from feature maps. In Huang tribution to the classification of a particular class. However, et al. [40], a new neural network architecture denominated a CNN tends to focus on more elements than the main target to as squeeze excitation skip-connection path networks (SESP- be searched, as some of these elements may contribute to the Nets) was proposed. The authors added a bottom-up path classification decision. For example, in our case, in addition to a feature pyramid network to improve the feature extrac- to the ships, the network can detect whether there is sea or tion capability and obtain more accurate and multi-scale coast. proposals. Some examples of this problem can be seen in Figure 1. The first row shows the original input image, while the second B. WEAKLY-SUPERVISED OBJECT DETECTION METHODS and third rows show the saliency maps after the application The localization of objects can also be estimated by using of backpropagation [36] and Grad-CAM [44], respectively. visualization methods, which have localization capabilities, Figures 1(a) and (e) clearly show how the attention of the net- despite not being explicitly trained to do so. These approaches work focuses on other locations rather than the ship targets. use standard CNN trained for classification and analyze the In addition, depending on the architecture and the selected feature maps (also called activation maps), which are the layer, the precision of localization may be very poor when output activations of each convolutional filter. Some of these the layer activation is high in a wide zone of the input image methods also consider error gradients in order to highlight (see Figures 1 (b), (c) and (d)). those locations that have made the greatest contribution to the This occurs because a feature hierarchy is learned in the prediction of a particular class. Their output (namely Saliency different convolutional layers of the CNN, from the low-level Maps or Class-Activation Maps) serves to visually analyze features (such as edges, corners, etc.) to the last convolutional what a network has learned and also to localize objects within layers (which are usually those employed to calculate the the image. heatmaps or visual saliency), from which high-level features One of the first methods for weakly-supervised object are obtained. However, in the last layers, filters are usually localization from CNN was proposed in [36]. This approach activated with different elements in the image, and the clas- performs a single backpropagation (BP) pass to obtain the sification is eventually performed by using a combination of true gradient, which masks out negative bottom data entries activations. This means that, for classification, some filters via the forward ReLU [41]. The class-activation map for an are activated that do not necessarily contain the target object, VOLUME 7, 2019 96569 S. Alashhab et al.: Precise Ship Location With CNN Filter Selection From Optical Aerial Images FIGURE 1. Examples of saliency maps from backpropagation [36] and Grad-CAM [44]. The first row shows the original images, in which ships are marked with a bounding box. but rather other elements in the image that help perform the this dataset (close to 100%, as will be seen in the evaluation classification. section), and also because they are frequently used as a basis Figure 2 shows a subset of feature maps from different for localization methods such as SSD [29], Faster RCNN [30] filters for a sample image, which contains coast and one ship. and CAM [42], among others. As can be seen, most of the filters do not have activations in VGG-16 has 13 convolutional and 3 fully-connected lay- the target location, and those filters that detect the ship also ers, whereas VGG-19 is composed of 16 convolutional and have activations for other elements from the image which may 3 fully-connected layers. Both topologies use dropout [45], be helpful for classification. max-pooling [46] and ReLU [41] activation functions. Fine-tuning was performed for training, and the net- A. SCHEME OF THE PROPOSED METHOD works were initialized with the pre-trained weights from the The objective of the proposed approach is to use only ILSVRC dataset,1 and then trained with the classes from those filters with high activation values for the target object. our dataset. This process usually speeds up the training and Figure 3 shows a scheme of the method. First, a categorical obtains better results when domains are similar [47]. The CNN is trained for classification. Once the weights have last fully-connected layer of the pretrained networks was been learned, a Filter Selection process is performed to select modified to match the number of classes in our dataset, as is the set of filters that maximize the precision as regards the usual in transfer learning tasks. location of the target class. Finally, in the inference stage, Training was performed by means of standard backprop- a new image is classified using the CNN and, if the predicted agation using Stochastic Gradient Descent [48] and consid- class corresponds to the target class, the subset of filters that ering the adaptive learning rate method proposed in [49]. was selected in the previous stage is then used to calculate its In the backpropagation algorithm, categorical crossentropy position in the image. was used as the loss function between the CNN output and Steps 1 (Train CNN) and 2 (Fit FS) of the scheme in the expected result. The training stage lasted a maximum Figure 3 correspond to the training stage of the method, while of 500 epochs with early stopping when the loss did not step 3 corresponds to the inference stage, once the training decrease during 10 epochs. The mini-batch size was set to stage has finished. Details of the steps in this method are 32 samples. provided in the following sections. C. STEP 2 – FIT FILTER SELECTION B. STEP 1 – TRAIN THE CNN Once the CNN had been trained for classification, we pro- In this first step, a categorical CNN is trained for classifica- ceeded to fit the filter selection algorithm. This process basi- tion. In the experimentation, we evaluated two widely-known cally consisted of selecting the most relevant filters from this topologies of CNN for categorical classification, VGG- network for the location of a target class c. This was done 16 and VGG-19 [20]. These two architectures were selected 1 ILSVRC is a 1,000 classes subset from ImageNet [46], a generic purpose because they obtained a good result for the classification of database for object classification. 96570 VOLUME 7, 2019 S. Alashhab et al.: Precise Ship Location With CNN Filter Selection From Optical Aerial Images FIGURE 2. Example of the activations (feature maps) obtained when classifying a coast sample containing a ship. The top-left image shows the input sample, and the others display activations of a random subset of filters from the VGG-16 last convolutional layer. Ships are marked with a bounding box only if the activation detected it correctly. morphological operation (denoted by ⊕) was then applied with a structuring element s. Since the noise was removed by the thresholding operation, this dilation was intended to close small gaps and increase the size of the detections after thresh- olding. Finally, the function Blobs calculated the groups of connected pixels (or blobs), returning a list of bounding boxes containing the detected blobs. (i) The gradients Gl,f of the filter f of layer l with respect FIGURE 3. Scheme of the proposed method. to an input image i were computed by performing a single backpropagation pass, calculating the partial derivative of the (i) activation map Al,f (also known as the feature map) obtained by calculating the subset of filters F c ⊆ F from all the for the filter f with respect to the input image space I and possible filters F in the selected convolutional layer l whose evaluated at the image I (i) . The gradients obtained were then average localization results were over a given threshold α. rescaled in the range [0, 1] using the function r, as follows: This subset of filters was subsequently used in the inference (i) ∂Al,f stage to obtain the location of targets for unseen images. (i) G̃l,f = r (2) In order to obtain the subset F c , we first calculated the ∂I I (i) localization results obtained for each filter f ∈ F. This was (i) (i) where Al,f represents the activation map obtained by the done by computing the prediction set Pl,f for an input image filter f of layer l when the input image i is processed by the i and a filter f from layer l, as follows: previously trained CNN. (i) (i) (i) Pl,f = Blobs((G̃l,f > β) ⊕ s) (1) As stated previously, the normalized gradients G̃l,f were used in Equation 1 to calculate the prediction set Pl,f by (i) where the set G̃l,f contains the normalized gradients in the selecting only the higher activations. (i) range [0, 1] obtained for the filter f in layer l (see Equation 2). Once the prediction set Pl,f had been obtained for all the Only those values over a threshold β were selected for these c selected input images I of a given class c, it was possible to gradients, thus allowing us to obtain a binary matrix of the calculate the subset of filters F c that would be used to predict same size R(w×h) → [0, 1](w×h) , where w and h are the the location of that class in the inference stage. To do this, width and height of the input image, respectively. A dilation the Intersection over Union (IoU) between the prediction set VOLUME 7, 2019 96571 S. Alashhab et al.: Precise Ship Location With CNN Filter Selection From Optical Aerial Images (i) Pl,f and the ground-truth was computed for all the images in Note that during the inference stage, the proposed approach that class. Only those filters whose average IoU was greater performs classification and localization simultaneously, as it than a threshold α were selected from this result. Formally, is based on the filter activations obtained by classifying the the subset of filters F c is calculated as follows: image, that is, it is not necessary to perform any additional |I c | forward pass of the image through the network in order to 1 X (i) calculate the localization. Fc = f ∈ F | c IoU (Pl,f , B(i) g )>α (3) |I | i=1 IV. DATASET (i) where Bg are the ground-truth localizations for the image i The proposed method for the precise detection of ships and class c, and |I c | represents the cardinality of the set I c was evaluated with an extended version of the MASATI with the input images of class c. (MAritime SATellite Imagery) dataset [1], which we will In order to calculate the IoU of the predictions obtained denominate as MASATI v2. For this work, we increased the for an input image i, each predicted bounding box from the size of this dataset by adding 1,177 new samples in order to (i) (i) set Pl,f was mapped onto the ground truth (Bg ) bounding box balance the number of prototypes for the different classes. with which it had a maximum IoU overlap. A detection was This new dataset contains a total of 7,389 satellite images in considered to be positive if the area overlap ratio between the the visible spectrum. In this new version, the labeling with predicted bounding box and the ground truth bounding box the bounding box for the ships’ location was also included, exceeded a certain threshold λ according to Equation 4. in addition to the labeling at the class level. The new version (i) (i) of this dataset is freely available for the scientific community (i) area(Pl,f ∩ Bg ) at http://www.iuii.ua.es/datasets/masati. IoU (Pl,f , B(i) g ) = (i) (i) (4) area(Pl,f ∪ Bg ) Images of different sizes were captured from Microsoft Bing maps in RGB, as these sizes were dependent on the (i) (i) where area(Pl,f ∩ Bg ) denotes the intersection between the region of interest to be registered in the image. In general, object proposal and the ground truth bounding box, and the average image size had a spatial resolution of around (i) (i) area(Pl,f ∪ Bg ) denotes the union. 512×512 pixels. The dataset was compiled at different times Once this stage was computed, the selected subset of filters of the year and from different regions in Europe, the USA, Fc for each target class c was stored to be used in the inference Africa, Asia, the Mediterranean Sea and the Atlantic and stage for unseen images. Pacific Oceans. The influence of the different configuration parameters Methods for automatic ship detection from optical imagery for the proposed method is evaluated in Section V-B, which are affected by many factors, such as lighting or weather con- provides a summary of the values selected. ditions. The proposed dataset considers a great variety of pos- sible situations, thus enabling the proposed CNN approaches D. STEP 3 – INFERENCE STAGE to obtain generic features. Figure 4 shows some examples Once steps 1 and 2 (corresponding to the training stage) have from the MASATI v2 dataset. been completed, it is possible to use the proposed method to Each image was manually labeled according to the follow- calculate the location of the ships. In the inference stage (see ing seven classes: ship, coast & ship, detail, multi, sea, coast, Figure 3), an input sample is forwarded through the trained and land. Table 1 shows the sample distribution of each class. model, and if the prediction for any of the target classes is The ‘‘ship’’ class represents images in which a single ship positive (in our case, if a ship is detected), the feature maps appears within the image. The multi class describes images of the network for that class are used to obtain its precise with two or more instances of ships. The ships in these two localization. classes have lengths of between 4 and 10 pixels, with a This is done by following the same steps as in Equation 1, bounding box area of between 6 and 154 pixels. The ‘‘coast & but performing the sum of the gradients obtained from the ship’’ class represents images in which one ship is close to the selected subset of filters F c . The function FS(i, c, l) calcu- coast and has similar dimensions to the two classes mentioned lates the localization of targets for a given class c for an input previously. The ‘‘detail’’ class contains images with a single image i using the pre-calculated subset of filters F c from ship with a length of between 20 and 100 pixels. This class layer l, as follows: was used only to enhance the training process. The most challenging class is ‘‘multi’’, since these |F c | 1 X (i) images contain many examples of ships per image (a total FS(i, c, l) = Blobs G̃l,f > β ⊕ s (5) |F c | of 1,966 ships are labeled in this class, and each image f =1 contains an average of 7 ships, although some images contain where |F c | represents the cardinality of the set F c . up to 82 ships). In addition, this class includes examples of As can be seen, Equation 5 is similar to Equation 1. How- both open sea and coast. ever, Equation 1 calculates the prediction for a single filter, We evaluated the proposed method by creating two sets, whereas Equation 5 performs the combination of the set of one denominated as ‘‘simple set’’, which included all the selected filters F c . classes, with the exception of the samples from ‘‘multi’’, and 96572 VOLUME 7, 2019 S. Alashhab et al.: Precise Ship Location With CNN Filter Selection From Optical Aerial Images FIGURE 4. Image examples of different classes from the MASATI v2 dataset. The first four rows show the ship classes, while the three lower rows show the non-ship categories. The dataset samples are highly varied and the categories ‘‘Ships’’, ‘‘Ships & coast’’ and ‘‘multi’’ contain challenging images. TABLE 1. Distribution of the classes in the MASATI v2 dataset. analyzing its behavior when multiple objects from the same class appeared. Each of these two sets was divided into two, using 80% of the samples for training and the remaining 20% for the evaluation. These partitions did not overlap (i.e., the test set did not contain any of the samples seen during training) and the same percentage of samples of each class was kept in each partition. The same training and validation partitions were used to perform the experiments with all the methods, including the compared approaches. With regard to the ship categories, we manually labeled bounding boxes with the exact locations of each ship in the another denominated as ‘‘complex set’’, which included all image. This was done by using the LabelImg2 tool, which the classes. This allowed us to carry out a better evaluation of the proposed method by first analyzing the precision of 2 Tzutalin. LabelImg. Git code (2015). https://github.com/tzutalin/ the detection of a single instance of small objects and then labelImg VOLUME 7, 2019 96573 S. Alashhab et al.: Precise Ship Location With CNN Filter Selection From Optical Aerial Images generates XML files in PASCAL VOC format. These data TABLE 2. Results obtained with the categorical CNN for the two sets considered (simple and complex sets). were used to validate the results of the proposed localization method. V. EXPERIMENTS In this section, we show the experimentation carried out for the different parts of the proposed method using the dataset described in Section IV. We first evaluated the first stage of the process, i.e. the results of the categorical CNN network, after which we analyzed the filter selection process by eval- uating different parameter values. Finally, we compared the to simplify this analysis, we use the simple set, given that the results obtained by the proposed approach with other state- results obtained for the complex set and the observed trends of-the-art methods. for the different hyperparameters were quite similar. Finally, the results obtained for the complex set are also reported. A. CATEGORICAL CNN The categorical networks trained for the simple set in the The first step of the proposed method involves training the previous section are now employed to analyze the localiza- categorical CNN network. As indicated in Section III-B, this tion accuracy for the ship class obtained using the proposed is done by training the VGG-16 and VGG-19 networks using method. In this case, we have merged the samples from the the dataset and the classes described in Section IV. ‘‘ships’’ and ‘‘coast & ship’’ classes. This is because, as can In order to evaluate the performance of this experiment, be seen in Figure 3, the image is classified first and, in the case three evaluation metrics that are widely used for classification of obtaining a class that contains a ship, the filter selection were chosen: Precision, Recall and F-measure (F1 ). These method is used to recover its position in the image. The metrics can be calculated using the following equations: ‘‘detail’’ category was used only to improve the accuracy of TP the categorical networks (in order to provide more examples Precision = (6) of ships at different scales). Since the ships in this class are TP + FP centered and occupy almost the entire image, finding their TP Recall = (7) location is not an issue. TP + FN The results are also evaluated using the F1 metric 2 · TP F1 = (8) (Equation 8), but in this case we measure the objects (or 2 · TP + FN + FP ships) whose location was correctly detected. This is done where TP (True Positives) denotes the number of posi- by calculating the bounding box of the predicted objects (P), tive class samples correctly classified, FN (False Negatives) which is then paired with the bounding box of the ground denotes the number of positive class samples that were mis- truth (B) with which it has a higher IoU (using Equation 4). classified, and FP (False Positives) denotes the number of A predicted bounding box P is considered to be properly predictions of the positive class that are incorrect. localized if IoU(P, B) ≥ λ. In this case, we set λ = 0.5 (a Table 2 shows the average results (in percentages) obtained threshold value commonly used in this type of tasks, such as for the two sets considered (simple and complex sets). As can in PASCAL VOC), and calculate the metric F1 , considering be seen, both networks obtain excellent average results, close the correct detections to be TP (when IoU ≥ λ), the wrong to 100%, when discriminating between the different classes detections to be FP (i.e., when a P does not overlap with in the simple set. any B), and those cases in which a ground truth object is not Reliable results are also obtained in the case of the complex detected to be FN. Note that if multiple detections of the same set, although they are slightly lower owing the complexity object are predicted, only the first one is counted as a positive of the new samples. Note that the VGG-16 network obtains while the rest are counted as negative. better results than VGG-19 for the simple set, but that this We first evaluated the influence of the training set size used behavior is reversed for the complex set. This difference may to select the filters. This was done by conducting an exper- be motivated by the complexity of the network and the num- iment using an incremental training set, i.e., we took only ber of parameters to learn, since it may perform overfitting one training image, performed the filter selection process and for simple data, although in this case, the differences are very evaluated the result obtained. This process was then repeated small. with two training images, and so on, until 100 images had Having shown how the networks to be used are trained, been evaluated (we stopped the experiment at this size since we shall now evaluate the second step of the proposed the results did not improve). In order to evaluate the influence method: the filter selection stage. of the training size, we froze the remaining parameters of the method, selecting the penultimate convolution layer of each B. FILTER SELECTION network, a size |F c | = 4, β = 0.8 and a square structuring In this section, we evaluate the filter selection process by ana- element of size 7 × 7. Figure 5a shows the result of this lyzing the influence of the different hyperparameters. In order experiment. As can be seen, in both cases it is sufficient to 96574 VOLUME 7, 2019 S. Alashhab et al.: Precise Ship Location With CNN Filter Selection From Optical Aerial Images FIGURE 5. Localization results (F1 %) obtained by varying (a) the size of the training set and (b) the layer of the network from which the filters are selected (given in percentage with respect to 100% of the total network depth). label between 14 and 20 training images with bounding boxes VGG-19. Upon analyzing the results of VGG-16, it will be to obtain the best accuracy and we, therefore, eventually set noted that 63.87 % of filters have an IoU that is lower than the size of the training set to only 14 labeled images. 0.2 and that only 13.48 % exceed 0.4, with the maximum Figure 5b shows the influence of the CNN layer selected value obtained by an individual filter being 0.4603. In the case for the localization calculation (variable l of Equation 3). of VGG-19, a lower percentage of filters does not exceed 0.2 This was done by computing the result obtained with all the (53.52 %) and more filters exceed 0.4 (25.00 %), with a very layers from each of the two network models evaluated, while similar maximum value of 0.4604. the remaining parameters were set to the aforementioned Another parameter to be analyzed is the size of the thresh- values. Since the VGG-16 network has a total of 18 layers old β (see Eqs. 1 and 5). As before, we set the rest of the and VGG-19 has 21, in this figure we represent the result as a parameter values to the best ones found and varied this param- function of the layer depth, where 100% of the depth means eter only in the range [0, 1]. Figure 7a shows that better results the last layer of the network. As can be seen, the results in are obtained with higher values for this threshold, i.e., when the first part of both networks were not good (up to 40% selecting only those pixels with the highest activations. The of depth). However, as expected, better localization results specific value selected for VGG-16 was β = 0.94, while that were obtained in the last layers of the networks (from 70% for VGG-19 was β = 0.82. of depth), from which the higher level representations of the Finally, we also analyzed the influence of the size of the images were extracted. The layer ‘‘block5_conv2’’ was, there- structuring element s (see Eqs. 1 and 5) that is used for fore, eventually selected for VGG-16, and ‘‘block4_conv3’’ the dilation of the result obtained from the filters’ activation for VGG-19 (see the full network architecture in [20]). before calculating the bounding box with the position of Another important variable to analyze is the number of the detected objects. The influence of this parameter was filters selected in order to obtain the localization, that is, assessed by varying the size of the structuring element the size of the set |F c | in Equation 5, which can be adjusted by between 3 × 3 and 13 × 13, and setting the remaining modifying the threshold value α. For this experiment, we also parameters to the best ones found in the previous experiments. used the best layer previously selected (which, in both cases, Figure 7b shows the result of this experiment. As can be seen, contains 512 filters), a training set of 14 images, β = 0.8 and the result remains fairly stable when varying this parameter, a structuring element of 7 × 7. Figure 6a shows the results and improves only slightly for the kernel size 7 × 7, which obtained by varying the number of filters used to calculate the was why we eventually selected this size. location. As can be seen, a maximum is obtained for the two Table 3 shows the best hyperparameters found after car- network models when using between 3 and 5 filters, and the rying out the experimentation for both the simple and the best results are, in both cases, obtained with 4 filters. These complex sets. As can be seen, the results obtained with each filters were selected by setting the α threshold to 0.458 for of the networks for the two sets are very similar. In both, VGG-16 and 0.454 for VGG-19. the same number of training images was used, the same layer Figures 6b and 6c show a histogram of the average IoU was employed to extract the filters and the same kernel size obtained for each of the filters in the selected layer of the was utilized. Variations occur only as regards the number VGG-16 and VGG-19 networks, respectively. The vertical of filters selected and the threshold β. In the case of the coordinate of this graph was truncated to a maximum value complex set, it would appear to be beneficial to combine more of 30 filters in order to facilitate its visualization, since the filters so as to obtain a more precise detection. With respect first column, corresponding to the range [0, 0.005] of IoU, to the threshold β, a high value allows only the most likely contains 33% of the filters of VGG-16 and 6 % of those of detections to be selected, so in the case that the number of VOLUME 7, 2019 96575 S. Alashhab et al.: Precise Ship Location With CNN Filter Selection From Optical Aerial Images FIGURE 6. (a) Localization results (F1 %) obtained when varying the number of filters in the set |F c |. Figures (b) and (c) show the histogram with the average IoU obtained by each of the filters in the selected layer from the VGG-16 and VGG-19 networks, respectively. FIGURE 7. Localization results (F1 %) when varying (a) the threshold β and (b) the structuring element s size. targets is reduced (as in the simple set), it is, therefore, better selected. The last row shows, in the first column, the result to use a higher value. However, in the case of the complex set, for the filter 35, and in the following columns, the result of it is better to reduce this threshold slightly when attempting the incremental sum with the previous predictions. A higher to detect many more objectives. activation value is indicated using dark red. As will be noted, Figure 8 shows an example of the filters obtained for an when using only filter number 35, the prediction made is input image, along with the process of adding up the result wrong (it detects a coast projection as a ship), but thanks to the until the final prediction is obtained. A challenging image of combination of the four filters, the algorithm correctly detects a coast with a ship (located in the upper-left part) has been the position of the ship. selected, for which most of the methods compared (as will be seen in the next section) make mistakes. The first row of this C. COMPARISON image shows the input image and the output obtained, while Having analyzed the different parameters of the proposed the second row shows the gradient obtained for the four filters method and determined the best configuration (see Table 3), 96576 VOLUME 7, 2019 S. Alashhab et al.: Precise Ship Location With CNN Filter Selection From Optical Aerial Images TABLE 3. Configuration for each of the networks with the Filter Selection method obtained for the simple and complex sets. The size of the training set is not a parameter of the algorithm but it was evaluated to analyze its influence on the results. The α parameter also includes the number of selected filters in parentheses. FIGURE 8. Example of the process performed to calculate the location of a ship. The first row shows the input and the output images, marking the bounding box of the detected ship in the upper-left part. The second row shows the gradients obtained by the four selected filters for the input image. The third row shows the process of incrementally adding up the result obtained. A higher activation value is indicated in dark red. we will now compare the results obtained for the simple and boxes from our dataset in order to improve the accuracy complex sets with those of other state-of-the-art methods. with small objects. In particular, we have compared our approach with the fol- • YOLO v2 [34], YOLO v3 [35] and YOLT [7] initialized lowing methods (already described in the introduction): with pre-trained weights from ILSVRC. These models were also trained with data augmentation, adjusting the • Visual saliency with backpropagation (BP) [36] using size of the anchors as occurred with the previous meth- the VGG-16 and VGG-19 models. ods. In the case of YOLT, the parameter ‘‘min retain • SelAE [5]: This approach uses a Selectional Auto- prob’’ was set to 0.35, as [7] stated that the highest F1 Encoder (SAE) network specialized in the segmentation score was obtained using values of between 0.3 to 0.4. of oil spills. It returns a probability distribution to which a threshold is applied in order to select the pixels to be For this comparison, in addition to the metrics (Precision, segmented. Recall and F1 ) previously used at the object detection level, • Faster R-CNN (FRCNN) [30] and RetinaNet [33], we show the average value of the IoU obtained, along with the which yielded competitive results for ship detection in Average Precision (AP), given that these metrics are widely SAR images in [50]. Both models use a ResNet50 net- used to evaluate object detection methods, such as in PAS- work initialized with pre-trained weights from ILSVRC. CAL VOC challenge. The most recent PASCAL’s challenge Training included data augmentation, and the size of the AP metric has been used (by interpolating all points rather anchors was adjusted to the average size of the bounding than using a fixed set of uniformly-spaced recall values) [51]. VOLUME 7, 2019 96577 S. Alashhab et al.: Precise Ship Location With CNN Filter Selection From Optical Aerial Images TABLE 4. Comparison of the results obtained for the simple set using the proposed method (VGG-16/19 + FS) and other state-of-the-art methods. These results were calculated by employing a threshold of λ = 0.5 in the IoU metric to consider a correct detection. The two best results for each metric are marked in bold type. This metric calculates the mean value in the recall interval case, we analyzed only the F1 and AP metrics for each of the [0, 1], which is equivalent to the area under the curve (AUC) tests performed. of the Precision-Recall curve (PRC). The first two columns in this table show the results Table 4 shows the results of the comparison with other obtained when evaluating the different methods with the methods obtained for the simple set. For each metric, the two complex set but using the models trained with the simple set. best results are marked in bold type. In general, the proposed In this case, the methods that best generalize are SelAE and method appears among the two best in all the metrics, with VGG19+FS, with the latter being only 1.6% of F1 below. the exception of the average IoU, although this indicates only Please recall that SelAE was trained using all the images and that the accuracy of the detected area is slightly lower, being by applying data augmentation (signifying that it may help necessary to analyze the other metrics in order to count the to generalize better). As shown previously, some methods, number of correct detections. Upon observing F1 , it will be such as YOLO and BP, are very dependent on the training set, noted that the best results are obtained with VGG19+FS which cannot generalize well when processing images with a (our proposal) and with YOLO v3, and that the proposed greater number of targets, even though the objects and the method is 1.19% better than YOLO v3. Note that the pro- type of images are the same (up to 50% worse in the case of posed approach has been trained using only 14 images labeled YOLO v3 or 46% in the case of VGG16+BP). with the location, while YOLO v3 used the entire dataset with In the central and last columns in this table, the learn- the bounding boxes. In the case of the AP metric, note that ing and generalization capabilities are evaluated by training the proposed method has obtained the best results for the two the different methods on a reduced set of data (using the network models to which it has been applied. same number of images as in the proposed method, that is, With regard to the results obtained for the complex set only 14). It is, therefore, also possible to evaluate how the (see Table 5), the proposed method also obtained competitive other methods behave when a large amount of training data results. The best results with the F1 metric were obtained is not available. As can be seen, the results obtained worsen by VGG-19+FS followed by YOLT. The result obtained considerably for all the methods compared, decreasing by with VGG-16+FS was also, in this case, among the best. between 30% and almost 70% in some cases. For the simple The YOLO v2 and v3 methods did not perform so well set (central columns), the compared method that works best when dealing with multiple objectives, and in this case, other with few data is YOLO v3 followed by RetinaNet, and for the approaches that are more oriented toward the detection of complex set (columns on the right) it is YOLT, which obtains multiple small objects, such as YOLT or SelAE, obtained a fairly stable result in both cases. However, upon comparing better results. The latter were the two that obtained the best these results with those of the proposed method, there is results for the AP metric, although the proposed approach a very significant difference, showing their generalization also obtained results close to them. It should be noted that capability. the remaining methods used the complete training set, labeled Figure 9 shows a comparison of the results obtained with with the location of the ships, while the proposed method used the different methods. An example of each method is selected only 14 labeled images for this purpose. (see the columns in the figure) for some of the images that We also analyzed the capability of the different methods were most difficult. The bounding boxes of the detections evaluated to generalize when processing images with a differ- obtained are marked for each result (TP in green and FP ent number of targets to those it was trained to detect, and also in red), and a colored circle has been added in a corner verified whether they can extrapolate the knowledge learned to indicate whether the detection was successful (green), using a small subset of the training data to the full dataset. whether the detection failed (red), or whether the targets were The results of this experiment are shown in Table 6. In this detected but false positives were also obtained (blue). As will 96578 VOLUME 7, 2019 S. Alashhab et al.: Precise Ship Location With CNN Filter Selection From Optical Aerial Images FIGURE 9. Example of results obtained by the different methods, including the proposed method (VGG19 + FS). Examples of each type of method for some of the images that were most difficult are shown. The bounding boxes of the correctly detected ships (TP) are marked in green and the incorrect detections (FP) in red. FN are not marked. A colored circular indicator has also been added to facilitate the visualization of a correct (green), incorrect (red), or partially correct (blue) detection. be noted, the most reliable methods are SelAE, YOLO v3 and as those shown in the 4th and 5th rows, are problematic for VGG19+FS, which detected all the targets and yielded only the BP, RetinaNet and YOLT methods, principally owing to some FP. Some images that may appear to be simple, such the small size of the objects to be detected. The last two VOLUME 7, 2019 96579 S. Alashhab et al.: Precise Ship Location With CNN Filter Selection From Optical Aerial Images TABLE 5. Comparison of the results obtained for the complex set using the proposed method (VGG-16/19 + FS) and other state-of-the art methods. These results were calculated by employing a threshold of λ = 0.5 in the IoU metric to consider a correct detection. The two best results obtained for each metric are marked in bold type. TABLE 6. Evaluation of the generalization capabilities of the different methods analyzed. In the first columns, we compare the results obtained when training with the simple set but using the complex set (with more targets) for testing. The central and last columns show the results obtained when training with a reduced amount of data but evaluating on the full test set. For each metric and column, the two best results are marked in bold type. In all cases, a threshold of λ = 0.5 is used in the IoU metric to consider correct detection. rows show examples for the multi class, and in this case, This approach was evaluated with an updated version of the SelAE, YOLT and VGG19+FS methods also obtain the the MAritime SATellite Imagery (MASATI) dataset, which best detection results. was extended for this work. We have specifically increased the number of samples from the 6,212 that were employed VI. CONCLUSIONS in the previous version of MASATI to 7,389 in this new This work presents a weakly-supervised approach for object version, principally by adding new samples to the ‘‘coast detection that can be applied to CNN classification models. & ship’’ and ‘‘multi’’ classes. We have additionally labeled The proposed method is specialized in the detection of small the ground-truth with the location of ships, which was not objects (that is, objects that occupy a very small percentage provided in the previous version. of pixels within the image) from satellite images. The local- The results obtained when analyzing the different parame- ization is performed by applying a Filter Selection process in ters of the proposed method show that, in general, this method order to obtain the set of filters that allow the target class to be needs to be trained with between only 14 and 20 images con- detected with greater precision. The gradients are calculated taining the location of ships in order to obtain precise results. on these filters with respect to the input image, and are then When employing more than 20 images, the score remains normalized and combined. A thresholding and a morpholog- stable and there are no significant improvements. In addition, ical operation are subsequently applied to eventually obtain it was also observed that the best location results are obtained the location. This method makes it possible to adapt a network when using the last (deepest) layers of the network. With that has already been trained for classification to a network regard to the filters in the selected layer, the method needs to for object detection, using only a few images labeled with the combine between only 4 and 5 filters to calculate the locations corresponding bounding boxes for localization. of ships. 96580 VOLUME 7, 2019 S. Alashhab et al.: Precise Ship Location With CNN Filter Selection From Optical Aerial Images When compared to other state-of-the-art methods, the pro- [15] U. Kanjir, H. Greidanus, and K. Oštir, ‘‘Vessel detection and classifica- posed approach is able to achieve the best average scores for tion from spaceborne optical images: A literature survey,’’ Remote Sens. Environ., vol. 207, no. 15, pp. 1–26, Mar. 2018. [Online]. Available: the detection of a single target. It obtains similar results to http://www.sciencedirect.com/science/article/pii/S0034425717306193 YOLT and YOLO v3, but with the difference that it requires [16] Y. LeCun, Y. Bengio, and G. Hinton, ‘‘Deep learning,’’ Nature, vol. 521, only a few labeled samples. When calculating the location of pp. 436–444, May 2015. [Online]. Available: http://dx.doi.org/10.1038/ nature14539%5Cn10.1038/nature14539 multiple targets, the method obtains reliable results. It yields [17] F. Chollet, ‘‘Xception: Deep learning with depthwise separable the best results according to the F1 metric, and similar results convolutions,’’ Oct. 2016, arXiv:1610.02357. [Online]. Available: https://arxiv.org/abs/1610.02357 to YOLT, SelAE and YOLO v3 according to the AP and the [18] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, IoU metrics. In addition, when analyzing the generalization ‘‘Rethinking the inception architecture for computer vision,’’ CoRR, capacity by evaluating the method for the localization of vol. abs/1512.00567, Dec. 2015. [Online]. Available: http://arxiv.org/ abs/1512.00567 multiple objectives but using the model trained for single [19] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for objectives, or when training with a reduced set of images, image recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., the proposed method is also among those that obtain the best Jun. 2016, pp. 770–778. [20] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for results. large-scale image recognition,’’ CoRR, Sep. 2014. [Online]. Available: As future work, we intend to carry out more exhaus- http://arxiv.org/abs/1409.1556 [21] F. Wu, Z. Zhou, B. Wang, and J. Ma, ‘‘Inshore ship detection based on tive experiments with the proposed method by evaluating it convolutional neural network in optical satellite images,’’ IEEE J. Sel. with other generic object detection datasets, analyzing the Topics Appl. Earth Observat. Remote Sens., vol. 11, no. 11, pp. 4005–4015, results with larger targets, and also evaluating the extension Nov. 2018. [22] X. Yang, H. Sun, X. Sun, M. Yan, Z. Guo, and K. Fu, ‘‘Position to multi-class. detection and direction prediction for arbitrary-oriented ships via multi- task rotation region convolutional neural network,’’ IEEE Access, vol. 6, REFERENCES pp. 50839–50849, 2018. [1] A.-J. Gallego, A. Pertusa, and P. Gil, ‘‘Automatic ship classification from [23] Y. Yu, H. Ai, X. He, S. Yu, X. Zhong, and M. Lu, ‘‘Ship detection in optical aerial images with convolutional neural networks,’’ Remote Sens., optical satellite images using haar-like features and periphery-cropped vol. 10, no. 4, p. 511, Mar. 2018. neural networks,’’ IEEE Access, vol. 6, pp. 71122–71131, 2018. [2] A.-J. Gallego, A. Pertusa, P. Gil, and R. B. Fisher, ‘‘Detection of bodies in [24] S. Agarwal, J. O. D. Terrail, and F. Jurie, ‘‘Recent advances in object maritime rescue operations using unmanned aerial vehicles with multispec- detection in the age of deep convolutional neural networks,’’ 2018, tral cameras,’’ J. Field Robot., vol. 36, no. 4, pp. 782–796, 2019. [Online]. arXiv:1809.03193. [Online]. Available: https://arxiv.org/abs/1809.03193 Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/rob.21849 [25] L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and [3] J. Jiao, Y. Zhang, H. Sun, X. Yang, X. Gao, W. Hong, K. Fu, and X. Sun, M. Pietikäinen, ‘‘Deep learning for generic object detection: A sur- ‘‘A densely connected end-to-end neural network for multiscale and multi- vey,’’ 2018, arXiv:1809.02165. [Online]. Available: https://arxiv.org/ scene SAR ship detection,’’ IEEE Access, vol. 6, pp. 20881–20892, 2018. abs/1809.02165 [26] D. Chaves, S. Saikia, L. Fernández-Robles, E. Alegre, and M. Trujillo, [4] J. Zhao, Z. Zhang, W. Yu, and T.-K. Truong, ‘‘A cascade coupled convo- ‘‘A systematic review on object localisation methods in images,’’ lutional neural network guided visual attention method for ship detection Revista Iberoamericana Automática Informática Ind., vol. 15, no. 3, from SAR images,’’ IEEE Access, vol. 6, pp. 50693–50708, 2018. pp. 231–242, 2018. [Online]. Available: https://polipapers.upv.es/index. [5] A.-J. Gallego, P. Gil, A. Pertusa, and R. B. Fisher, ‘‘Segmentation of php/RIAI/article/view/10229 oil spills on side-looking airborne radar imagery with autoencoders,’’ [27] P. Pham, D. Nguyen, T. Do, T. D. Ngo, and D.-D. Le, ‘‘Evaluation of Sensors, vol. 18, no. 3, p. 797, 2018. [Online]. Available: http://www. deep models for real-time small object detection,’’ in Neural Information mdpi.com/1424-8220/18/3/797 Processing, D. Liu, S. Xie, Y. Li, D. Zhao, and E.-S. El-Alfy, Eds. Cham, [6] Z. Deng, H. Sun, S. Zhou, and J. Zhao, ‘‘Learning deep ship detector in Switzerland: Springer, 2017, pp. 516–526. SAR images from scratch,’’ IEEE Trans. Geosci. Remote Sens., vol. 57, [28] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ‘‘You only look once: no. 6, pp. 4021–4039, Jun. 2019. Unified, real-time object detection,’’ in Proc. IEEE Conf. Comput. Vis. [7] A. Van Etten, ‘‘You only look twice: Rapid multi-scale object detection Pattern Recognit. (CVPR), Las Vegas, NV, USA, Jun. 2016, pp. 779–788. in satellite imagery,’’ CoRR, vol. abs/1805.09512, May 2018. [Online]. doi: 10.1109/CVPR.2016.91. Available: http://arxiv.org/abs/1805.09512 [29] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, [8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, ‘‘ImageNet: and A. C. Berg, ‘‘SSD: Single shot multibox detector,’’ in Computer A large-scale hierarchical image database,’’ in Proc. IEEE Conf. Comput. Vision—ECCV, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds. Cham, Vis. Pattern Recognit. (CVPR), Jun. 2009, pp. 248–255. Switzerland: Springer, 2016, pp. 21–37. [9] F. Y. M. Lure and Y.-C. Rau, ‘‘Detection of ship tracks in AVHRR [30] S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster R-CNN: Towards real-time cloud imagery with neural networks,’’ in Proc. IEEE Int. Geosci. Remote object detection with region proposal networks,’’ in Proc. Adv. Neural Inf. Sens. Symp. (IGARSS), Aug. 1994, pp. 1401–1403. [Online]. Available: Process. Syst. (NIPS), 2015, pp. 1–9. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=399451 [31] C. Eggert, S. Brehm, A. Winschel, D. Zecha, and R. Lienhart, ‘‘A closer [10] J. M. Weiss, R. Luo, and R. M. Welch, ‘‘Automatic detection of ship look: Small object detection in faster R-CNN,’’ in Proc. IEEE Int. Conf. tracks in satellite imagery,’’ in Proc. IEEE Int. Geosci. Remote Sens. Symp., Multimedia Expo (ICME), Jul. 2017, pp. 421–426. [32] J. Li, X. Liang, Y. Wei, T. Xu, J. Feng, and S. Yan, ‘‘Perceptual generative Remote Sens.-Sci. Vis. Sustain. Develop., vol. 1, Aug. 1997, pp. 160–162. adversarial networks for small object detection,’’ in Proc. IEEE Conf. [Online]. Available: http://ieeexplore.ieee.org/document/615827/ Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1951–1959. [11] J. Zhang, K. Huang, Y. Yu, and T. Tan, ‘‘Boosted local structured HOG- [33] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, ‘‘Focal loss for dense LBP for object localization,’’ in Proc. IEEE Conf. Comput. Vis. Pattern object detection,’’ IEEE Trans. Pattern Anal. Mach. Intell., to be published. Recognit. (CVPR), Jun. 2011, pp. 1393–1400. [34] J. Redmon and A. Farhadi, ‘‘YOLO9000: Better, faster, stronger,’’ in Proc. [12] R. G. Cinbis, J. Verbeek, and C. Schmid, ‘‘Multi-fold MIL training for IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Honolulu, HI, USA, weakly supervised object localization,’’ in Proc. IEEE Conf. Comput. Vis. Jul.2017, pp. 6517–6525. doi: 10.1109/CVPR.2017.690. Pattern Recognit. (CVPR), Jun. 2014, pp. 2409–2416. [35] J. Redmon and A. Farhadi, ‘‘YOLOv3: An incremental improve- [13] S. J. Hwang and K. Grauman, ‘‘Reading between the lines: Object local- ment,’’ CoRR, vol. abs/1804.02767, Apr. 2018. [Online]. Available: ization using implicit cues from image tags,’’ IEEE Trans. Pattern Anal. http://arxiv.org/abs/1804.02767 Mach. Intell., vol. 34, no. 6, pp. 1145–1158, Jun. 2011. [36] K. Simonyan, A. Vedaldi, and A. Zisserman, ‘‘Deep inside convolu- [14] F. Yang, Q. Xu, and B. Li, ‘‘Ship detection from optical satellite images tional networks: Visualising image classification models and saliency based on saliency segmentation and structure-LBP feature,’’ IEEE Geosci. maps,’’ 2013, arXiv:1312.6034. [Online]. Available: https://arxiv.org/abs/ Remote Sens. Lett., vol. 14, no. 5, pp. 602–606, May 2017. 1312.6034 VOLUME 7, 2019 96581 S. Alashhab et al.: Precise Ship Location With CNN Filter Selection From Optical Aerial Images [37] Y. Bai and B. Ghanem, ‘‘Multi-scale fully convolutional network for face ANTONIO-JAVIER GALLEGO received the B.Sc. detection in the wild,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. and M.Sc. degrees in computer science, and the Workshops (CVPRW), Jul. 2017, pp. 2078–2087. Ph.D. degree in computer science and artifi- [38] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, ‘‘Deformable cial intelligence from the University of Alicante, convolutional networks,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), in 2004 and 2012, respectively. He has been a Oct. 2018, pp. 764–773 [Online]. Available: http://ieeecomputersociety. Researcher on 10 research projects funded by both org/10.1109/ICCV.2017.89 [39] Q. Li, L. Mou, Q. Liu, Y. Wang, and X. X. Zhu, ‘‘HSF-Net: Multiscale deep the Spanish Government and private companies. feature embedding for ship detection in optical remote sensing imagery,’’ He is currently an Assistant Professor with the IEEE Trans. Geosci. Remote Sens., vol. 56, no. 12, pp. 7147–7161, Department of Software and Computing Systems, Dec. 2018. University of Alicante, Spain. He has authored or [40] G. Huang, Z. Wan, X. Liu, J. Hui, Z. Wang, and Z. Zhang, ‘‘Ship coauthored 40 works published in international journals, conferences, books, detection based on squeeze excitation skip-connection path networks for and book chapters. His research interests include deep learning, pattern optical remote sensing images,’’ Neurocomputing, vol. 332, pp. 215–223, recognition, and computer vision. Mar. 2019. [Online]. Available: http://www.sciencedirect.com/science/ article/pii/S092523121831508X [41] X. Glorot, A. Bordes, and Y. Bengio, ‘‘Deep sparse rectifier neural net- works,’’ J. Mach. Learn. Res., vol. 15, no. 4, pp. 315–323, 2011. [42] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, ‘‘Learning deep features for discriminative localization,’’ in Proc. IEEE Conf. Com- put. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 2921–2929. [43] M. Lin, Q. Chen, and S. Yan, ‘‘Network in network,’’ in Proc. Int. Conf. Learn. Represent., Proc. ICLR, Apr. 2014, pp. 1–10. ANTONIO PERTUSA received the B.Sc. degree [44] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and in computer science and the Ph.D. degree from D. Batra, ‘‘Grad-CAM: Visual Explanations from deep networks via the University of Alicante, Spain, where he is cur- gradient-based localization,’’ 2016, arXiv:1610.02391. [Online]. Avail- rently an Associate Professor with the Department able: https://arxiv.org/abs/1610.02391 of Software and Computing Systems. He has been [45] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and a Researcher on over 15 research and develop- R. Salakhutdinov, ‘‘Dropout: A simple way to prevent neural ment projects funded by the Spanish Government networks from overfitting,’’ J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958, Jan. 2014. [Online]. Available: http://dl.acm.org/citation. agencies and private companies. He has authored cfm?id=2627435.2670313 or coauthored more than 40 works in interna- [46] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classification tional journals, conferences, and book chapters. with deep convolutional neural networks,’’ in Proc. Adv. Neural Inf. Pro- His research interests include signal processing, deep learning, and pattern cess. Syst., 2012, pp. 1–9. recognition methods applied to computer vision, music information retrieval, [47] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, ‘‘How transferable are remote sensing, and medical knowledge extraction. He is a member of the features in deep neural networks?’’ in Advances in Neural Information executive committee of the Spanish AERFAI Association and is the Secretary Processing Systems, Montreal, QC, Canada, Z. Ghahramani, M. Welling, of the University Institute of Computing Research (IUII), University of C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Cambridge, MA, Alicante. USA: MIT Press, 2014, pp. 3320–3328. [48] L. Bottou, ‘‘Large-scale machine learning with stochastic gradient descent,’’ in Proc. COMPSTAT. Berlin, Germany: Springer, 2010, pp. 177–186. [49] M. D. Zeiler, ‘‘ADADELTA: An adaptive learning rate method,’’ 2012, arXiv:1212.5701. [Online]. Available: https://arxiv.org/abs/1212.5701 PABLO GIL (M’12–SM’14) received the B.Sc. [50] Y. Wang, C. Wang, H. Zhang, Y. Dong, and S. Wei, ‘‘Automatic ship detec- degree in computer science engineering and the tion based on retinanet using multi-resolution gaofen-3 imagery,’’ Remote Ph.D. degree from the University of Alicante, Ali- Sens., vol. 11, no. 5, p. 531, 2019. [Online]. Available: http://www.mdpi. cante, Spain, in 1999 and 2008, respectively, where com/2072-4292/11/5/531 he is currently an Associate Professor with the [51] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, Department of Physics, Systems Engineering, and and A. Zisserman, ‘‘The PASCAL visual object classes challenge: A ret- Signal Theory. From 2016 to 2018, he was the rospective,’’ Int. J. Comput. Vis., vol. 111, no. 1, pp. 98–136, Jan. 2014. Secretary of the Computer Science Research Insti- tute, University of Alicante, and is currently the Head of that research institute. He has also been a Researcher on over 18 research and development projects funded by the SAMER ALASHHAB received the B.Sc. degree European Commission, Spanish Government agencies, and private compa- in computer science from Al-Ahliyya Amman nies. He has authored or coauthored more than 100 works in international University, in 2001, the Higher Diploma degree journals (29 indexed in JCR), conferences, and book chapters. His research in computer information systems, and the M.Sc. interests include computer vision, 3-D vision, deep learning, and perception degree in computer information systems from the for robots. He was a Guest Editor of a special issue for the Journal of Sensors Arab Academy for Banking and Financial Sci- and is an Associate Editor of the International Journal of Advanced Robotics ences, Amman, Jordan, in 2004 and 2005, respec- Systems and Mathematical Problems in Engineering. He is a member of tively, and the master’s degree as an Expert in the Spanish Automatic Committee of IFAC and a Senior Member of the developing applications for smart devices from the IEEE Robotics and Automation Society, Education Society, and the IEEE University of Alicante, Spain, in 2013, where he Sensor Council. Since 2018, he has been the Secretary of the IEEE-RAS is currently pursuing the Ph.D. degree in computer science. He has worked Spanish Chapter. His awards and honors include the Teaching Excellence as a Researcher for the Computer Science Research Institute, University of Prize at the University of Alicante, in 2011, and the Best Paper Award in the Alicante. His research interests include pattern recognition, machine learn- 14th International Conference on Informatics in Control, Automation and ing, and artificial intelligence. He is a member of the Spanish Association of Robotics (ICINCO 2017). Recognition of Forms and Analysis of Images (AERFAI). 96582 VOLUME 7, 2019