This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1 Tactile Transfer Learning and Object Recognition With a Multifingered Hand Using Morphology Specific Convolutional Neural Networks Satoshi Funabashi , Member, IEEE, Gang Yan, Student Member, IEEE, Fei Hongyi , Student Member, IEEE, Alexander Schmitz , Member, IEEE, Lorenzo Jamone , Member, IEEE, Tetsuya Ogata , Member, IEEE, and Shigeki Sugano , Fellow, IEEE Abstract— Multifingered robot hands can be extremely effec- tive in physically exploring and recognizing objects, especially if they are extensively covered with distributed tactile sensors. Con- volutional neural networks (CNNs) have been proven successful in processing high dimensional data, such as camera images, and are, therefore, very well suited to analyze distributed tactile infor- mation as well. However, a major challenge is to organize tactile inputs coming from different locations on the hand in a coherent structure that could leverage the computational properties of the CNN. Therefore, we introduce a morphology-specific CNN Fig. 1. Examples of applications for multifingered hands. Picking objects (MS-CNN), in which hierarchical convolutional layers are formed with diverse object properties is difficult even though the motion seems following the physical configuration of the tactile sensors on the the same because the properties change the motion. During those kinds of robot. We equipped a four-fingered Allegro robot hand with multifingered tasks, acquiring object features through tactile information is several uSkin tactile sensors; overall, the hand is covered with important to achieve dexterous and stable manipulation. 240 sensitive elements, each one measuring three-axis contact force. The MS-CNN layers process the tactile data hierarchically: at the level of small local clusters first, then each finger, and then multiple fingers dexterously (see Fig. 1). To achieve such the entire hand. We show experimentally that, after training, the multifingered tasks stably and effectively, tactile sensing was robot hand can successfully recognize objects by a single touch, with a recognition rate of over 95%. Interestingly, the learned applied to a lot of tasks, such as grasp stability, detecting MS-CNN representation transfers well to novel tasks: by adding tactile events, and tactile exploration [1]. Those skills can also a limited amount of data about new objects, the network can be crucial for multifingered manipulation where quick tactile recognize nine types of physical properties. feedback or recognition is required. Index Terms— Convolutional neural network (CNN), multifin- Tactile sensing is considered to be complementary to other gered hand, object recognition, tactile sensing. sensing modalities [2], especially when there is visual occlu- sion. In camera-based situations [3], the hand needs to be simple in shape to avoid occlusions, but there is a limitation I. I NTRODUCTION in realizing difficult tasks, such as with multifingered hands. Therefore, for multifingered tasks, tactile sensing becomes M ULTIFINGERED hands are useful for the exploration and recognition of objects or environments by using a more important modality. In multifingered hand tasks, a diverse and relatively large area, including not only the Manuscript received 15 September 2021; revised 16 August 2022; fingertips but also the phalanges, comes into contact with the accepted 4 October 2022. This work was supported by the Japan Science and object, and the forces act in various directions as the fingers Technology Agency ACT-I Information and Future under Grant JPMJPR18UP and its Acceleration Phase under Grant JPMJPR18UP. (Corresponding author: touch the grasped object from different directions [4]. A mul- Satoshi Funabashi.) tifingered hand, such as a human mimetic hand, is capable Satoshi Funabashi is with the Institute for AI and Robotics, Future of performing multipurpose tactile tasks [5], [6], but it is Robotics Organization, Waseda University, Tokyo 169-8555, Japan (e-mail:
[email protected]). difficult to process such a rich amount of tactile information. Gang Yan, Fei Hongyi, Alexander Schmitz, and Shigeki Sugano are with the Much research has been done on how a robotic hand equipped Department of Modern Mechanical Engineering, Waseda University, Tokyo with tactile sensors can accomplish a task [7]. By considering 169-8555, Japan. Lorenzo Jamone is with the School of Electronic Engineering and Computer the grasping state and fingertip position analytically, it has Science, Queen Mary University of London, E1 4NS London, U.K. become possible to optimize the grasp and recognize the slip of Tetsuya Ogata is with the Department of Intermedia Art and Science, the grasped object [8], [9]. However, complex grasping states Waseda University, Tokyo 169-8555, Japan. are difficult to model analytically, and there are cases where Color versions of one or more figures in this article are available at https://doi.org/10.1109/TNNLS.2022.3215723. only two fingers are used [10] or the touch is limited to the Digital Object Identifier 10.1109/TNNLS.2022.3215723 fingertips [11]. 2162-237X © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information. Authorized licensed use limited to: Queen Mary University of London. Downloaded on November 09,2022 at 10:54:40 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS Various machine learning methods are used for the recog- It also makes it difficult to extract multifingered level nition task; for example, a random forest was used for object meaning from tactile information, such as object grasping classification for a simple two-fingered robot hand [12] and and in-hand manipulation, due to its incapability of process- also for slip prediction for an arbitrary number of robot ing tactile information on multifingered hands at the same fingers [13]. Also, a support vector machine (SVM) has been time. Previously, we used the uSkin three-axis tactile sensors used for zero-shot learning of object recognition tasks even [28], [29] attached to the Allegro hand, and in this article, though it has tactile information limited to fingertips [11]. we use the same dataset as in our papers [25], [30]. Both Some other researchers have experimented with the fast in [25] and [30], we obtained sufficient object recognition estimation of object shape recognition when there is only rates (95%). However, in [25], the input was a sequence of one tactile sensor at a fingertip [14]. Active exploration tactile states from different regions of the hand, while, in [30], methods, which use SVMs to learn by combining tactile it was a single, tactile state of the entire hand. Therefore, the and visual information, have been shown to be less accurate architecture in [30] permits to have a much faster recognition. than learning from either alone [15]. This is one example This is crucial especially if the information has to be used where machine learning methods were not able to process for real-time object manipulation (e.g., retrieving a stored sensor information of enormous size. Distributed tactile sen- 3-D model of the object after the object has been correctly sors have also been studied [16]. Even with SVM and recognized). self-organized maps (SOMs), the recognition rate remained However, Funabashi et al. [30] just showed recognition low [17]. Also, research has already been done focusing on rates and did not analyze how the CNNs process tactile infor- the features of multifingered hands [18]. By concatenating mation, and thus, it was not clear why this specific structure the information of each finger, a high recognition rate of of the CNN was beneficial for multifingered hands. A more objects can be obtained. While many studies have already careful and detailed analysis of the internal representations been reported on the combination of robot hands and tactile created by the morphology-specific CNN (MS-CNN) would sensors, there are still challenges in dealing with a large explain why the proposed structure is superior to other possible number of tactile sensors and learning high-dimensional tactile arrangements of the data, and whether it is beneficial for information. other tasks rather than object recognition. Finally, whether Deep learning has been used for tactile sensors as one of the the CNNs are usable with other efficient training methods, methods to process a large amount of tactile information. Deep such as transfer learning, to achieve better results has not learning has achieved better recognition rates compared to been confirmed yet. It has to be evaluated whether CNNs other machine learning methods, such as SVM [19]. Another can embrace a useful viewpoint, which is based on robotic example is the use of deep reinforcement learning to change morphology. the orientation of a cylindrical object. Although the hand is Even though the proposed method could be applied to multifingered, the small number of pressure sensors and the multiple tasks, the data collection would require a high hard- high degree of freedom of the finger joints could make the ware load especially because tactile sensors directly touch situation difficult; moreover, it has not been used to manipulate objects and, in general, get worn out easily (e.g., [22]). This various objects [20]. Many researchers have been working on is especially important for multifingered hands because a lot convolutional neural networks (CNNs) for image and speech of tactile sensors on the hands can break. In this case, transfer recognition because of their robust performance in extracting learning that is widely used for image recognition tasks [31] features from multidimensional information. This advantage of can be useful to reduce the size of the required training dataset CNNs is well suited for distributed tactile sensors because the with tactile sensors. There are some tactile transfer leaning sensors are physically distributed on a 2-D surface. Therefore, methods that achieved high recognition rates [32], [33], yet CNNs have become widely used in distributed tactile sensors they focus on only fingertips. N-shot learning was also used for robotic hands [21]. A state-of-the-art method focusing on by specifying the size of training data [34]. Multimodal CNNs and distributed tactile sensors has been developed [22]. transfer learning, including vision and tactile information, CNNs have also been applied to tactile sensors for multifin- was conducted [35], which can be difficult to implement gered hands [23]. In our previous works, CNNs were also in the case of multifingered hands due to occlusions. Sim- applied for in-hand manipulation and object recognition of an to-real transfer learning is one of the effective ways to collect Allegro hand with three-axis tactile sensors [24], [25], and training data [36], [37]. However, touching an object from the results were better than the results of other modeling and diverse orientations happens due to multiple fingers and, thus, machine learning methods. requires three-axis tactile information. That information is While CNNs have achieved prominent results, it is still difficult to implement in simulation systems. Even though necessary to consider the problem of how to input tactile many tactile transfer learning methods have been used, transfer information from such sensors into the CNN when it comes learning with a multifingered hand with tactile information to multifingered hands, as some hands have tactile sensors at from not only the fingertips but also the phalanges has not the fingertips [26], while others have distributed sensors of been investigated yet. Therefore, transfer learning was chosen different sizes and shapes [27]. This is particularly difficult for evaluating the MS-CNNs. because the size and shape of tactile patches on hands vary Therefore, the contributions of this study are given as much as the size of fingers, and in general, CNNs require as follows. First, we review and compare the possible rectangular input, which makes the implementation of CNNs architectures of convolutional layers of the MS-CNN and difficult. how their combinations affect the object recognition rate. Authorized licensed use limited to: Queen Mary University of London. Downloaded on November 09,2022 at 10:54:40 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. FUNABASHI et al.: TACTILE TRANSFER LEARNING AND OBJECT RECOGNITION WITH A MULTIFINGERED HAND 3 Second, we visualize the internal representations of the differ- ent network structures using Grad-CAM++ [38] and highlight why and how such representations are the best candidates to support a wide range of robotic tasks related to multifingered hands. Finally, and most importantly, we demonstrate how these learned representations permit efficient transfer learning from the recognition of object instances to the recognition of physical properties (i.e., heaviness, slipperiness, and softness) of novel objects during in-hand manipulation with complex contact states on several fingers and show whether the CNNs are useful with transfer (WT) learning. II. S YSTEM A RCHITECTURE A. Hardware Design This study uses the Allegro hand, a commercially available robotic hand from Wonik Robotics. Our uSkin tactile sensor, which was designed to detect forces in three axes, is used to cover the fingertips [28] and phalanges [29] of the Allegro hand, as shown in Fig. 2(a). As a multifingered robot hand with 16 degrees of freedom, the Allegro hand generates forces in many directions during manipulation. Therefore, the uSkin sensor has been implemented to detect such complicated grasping states. A total of 15 uSkin patches are installed: four on the index, middle, and little fingers, and three on the thumb. Thus, the customized Allegro hand has a total of 15 Fig. 2. Hardware setup and architecture of the proposed CNN. (a) Allegro hand with a set of uSkin three-axis tactile sensors mounted on the phalanges (sensor patches) × 16 (sensors) × 3 (tactile axes) + 16 (joint and fingertips (hardware setting). There are four uSkin sensor patches on the angles) = 736 measurements. index, middle, and little fingers, and three patches on the thumb. (b) From left to right, the mounted sensor patches, the location of the sensors in black dots, and the input map for the CNN (how to input tactile information). Each red B. How to Input Tactile Information? “0” stands for a position where no sensor is mounted on the corresponding The locations of the sensors on the phalanges and fingertips actual sensor patch (other values are arbitrary). The map is used for input to the CNN in three channels (x, y, z). (c) One example is how to combine the are shown in Fig. 2(a) and (b). There are 16 sensors in each convolution layers from each tactile sensor patch (how to combine tactile sensor patch. The phalanges and fingertips, however, differ in features). The current robotic hand platform has different sizes of sensor size and shape [see Fig. 2(b)]. Regarding the position of the patches for fingertips and phalanges. In addition, there are different numbers of patches for the thumb and other fingers. In the example shown in this figure, sensors, the input map of the phalanges has a shape of 4 × 4, each tactile patch has its own convolution layer (patch-level convolution) and and the input map of the fingertips has a shape of 6 × 4. is combined at a later stage (hand-level convolution). In the fingertip input map, the number “0” [the red number in Fig. 2(b)] is received at the position where each fingertip is not equipped with a sensor, resulting in a rectangular input layer convolutes the input from each sensor patch and converts map. This allows to convolute the input map with a filter it into an output of the required size. The height and width of a size equal to or larger than 2 × 2. Some studies on sizes of the filters in the convolution layer are adjusted as image recognition use three channels of “RGB” input. This follows: is because, in an image, each pixel has “RGB” information. H + 2P − FH Likewise, in this article, the input to CNN is set to three OH = +1 channels. This is because each of the sensors (or “taxel”) S W + 2P − FW provides x yz information [see Fig. 2(b)], and the orientation OW = +1 (1) of x yz information on the patch is shown in [see Fig. 2(a)]. S For the fingertips, the x yz-direction varies between taxels as where H and W are the height and width of the current the fingertips are curved. Specifically, the z-axis corresponds convolution layer’s input, OH and OW are the height and to the tactile information in the direction perpendicular to width of the current convolution layer’s output, and FH and the sensor surface, and the other axes correspond to the FW are the height and width of the filters that convolute the tactile information in the tangential direction. This method of input into the output of the next convolution layer. P is the processing was used in [24], [30]. One way to process the padding, which typically adds “0” around the input map, so as tactile sensor patches is shown in Fig. 2(c). Here, as the first to keep the outputs of the convolution layers the same size. convolution layer, a convolution layer is prepared for each Since the size of the input varies depending on the combination tactile patch. In the second layer, all the convolution layers of of convolution layers, padding is not used in this article. the first layer are combined in accordance with the position of In addition, as we focus on how the convolution layers are the tactile sensor on the hand. In order to make the shape of the combined, there is no pooling layer in the CNN used in our convolution layer rectangular by filtering, the first convolution experiments that change the size of the input. Authorized licensed use limited to: Queen Mary University of London. Downloaded on November 09,2022 at 10:54:40 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 4 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS III. E XPERIMENT D ESIGN a power grasp by starting with picking motion of fingertips from a ground [see Fig. 3(d)]. Since this motion embraces a A. Data Collection variety of complicated contact states (e.g., slip, rolling contact, Instead of collecting tactile information only when an object touch, and not touch by finger gating) during executing the was statically grasped, we recorded tactile information through manipulation, it was chosen to evaluate the proposed CNN’s a series of manipulations in the hand as an active tactile adaptability and whether it is applicable to other tactile recog- sensing method. Since the objects were provided to the Allegro nition tasks or not. Since it is difficult to derive a formula to hand in random positions for training, recognition should be generate the motion, the training data were collected via the according to the object, not the way it was grasped. Note CyberGlove [40], which is a dataglove that enables teleoperate that, because the tactile measurements are not calibrated, the human-mimetic robot hands by sending joint motions of their sensitivity varies, and furthermore, crosstalk between measure- fingers to robot fingers. One experimenter collected all data ment axes may occur (see [28], [29] for details). However, because the motion can be drastically different among trials we assume that, as long as the features from the measurements when different people do the experiment, i.e., human hands are extracted by neural networks, there is no need to calibrate are different from each other and require different calibration the measurements. settings for the dataglove. 1) Object Recognition: For the training data of object As shown in Fig. 3(b), there are 45 objects, including recognition, we use the same data as in our previous paper three objects from the Yale-CMU-Berkeley Model Set. The [25], [30], where we confirmed the effect of uSkin sensors on objects were separated into three heaviness classes [Heaviness a multifingered hand by focusing on time series information High (more than 136 g), Heaviness Medium (77–114 g), and spatial information to improve the accuracy of object and Heaviness Low (under 68 g)], three softness classes recognition by CNNs. In this article, we focus on spatial [Softness High (deformable), Softness Medium (only surface information and its analysis. The data were collected in a deformable), and Softness Low (stiff objects)], and three slip- way that mimicks an infant’s manipulation and exploration of periness classes [Slipperiness High (plastic or coated paper), objects to obtain information about their physical properties. Slipperiness Medium (paper or bumpy), and Slipperiness Low Fig. 3(a) shows the 20 common objects used in the experiment, (textile or rubber)]. Each class has at least one object and which also includes ten objects from the Yale-CMU-Berkeley sometimes several objects in terms of outer properties so that Model Set. In particular, the objects in the top row of Fig. 3(a) the networks can acquire a generalization skill for the inner and 20: spray bottle are elongated in shape, and when grasping properties (e.g., different-sized balls and differently shaped the object, the palm of the hand always faces roughly the plastic fruits). elongated side, e.g., the bottle was never grasped from the Two trials of the in-hand manipulation motion for each cap or bottom, and the orientation varied, with the object being of the 45 objects were collected resulting in 90 trials in grasped close to the center, but not necessarily exactly at the total. Data were collected with a sampling rate of 100 Hz. center. It is also important to note that, in this experiment, The objects were placed near the Allegro hand in random even after repeated grasping, the objects do not always have positions. Fig. 3(f) shows tactile trajectories obtained from the same orientation, as the final grasping posture depends on one trial, where red indicates the raw data recorded and green the weight distribution. The other ten objects were roughly indicates the data extracted for training and testing. The data spherical and grasped in random orientations. Some examples were recorded for 17 s, and the manipulation was executed of active tactile sensing for elongated objects can be seen in during the period. Since the motion was executed by a human, Fig. 3(c). Note that the Allegro hand controlled the position the motion itself and when the motion finishes are always with a constant controller gain in all trials, and the reaction differently shown in Fig. 4. Therefore, the data for training in force varied depending on the size and shape of the object. each trial were extracted especially when the hand touches the Thirty manipulation trials were performed for each object, grasped object with a threshold of tactile measurements. Sev- for a total of 600 trials. Twenty-five trials for each object eral timesteps where the movement stops were also sampled were used for training the CNNs, and five trials were used as a static state. 291 timesteps were randomly sampled from for each CNN’s test set [see Fig. 3(a)]. Data were collected at each trial. In total, 26 189 samples were randomly chosen and a sampling rate of 30 Hz. Red is the raw data recorded, and used. The samples were split randomly into the training dataset green is the data extracted for training and testing shown in and the test dataset consisting of 23 570 and 2619 samples, Fig. 3(e). The last step of the 250 steps was set less than one respectively. A different random split was conducted for each second before the end of the recording (the recording always training trial of CNNs. stops at the same time after the movement stops). Note that the 3) Object Property Recognition With Transfer Learning: hand remains grasping the object even after the movement has For the training data of object property recognition, the stopped. Twenty-five timesteps from 250 timesteps in each of datasets described in Sections III-A1 and III-A2 were used. the 25 trials used for training were randomly sampled and used First, the datasets for object recognition were used for a as the training dataset. The training dataset contains a total pretraining phase of the transfer learning, and thus, the task of 12 500 samples. The test dataset contained 2500 samples for the pretraining was the same object recognition task randomly sampled from five trials for each object. described in Section III-A1. Therefore, the size of samples 2) Object Property Recognition: For the training data of for the training dataset and the test dataset is 23 570 and object property recognition, the target in-hand manipulation 2619, respectively. Second, the datasets for object recognition chosen in this study was a movement of a precision grasp to property recognition were used for a fine-tuning phase of the Authorized licensed use limited to: Queen Mary University of London. Downloaded on November 09,2022 at 10:54:40 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. FUNABASHI et al.: TACTILE TRANSFER LEARNING AND OBJECT RECOGNITION WITH A MULTIFINGERED HAND 5 Fig. 3. Collection of objects and training data. (a) Selection of 20 daily objects (target objects for object recognition). Ten objects were selected from the “Yale-CMU-Berkeley Model Set” [39]. To increase the difficulty of recognition, we chose ten slightly more difficult identifiable objects. The bottles are labeled L (large), M (medium), and S (small). 1: bottle (L, spheric); 2: bottle (L, cornered); 3: bottle (S, spheric); 4: bottle (S, cornered); 5: bottle (M, spheric); 6: bottle (M, waisted); 7: powder can; 8: pringles; 9: hand model; 10: pack of Styrofoam dices; 11: pack of snacks; 12: pack of solid dices; 13: tuna can; 14: large can; 15: spam can; 16: bowl; 17: clipper; 18: baseball; 19: soccer ball; and 20: spray bottle. (b) 45 daily objects that were selected in terms of inner object properties (heaviness, softness, and slipperiness) (target objects for object property recognition). (c) shows how the training data were collected (target motion for object recognition). The motion is to mimic a human baby’s squishing. Fingers move left and right. The skin sensors are rubbed to the grasped object. (d) Target motion of the object property recognition is to start a grasping posture by pinching it with fingertips. (e) Example of tactile time-series data throughout squishing trial (tactile trajectories during object recognition motion). The repetitive motion is reflected in the tactile data stream. Red indicates the complete data stream; green indicates the extracted raw data for learning object recognition. (f) Example of tactile time-series data throughout object picking trial (tactile trajectories during object properly recognition motion). Red indicates the complete data stream; green indicates the extracted raw data for learning object property recognition. The data are extracted when the hand touches objects, and thus, the extracted timesteps depend on each trial. transfer learning. The task for the fine-tuning was the same 16 (joint angles) + 15 (sensor patches) × 16 (sensor) × 3 object property recognition task described in Section III-A2 (tactile axis); thus, 736 measurements at each time step were but with different training settings to the one for object obtained. To train the CNN, we added “0” to the input from property recognition without transfer learning. the fingertip sensor, as shown in Fig. 3(b). Consequently, the number of dimensions of the input is 736 + 8 (number of “0”s for one fingertip) × 3 (tactile axis) × 4 (number of B. Training Setting fingertips) = 832 dimensions. All the CNNs were built with the Tensorflow library for Transfer learning was conducted with object property recog- Python and a GTX Geforce 1080, 1080Ti and RTX 2080 were nition. In this study, acquired features (or trained weights used as the GPU. Object recognition and object property and biases) from the object recognition task were reused for recognition have different training settings because the object (transferred to) the object property recognition task to get recognition task requires the CNNs to generate one-hot vectors higher recognition rates. As far as the authors know, this is for so-called multiclass classification, while the object property the first time to achieve transfer learning for tactile tasks with recognition task requires the CNNs to generate “multi”-hot a multifingered hand. vectors for so-called multilabel classification. 1) Object Recognition: Each CNN was trained with Otherwise, with the exception of the size of the convolution 12 500 samples, and the test set consisted of 2500 different layer, all the CNNs used in this article have the same network samples for up to 10 000 epochs until the training loss con- parameters and training settings, and they are described in verged and the test loss for the test set did not go up. Weights Fig. 5(b). For the CNN input, we used samples, including for all CNNs were initialized with a random number from a Authorized licensed use limited to: Queen Mary University of London. Downloaded on November 09,2022 at 10:54:40 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 6 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS Fig. 4. Examples of the used objects for object property recognition and corresponding time-series (joint and tactile measurements) data. As shown in the graphs, even though each trial of manipulation starts and ends at a different time and the trajectories of joint and tactile information differently change, the proposed CNNs could get high recognition rates of the object properties. truncated normal distribution with a standard deviation of 0.1, the output layer. The output layer has a sigmoid activation with a random seed defined as 1. All layers except the output function unlike object recognition layer used ReLU as the activation function. Softmax was used 1 as the activation function for the output layer f (x i ) = (4) 1 + e−xi ex f (x i ) = Ci (2) where x i is a feature from i th neuron in the output layer and xk k e f (x i ) is activated output from the output layer. The output where x i is a feature from the i th neuron in the output layer values from the output layer take a range of 0 to 1 through and f (x i ) is the activated output from the output layer. Since the sigmoid. The loss function was binary cross (BC) entropy. the sum of the values from the output layer is 1.0 through This BC loss is defined as Softmax, it can be regarded as a probability. It is suitable for categorical cross (CC) entropy as the loss function. This CC C BCLoss = − tk log(x k ) (5) Loss is defined as k exp CCLoss = − log C (3) where tk and x k are the target label and the output of the CNN xk k e for class k in C classes. The optimizer, the learning rate, and where tk and x k are the target label and the output of the the minibatch size were the same as that of object recognition. CNN for class k of C classes. The optimizer used was Adam, 3) Object Property Recognition With Transfer Learning: and the minibatch size was set to 100. The step size of the For pretraining in transfer learning, the models that were made optimizer α was 0.0001, the first exponential decay rate β1 in the object recognition task described in Section III-B1 were was 0.9, the second exponential decay rate β2 was 0.999, and prepared as pretrained models for object property recognition. the small value of numerical stability ε was 1e−08. Moreover, Those pretrained models were fine-tuned for the object prop- the learning rate was 0.00001. erty recognition task. Each MS-CNN has differently shaped 2) Object Property Recognition: For the training of the convolution layers, as shown in Fig. 5, the difference is object property recognition task, each CNN was trained with investigated by considering the recognition results for transfer 23 490 samples, and the test set consisted of 2610 different learning in Sections IV-D–IV-F, and the size of the output samples for up to 10 000 epochs until the training loss con- layer for the two tasks is different (20 for object recognition verged and the test loss for the test set did not go up. ReLU and nine for object property recognition). From these points, was used as an activation function for all the layers except the weights and biases in the fully connected (FC) layers Authorized licensed use limited to: Queen Mary University of London. Downloaded on November 09,2022 at 10:54:40 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. FUNABASHI et al.: TACTILE TRANSFER LEARNING AND OBJECT RECOGNITION WITH A MULTIFINGERED HAND 7 Fig. 5. Nine architectures for combining convolution layers. (a) First FC gets features from the third convolution (Conv) layer and joint angles (J) (combining patterns of MS-CNN architectures). Architectures II and III, and VII and IX are prepared to see the difference in when to combine convolution layers. (b) Each parameter for constructing convolution layers is described (parameter settings for MS-CNN architectures). (FC1 and output layers in Fig. 5) were not transferred, only set was almost 50% larger) as transfer learning is supposed to the ones in the convolution layers. transfer useful knowledge to train networks for another task For fine-tuning in transfer learning, the weights and biases and result in being able to reduce the size of the training in the convolution layers were not updated during fine-tuning. dataset. This was also because, when the training and test The pretrained models were trained with 13 094 samples and dataset sizes for object property recognition tasks were used, 13 095 samples as the test set for up to 1000 epochs as training it was difficult to see a difference among recognition results losses of the models got converged, and the test loss for the from the CNNs used for transfer learning. Those datasets were test set did not go up (the original dataset with 26 189 samples created by randomly choosing samples every time the object was split half and a half). The samples were randomly chosen property recognition trial was conducted. for the training and test datasets for each training trial. All the hyperparameters for training, such as learning rate or training C. Combining Architectures epoch, were the same as the object property recognition task in Different architectures were considered to combine the Section III-B2. However, in the case of the transfer learning, convolutional (Conv) layers. The location of the hand sensors the size of the training dataset was almost 50% smaller than was taken into account when defining these architectures. Par- the training of object property recognition task (and the test ticularly, the choice of applying filters across the boundaries of Authorized licensed use limited to: Queen Mary University of London. Downloaded on November 09,2022 at 10:54:40 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 8 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS finger segments and different stages of fingers was considered. This idea considering merging convolution layers for extract- ing features at the different stages with each resolution can be useful for achieving high object recognition performance [41]. As shown in Fig. 5(a), there are nine architectures, and the parameters of each architecture are shown in Fig. 5(b). For example, in architecture I, the first Conv layer has an input of size 18×16×3. When this is passed to the first Conv layer, the output from the first Conv layer is 10 × 8 × 14. In a previous paper of ours [30], we evaluated the comparison between three-axis and one-axis tactile information and showed that using three-axis tactile information is more effective than that of one-axis. Therefore, Fig. 5(b) shows the parameter settings for the three axes only, which are determined heuristically. The joint angle measurements were added to the first FC layer, represented by J in Fig. 5(a), for all architectures. Weight sharing was tried to reduce the number of filters between Fig. 6. Average and variance of recognition rate for ten times from each architecture. Architecture IV shows the best recognition rate. fingers. However, when we evaluated the results with and without weight sharing, we found that weight sharing did not improve the recognition rate compared to the case where between architectures II and III is when the “Hand Map Layer” different weights were used in [30]. Therefore, only different is constructed. This idea comes from [16], which studied the weights of CNNs were used in this study. To adjust the size timing of information fusion for grasp stability. of the convolution layer, we used filters of different sizes in In Architecture IV, we have a “Patch Map Layer” as the first architectures II, III, and IV. Conv layer. Thereby, the maps of the four fingers (index finger, The measurements of joint angles and tactile sensors middle finger, little finger, and thumb) are constructed as the were, respectively, normalized by their full measurement “Finger Map Layer.” The “Hand Map Layer” is constructed range. In architecture I, a “Hand Map Layer with Zero in the third Conv layer. This architecture changes the map Padding” was constructed as an input layer with a size of architecture from a “Patch Map Layer” to a “Hand Map 18 rows × 16 columns. Three tactile patches [4 × 4 × Layer,” one Conv layer at a time. A filter of size (4, 2) of (2 patches) and 6 × 4 × (1 patch)] for the thumb and four the second Conv layer is used for the Conv layer from the patches [4 × 4 × (3 patches) and 6 × 4 × (1 patch)] for phalanges and fingertip of the thumb. the other fingers were implemented, respectively. Therefore, In Architecture V, a “Patch Map Layer” is used for the input to construct a rectangular input layer, we had to add four layer and the first Conv layer. A “Finger Map Layer” consists rows × four columns of “0” on top of the tactile input map of the second and third Conv layers. Architecture V differs from the thumb fingertip, as shown in Fig. 5(a). As a result, the from Architecture IV in that it does not have a “Hand Map dimensionality of the input is 736 + 8 (number of “0”s on one Layer” but still gradually combines the Conv layers. fingertip) × 3 (tactile axis) × 4 (number of fingertips) + 16 Architecture VI consists of only a “Patch Map Layer” for (4 (rows) × 4 (columns) above the thumb’s fingertip) × 3 the input and all Conv layers. This architecture is to check (tactile axis) = 880 dimensions. The filter was applied across whether we need to combine convolution layers according to the boundaries of the finger segments already set up in the the position of the patch on the hand. first layer. The purpose of this architecture is to see whether Architecture VII uses only a “Finger Map Layer” for the it is sufficient to combine all the information into one large input and all Conv layers, to check whether it is sufficient to input layer or not. consider the position of the patches on the fingers. In Architecture II, each one of the 15 sensor patches Architecture VIII uses the “Finger Map Layer” in the input is processed separately in the first convolution layer (as a and the first Conv layers so that we confirm if a morphological “Patch Map Layer”). In this case, the first convolution layer fusion of convolution layers (i.e., finger- and hand-shaped is the “Patch Map Layer,” i.e., no filters are applied across convolution layers) extracts more useful features for tactile the boundaries of the sensor patches, and hence, for every recognition tasks or not. different sensor patch, a different filter is trained. In the next Architecture IX uses the “Finger Map Layer” in the input, layer, the output of the first convolution layer is combined first, and second Conv layers. The difference between archi- taking into account the location of the tactile patches of tectures VIII and IX lies in when to build the “Hand Map the hand in Fig. 2(c) (“Hand Map Layer”). For the input Layer” with the same reason as for architectures II and III. map from the phalanges of the thumb, a filter of size (2, 3) is used in the first Conv layer. The details are shown IV. E VALUATION in Fig. 5(b). In Architecture III, a “Patch Map Layer” is also used for A. Combining Architectures and Object Recognition Rates the second Conv layer. A “Hand Map Layer” is implemented In Fig. 6, the accuracy and variance presented are the mean in the third Conv layer. For the thumb fingertip, a filter of size values of ten recognition trials. In each trial, 1500 samples (2, 2) is used in the second Conv layer. The only difference were randomly selected from the 2500 samples in the test set. Authorized licensed use limited to: Queen Mary University of London. Downloaded on November 09,2022 at 10:54:40 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. FUNABASHI et al.: TACTILE TRANSFER LEARNING AND OBJECT RECOGNITION WITH A MULTIFINGERED HAND 9 C. Visualization of Sensor Map With Weights From Convolution Layers Since the error rates of object recognition can change according to the structure of the convolution layers, how the weights in the last convolution layer (third Conv layer) in each architecture react to tactile measurements was investi- gated. Grad-CAM++ and guided Grad-CAM++ [38], which provides a saliency map of calculated weights from the last convolution layer corresponding to tactile measurements, were used, as shown in Fig. 8. The saliency map is defined as follows [38]: L i j = ReLU c w k · Ai j c k (6) k where wkc is a weight for a feature map. The map is defined as Akij at the i th and j th spatial location for class c. For the map provided by Grad-CAM++, blue to red represent the lowest to highest values, respectively, and the pixels where high values are placed show where the network regards the information as important. For the map provided by guided Grad-CAM++, Fig. 7. Error rates of object recognition for six objects from each architecture. The red solid square shows where architectures that do not have the “Patch the pixels where convolution layers focus on are emphasized Map Layer” have relatively higher error rates. The red dot square shows where with colors, while the other pixels are depicted with gray. architectures that do not have the “Patch Map Layer” have lower error rates. This section focuses on the objects that are described in The green solid square shows where architecture VI that only has “Patch Map Layer” has relatively higher error rates. The green dot circle shows where Section IV-B. For architectures I, II, and IV, the saliency maps architecture VI that has lower error rates. provided by Grad-CAM++ show that the tactile information is wholly focused on by their last convolution layer. On the other hand, architecture VI shows that the layer focuses on a small Architecture II showed a slightly lower recognition rate part of the tactile information, relatively. Also, architecture VII than the others. The variance for architecture VI is the highest focuses on tactile information in a line-shaped fashion. as the architecture has the largest number of weight parame- It seems that these differences happen because one or ters in the convolution filters. The best recognition rate was several convolution layers in the last layer (third Conv layer) achieved by architecture IV, which includes the Patch, Finger, are weighed heavily among the convolution layers. Since there and Hand Map Layers. is only one convolution layer in the last layer of architectures I, III, and IV, they seem to regard entire tactile information as important. Some objects picked in Figs. 7 and 8 (2: bottle B. Object Recognition Rate for Each Object (L, cornered), 12: pack of solid dices, 13: tuna can, and In this section, the object recognition rate for each object 16: bowl) have a relatively complicated shape compared to is investigated to elaborate on the effect of morphological the rest of the objects [4: bottle (S, cornered) and 5: bottle convolution architectures. Specifically, six objects (2: bottle (M, spheric)]. This shape can change contact patterns on (L, cornered), 4: bottle (S, cornered), 5: bottle (M, spheric), each part of the multifingered hand. Therefore, when the 12: pack of solid dices, 13: tuna can, and 16: bowl) are hand grasps 12: pack of solid dices, for example, contact investigated, as shown in Fig. 7. For 2: bottle (L, cornered), 12: patterns on each finger segment can be different which are pack of solid dices, and 13: tuna can, four CNN architectures provided by an edge, side, and plane of the dices. On the [architectures I, VII, VIII, and IX (with red dot squares)] other hand, 5: bottle (M, spheric) has small enough size achieved a lower error rate than other architectures. On the that the hand grasps it wholly and has a cylindrical shape, other hand, they have relatively higher error rates (with which can produce a similar contact pattern on the hand while red solid squares) than the other architectures for 5: bottle grasping. From this result, we deduce that it was easy for (M, spheric). Those architectures do not have Patch Map architecture VI to recognize relatively simple shaped objects Layers, and tactile sensor patches are combined from their [4: bottle (S, cornered) and 5: bottle (M, spheric)] because the input layer as the “Finger Map Layer” or the “Hand Map network focuses on a small part of the contact areas, which Layer.” can be enough to perform object recognition due to the similar Moreover, architecture VI has a lower error rate for 4: contact pattern on any part of the Allegro hand. However, bottle (S, cornered) and 5: bottle (M, spheric) (with green dot when it comes to complicated shaped objects, such as 2: bottle squares), and the others have the highest wrong recognition (L, cornered), 12: pack of solid dices, 13: tuna can, and 16: rate with 12: pack of solid dices, 13: tuna can, and 16: bowl bowl, they were difficult for architecture VI to recognize (with green solid squares). From this result, depending on because it focuses on small contact areas, but contact patterns having the “Patch Map Layer” or not, the recognition rate on the contact areas are diverse. This misleads the network for each object can be changed. to recognize objects wrongly. For the other architectures, Authorized licensed use limited to: Queen Mary University of London. Downloaded on November 09,2022 at 10:54:40 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS Fig. 8. Saliency maps from the last convolution layer in each MS-CNN generated by Guided Grad-CAM++ and Grad-CAM++ from [38]. The maps generated by both visualization methods for architectures I, III, and IV that have the “Hand Map Layer” clarify that the networks focus on the entire tactile information. Architecture VI that only has the “Patch Map Layer” shows it focuses on small areas compared to the other networks. Interestingly, architecture VII seems to see tactile information in a line fashion. Since those maps are generated by the last convolution layer and the layer shape is different for each architecture, the difference in the saliency maps among networks seems to happen as each network chooses which layer in the last convolution layer to focus on. architectures I, II, and IV focus on whole tactile information, and architecture VII focuses on larger areas compared to architecture VI. Thus, they have better recognition rates when the hand grasps the relatively complicated shaped objects. Also, as shown in Fig. 7, architectures I, VII, VIII, and IX have lower error rates for the complicated objects specifically because they have combined inputs from the input layer. Furthermore, it can be considered that architecture IV could have the best object recognition rate with a variety of objects Fig. 9. (a) Recognition rate of the object property (comparison of the in terms of shape (or contact patterns) in total because the proposed CNN architectures). On the left-hand side of the table, averages and variances of recognition rate for five times of no transfer (NT) models are network has a “well-balanced” network architecture (i.e., the shown. The accuracy difference among CNNs for object property recognition “Patch Map Layer,” the “Finger Map Layer,” and the “Hand is larger than that of object recognition shown in Fig. 6. Furthermore, Map Layer”). From this result, we hypothesize that a CNN that architectures I, III, and IV have a large difference in comparison with architectures VI and VII. On the right-hand side of the table, averages and has combined convolution layers (i.e., the “Hand Map Layer”) variances of recognition rate for five times WT models are shown. The sees the entire tactile information on the multifingered hand recognition rate got around 10% better than without transfer learning. Note and is well suited for recognizing complicated contact states. that architecture IV still has the highest recognition rate, and architectures I, III, and IV that have a “Hand Map Layer” have better recognition rates than the others. (b) Recognition rate of the object property (comparison with popular CNNs). On the right-hand side of the table, averages and variances D. Object Property Recognition and Transfer Learning of recognition rate for five times of each model are shown. Object property recognition was chosen as the target task. The accuracy was calculated as a mean of all nine outputs of the CNNs. The recognition trials were conducted five times Also, architectures I and II got better recognition rates than for each CNN. Fig. 9(a) shows the mean accuracies and their the others, which do not have a “Hand Map Layer.” However, variances of the five trials. Architectures I, III, IV, VI, and VII a “Patch Map Layer” can improve the recognition rate from were chosen. As a result, architecture IV got again the best the result that architecture I has less recognition rate and huge recognition rate. variance compared to architectures III and IV. The “Finger Authorized licensed use limited to: Queen Mary University of London. Downloaded on November 09,2022 at 10:54:40 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. FUNABASHI et al.: TACTILE TRANSFER LEARNING AND OBJECT RECOGNITION WITH A MULTIFINGERED HAND 11 Map Layer” also contributes to a good result as architecture IV got better results than architecture III. Regarding transfer learning, most architectures got around 10% better recognition rates, and architecture IV got the best accuracy. Furthermore, the trend that networks get better results with the “Hand Map Layer” and/or the “Patch Map Layer” was kept. E. Classifier Comparison for Object Property Recognition From Section IV-E, architecture IV got the best accuracy of object property recognition. To validate the recognition performance of the proposed CNNs, other CNN-based neural Fig. 10. Weight values (filters) from the last convolution layers. The top row networks and machine learning models were used for compar- shows one of the filters consisting of the weights from the last convolution ison. Since each network has a different architecture, training layer in architecture VI. The middle row shows one of the filters consisting epochs where each network converged was different. Resnet of the weights from the last convolution layer in architecture VII. The bottom row shows one of the filters consisting of the weights from the last convolution with 18 layers (trained for 800 epochs) and 34 layers (trained layer in architecture IV. The errors between weights values from the NT model for 900 epochs), MobileNetV2 (trained for 900 epochs), after training and the transferred weights’ values from the WT model are ShuffleNetV2 (trained for 800 epochs), and MnasNet (trained shown in the center. Also, the errors between the weights from the NT model after training and the NT model before training (the weights were randomly for 750 epochs) as CNN-based models (provided by PyTorch) initialized by a normal distribution with a mean of 0.0 and a standard deviation and SVM (provided by scikit-learn) as a machine learning of 0.1) are shown on the right-hand side. Most of the errors between the NT model were prepared. The “Hand Map Layer with Zero model after training and the transferred weights of the WT model are less than 1.00, but the errors between the NT model after training and the NT Padding” was constructed as an input layer with a size of model before training are various and huge. 18 rows × 16 columns (same as architecture I) to be input to the deep learning models. Note that the deep learning models were chosen, which could process the tactile input with a size of 18 × 16, which is relatively small-sized input compared deviation of 0.1 as an NT model. Note that the transferred to visual inputs from a camera with a larger size, such as weights were from only convolution layers (not FC layers) as 224 × 224. Moreover, the convolution layers soon before FC this study focused on the convolution mechanism for tactile layers of the deep learning models were used as a feature information. extractor and a first FC, and output layers were prepared to be In Fig. 10, from the top row, one of the filters (weights) in applied to a new domain referring [22]. During the training, the last convolution layer of architectures VI (2 × 2 filter), the layers of the deep learning models were not updated, but VII (2 × 4 filter), and IV (3 × 3 filter) as ones of examples only the first FC and output layers were, so that the effect of are shown in a gray scale. On the left-hand side, the weights convolution layers of the deep learning models were validated in the NT model after training is shown as target weights in the same way as the transfer learning in Section IV-D, all and the weights are supposed to be the optimized (trained the models were trained with the same training setting as and converged) ones for object property recognition. In the one used in Section III-B-2 except the training epoch. The middle of Fig. 10, the weights in the WT model, which are accuracy was calculated as a mean of all nine outputs of the the transferred weights of convolution layers from a model networks. The recognition trials were conducted five times for trained for object recognition, are shown. On the right-hand each network. Fig. 9(b) shows the mean accuracies and their side, the weights in the NT model before training are shown variances of the five trials. In the proposed CNN, architecture as a baseline for this comparison study. The values shown in IV could achieve the best recognition rate. Especially, the deep Fig. 10 are the errors of the weights between the NT model learning models that are larger than architecture IV and do after training and the WT model in the center, and the NT not have combined convolution layers following tactile sensor model after training and the NT model before training on the positions on the hand produced very low recognition rates. right-hand side. The weights in the WT model are similar to the weights in the NT model after training with the dataset of object property recognition as most of the errors between F. Analysis of Weights in CNNs weights of the NT model after training and weights of the Even though transfer learning is useful for problems with WT model are under 1.00. This shows that the weights in insufficient training data in general [42], it is not clear why the WT model were already optimized enough that the WT the CNNs could achieve better recognition rates for object model required less size of training dataset and training epochs property recognition, as shown in Fig. 9(a). The difference for object property recognition. Then, the WT model focused between WT and NT models is how to prepare the weights on updating only weights in the FC layers was enough to in a neural network for a new task, i.e., the weights from achieve high recognition rates (technically, the weights in a pretrained model trained in the other task or the weights the convolution layers were fixed and were not updated). generated from a random initialization method. Therefore, On the other hand, as most of the errors between weights of the we compared the weights in pretrained models from the object NT model after training and weights of the NT model before recognition task as a WT model and the weights initialized training are over 1.00 and some of the errors are even more by a normal distribution with a mean of 0.0 and a standard than 100, the NT model needs to update weights in convolution Authorized licensed use limited to: Queen Mary University of London. Downloaded on November 09,2022 at 10:54:40 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 12 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS Fig. 11. AUC of architectures I–IV, VI, and VII. AUC is an indicator of how much the models are capable of distinguishing between classes and its value, which is an area under the ROC curve. The AUC of the label “Heaviness High” is very low from architecture I. The AUC of the label “Softness High” is very low from architecture VI. This also implies that the fusion combination of CNN architectures affects the accuracy of each physical property of objects. layers and the weights in FC layers. This comparison result outputs from the CNNs are positive, and thus, TN and FN can be a reason why the recognition rates of the WT models are 0); otherwise, they are 0 (i.e., all outputs from the CNNs were better than those of the NT models. are negative, and thus, TP and FP are 0). From this point, an ROC curve can be built only by coordinates TPR and G. Object Property and CNN Architectures FPR = (0, 0) and (1, 1). Finally, an AUC derived from (8) would be 0.500. On the other hand, the AUC of “Softness By using the WT models, recognition performance for High” in architectures VI and VII shows a very low value. The each object property was investigated. Since the binary-cross difference is whether each model has the “Hand Map Layer.” entropy was used as the cost function, recognition results Furthermore, architectures II and III have similar network would change by cutoff values that decide an output of CNNs architecture, while each model has a low AUC value for as 0 or 1. In this case, the receiver operating characteris- different properties. Architecture II has a lower number of tic (ROC) and the area under the curve (AUC) were calculated. the “Patch Map Layers” and a low AUC value for “Heaviness The ROC curve is depicted in a map with the true positive High.” Architecture III has a lower number of the “Hand Map rate (TPR) axis and the false positive rate (FPR) axis, which Layers” and a low AUC value for “Softness High.” Therefore, are defined as we deduce that the fusion of convolution layers affects not TP only the recognition of objects but also the physical property TPR = TP + FN of objects. FP FPR = (7) FP + TN H. Object Property and Tactile Information where true positive (TP) is that the actual class and prediction Finally, tactile information was analyzed for a better under- are positive (correct answer), true negative (TN) is that the standing of CNN architectures for object property recognition. actual class and the prediction are negative (correct answer), Specifically, the architectures that have the “Patch Map Layer” false positive (FP) is that the actual class is negative but showed a low AUC value for “Softness High,” while the the prediction is positive (incorrect answer), and false neg- other architectures that have “Hand Map Layer” showed a ative (FN) is that the actual class is positive but the prediction low AUC value for “Heaviness High.” Grad-CAM++ showed is negative (incorrect answer). how the architectures that have a “Patch Map Layer” are The AUC is an area under the ROC curve and is defined as good at recognizing simpler contact patterns, while the other 1 architectures that have the “Hand Map Layer” are good at AUC = TPR(FPR)dFPR recognizing complicated contact patterns. 0 1 From this point, how each object property changes contact = TPR FPR−1 (x) d x (8) patterns was analyzed. In Fig. 12, each tactile trajectory 0 represents an average on each tactile sensor patch for simpler where x is a continuous random variable. Fig. 11 shows AUCs visualization of the tactile information. This information is for object properties from each CNN. Architecture IV achieved taken from the third to fourth grasping postures in Fig. 3(d) as a high value of AUC for each physical property. Interestingly, the last 70 timesteps of the motion. The red line is on a digital there are some properties that each CNN model is good at value of 2400 where noises and responses to the grasped recognizing. The AUC of “Heaviness High” in architectures object are clearly separated. The tactile trajectories over the I–III shows a very low value of 0.500, which is theoretically digital value are regarded as where the hand touches the object the same value as what a predictor randomly outputs. There firmly. Fig. 12(a) shows that five sensor patches have tactile is a reason why 0.500 appears for some property labels. First, value. We deduce that the soft object deformed and followed CNNs output 0 for the labels at all grasping states. Although a grasping posture of the hand, and thus, the soft object a cutoff value, which is a threshold to decide whether the affected relatively many tactile sensor patches. Also, tactile CNNs recognize an object property or not, varies from 0 to trajectories are dynamic due to the softness of the object, under 1, the output from CNNs is always classified as 0. Only which means that contact patterns are complicated. Fig. 12(b) when the cutoff value is set as 0, the output from CNNs shows that a smaller number of tactile sensor patches (three are always classified as 1. Therefore, when the cutoff value patches) have tactile values over 2400 to the contrary. Also, is set to 0, TPR and FPR derived from (7) are 1 (i.e., all tactile trajectories are relatively flat due to the stiffness of the Authorized licensed use limited to: Queen Mary University of London. Downloaded on November 09,2022 at 10:54:40 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. FUNABASHI et al.: TACTILE TRANSFER LEARNING AND OBJECT RECOGNITION WITH A MULTIFINGERED HAND 13 recognition and with heaviness labels for object property recognition. On the other hand, when the convolution layers are fused and built in the “Hand Map Layer” (e.g., architec- ture IV), it has better recognition rates with more complicated shaped objects for object recognition and with softness labels for object property recognition. Thus, the CNN architecture can be customized depending on tactile measurements or tasks (i.e., simple or complicated touch), reflecting outer (e.g., size and shape), and inner (e.g., softness, slipperiness, and heav- iness) properties of objects. This finding is investigated by a visualized localization skill of the Grad-CAM++ and thor- oughgoing analyses of recognition results for each object and object property using not only a variety of proposed CNN architectures and also CNN-based neural network and machine learning models. Moreover, transferred weights and tactile measurements are analyzed to investigate the factors behind successful recognition. This kind of approach to geometrical features of grasped objects from tactile sensors can be a key for further understanding of tactile sensing as a tactile system in a human’s skin also recognizes such object geometries [43]. Overall, considering robotic configuration represented by distributed tactile sensors with different scales (patch, finger, Fig. 12. Tactile information during grasping: (a) kitchen paper with “Softness and hand mappings) can be a useful approach to achieving High,” “Slipperiness Low,” and “Heaviness High” labels and (b) spray bottle robotic tactile tasks, including simple and complicated contact with “Softness Low,” “Slipperiness High,” and “Heaviness High” labels. Those states, and the approach can be used with other useful methods objects are chosen to compare tactile measurements varied by their softness. Despite a difference in the slipperiness labels of the objects, they are not (i.e., transfer learning that is specifically important for tasks taken into consideration for this comparison as AUCs of CNN architectures with many tactile sensors as they can be worn out) for are barely different. They have the same heaviness label so that tactile achieving better results. measurements largely vary. The last 70 timesteps of tactile information during in-hand manipulation are shown. (a) Tactile information with kitchen paper Nowadays, some of the differently shaped robots have dis- (soft object). (b) Tactile information with a spray bottle (stiff object). tributed tactile sensors on their surface, for example, humanoid or disaster robots with distributed tactile sensors [44], [45]. object, which means that contact patterns are not complicated. The proposed concept, i.e., building network architectures We deduce that the heavy and stiff objects are held at a following robot configurations, could be a suggestion for how small number of phalanges because it does not deform. These to process the tactile information with CNNs for such robots. physical properties of objects change the recognition rates of Also, the proposed concept can be applied to infant robots to an object property. This result revealed that each architecture identify their body via distributed tactile sensors for further has the robustness to process contact patterns that depend on understanding of cognitive science. Not only tactile informa- the physical properties of objects. tion but also thermal information following robot configura- tion could be another application for our proposed methods V. C ONCLUSION and used for thermal imaging tasks [46], [47], [48]. As the This study investigated how the MS-CNN architecture proposed networks achieved high accuracies in recognition affects tactile-based multifingered hand tasks with distrib- tasks, real-time recognition during multifingered manipulation uted three-axis tactile sensors. Object recognition and object is the next step for dexterous manipulation [49] using the property recognition were targeted to evaluate the CNN. graph convolutional network [50] inspired by the results of the The best object recognition rates over 95% were achieved MS-CNNs with morphology-related convolution. Moreover, in the experiments by initially separating and subsequently transfer learning can be applied to more different domains, combining convolution layers following the robot’s configu- such as in-hand manipulation and measuring grasping stability. ration, especially when making the patch, finger, and hand ACKNOWLEDGMENT maps (architecture IV). Moreover, we clarified that the CNNs can make better results by synergy with another generally The authors would like to thank Dr. S. Somlor and useful training method, i.e., transfer learning, which achieved Dr. T. Tomo Pradhono for their technical support. They would prominent object property recognition rates up to 98% with the also like to thank the Editor-in-Chief, the Associate Editor, CNNs. Since this recognition was achieved at a single touch and all the anonymous reviewers for their time and helpful of the multifingered hand, the CNN could also be applied to comments. in-hand manipulation and grasp stability tasks in which quick R EFERENCES processing is required. Most importantly, this study revealed an interesting finding [1] A. Yamaguchi and C. G. Atkeson, “Recent progress in tactile sensing and sensors for robotic manipulation: Can we turn tactile sensing when convolution layers are not fused (architecture VI); it has into vision?” Adv. Robot., vol. 33, no. 14, pp. 661–673, 2019, doi: better recognition rates with simpler shaped objects for object 10.1080/01691864.2019.1632222. Authorized licensed use limited to: Queen Mary University of London. Downloaded on November 09,2022 at 10:54:40 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 14 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS [2] M. A. Lee et al., “Making sense of vision and touch: Self-supervised [24] S. Funabashi et al., “Stable in-grasp manipulation with a low-cost robot learning of multimodal representations for contact-rich tasks,” in Proc. hand by using 3-axis tactile sensors with a CNN,” in Proc. IEEE/RSJ Int. Conf. Robot. Automat. (ICRA), May 2019, pp. 8943–8950. Int. Conf. Intell. Robots Syst. (IROS), Jan. 2020, pp. 9166–9173. [3] Y. Gao, L. A. Hendricks, K. J. Kuchenbecker, and T. Darrell, “Deep [25] S. Funabashi et al., “Object recognition through active sensing using a learning for tactile understanding from visual and haptic data,” in Proc. multi-fingered robot hand with 3D tactile sensors,” in Proc. IEEE/RSJ IEEE Int. Conf. Robot. Autom. (ICRA), May 2016, pp. 536–543. Int. Conf. Intell. Robots Syst. (IROS), Oct. 2018, pp. 2589–2595. [4] I. Akkaya et al., “Solving Rubik’s cube with a robot hand,” 2019, [26] B. Romero, F. Veiga, and E. Adelson, “Soft, round, high resolution arXiv:1910.07113. tactile fingertip sensors for dexterous robotic manipulation,” in Proc. [5] A. Billard and D. Kragic, “Trends and challenges in robot manipu- IEEE Int. Conf. Robot. Autom. (ICRA), May 2020, pp. 4796–4802. lation,” Science, vol. 364, no. 6446, Jun. 2019. [Online]. Available: [27] Q. Li, O. Kroemer, Z. Su, F. F. Veiga, M. Kaboli, and H. J. Ritter, https://science.sciencemag.org/content/364/6446/eaat8414 “A review of tactile information: Perception and action through touch,” [6] Y. Chebotar, K. Hausman, Z. Su, G. S. Sukhatme, and S. Schaal, IEEE Trans. Robot., vol. 36, no. 6, pp. 1619–1634, Dec. 2020. “Self-supervised regrasping using spatio-temporal tactile features and [28] T. P. Tomo et al., “A modular, distributed, soft, 3-axis sensor system reinforcement learning,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. for robot hands,” in Proc. IEEE-RAS 16th Int. Conf. Humanoid Robots (IROS), Oct. 2016, pp. 1960–1966. (Humanoids), Nov. 2016, pp. 454–460. [7] H. Yousef, M. Boukallel, and K. Althoefer, “Tactile sensing for [29] T. P. Tomo et al., “Covering a robot fingertip with uSkin: A soft dexterous in-hand manipulation in robotics-A review,” Sens. Actua- electronic skin with distributed 3-axis force sensitive elements for tors A, Phys., vol. 167, no. 2, pp. 171–187, 2011. [Online]. Available: robot hands,” IEEE Robot. Automat. Lett., vol. 3, no. 1, pp. 124–131, https://www.sciencedirect.com/science/article/pii/S0924424711001105 Jan. 2018. [8] M. Andrés and R. Suárez, “Manipulation of unknown objects to improve [30] S. Funabashi, G. Yan, A. Geier, A. Schmitz, T. Ogata, and S. Sugano, the grasp quality using tactile information,” in Proc. SENSORS, vol. 18, “Morphology-specific convolutional neural networks for tactile object 2018, pp. 1628–1635. recognition with a multi-fingered hand,” in Proc. Int. Conf. Robot. [9] T. Narita, S. Nagakari, W. Conus, T. Tsuboi, and K. Nagasaka, “Theo- Autom. (ICRA), May 2019, pp. 57–63. retical derivation and realization of adaptive grasping based on rotational [31] Z. Peng, Z. Li, J. Zhang, Y. Li, G.-J. Qi, and J. Tang, “Few-shot image incipient slip detection,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), recognition with knowledge transfer,” in Proc. IEEE/CVF Int. Conf. May 2020, pp. 531–537. Comput. Vis. (ICCV), Oct. 2019, pp. 441–449. [10] B. Sundaralingam and T. Hermans, “In-hand object-dynamics infer- [32] C. Sferrazza and R. D’Andrea, “Transfer learning for vision-based tactile ence using tactile fingertips,” IEEE Trans. Robot., vol. 37, no. 4, sensing,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), pp. 1115–1126, Aug. 2021, doi: 10.1109/TRO.2020.3043675. Nov. 2019, pp. 7961–7967. [11] Z. Abderrahmane, G. Ganesh, A. Crosnier, and A. Cherubini, “Hap- [33] J. M. Gandarias, A. J. Garcia-Cerezo, and J. M. Gomez-de-Gabriel, tic zero-shot learning: Recognition of objects never touched before,” “CNNbased- methods for object recognition with high-resolution tactile Robot. Auto. Syst., vol. 105, pp. 11–25, Jul. 2018. [Online]. Available: sensors,” IEEE Sensors J., vol. 19, no. 16, pp. 6872–6882, 2019. https://www.sciencedirect.com/science/article/pii/S0921889017307492 [34] B. Bauml and A. Tulbure, “Deep n-Shot transfer learning for tactile material classification with a flexible pressure-sensitive skin,” in Proc. [12] A. J. Spiers, M. V. Liarokapis, B. Calli, and A. M. Dollar, “Single- Int. Conf. Robot. Autom. (ICRA), May 2019, pp. 4262–4268. grasp object classification and feature extraction with simple robot hands and tactile sensors,” IEEE Trans. Haptics, vol. 9, no. 2, pp. 207–220, [35] P. Falco, S. Lu, C. Natale, S. Pirozzi, and D. Lee, “A transfer learning Apr./Jun. 2016. approach to cross-modal object recognition: From visual observation to robotic haptic exploration,” IEEE Trans. Robot., vol. 35, no. 4, [13] F. Veiga, B. Edin, and J. Peters, “Grip stabilization through inde- pp. 987–998, Aug. 2019. pendent finger tactile feedback control,” Sensors, vol. 20, no. 6, [36] H. Lee, H. Park, G. Serhat, H. Sun, and K. J. Kuchenbecker, “Calibrating p. 1748, Mar. 2020. [Online]. Available: https://www.mdpi.com/1424- a soft ERT-based tactile sensor with a multiphysics model and sim-to- 8220/20/6/1748 real transfer learning,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), [14] T. Matsubara and K. Shibata, “Active tactile exploration with uncertainty May 2020, pp. 1632–1638. and travel cost for fast shape estimation of unknown objects,” Robot. [37] Z. Ding, N. F. Lepora, and E. Johns, “Sim-to-real transfer for optical tac- Auto. Syst., vol. 91, pp. 314–326, May 2017. [Online]. Available: tile sensing,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2020, https://www.sciencedirect.com/science/article/pii/S092188901630522X pp. 1639–1645. [15] P. Falco, S. Lu, A. Cirillo, C. Natale, S. Pirozzi, and D. Lee, [38] A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian, “Cross-modal visuo-tactile object recognition using robotic active explo- “Grad-CAM++: Generalized gradient-based visual explanations for deep ration,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2017, convolutional networks,” in Proc. IEEE Winter Conf. Appl. Comput. Vis. pp. 5273–5280. (WACV), Mar. 2018, pp. 839–847. [16] J. Kwiatkowski, D. Cockburn, and V. Duchaine, “Grasp stability assess- [39] B. Calli, A. Singh, A. Walsman, S. Srinivasa, P. Abbeel, and ment through the fusion of proprioception and tactile signals using A. M. Dollár, “The YCB object and model set: Towards common convolutional neural networks,” in Proc. IEEE/RSJ Int. Conf. Intell. benchmarks for manipulation research,” in Proc. Int. Conf. Adv. Robot. Robots Syst. (IROS), Sep. 2017, pp. 286–292. (ICAR), Jul. 2015, pp. 510–517. [17] A. Vasquez, Z. Kappassov, and V. Perdereau, “In-hand object shape iden- [40] [XXXX]. tification using invariant proprioceptive signatures,” in Proc. IEEE/RSJ [41] H. Zhou, Z. Li, C. Ning, and J. Tang, “CAD: Scale invariant framework Int. Conf. Intell. Robots Syst. (IROS), Oct. 2016, pp. 965–970. for real-time object detection,” in Proc. IEEE Int. Conf. Comput. Vis. [18] H. Liu, D. Guo, and F. Sun, “Object recognition using tactile measure- Workshops (ICCVW), Oct. 2017, pp. 760–768. ments: Kernel sparse coding methods,” IEEE Trans. Instrum. Meas., [42] C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu, “A survey vol. 65, no. 3, pp. 656–665, Mar. 2016. on deep transfer learning,” Artificial Neural Networks and Machine [19] S. S. Baishya and B. Bauml, “Robust material classification with a tactile Learning–(ICANN) (Lecture Notes in Computer Science), vol. 11141. skin using deep learning,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots 2018, pp. 270–279. Syst. (IROS), Oct. 2016, pp. 8–15. [43] J. A. Pruszynski and R. S. Johansson, “Edge-orientation processing [20] V. Kumar, A. Gupta, E. Todorov, and S. Levine, “Learning dex- in first-order tactile neurons,” Nature Neurosci., vol. 17, no. 10, terous manipulation policies from experience and imitation,” 2016, pp. 1404–1409, 2014, doi: 10.1038/nn.3804. arXiv:1611.05095. [44] G. Cheng, E. Dean-Leon, F. Bergner, J. R. G. Olvera, Q. Leboutet, and [21] M. Meier, F. Patzelt, R. Haschke, and H. J. Ritter, “Tactile convolutional P. Mittendorfer, “A comprehensive realization of robot skin: Sensors, networks for online slip and rotation detection,” in Artificial Neural sensing, control, and applications,” Proc. IEEE, vol. 107, no. 10, Networks and Machine Learning– I C AN N , vol. 9887. 2016, pp. 12–19. pp. 2034–2051, Oct. 2019. [22] W. Yuan, Y. Mo, S. Wang, and E. Adelson, “Active clothing [45] D. Inoue, M. Konyo, and S. Tadokoro, “Distributed tactile sensors material perception using tactile sensing and deep learning,” 2017, for tracked robots,” in Proc. 5th IEEE Conf. Sensors, Oct. 2006, arXiv:1711.00574. pp. 1309–1312. [23] M. Lambeta et al., “DIGIT: A novel design for a low-cost compact [46] A. Glowacz, “Thermographic fault diagnosis of ventilation in BLDC high-resolution tactile sensor with application to in-hand manipulation,” motors,” Sensors, vol. 21, no. 21, p. 7245, Oct. 2021. [Online]. Avail- IEEE Robot. Automat. Lett., vol. 5, no. 3, pp. 3838–3845, Jul. 2020. able: https://www.mdpi.com/1424-8220/21/21/7245 Authorized licensed use limited to: Queen Mary University of London. Downloaded on November 09,2022 at 10:54:40 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. FUNABASHI et al.: TACTILE TRANSFER LEARNING AND OBJECT RECOGNITION WITH A MULTIFINGERED HAND 15 [47] A. Glowacz, “Ventilation diagnosis of angle grinder using thermal imag- Alexander Schmitz (Member, IEEE) received the ing,” Sensors, vol. 21, no. 8, p. 2853, Apr. 2021. [Online]. Available: master’s degree (Hons.) from the University of https://www.mdpi.com/1424-8220/21/8/2853 Vienna, Vienna, Austria, in 2007, and the Ph.D. [48] A. Glowacz, “Fault diagnosis of electric impact drills using ther- degree from The University of Sheffield, Sheffield, mal imaging,” Measurement, vol. 171, Feb. 2021, Art. no. 108815. U.K., in 2011. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ He performed his Ph.D. research as part of a S0263224120313099 joint location program with the Italian Institute [49] S. Funabashi et al., “Multi-fingered in-hand manipulation with various of Technology, Genoa, Italy. He is currently an object properties using graph convolutional networks and distributed tac- Associate Professor with the Department of Modern tile sensors,” IEEE Robot. Automat. Lett., vol. 7, no. 2, pp. 2102–2109, Mechanical Engineering, Waseda University, Tokyo, Apr. 2022. Japan. He has published 14 journal articles, one book [50] T. N. Kipf and M. Welling, “Semi-supervised classification with graph chapter, and 34 international conference papers. Furthermore, he has applied convolutional networks,” Tech. Rep., 2017. for three national and five international patents. His research interests include tactile sensing, intrinsically safe actuation, human symbiotic robotics, and robotic object handling. Dr. Schmitz received a grant of 117 million JPY from the JST START Program (Program for Creating STart-ups from Advanced Research and Technology) in 2016. Satoshi Funabashi (Member, IEEE) received the B.E., M.E., and Ph.D. degrees from Waseda Uni- Lorenzo Jamone (Member, IEEE) received the versity, Tokyo, Japan, in 2015, 2017, and 2021, M.S. degree (Hons.) in computer engineering from respectively. the University of Genoa, Genoa, Italy, in 2006, and Since 2021, he has been a Junior Researcher with the Ph.D. degree in humanoid technologies from the the Institute for AI and Robotics, Future Robot- Italian Institute of Technology, Genoa, in 2010. ics Organization, Waseda University. From 2018 He was an Associate Researcher with the Takan- to 2019, he was a Visiting Student with the Com- ishi Laboratory, Waseda University, Tokyo, Japan, puter Science and Artificial Intelligence Laboratory, from 2010 to 2012, and Vislab, Instituto Superior Massachusetts Institute of Technology, Cambridge, Técnico, Lisbon, Portugal, from 2012 to 2016. He is MA, USA. His current research interests include currently a Senior Lecturer in robotics with the multifingered hands, tactile perception, and dexterous manipulation. School of Electronic Engineering and Computer Dr. Funabashi received the Research Fellowship for Young Scientists DC1 Science, Queen Mary University of London, London, U.K. He is part of from the Japan Society for the Promotion of Science (JSPS) and the Strategic Advanced Robotics at Queen Mary (ARQ), London. He is the Founder and the Basic Research Programs ACT-I from the Japan Science and Technology Director of the CRISP Group: Cognitive Robotics and Intelligent Systems for Agency (JST) in 2017 and 2018, respectively. the People. He has over 100 publications with an H-index of 26. His current research interests include cognitive robotics, robotic manipulation, and tactile sensing. Dr. Jamone has been a Turing Fellow since 2018. Tetsuya Ogata (Member, IEEE) received the B.S., Gang Yan (Student Member, IEEE) received M.S., and D.E. degrees in mechanical engineering the B.E. degree from Northeastern University, from Waseda University, Tokyo, Japan, in 1993, Shenyang, China, in 2016, and the M.E. degree from 1995, and 2000, respectively. Waseda University, Tokyo, Japan, in 2020, where He was a Research Associate with Waseda Univer- he is currently pursuing the Ph.D. degree with the sity from 1999 to 2001. From 2001 to 2003, he was Department of Modern Mechanical Engineering. a Research Scientist with the RIKEN Brain Science In 2022, he was a Visiting Student with the Robot- Institute, Saitama, Japan. From 2003 to 2012, he was ouch Laboratory, The Robotics Institute, Carnegie an Associate Professor with the Graduate School Mellon University, Pittsburgh, PA, USA. He has of Informatics, Kyoto University, Kyoto, Japan. been working on research about grasping stability From 2009 to 2015, he was a JST (Japan Science and estimation and slip detection relying on either tactile Technology Agency) PREST Researcher. Since 2012, he has been a Professor or multimodal tactile-visual sensing using a data-driven approach. His research with the Faculty of Science and Engineering, Waseda University. Since 2017, results have been published at the International Conference on Robotics and he has been a Joint-Appointed Fellow with the Artificial Intelligence Research Automation and the IEEE ROBOTICS AND AUTOMATION L ETTERS . His Center, National Institute of Advanced Industrial Science and Technology, current research interests include tactile perception, tactile sensor simulation, Tokyo. robotic manipulation, and human–robot interaction. Shigeki Sugano (Fellow, IEEE) received the B.S., M.S., and D.E. degrees in mechanical engineering from Waseda University, Tokyo, Japan, in 1981, 1983, and 1989, respectively. Since 1986, he has been a Faculty Member of Fei Hongyi (Student Member, IEEE) received the the Department of Mechanical Engineering, Waseda B.E. degree from China Jiliang University, Zhejiang, University, where he is currently a Professor. Since China, and the M.E. degree from Waseda University, 2014, he has been the Dean of the School/Graduate Tokyo, Japan, in 2022. School of Creative Science and Engineering, Waseda Since 2020, he has been a student with the Depart- University. Since 2020, he has been the Senior Dean ment of Modern Mechanical Engineering, Waseda of the Faculty of Science and Engineering, Waseda University. Since 2020, he has been working on University. research about object recognition and dexterous Dr. Sugano is a fellow of four academic societies: IEEE, the Japan Society in-hand manipulation with multifingered hands and of Mechanical Engineers, the Society of Instrument and Control Engineers, tactile sensors. He has developed a controlling sys- and the Robotics Society of Japan. He has served as the General Chair of the tem of anthropomorphic hands and machine learning IEEE/ASME International Conference on Advanced Intelligent Mechatronics methods. He worked on the project for generating in-hand manipulation in 2003 and the IEEE/RSJ International Conference on Intelligent Robots and published for IEEE ROBOTICS AND AUTOMATION L ETTERS (RA-L) in 2022. Systems in 2013. From 2001 to 2010, he has served as the President of the His research interests include machine learning, tactile sensing, and dexterous Japan Association for Automation Advancement. In 2017, he has served as manipulation. the President of SICE. Authorized licensed use limited to: Queen Mary University of London. Downloaded on November 09,2022 at 10:54:40 UTC from IEEE Xplore. Restrictions apply.