On Lightweight Privacy-Preserving Collaborative Learning for Internet-of-Things Objects Linshan Jiang Rui Tan School of Computer Science and Engineering School of Computer Science and Engineering Nanyang Technological University Nanyang Technological University
[email protected] [email protected]Xin Lou Guosheng Lin Advanced Digital Sciences Center School of Computer Science and Engineering Illinois at Singapore Pte Ltd Nanyang Technological University
[email protected] [email protected]ABSTRACT KEYWORDS The Internet of Things (IoT) will be a main data generation in- Internet of Things, collaborative learning, privacy frastructure for achieving better system intelligence. This paper ACM Reference Format: considers the design and implementation of a practical privacy- Linshan Jiang, Rui Tan, Xin Lou, and Guosheng Lin. 2019. On Lightweight preserving collaborative learning scheme, in which a curious learn- Privacy-Preserving Collaborative Learning for Internet-of-Things Objects. ing coordinator trains a better machine learning model based on In International Conference on Internet-of-Things Design and Implementation the data samples contributed by a number of IoT objects, while (IoTDI ’19), April 15–18, 2019, Montreal, QC, Canada. ACM, New York, NY, the confidentiality of the raw forms of the training data is pro- USA, 12 pages. https://doi.org/10.1145/3302505.3310070 tected against the coordinator. Existing distributed machine learn- ing and data encryption approaches incur significant computation and communication overhead, rendering them ill-suited for resource- 1 INTRODUCTION constrained IoT objects. We study an approach that applies inde- The recent research advances of machine learning have led to per- pendent Gaussian random projection at each IoT object to obfus- formance breakthroughs of various tasks such as image classifica- cate data and trains a deep neural network at the coordinator based tion, speech recognition, and language understanding. The drasti- on the projected data from the IoT objects. This approach intro- cally increasing amount of data generated by the Internet of Things duces light computation overhead to the IoT objects and moves (IoT) will further foster machine learning performance and enable most workload to the coordinator that can have sufficient comput- new applications in various domains. In particular, the collabora- ing resources. Although the independent projections performed by tive learning, which builds a machine learning model (e.g., a super- the IoT objects address the potential collusion between the curious vised classifier) based on the training data contributed by many coordinator and some compromised IoT objects, they significantly participants, is a desirable and empowering paradigm for smarter increase the complexity of the projected data. In this paper, we IoT systems. By leveraging on the much increased volume of train- leverage the superior learning capability of deep learning in captur- ing data and coverage of data patterns, collaborative learning will ing sophisticated patterns to maintain good learning performance. approach the intelligence of a crowd and improve the learning per- Extensive comparative evaluation shows that this approach out- formance beyond that achieved by any single participant alone. performs other lightweight approaches that apply additive noisifi- Moreover, a resource-rich learning coordinator (e.g., a desktop-class cation for differential privacy and/or support vector machines for edge device or a cloud computing service) allows the execution of learning in the applications with light data pattern complexities. advanced, compute-intensive machine learning algorithms to cap- ture deeper structures in the aggregated data, whereas the partici- CCS CONCEPTS pants (e.g., IoT objects) are often resource-constrained and incom- petent for intensive computation. By contributing training data, • Computer systems organization → Sensor networks; • Com- the individual participants will enjoy the improved machine intel- puting methodologies → Supervised learning; • Security and ligence in return. privacy → Domain-specific security and privacy architectures. However, the data contributed by the participants may contain privacy-sensitive information. On Internet, various online services Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed (e.g., webmail and social networking) generally collect and ana- for profit or commercial advantage and that copies bear this notice and the full cita- lyze the user data in the raw forms. In this scheme, the users risk tion on the first page. Copyrights for components of this work owned by others than privacy leak due to potential inadvertent actions by the service ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission providers and/or targeted cyber-attacks from the external. This and/or a fee. Request permissions from
[email protected]. risk has been evidenced by several recent large-scale user privacy IoTDI ’19, April 15–18, 2019, Montreal, QC, Canada leak incidents [7, 34, 38]. Data anonymization can mitigate the con- © 2019 Association for Computing Machinery. ACM ISBN 978-1-4503-6283-2/19/04. . . $15.00 cern; but it is inadequate for privacy preservation, because cross https://doi.org/10.1145/3302505.3310070 correlations among different databases may be used to re-identify IoTDI ’19, April 15–18, 2019, Montreal, QC, Canada Linshan Jiang, Rui Tan, Xin Lou, and Guosheng Lin data [33]. Moreover, the correlations between different properties separatable (that is, they are features), the inverse of the Gauss- of anonymous individuals (e.g., race, income, political views, etc.) ian matrix can be considered as a linear feature extraction matrix. can be exploited to identify an interested user group to target for With the deep learning’s unsupervised feature learning capability, advertisement and advocacy. In the coming era of IoT with many this inverse matrix can be implicitly captured by the trained deep smart objects penetrating into our private space and time, the cur- model. Thus, we conjecture that the randomly projected training rent raw data collection approach will only raise large privacy con- samples can still be used by the coordinator to build the deep model cerns and may potentially violate relevant laws such as the recent for classification. General Data Protection Regulation in European Union. Therefore, To achieve robustness of the privacy preservation against the to be successful, the IoT-driven collaborative learning applications collusion between any single participant and the curious learning must be privacy-preserving. coordinator, each participant should generate its own Gaussian ma- Privacy-preserving collaborative learning (PPCL) has received trix independently. However, this presents a main challenge on the increasing research recently under the enterprise settings, where PPCL system’s scalability with respect to the number of partici- the participants are entities with rich computing resources. The pants (denoted by N ). Specifically, assuming that the training data existing approaches can be broadly classified into two categories. samples for each class are horizontally distributed among the par- The first category of approaches [9, 23, 31, 36, 40] follows the dis- ticipants, the number of data patterns for a class will increase from tributed machine learning (DML) scheme, such that the partici- one in the plaintext domain to N in the projection data domain. pants do not need to transmit the training data to the coordinator. This increased pattern complexity is to be addressed by the strong The recently proposed federated learning [31] is a type of DML. In learning capability of deep learning. Thus, in the proposed PPCL the second category of approaches [12, 19, 22], each participant ap- approach, most of the computational workload is offloaded to the plies the homomorphic encryption on the data before being trans- resourceful coordinator at the edge or in the cloud, unlike the exist- mitted to the coordinator such that the training and inference can ing DML and homomorphic encryption approaches that introduce be performed on ciphertexts. However, for resource-constrained significant or prohibitive compute overhead to the smart objects IoT objects, these DML and data encryption approaches incur sig- beneath the IoT edge. nificant and even prohibited compute overhead. The DML will re- To understand the effectiveness of the GRP approach and its quire the participants to execute machine learning algorithms to scalability with the number of participants and the pattern com- train local models, which is often too compute-intensive for IoT plexity of the training data, we conduct extensive evaluation to objects. Moreover, the iterative communication rounds of DML in- compare GRP with several other lightweight PPCL approaches. The troduce large communication overhead. Currently, the homomor- evaluation is based on two example applications with low and mod- phic encryption algorithms are still too compute-intensive to be erate pattern complexities, i.e., handwritten digit recognition and realistic for resource-constrained devices (cf. §6). Therefore, these spam e-mail detection. The baseline approaches include various existing approaches are ill-suited or unpractical for the resource- combinations between (1) multiplicative GRP versus additive nois- constrained smart objects beneath the IoT edge. ification for differential privacy (DP) at the participants, and (2) In this paper, we study the design and implementation of a PPCL deep neural networks (DNNs), including the multilayer percep- approach that is lightweight for resource-constrained participants, tron (MLP) and convolutional neural network (CNN), versus sup- while keeping privacy-preserving against a honest-but-curious learn- port vector machines (SVMs) at the coordinator. The results show ing coordinator. The coordinator can be a cloud server or a resource- that, for the two example applications, the proposed GRP-DNN ap- rich edge device, e.g., access points, base stations, network routers, proach can support up to hundreds of participants without sac- etc. We propose to apply (1) multiplicative Gaussian random pro- rificing the learning performance much, whereas the GRP-SVM jection (GRP) at the resource-constrained IoT objects to obfuscate approach may fail to capture the projected data patterns and the the contributed training data and (2) deep learning at the coordina- performance of the DP-DNN approach is susceptible to additive tor to address the much increased complexity of the data patterns noisification. The results of this paper suggest that GRP-DNN is a due to the GRP. Specifically, each participant uses a private, time- practical PPCL approach for resource-constrained IoT objects ob- invariant but randomly generated Gaussian matrix to project each serving data with low- or moderate-complexity patterns. plaintext training data vector and transmits the result to the coor- We also implement GRP-DNN, Crowd-ML [23] (a federated learn- dinator. GRP gives several privacy preservation properties of (1) ing approach based on shallow learning), and CryptoNets [19] (a the computational difficulty for the coordinator to reconstruct the homomorphic encryption approach) on a testbed of 14 IoT devices. plaintext without knowing the Gaussian matrix [30, 37], and (2) Experiments show that, compared with GRP-DNN, Crowd-ML in- quantifiable plaintext reconstruction error bounds even if the co- curs 350x compute overhead and 3.5x communication overhead to ordinator obtains the Gaussian matrix [30]. From a system perspec- each IoT device. Deep federated learning will only incur more com- tive, GRP is computationally lightweight and does not increase the pute overhead. CryptoNets incurs 2.6 million times higher com- data volume. Thus, GRP is a practical privacy protection method pute overhead to the IoT device, compared with GRP. suitable for resource-constrained IoT objects. Regarding GRP’s im- The remainder of this paper is organized as follows. §2 intro- pact on the design of the machine learning algorithms, the ran- duces the background and preliminaries. §3 reviews related work. dom projection can be viewed as a process of mapping the original §4 states the problem and overviews our approach. §5 presents the data vectors to some domain in which the data vectors in different learning performance evaluation for various lightweight PPCL ap- classes are less separatable. If the original data vectors are readily proaches. §6 presents the benchmark results of GRP-DNN, Crowd- ML, and CryptoNets on the testbed. §7 concludes this paper. On Lightweight Privacy-Preserving Collaborative Learning for IoT Objects IoTDI ’19, April 15–18, 2019, Montreal, QC, Canada 2 BACKGROUND AND PRELIMINARIES 3 RELATED WORK 2.1 Supervised Collaborative Learning Existing PPCL approaches can be classified into two categories, i.e., distributed machine learning (§3.1) and training data encryption or Supervised machine learning has two phases, i.e., the learning phase obfuscation (§3.2). §3.3 reviews other related work. and the classification phase. We now formally describe the collab- orative learning scheme. The trained classifier, denoted by h(x|θ ), can classify a d-dimensional data vector x ∈ Rd to be one of a 3.1 Distributed Machine Learning (DML) finite number of classes represented by a set C. The learning pro- DML approaches exploit the computing capability of the partic- cess determines the parameter θ based on the training data. Let ipants to solve Eq. (1) using some variant of stochastic gradient N denote the number of participants of the collaborative learning. descent (SGD) in a distributed manner. During the learning pro- Let Di denote a set of Mi training data samples generated by the cess, the training data samples are not transmitted. The studies participant i, i.e., Di = {(xi,j ,yi,j )|j ∈ [1,Mi ],yi,j ∈ C}, where xi,j [23, 31, 32, 40] share the similar idea of exchanging gradients and is the training data vector and yi,j is the corresponding class label. classifier parameters among the participants, which is coordinated For a training data sample (x,y), denote by l (h(x|θ ),y) the loss by the coordinator. Specifically, in the Crowd-ML approach [23], function. The collaborative learning solves the following problem a participant checks out the global classifier parameters θ from to determine the optimal classifier parameter denoted by θ ∗ : the coordinator and computes the gradients using its own training data. Then, the participants transmit the gradients to the coordina- N M X 1 Xi tor that will update θ . In [40], each participant trains a local deep θ ∗ = argmin l h xi,j |θ ,yi,j + λkθ k 2 , (1) M model using SGD and uploads a selected portion of gradients to θ i =1 i j=1 the coordinator for combining. Then, each participant downloads where the λkθ k 2 is a regularization term. With θ ∗ , the classifica- a selected portion of the global gradients to update its local deep tion for a test data sample x is to compute h(x|θ ∗ ). model. As the exchanged gradients and classifier parameters may A simple approach is to collect all the plaintext training data to still contain privacy, the approaches [23, 40] add random noises to the coordinator and solve Eq. (1). However, this approach raises the the exchanged values for differential privacy [17]. In the federated concern of privacy breach, as the raw training data are generally learning scheme [31], the coordinator periodically pulls the deep privacy-sensitive. The problem of solving Eq. (1) without threat- models trained by the participants locally based on their training ening the participants’ privacy contained in Di , i = 1, . . . , N , is data and returns an average deep model to the participants. In [32], called PPCL. Existing approaches to PPCL will be reviewed in §3. the participant adds random noises to the deep model parameters before being sent to the coordinator for privacy protection in the 2.2 Random Gaussian Projection (GRP) federated learning process. This section reviews two properties of GRP. Let R ∈ Rk×d rep- However, the above DML approaches have the following limi- resent a random Gaussian matrix, i.e., each element in R is drawn tations. First, the local training introduces computation overhead independently from the normal distribution N (0,σ 2 ). GRP has the to the participants. Training a DNN locally may be infeasible for following two properties [30]: resource-constrained IoT objects. Second, DML approaches often require many iterations for the learning algorithm to converge, Property 1. For data vectors x1 , x2 and their projections y1 = which may incur a high volume of data traffic between each partic- √1 Rx1 , y2 = √1 Rx2 , the dot product and Euclidean distance ipant and the coordinator. In §6, we will show this by comparing kσ kσ between yf1 and gy2 are unbiased estimates of those between x1 and the Crowd-ML [23] and our proposed approach. Third, as shown recently in [24], generative adversarial networks can generate pro- f g x2 , i.e., E y1 y2 = x1 x2 and E ky1 − y2 k22 = kx1 − x2 k22 . The ⊺ ⊺ totypical training data samples based on the exchanged gradients estimation error bounds are Var[y1 y2 ] ≤ k2 and Var ky1 − y2 k22 ≤ ⊺ f g and model parameters, weakening the privacy preservation claimed 32 . k in [31, 40]. In [36] and [9], homomorphic encryption and secure aggregation have been applied to enhance the privacy preserva- Property 2. Given a Gaussian matrix instance R ∈ Rk×d where tion of the approach in [40] and the federated learning in [31], re- k < d and the projection y = √1 Rx, the minimum norm estimate kσ spectively. With these enhancements, only the encrypted gradients of x, denoted by x̂, is an unbiased estimate of x, i.e., E [x̂] = x. The es- [36] and aggregate model update [9] are revealed to the honest-but- timation error for the ith element of x is Var[xi ] = k2 xi2 + k1 j,j,i x j2 . P curious coordinator. However, these privacy enhancements further increase the computation overhead of each participant, making it Based on Property 1, the study [30] shows that a trained SVM more unsuitable for resource-constrained IoT objects. classifier can be transferred to classify the projected data. In a re- cent study [45], a random projection layer that can be implemented 3.2 Training Data Encryption/Obfuscation by GRP is added to an MLP for dimension reduction. Such design Different from the DML approaches that transmit classifier’s pa- is also based on Property 1. However, the studies [30, 45] do not rameters, the approaches [22, 29, 39] transmit the encrypted or address collaborative learning and privacy. obfuscated training data to the coordinator to solve Eq. (1). The The estimation error given by Property 2 will be used in the later approach proposed in this paper also belongs to this category. In sections of this paper to measure the degree of privacy protection the following, we review each of [22, 29, 39] and then discuss our provided by our proposed approach. new design to overcome their shortcomings. IoTDI ’19, April 15–18, 2019, Montreal, QC, Canada Linshan Jiang, Rui Tan, Xin Lou, and Guosheng Lin In [22], homomorphic encryption is integrated with a Linear trainin g data Means classifier and Fisher’s Linear Discriminant classifier. During sample s participants both the training and classification phases, the participant trans- mits the homomorphically encrypted data vector to the coordina- ... tor. However, homomorphic encryption results in intensive com- putation and increased volume of data transmissions (cf. §6). Thus, coordinator although the homomorphic encryption approach provides prov- able confidentiality protection, it is inefficient and unrealizable on Figure 1: A collaborative learning system. many resource-constrained IoT platforms. To reduce the computation and communication overheads, Liu et al. [29] propose a data obfuscation approach based on random 3.3 Other Related Work projection. Specifically, the participant i independently generates In CryptoNets [19], the computation of each neuron in a neural net- a random matrix Ri and transmits the obfuscated training dataset work trained using plaintext data is performed in the domain of ho- {(Ri xi,j ,yi,j )|j ∈ [1,Mi ]} to the coordinator. However, different momorphic encryption. During the classification phase, the partici- from Property 1 in §2.2 that requires the same projection matrix, pant sends the homomorphically encrypted data to the coordinator the approach [29] uses distinct projection matrices for different for classification. The work [12] extends [19] to support more hid- participants and thus no longer preserves the Euclidean distance, den layers. However, these studies [12, 19] address privacy-preserving i.e., kRu xu,p − Rv xv,q k , kxu,p − xv,q k. This will result in poor classification outsourcing (i.e., offloading the classification compu- training performance for distance-based classifiers, such as k-nearest tation to a honest-but-curious entity), rather than the collabora- neighbors and SVM. To address this issue, the study [29] designs tive learning addressed in this paper. The training in [12, 19] is a regression phase before the learning phase. Specifically, the co- performed based on plaintext data. Moreover, the homomorphic ordinator sends a number of public data vectors {zk |k = 1, 2, . . .} encryption is too computation intensive for resource-constrained to all participants and the participant i returns the projected data IoT devices, which will be shown in §6. {Ri zk |k = 1, 2, . . .}. Based on the original and projected public data The differentially private machine learning (DPML) [5, 14, 41] vectors, a regress function fuv (·, ·) for each participant pair (u,v ) builds a classifier that cannot be used to infer the training data. is learned such that fuv (Ru xu,p , Rv xv,q ) ≃ kxu,p − xv,q k. As a re- The training of the classifier is based on plaintext data. For DNNs, sult, the distance-based classifiers can be trained in the domain of DPML can be achieved by perturbing the gradients in each iter- obfuscated data by using the learned regress functions to compute ation of the SGD with additive noises [5, 41]. DPML and PPCL distances during the training phase. address different problems, i.e., PPCL preserves the privacy of the However, the approach [29] has two shortcomings. First, it is training data against the honest-but-curious coordinator who builds only applicable to distance-based classifiers. These conventional the classifier, whereas DPML trusts the classifier builder and pre- classifiers do not scale well with the volume of the training data serves the privacy of the training data against the curious user of and the complexity of the data patterns [42]. It is desirable to sup- the classifier. Thus, in DPML, the plaintext training dataset is avail- port the DNNs that give the state-of-the-art learning performance able to the classifier builder; differently, in PPCL, only encrypted or in a range of applications. Second, obfuscating the public data vec- obfuscated training data is made available to the classifier builder tors and returning the results may incur known-plaintext attacks (i.e., the learning coordinator). and engenders a clear privacy breaching concern. For instance, a proactively curious coordinator may use a public data vector zk = 4 PROBLEM STATEMENT AND APPROACH [1, 0, 0, . . . , 0] ⊺ to extract the first column of Ri . Other columns of In this section, we state the PPCL problem in §4.1 and present the Ri can be similarly extracted by using specific public data vectors. proposed GRP approach in §4.2. §4.3 provides two illustrating ex- Even without using these specific public data vectors, in general, amples for insights into understanding the effect of GRP on train- the private random projection matrix Ri can be estimated using ing DNN-based classifiers. §4.4 discusses two other alternative ap- regression analysis based on a number of public data vectors and proaches for lightweight PPCL and their limitations. the corresponding projections. The study [39] also uses random projection to obfuscate the data 4.1 Problem Statement vector x in training and executing a Sparse Representation Classi- This section states the problem addressed in this paper. §4.1.1 de- fier. However, all participants use the same random projection ma- fines the system model; §4.1.2 defines the threat and privacy mod- trix, rendering the system vulnerable to the collusion between any els; §4.1.3 discusses several relevant issues. single participant and the coordinator. Different from [39], each participant in our approach uses its 4.1.1 System model. In this paper, we consider a PPCL system own private random project matrix, rendering the collusion futile. with N resource-constrained participants and an honest-but-curious Different from [29], our approach uses DNNs and leverages on the coordinator with sufficient computation power. Fig. 1 illustrates deep learning capability to avoid the regression phase that is vul- the system. During the learning phase, the participants contribute nerable to the known-plaintext attacks. Different from [22] that is training data samples to build a supervised classifier. As discussed too compute-intensive for IoT objects, our approach uses GRP that in §2.1, the training dataset Di contributed by the participant i introduces light computation overhead only. consists of Mi data vectors {xi,j |j ∈ [1,Mi ]} and the correspond- ing class labels {yi,j |j ∈ [1,Mi ]}. As the learning process is often On Lightweight Privacy-Preserving Collaborative Learning for IoT Objects IoTDI ’19, April 15–18, 2019, Montreal, QC, Canada compute-intensive, most of the learning computation should be ac- of our PPCL approach will not leverage the participants’ complished by the coordinator. In this paper, we focus on address- identities to support data anonymization. ing the problem of building an effective supervised classifier while • Label privacy: The class labels {yi,j |j ∈ [1,Mi ]} may also protecting certain privacy contained in the data vectors. contain information about the participant. In this paper, we do not consider label privacy because the participant will- 4.1.2 Threat and privacy models. The privacy concern regarding ingly contributes the labeled data vectors and should have the data vectors is primarily due to that the data vectors may con- no expectation of privacy regarding labels. In practice, sev- tain information beyond the classification objective in question. eral means can be taken to mitigate the concern of label pri- For example, consider a PPCL system for training a classifier to rec- vacy leak. First, the training data anonymization mitigates ognize human body activity (e.g., sitting, walking, climbing stairs, the concern during the learning phase. Second, during the etc). The recognition is based on various body signals (e.g., motion, classification phase, if the participant has sufficient process- heart rate, breath rate, etc) that are captured by wearable sensors. ing capability to perform the classification computation, the However, the raw body signals can also be used to infer health sta- coordinator may send the trained model to the participant tuses of the participants and even pinpoint the patients of certain for local execution. Existing studies have enabled the execu- diseases. In this paper, we adopt the following threat and privacy tion of deep models on personal and low-end devices [25, models. 47]. Low-power inference chips (e.g., Google’s Edge TPU Threat model: It consists of the following two aspects: [21]) will further enhance low-end devices’ capabilities in • Honest-but-curious coordinator: We assume that the coordi- executing classification models. Note that the studies [25, nator will honestly coordinate the collaborative learning pro- 47] and the inference chips are not to support the much cess, aiming to train the best supervised classifier. Thus, it more compute-intensive training. will neither tamper with any data collected from or trans- • Other privacy models: Differential privacy [17] aim to achieve mitted to the participants. However, the coordinator is curi- indistinguishability of different data vectors is another widely ous about the participants’ privacy contained in the training used quantifiable privacy definition. However, as discussed data vectors. in §4.4 and evaluated in §5, the additive noisification imple- • Potential collusion between participants and coordinator: We mentation of differential privacy is ill-suited for PPCL. assume that the participants are not trustworthy in that they may collude with the coordinator in finding out other par- 4.2 Gaussian Random Projection Approach ticipants’ privacy contained in the data vectors. The collud- Existing DML and homomorphic encryption approaches incur sig- ing participants are also honest, i.e., they will faithfully con- nificant computation and communication overhead due to the many tribute their training data to improve the supervised classi- computation/communication rounds and data volume swell. In §6, fier. The design of the PPCL system should keep the privacy we will provide benchmark results to show this. Thus, these ap- preservation for a participant when any or all other partici- proaches are not promising for resource-constrained participants. pants are colluding with the coordinator. This section describes a GRP-based approach that is computation- ally lightweight and communication efficient for the participants. Privacy model: The raw form of each data vector is the partic- The overview of our approach is presented as follows. ipant’s privacy to be protected. The error in estimating the data At the system initialization, each participant i independently raw form by the coordinator can be used as a metric to measure generates a random Gaussian matrix Ri ∈ Rk×d , where d is the the degree of privacy protection. Data form confidentiality is an dimension of the data vector. During the learning phase, the par- immediate and basic privacy requirement in many applications. ticipant i keeps Ri secret and uses it to project all the training data vectors. The participant i transmits the projected training dataset 4.1.3 Several other issues. In the following, we discuss three is- sues that are related to privacy protection: Di = {Ri xi,j ,yi,j |j ∈ [1,Mi ],yi,j ∈ C} to the coordinator. After collecting all projected training datasets Di , i = 1, . . . , N , the co- • Training data anonymization: We aim to support anonymiza- ordinator applies deep learning algorithms to train the classifier tion of the training data. That is, the coordinator should not h(·|θ ∗ ). During the classification phase, the participant i still uses expect to know the participant’s identity for any received Ri to project the test data vector x and obtains the classification re- training data sample. Moreover, the coordinator cannot de- sult h(Ri x|θ ∗ ). As discussed in §4.1, the classification computation termine whether any two training data samples are from the can be carried out at the participant or the coordinator, depending same participant. To achieve the above strong anonymity, on whether the participant is capable of executing the trained deep the training data samples can be transmitted in separate model. In our approach, each participant independently generates sessions via an anonymous communication network [16]. its random projection matrix to counteract the collusion between Moreover, the transmissions of the data samples from all participants and coordinator. Now, we explain the two key compo- participants can be interleaved randomly, such that the co- nents of our approach: GRP and deep learning on projected data. ordinator cannot associate the data samples from the same participant by their arrival times. Note that the training data 4.2.1 Gaussian random projection. In this work, we adopt Gauss- anonymization requirement is not mandatory, because the ian matrices. Specifically, each element of Ri is sampled indepen- anonymous communication may incur large overhead for dently from the standard normal distribution [6]. The rationale of some resource-constrained IoT objects. However, the design choosing Gaussian matrices will be explained in §4.3.2. We set the IoTDI ’19, April 15–18, 2019, Montreal, QC, Canada Linshan Jiang, Rui Tan, Xin Lou, and Guosheng Lin row dimension of Ri smaller than or equal to its column dimen- helps address the distortion caused by the GRP. In practice, effec- sion, i.e., k ≤ d. Thus, the GRP can also compress the data vector. tive feature extractions are generally non-linear mappings. Neu- We define the compression ratio as ρ = d/k. The understanding ral network-based deep learning has shown strong capability in regarding the admission of compression into the training data pro- capturing sophisticated features beyond the above ideal linear fea- jection is as follows. From the compressive sensing theory [11], a tures. In this paper, based on multiple datasets, we will investigate sparse signal can be represented by a small number of linear pro- the effectiveness of deep learning to address the distortion caused jections of the original signal and recovered faithfully. Therefore, by the GRP. in the compressively projected data vector, the feature information As discussed earlier, each participant independently generates still exists, provided that the adopted compression ratio is within a Gaussian matrix to counteract the potential collusion between an analytic bound [11]. In §5, we will evaluate the impact of the participants and the coordinator. However, this introduces a chal- compression ratio ρ on the learning performance. lenge to deep learning, because the pattern for a class of projected With GRP, if Ri is kept confidential to the coordinator, it is com- data vectors from N participants will be a composite of N different putationally difficult (practically impossible) for the coordinator patterns. Thus, intuitively, a deeper neural network and a larger to generate a meaningful reconstruction of the original data vec- volume of training data will be needed to well capture the data tor from the projected data vector [30, 37]. Thus, GRP protects the patterns with increased complexity due to the participants’ inde- form of the original data. In the worst case where the coordinator pendence in generating their projection matrices. We note that, the obtains Ri , the estimation error given by Property 2 in §2.2 can be participants’ independence also engenders the following possible used as a measure of privacy protection. Random projection has situation that undermines the learning performance and leads to been used as a lightweight approach to protect data form confi- classification errors: Ru xu = Rv xv , where xu and xv are gener- dentiality in various contexts [28, 43, 44, 46]. ated by participants u and v and belong to different classes. How- ever, for high-dimensional data vectors, the probability of the above situation is low. The more complex data patterns due to the inde- pendent projection matrix generation will be the major challenge. 4.2.2 Deep learning on projected data. Feature extraction is a crit- In this paper, we conduct extensive experiments to assess how well ical step of supervised learning. With the traditional shallow learn- deep learning can scale with the number of participants, compared ing, the classification system designer needs to handcraft the fea- with the traditional learning approaches. ture. The emerging deep learning method [26] automates the de- sign of feature extraction by unsupervised feature learning, which 4.3 Illustrating Examples is often based on a neural network consisting of a large number of We use two examples to illustrate the intuitions discussed in §4.2. parameters. Thus, the deep model is often a tandem of the feature extraction stage and the classification stage. For example, a convo- 4.3.1 A 2-dimensional example. We consider a PPCL system with lutional neural network (CNN) for image classification consists of four participants (i.e., N = 4) to build a two-class classifier. The convolutional layers and dense layers, which are often considered original data vectors in the two classes follow two 2-dimensional performing the feature extraction and classification, respectively. Gaussian distributions with means of [−2, −2] ⊺ and [2, 2] ⊺ , and the Our approach leverages on the unsupervised feature learning ca- same covariance matrix of [1, 0; 0, 1]. Fig. 2(a) shows the plaintext pability of deep learning to address the data distortion introduced data vectors generated by the four participants. From the figure, by the GRP. We now illustrate this using a simple example system, the plaintext data vectors of the two classes can be easily separated in which there is only one participant and the projection matrix R using a simple hyperplane. Each participant independently gener- is a square invertible matrix. Moreover, we make the following two ates a Gaussian random matrix. Figs. 2(b)-2(e) show the projected assumptions to simplify our discussion. First, we assume that a lin- data vectors of each participant. We can see that the patterns of the ear transform Ψ ∈ Rf ×d gives effective features of the data vectors, projected data vectors are different across the participants. Fig. 2(f) where f is the feature dimension. That is, f = Ψx is an effective shows the mixed projected data vectors received from all partici- representation of the data vector x for classification. Second, we as- pants. Compared with Fig. 2(a), the pattern of the mixed projected sume that Ψ can be learned in the form of a neural network by the data from all participants is highly complex. Moreover, no simple unsupervised feature learning. Now, we discuss the impact of the hyperplane can well divide the two classes. random projection on the unsupervised feature learning. After the We also generate two other sets of the random projection ma- projection, the data vector becomes Rx. Moreover, the linear trans- trices for all participants. Figs. 2(g) and 2(h) show the mixes of form ΨR−1 will be an effective feature extraction method, since all participants’ projected data vectors with the two sets of ran- f = ΨR−1 (Rx). It is reasonable to expect that the unsupervised dom projection matrices, respectively. Similarly, the pattern of the feature learning can also build a neural network to capture the lin- mixed projected data from all participants is highly complex. ear transform ΨR−1 , similar to the unsupervised feature learning We construct a classifier based on an MLP with two hidden lay- to capture the Ψ based on the plaintext training data x. As a result, ers of 30 and 40 rectified linear units (ReLUs), respectively. The in- the deep model trained using the projected data can still classify put layer admits a 2-dimensional data vector, whereas the output future projected data vectors. In §4.3, we will use a numerical ex- layer consists of two ReLUs. The final classification result is gen- ample to illustrate this. erated using a softmax function based on the output layer’s ReLU The above discussion based on linear features provides a ba- values. Moreover, we construct an SVM classifier as a baseline ap- sis for us to understand how the unsupervised feature learning proach. We use LIBSVM [13] to implement the classifier. The SVM On Lightweight Privacy-Preserving Collaborative Learning for IoT Objects IoTDI ’19, April 15–18, 2019, Montreal, QC, Canada (a) Original data (b) Participant 1 (c) Participant 2 (d) Participant 3 (e) Participant 4 (f) Coordinator (g) Coordinator (h) Coordinator Figure 2: Two-dimensional example. Original data vectors and projected data vectors (red: class 0; blue: class 1). The ranges for the x and y axes are [−10, 10]. 1 1 classifier uses radial basis function (RBF) kernel with two config- MLP SVM MLP SVM urable parameters C and λ. During the training phase, we apply 0.8 0.8 Test accuracy grid search to determine the optimal settings for C and λ. Test accuracy First, we use disjoint subsets of the original data shown in Fig. 2(a) 0.6 0.6 to train and test the MLP and SVM classifiers. Both classifiers can 0.4 0.4 achieve 99% test accuracy. This shows that the MLP and the SVM are properly designed for the 2-dimensional data vectors. 0.2 0.2 Then, we use disjoint subsets of the randomly projected data 0 0 shown in Fig. 2(f) to train and test the MLP and SVM classifiers. 4 8 12 16 20 150 200 250 300 103 104 The number of participants Condition number Moreover, we also increase the number of participants in the PPCL system. Fig. 3 shows the test accuracy versus the number of par- Figure 3: Test accuracy Figure 4: Test accuracy ticipants. We can see that the MLP classifier always outperforms based on projected data vs. based on projected data vs. the SVM classifier. Moreover, the test accuracy decreases with the the number of participants. the condition number. number of participants. This is because, with more participants, the pattern of the projected data becomes more complex, introduc- ing challenges to both MLP and SVM. The test accuracy difference The study [15] analyzes the distribution of the condition num- between MLP and SVM increases from 2% to 7%, when the number bers of Gaussian random matrices. The results show that a Gauss- of participants increases from 4 to 20. This result is also consistent ian random matrix is well-conditioned with a high probability. For with the understanding that deep learning is more effective in cap- instance, it is shown in [15] that for a 10 × 5 Gaussian random ma- turing complex patterns than traditional learning. trix, the probability that its condition number is larger than 100 is 4.3.2 A 10-dimensional example. Now, we use another example less than 6 × 10−7 . This is a basis for our choice of using Gaussian system to understand the effect of deep learning’s unsupervised random matrices to project data. feature learning capability in addressing the data distortion caused by the random projection. This example is a PPCL system with 4.4 Alternative Approaches and Limitations only one participant (i.e., N = 1). The original data vectors in two This section discusses two alternative approaches to PPCL and classes follow two 10-dimensional Gaussian distributions, with the their limitations. These two alternatives will be used as the base- [−2, −2, . . . , −2] ⊺ and [2, 2, . . . , 2] ⊺ as the respective mean vectors, line approaches in our comparative performance evaluation in §5. and the 10-dimensional identity matrix as their identical covari- 4.4.1 Non-collaborative learning. If the data anonymity require- ance matrix. ment is not enforced, the coordinator can train a separate deep In our discussions in §4.2.2, we assume that the projection ma- model based on the projected data vectors contributed by each par- trix R is invertible and the unsupervised feature learning tend to ticipant. This alternative approach can address the challenge of the capture ΨR−1 . As learning algorithms are based on numerical com- complex mixed patterns due to different random projection matri- putation on the training data, an ill-conditioned matrix R will im- ces adopted by different participants as illustrated in §4.3. How- pede efficient fitting of ΨR−1 . We verify this intuition by assessing ever, it loses the advantages of collaborative learning, i.e., the in- the learning performance of the single-participant PPCL system us- creased data volume and pattern coverage. From our evaluation in ing different R matrices with varying condition numbers. Specifi- §5, compared with our proposed approach, despite that this non- cally, by following a method described in [8], the participant gener- collaborative learning approach additionally uses the participant ates a random square matrix R that has a certain condition number identity information, it yields inferior average accuracy. value. The condition number is defined as kRkF kR+ kF [35], where R+ denotes the pseudoinverse of R and k · kF represents the Frobe- 4.4.2 Differential privacy. Differential privacy (DP) [17] is a rigor- nius norm. Fig. 4 shows the test accuracy of the MLP and SVM ous information-theoretic approach to prevent leak of individual classifiers trained using data projected by R versus the condition records by statistical queries on a database of these records. The number of R. Note that a larger condition number means that the ϵ-DP [17] is formally defined as follows: A randomized algorithm matrix is more ill-conditioned. We can see that the test accuracy A : D → Rt gives ϵ-DP if for all adjacent datasets D 1 ∈ D and decreases with the condition number, consistent with the intuition. D 2 ∈ D differing on at most one element, and all S ⊆ Ranдe (A), IoTDI ’19, April 15–18, 2019, Montreal, QC, Canada Linshan Jiang, Rui Tan, Xin Lou, and Guosheng Lin Pr(A(D 1 ) ∈ S ) ≤ exp(ϵ ) · Pr(A(D 2 ) ∈ S ). The ϵ, a positive real Applying ϵ-DP to PPCL with resource-constrained participants number, is a measure of privacy loss, i.e., a smaller ϵ implies better also introduces the following two challenges. privacy. When ϵ is very small, Pr(A(D 1 ) ∈ S ) ≃ Pr(A(D 2 ) ∈ S ) Non-trivial computation overhead: From the DP theory, an indepen- for all S ⊆ Ranдe (A), which means that the query results A(D 1 ) dent random noise vector should be generated and added to ev- and A(D 2 ) are almost indistinguishable based on any “test cri- ery data vector x. However, random number generation is often terion” of S. The indistinguishability between the query results a costly operation due to the use of various mathematical func- A(D 1 ) and A(D 2 ) decreases with ϵ. The study [18] develops an tions. The continuous generation of Laplacian noises will incur approach of adding Laplacian noises to implement ϵ-DP. Specifi- non-trivial computation overhead for the resource-constrained par- cally, for all function F : D → Rt , the randomized algorithm ticipants. Differently, in our approach, the random projection ma- A(D) = F (D) + [n 1 ,n 2 , . . . ,n t ] ⊺ gives ϵ-DP, where each n i is trix generation is a one-off overhead. The projection to compute drawn independently from a Laplace distribution Lap(S (F )/ϵ ) and Rx is a lightweight operation consisting of multiplications and ad- S (F ) denotes the global sensitivity of F . Note that Lap(λ) denotes ditions only. Our previous work [43] has implemented the projec- a zero-mean Laplace distribution with a probability density func- |x | tion operation on an MSP430-based platform. Moreover, the pro- 1 e tion of f (x |λ) = 2λ ; the global sensitivity is λ jection can be sped up if a parallel computing chip (e.g., Google’s Edge TPU [21]) is available. S (F ) = max ||F (D ′ ) − F (D ′′ )||1 . Learning performance degradation: As discussed in §4.2.2, the pro- ∀D ′ ∈D,∀D ′′ ∈D jection matrix can be implicitly learned by the deep learning algo- Essentially, ϵ-DP gives quantifiable indistinguishability of the rithms. Differently, the additive Laplacian noises to ensure ϵ-DP query results based on different datasets. The ϵ-DP framework has can be considered neither a pattern nor an embedding that can been applied in various privacy preservation problems in machine be learned by learning algorithms. Thus, the Laplacian noises will learning. As discussed in §3.1, the DML approaches to PPCL [23, only negatively affect the learning performance. Our evaluation in 40] add random noises to the parameters exchanged between the §5 shows that the Laplacian noises for achieving moderate ϵ-DP participants and the coordinator to achieve ϵ-DP. The original pa- significantly degrade the learning performance. rameters can be viewed as deterministic query results of the train- From the above discussions and the evaluation results in §5, ing data. Adding random noises to the parameters ensures certain adding Laplacian noises to the training data for ϵ-DP is not a promis- levels of indistinguishability between the noise-added parameters ing approach to PPCL with resource-constrained participants. based on different training datasets. The achieved ϵ-DP mitigates the privacy concern that the curious coordinator may use the re- 5 PERFORMANCE EVALUATION ceived parameters to infer the existence of particular data vectors In this section, we extensively compare the accuracy achieved by in the training dataset. However, these DML approaches [23, 40] in- various approaches. The computation and communication over- cur significant overhead to resource-constrained participants. For head of these approaches will be profiled in §6 based on their im- PPCL based on resource-constrained participants, an approach to plementations on a testbed. achieving ϵ-DP is to add a Laplacian noise vector to the original data vector x and then transmit the noise-added data vector to the 5.1 Evaluation Methodology and Datasets coordinator for building the classifier. By doing so, certain levels of indistinguishability between the noise-added data vectors based We conduct extensive evaluation to compare several approaches: on different original data vectors are achieved. • GRP-DNN: This is the proposed approach consisting of GRP Additive noisification and multiplicative GRP preserve different at the participants and collaborative learning based on a forms of privacy. Compared with protecting indistinguishability DNN at the coordinator. The design or choice of the DNN under the DP framework, we believe that protecting the confiden- model will be application specific. The DNN models and tiality of the raw data form, which can be achieved by GRP, is a training algorithms are implemented based on PyTorch [2]. more immediate and basic privacy requirement in many applica- • GRP-SVM: This baseline approach applies GRP at the par- tions. The additive noisification, though achieving ϵ-DP, falls short ticipants and trains an SVM-based classifier at the coordina- of protecting the confidentiality of the raw data form. Specifically, tor. The SVM-based classifier is implemented using LIBSVM under the ϵ-DP framework based on zero-mean Laplacian noises, [13]. The classifier uses RBF kernel with two configurable a noise-added data vector can be considered an unbiased estimate parameters C and λ. During the training phase, we apply of the original data vector with an estimation variance related to grid search to determine the best settings for C and λ. This ϵ. Thus, the coordinator always has a meaningful (i.e., unbiased) grid search is often lengthy in time (e.g., several days). estimate of the raw data. According to Property 2 in §2.2, this only • GRP-NCL: This is the non-collaborative learning (NCL) base- happens to the GRP approach in the worst (and unrealistic) case line approach described in §4.4.1. It runs GRP at the par- that the projection matrix is revealed to the coordinator; other than ticipants and trains a separate DNN for each participant at the worst case, the coordinator cannot have a meaningful estimate the coordinator. Compared with other approaches, this ap- of the raw data form. In the image classification case studies in proach additionally requires the identity of the participant §5, we will show that when ϵ is small (i.e., good DP), the contents for each training sample. of the noise-added images can still be interpreted. In contrast, the • ϵ-DP-DNN: As described in §4.4.2, this approach implements projected images cannot be interpreted visually at all. ϵ-DP by adding Laplacian noise vectors to the data vectors On Lightweight Privacy-Preserving Collaborative Learning for IoT Objects IoTDI ’19, April 15–18, 2019, Montreal, QC, Canada 10 5×5 conv filters 20 5×5 conv filters pooling (2x2) pooling (2x2) 320 ReLUs 50 ReLUs 10 ReLUs softmax (a) Original images ⇒ ⇒ →2 (b) Projected images in GRP-DNN conv layers dense layers Figure 6: CNN with a projected MNIST image as input. (c) Noise-added images in ϵ -DP-DNN (ϵ = 50) GRP-DNN GRP-SVM SVM GRP-NCL CNN 1 Test accuracy (d) Noise-added images in ϵ -DP-DNN (ϵ = 10) 0.8 0.6 Figure 5: Example images from MNIST dataset. 0.4 0.2 0 and performs collaborative deep learning based on a DNN 40 80 120 160 200 240 280 320 360 400 at the coordinator. The number of participants N • ϵ-DP-SVM: This approach implements ϵ-DP by adding Lapla- cian noise vectors to the data vectors and performs collabo- Figure 7: Impact of the number of participants (MNIST). The rative learning based on SVM at the coordinator. error bars for GRP-NCL represent min and max. • CNN, SVM, MLP, ResNet-152: These are the plain learn- ing approaches based on the CNN, SVM, MLP, and ResNet- 152 models, respectively. They do not protect any privacy. assigned to a participant. Under GRP-DNN, GRP-SVM, and GRP- The performance evaluation is performed based on two datasets, NCL, each participant independently generates its random Gauss- i.e., MNIST [27] and spambase [4]. The MNIST dataset consists of ian matrix as described in §4.2.1 and uses the matrix to project 60,000 training samples and 10,000 testing samples. Each sample is its plaintext data vectors. The deep models and SVM are trained a 28 × 28 grayscale image showing a handwritten digits from 0 to 9. by the coordinator based on the projected or noise-added training Fig. 5(a) shows an instance of each digit. The spambase dataset con- data vectors from the participants. The trained deep models and sists of 4,601 samples. Each sample consists of (i) a 57-dimensional SVM are used to classify the projected or noise-added testing data feature vector that is extracted from an e-mail message and (ii) a vectors to measure the test accuracy as the evaluation results. class label indicating whether the e-mail message is an unsolicited commercial e-mail. The details of the feature vector can be found in [4]. As the data volume of this spambase dataset is limited, we 5.2 Evaluation Results with MNIST Dataset apply data augmentation to the spambase by adding zero-mean We design a CNN that is used in the GRP-DNN, GRP-NCL, and Gaussian noises, resulting in 40,000 training samples and 400 test- ϵ-DP-DNN approaches. The CNN consists of two convolutional ing samples. We choose these two datasets because the small sizes layers and three dense layers of ReLUs. We apply max pooling af- of the data vectors are commensurate with the limited computing ter each convolutional layer to reduce the dimension of data after and transmission capabilities of IoT end devices. convolution. The max pooling controls overfitting effectively and Training a spam detector based on user-contributed samples improves the CNN’s robustness to small spatial distortions in the (e.g., e-mails) may cause privacy concerns. Thus, our proposed ap- input image. The last dense layer has ten ReLUs corresponding to proach well fits in this case. The choice of the vision-based charac- the ten classes of MNIST. A softmax function is used to make the ter recognition task with the MNIST dataset allows us to leverage classification decision based on the outputs of the last dense layer. on the learning capabilities of the deep models that are often de- Fig. 6 illustrates the design of the CNN. Note that, without random signed for image classification. Moreover, by using images as the projection, the CNN and the SVM with grid search for kernel pa- data vectors, the effect of the distortion caused by noise adding or rameters can achieve test accuracy of 98.7% and 98.52%. This shows random projection can be visualized for intuitive understanding. that the CNN and SVM can well capture the patterns of MNIST. Although the character recognition task is not privacy-sensitive, First, we evaluate the impact of the number of participants N its results will provide understanding on other image classification- on the learning performance of GRP-DNN, GRP-NCL, and GRP- based privacy-sensitive applications, such as collaboratively train- SVM. Fig. 7 shows the results. The two horizontal lines in Fig. 7 ing a mood classifier using the photos in the album of the users’ represent the test accuracy of the plain CNN and SVM without any smartphones. privacy protection. The two lines overlap. When N increases from For a PPCL system with N participants, we divide both the train- 40 to 400, the test accuracy of GRP-DNN decreases from 96.87% ing and testing samples into N disjoint sets evenly. Each set is to 86.18%. If N is no greater than 280, GRP-DNN can maintain IoTDI ’19, April 15–18, 2019, Montreal, QC, Canada Linshan Jiang, Rui Tan, Xin Lou, and Guosheng Lin 1 GRP-DNN ǫ-DP-DNN 1 GRP-DNN GRP-SVM SVM 1 GRP-SVM ǫ-DP-SVM GRP-NCL MLP 0.8 Test accuracy 0.95 Test accuracy Test accuracy 0.8 0.6 0.9 0.6 0.85 0.4 0.4 0.8 0.2 0.75 0.2 1 40 80 120 160 200 0 0 The number of participants N 1 1.16 1.41 1.75 2.33 10 30 40 50 60 70 80 100 200 300 400 500 0 20 9 Compression ratio ρ Differential privacy loss ǫ Figure 10: Impact of the number of participants (spambase). The error bars for GRP-NCL represent min and max. Figure 8: Impact of data com- Figure 9: Impact of differen- pression on learning perfor- tial privacy loss on learning mance (MNIST, N = 100). performance (MNIST). same CNN model as shown in Fig. 6 for the GRP-DNN, GRP-NCL, and ϵ-DP-DNN approaches. We do not spend special efforts to im- a test accuracy greater than 90%. The drop of accuracy with in- prove the CNN design in favor of any approach; we only make creased N is consistent with the understanding that distinct ran- sure the CNN fed with the original MNIST images achieves satis- dom projection matrices increase the pattern complexity of the ag- factory performance. The poor performance of ϵ-DP-DNN is con- gregated data. However, for MNIST data with light pattern com- sistent with the understanding that the performance of deep learn- plexities, the GRP-DNN approach can support up to 280 IoT ob- ing can be susceptible to small perturbations to the data vectors jects for a satisfactory classification accuracy of 90%. Under the [48]. There are also systematic approaches to generating adver- GRP-NCL approach, the deep models corresponding to the partic- sary examples with small differences from the training samples to ipants have different test accuracy values. The histogram and er- yield wrong classification results [10, 20]. Special cares are needed ror bars in Fig. 7 represent the average, minimum, and maximum in the deep model design to improve robustness against human- of the test accuracy values across all trained deep models. Under indiscernible perturbations [48]. Significant noises, which are re- each setting of N , the maximum test accuracy is 100%. However, quired to achieve good DP protection, are still open challenges the average test accuracy is consistently lower than that of GRP- to deep learning. Thus, under the ϵ-DP framework, it is challeng- DNN. This shows that, the GRP-NCL that needs to compromise ing to achieve a desirable trade-off between the privacy protection data anonymity yields inferior average learning performance com- strength and learning performance. pared with GRP-DNN. This result shows the advantage of collab- We discussed in §4.4.2 that the additive noisification for ϵ-DP is orative learning. Lastly, the GRP-SVM approach gives poor test ineffective in achieving a good trade-off between learning perfor- accuracy around 17.5%. This is because no efficient RBF kernels mance and protecting the confidentiality of the raw forms of the can be found to create proper hyperplanes for classification. This training data. Now, we compare the results of GRP-DNN (N = 1, suggests that DNNs are more efficient to cope with the distortions k = d − 1) and ϵ-DP-DNN. We consider the worst case for GRP- caused by projections. DNN, i.e., the projection matrix R is revealed to the curious coor- Second, we evaluate the impact of GRP’s data compression on dinator. From Property 2 in §2.2, the minimum norm estimate of the learning performance. Fig. 8 shows the results when N = 100. the original data vector by the coordinator will have a per-element When the compression ratio increases from 1 (i.e., no compression) variance of about 410 for any MNIST image. Under this setting, to 2.33 (i.e., 43% of data volume is retained), the test accuracy of GRP-DNN can achieve a test accuracy of 94.82%. To achieve the GRP-DNN decreases from 95.52% to 92.85% only. From our discus- same per-element variance of 410, the ϵ value adopted by the ϵ- sion in §4.2.1, the good tolerance of GRP-DNN against data com- DP-DNN should be 18.89. Under this ϵ setting, the test accuracy of pression is due to the high sparsity of the MNIST images. In con- ϵ-DP-DNN is 12.86% only. trast, the GRP-SVM approach performs poorly under all compres- Fig. 9 also shows the test accuracy of the ϵ-DP-SVM approach. sion ratio settings. It performs poorly when ϵ ≤ 100. Only when the added noises are Then, we evaluate the impact of adding Laplacian noises to im- very small under the settings of ϵ = 400 and ϵ = 500, this approach plement ϵ-DP on the learning performance. Fig. 9 shows the test can achieve good test accuracy. accuracy of ϵ-DP-DNN versus the privacy loss level ϵ. When ϵ = 100 (small Laplacian noises and large differential privacy loss), the 5.3 Evaluation Results with Spambase Dataset ϵ-DP-DNN achieves a test accuracy of 86.6%, lower than those We design a 5-layer MLP classifier to detect spams. The numbers achieved by GRP-DNN when N is up to 400. When ϵ = 10, the per- of ReLUs in the five layers are 57, 100, 50, 10, and 2, respectively. A formance of ϵ-DP-DNN drops to 11.4%, close to the performance of softmax function is used lastly to make the final detection decision. random guessing. For comparison, we visualize the projected and Dropout is used during training to suppress overfitting. Without noise-added images with two ϵ settings in Fig. 5. From Fig. 5(b), random projection, the MLP and the SVM with grid research for we cannot visually interpret the projected images. However, from kernel parameters can achieve test accuracy of 96.52% and 96.25%, Figs. 5(c) and 5(d), the noise-added images are easily interpreted respectively. This shows that the MLP and SVM can well capture when ϵ is down to 10. Note that in our evaluation, we use the the patterns of spambase. On Lightweight Privacy-Preserving Collaborative Learning for IoT Objects IoTDI ’19, April 15–18, 2019, Montreal, QC, Canada We evaluate the impact of the number of participants N on Table 1: The overhead of various approaches. the learning performance of GRP-DNN, GRP-NCL, and GRP-SVM. Fig. 10 shows the results. The two horizontal lines in Fig. 10 rep- Overhead GRP-DNN Crowd-ML CryptoNets Testing Training resent the test accuracy of the plain MLP and SVM without any Participant comm. vol. 33.6 MB 117.2 MB n/a privacy protection. When N increases from 1 to 200, the test accu- Participant compute time 0.96 s 367.24 s n/a Coordinator compute time 928.34 s 1.04 s n/a racy of GRP-DNN decreases from 96% to 83.25%. If N is no greater Participant comm. vol. 5.6 MB n/a 15.0 MB 100, GRP-DNN can maintain a test accuracy of about 90%. The av- Participant compute time 0.16 s 4.67 s 116 hours erage test accuracy of GRP-NCL is about 5% lower than that of the Coordinator compute time 40.88 s n/a GRP-DNN, because GRP-NCL misses the advantages of collabora- n/a represents “not applicable.” tive learning. The test accuracy of the GRP-SVM is about 1.25% to 2.75% lower than that of the GRP-DNN. Thus, the GRP-SVM performs satisfactorily for this spambase dataset. The reasons are participant part of our GRP-DNN can be implemented on mote- two-fold. First, in this spambase dataset, the classifiers operate on class platforms. Our previous work [43] has implemented Gauss- the e-mail features, rather than the raw data. Second, the RBF ker- ian matrix generation and GRP on the MSP430-based Kmote plat- nel is effective in capturing the features. In fact, the nature of this form. However, it is difficult/impossible to implement Crowd-ML spambase dataset is similar to that of the 2-dimensional and 10- and CryptoNets on mote-class platforms. dimensional generated feature datasets used in §4.3, on which the We implement our GRP-DNN approach on the testbed. The com- GRP-DNN and GRP-SVM perform similarly. pression ratio ρ = 1 (i.e., no compression). Table 1 shows the bench- mark results. During the training phase, each GRP-DNN partici- pant needs to transmit a total of 33.6 MB projected data. A partic- 5.4 Summary and Discussion ipant can complete projecting all the 4,285 training images within We have several observations from the results in §5.2 and §5.3: 0.96 s. The coordinator needs 928.34 s to train the CNN. In our GRP- DNN implementation, the testing phase is performed on the coor- • Compared with SVM, deep learning can better adapt to the dinator. During the testing phase, each participant completes pro- complexity introduced by the multiplicative projections. jecting all the 714 testing images within 0.16 s and transmits a total • Although the GRP-NCL approach additionally uses the iden- of 5.6 MB data to the coordinator. The coordinator needs 40.88 s tities of the participants, it gives inferior performance com- to classify all projected testing images from the participants. Note pared with the collaborative GRP-DNN. This shows the ad- that GPU acceleration is not used in this benchmark for GRP-DNN vantage of collaborative learning even with the privacy preser- during both the training and testing phases. vation requirement. The Crowd-ML [23] is a DML approach. In Crowd-ML, a partici- • Compared with GRP-DNN, the additive noisification for ϵ- pant checks out the global classifier parameters from the coordina- DP achieves inferior trade-off between learning performance tor and computes the gradients using its own training data. Then, and protecting confidentiality of raw forms of training data. the participants transmit the gradients to the coordinator that will • GRP-DNN shows promising scalability with the number of update the global classifier parameters. Thus, during the training participants observing low-complexity data patterns. For the phase, the participants and the coordinator repeatedly exchange MNIST and spambase datasets, the GRP-DNN can well sup- parameters. We apply an existing implementation of Crowd-ML port 100 participants with a few percents test accuracy drop. [1] on our testbed. Our measurement shows that, during the train- For large-scale PPCL systems involving more participants, ing phase, each participant needs to upload and download a total we envision a two-tier system architecture as follows. The of 117.2 MB data, which is 3.5x of our GRP-DNN. The participant participants are divided into groups. At the first tier, our compute time is more than 350x of that under GRP-DNN. Despite GRP-DNN is applied within each group; at the second tier, the larger volume of data exchanges, Crowd-ML achieves 91.28% the DML approach is applied among the group coordinators. test accuracy only, which is lower than the 95.58% test accuracy achieved by GRP-DNN. This is because Crowd-ML uses a simple multiclass logistic classifier, which is inferior compared with the 6 IMPLEMENTATION AND BENCHMARK CNN used by GRP-DNN in terms of learning performance. Note In this section, we measure the overhead of two PPCL approaches that during the testing phase of Crowd-ML, the participants exe- (i.e., our GRP-DNN and Crowd-ML [23]) and a privacy-preserving cute their local classifiers. Thus, they do not need to transmit the classification outsourcing approach (i.e., CryptoNets [19]) on a testbed testing samples to the coordinator for classification. of 14 Raspberry Pi 2 Model B nodes [3] and a powerful workstation CryptoNets [19] uses homomorphic encryption algorithm to en- computer. The Raspberry Pi nodes act as PPCL participants and the crypt a testing sample during the classification phase and transmits workstation acts as the coordinator. They are interconnected using the encrypted sample to the coordinator. Then, the coordinator a 24-port network switch. We benchmark these approaches using uses a neural network trained with plaintext data to classify the en- the MNIST dataset. The training and testing samples are evenly crypted testing sample. We have implemented the homomorphic allocated to the participants, resulting in 4,285 training samples encryption part of CrytoNets that runs on the Raspberry Pis. The and 714 testing samples on each participant. The implementations volume of the 714 encrypted testing images is 15 MB, almost 3x of of the three approaches (GRP-DNN, Crowd-ML, CryptoNets) on the data volume generated by random projection. In particular, a the same platform, i.e., Raspberry Pi, allow fair comparisons. The Raspberry Pi node takes about 10 minutes and a total of 116 hours IoTDI ’19, April 15–18, 2019, Montreal, QC, Canada Linshan Jiang, Rui Tan, Xin Lou, and Guosheng Lin to encrypt an image and all the testing images, respectively. This is [19] Ran Gilad-Bachrach, Nathan Dowlin, Kim Laine, Kristin Lauter, Michael 2.6 million times slower than the random projection computation. Naehrig, and John Wernsing. 2016. Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy. In Proc. ICML. 201–210. This result clearly shows that the high computation complexity [20] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and of the homomorphic encryption makes CryptoNets ill-suited for Harnessing Adversarial Examples. In Proc. ICLR. [21] Google Cloud. 2018. Edge TPU. https://cloud.google.com/edge-tpu/. resource-constrained devices. [22] Thore Graepel, Kristin Lauter, and Michael Naehrig. 2012. ML confidential: Ma- chine learning on encrypted data. In Proc. Intl. Conf. Inf. Security & Cryptology. Springer, 1–21. 7 CONCLUSION [23] J. Hamm, A. Champion, G. Chen, M. Belkin, and D. Xuan. 2015. Crowd-ML: A This paper proposes a practical privacy-preserving collaborative Privacy-Preserving Learning Framework for a Crowd of Smart Devices. In Proc. ICDCS. IEEE, 11–20. learning approach, in which the resource-constrained learning par- [24] B. Hitaj, G. Ateniese, and F. Perez-Cruz. 2017. Deep Models Under the GAN: ticipants apply independent Gaussian projections on their training Information Leakage from Collaborative Deep Learning. In Proc. CCS. ACM, 603– 618. data vectors and the coordinator applies deep learning to train a [25] Loc N Huynh, Youngki Lee, and Rajesh Krishna Balan. 2017. Deepmon: Mobile classifier based on the projected data vectors. Our approach pro- gpu-based deep learning framework for continuous vision applications. In Proc. tects the confidentiality of the raw forms of the training data against MobiSys. ACM, 82–95. [26] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature the honest-but-curious coordinator. Evaluation using two datasets 521, 7553 (2015), 436–444. shows that our approach outperforms various baselines and ex- [27] Yann LeCun, Corinna Corts, and Christopher J.C. Burges. 2018. The MNIST hibits promising scalability with respect to the number of partic- Database of Handwritten Digits. http://yann.lecun.com/exdb/mnist/. [28] Shancang Li, Li Da Xu, and Xinheng Wang. 2013. Compressed sensing signal ipants observing low-complexity data patterns. Benchmark on a and data acquisition in wireless sensor networks and internet of things. IEEE testbed shows the practicality and efficiency of our approach. Trans. Ind. Informat. 9, 4 (2013), 2177–2186. [29] Bin Liu, Yurong Jiang, Fei Sha, and Ramesh Govindan. 2012. Cloud-enabled privacy-preserving collaborative learning for mobile sensing. In Proc. SenSys. ACKNOWLEDGMENTS ACM, 57–70. [30] Kun Liu, Hillol Kargupta, and Jessica Ryan. 2006. Random projection-based This research was funded by a Start-up Grant at Nanyang Techno- multiplicative data perturbation for privacy preserving distributed data mining. logical University. We acknowledge the support of NVIDIA Corpo- IEEE Trans. knowl. Data Eng. 18, 1 (2006), 92–106. [31] H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and ration with the donation of two GPUs used in this research. We also Blaise Agüera y Arcas. 2017. Communication-efficient learning of deep net- acknowledge Dr. Yi Li for constructive discussions and Zhenyu Yan works from decentralized data. In AISTATS. [32] H Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. 2018. Learn- for managing the computation resources used in this paper. ing Differentially Private Recurrent Language Models. In Proc. ICLR. [33] Arvind Narayanan and Vitaly Shmatikov. 2006. How to break anonymity of the netflix prize dataset. arXiv preprint cs/0610105 (2006). REFERENCES [34] Lindsey O’Donnell. 2018. Zero-Day Flash Exploit Targeting Middle East. [1] 2018. Crowd-ML. https://github.com/jihunhamm/Crowd-ML. https://threatpost.com/zero-day-flash-exploit-targeting-middle-east/132659/. [2] 2018. PyTorch. https://pytorch.org/. [35] Christopher C Paige and Michael A Saunders. 1982. LSQR: An algorithm for [3] 2018. Raspberry Pi 2 Model B. https://bit.ly/1b75SRj. sparse linear equations and sparse least squares. ACM Trans. Math. Software 8, [4] 2018. Spambase data set. https://archive.ics.uci.edu/ml/datasets/spambase. 1 (1982), 43–71. [5] M. Abadi, A. Chu, I. Goodfellow, H. McMahan, I. Mironov, K. Talwar, and L. [36] L. Phong, Y. Aono, T. Hayashi, L. Wang, and S. Moriai. 2018. Privacy-Preserving Zhang. 2016. Deep learning with differential privacy. In Proc. CCS. ACM, 308– Deep Learning via Additively Homomorphic Encryption. IEEE Trans. Inf. Foren- 318. sics Security 13, 5 (2018). [6] Nir Ailon and Bernard Chazelle. 2009. The fast Johnson–Lindenstrauss trans- [37] Yaron Rachlin and Dror Baron. 2008. The secrecy of compressed sensing mea- form and approximate nearest neighbors. SIAM Journal on computing 39, 1 surements. In Proc. Allerton. IEEE, 813–817. (2009), 302–322. [38] Reuters. 2018. Facebook critics want regulation, investigation after data misuse. [7] Jonathan Berr. 2018. Equifax breach exposed data for 143 million consumers. https://reut.rs/2GwKF8p. https://cbsn.ws/2Qc8VOg. [39] Yiran Shen, Chengwen Luo, Dan Yin, Hongkai Wen, Rus Daniela, and Wen Hu. [8] Michel Bierlaire, Ph L Toint, and Daniel Tuyttens. 1991. On iterative algorithms 2018. Privacy-preserving sparse representation classification in cloud-enabled for linear least squares problems with bound constraints. Linear Algebra Appl. mobile applications. Comput. Netw. 133 (2018), 59–72. 143 (1991), 111–143. [40] R. Shokri and V. Shmatikov. 2015. Privacy-preserving deep learning. In Proc. [9] Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H. Brendan CCS. ACM, 1310–1321. McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. 2017. Prac- [41] Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. 2013. Stochastic gra- tical secure aggregation for privacy preserving machine learning. In Proc. CCS. dient descent with differentially private updates. In Proc. GlobalSIP. IEEE, 245– ACM, 1175–1191. 248. [10] Avishek Joey Bose and Parham Aarabi. 2018. Adversarial Attacks on Face Detec- [42] Johan AK Suykens. 2003. Advances in learning theory: methods, models, and ap- tors using Neural Net based Constrained Optimization. In Proc. Intl. Workshop plications. Vol. 190. IOS Press. Multimedia Signal Process. [43] Rui Tan, Sheng-Yuan Chiu, Hoang Hai Nguyen, David KY Yau, and Deokwoo [11] Emmanuel J Candès and Michael B Wakin. 2008. An introduction to compressive Jung. 2017. A Joint Data Compression and Encryption Approach for Wireless sampling. IEEE Signal Process. Mag. 25, 2 (2008), 21–30. Energy Auditing Networks. ACM Trans. Sensor Networks 13, 2 (2017), 9. [12] Hervé Chabanne, Amaury de Wargny, Jonathan Milgram, Constance Morel, and [44] Cong Wang, Bingsheng Zhang, Kui Ren, and Janet M Roveda. 2013. Privacy- Emmanuel Prouff. 2017. Privacy-Preserving Classification on Deep Neural Net- assured outsourcing of image reconstruction service in cloud. IEEE Trans. Emerg. work. IACR Cryptology ePrint Archive 2017 (2017), 35. Topics Comput. 1, 1 (2013), 166–177. [13] Chih-Chung Chang and Chih-Jen Lin. 2018. LIBSVM – a library for support [45] Piotr Iwo Wójcik and Marcin Kurdziel. 2018. Training neural networks on high- vector machines. https://www.csie.ntu.edu.tw/~cjlin/libsvm/. dimensional data using random projection. Pattern Anal. Appl. (2018), 1–11. [14] Kamalika Chaudhuri and Claire Monteleoni. 2009. Privacy-preserving logistic [46] Wanli Xue, Chenwen Luo, Guohao Lan, Rajib Rana, Wen Hu, and Aruna Senevi- regression. In Proc. NIPS. 289–296. ratne. 2017. Kryptein: a compressive-sensing-based encryption scheme for the [15] Zizhong Chen and Jack J Dongarra. 2005. Condition numbers of Gaussian ran- internet of things. In Proc. IPSN. IEEE, 169–180. dom matrices. SIAM J. Matrix Anal. Appl. 27, 3 (2005), 603–620. [47] Shuochao Yao, Yiran Zhao, Aston Zhang, Lu Su, and Tarek Abdelzaher. 2017. [16] George Danezis and Claudia Diaz. 2008. A survey of anonymous communication DeepIoT: Compressing deep neural network structures for sensing systems with channels. Technical Report. Microsoft Research. MSR-TR-2008-35. a compressor-critic framework. In Proc. SenSys. ACM, 4:1–4:14. [17] C. Dwork. 2006. Differential privacy. In Proc. ICALP. [48] Stephan Zheng, Yang Song, Thomas Leung, and Ian Goodfellow. 2016. Improving [18] C. Dwork, F. McSherry, K. Nissim, and A. Smith. 2006. Calibrating noise to the robustness of deep neural networks via stability training. In Proc. CVPR. IEEE, sensitivity in private data analysis. Conf. Theory of Cryptography (2006), 265– 4480–4488. 284.