(PDF) Joint Client Scheduling and Resour

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCOMM.2021.3088528, IEEE Transactions on Communications 1 Joint Client Scheduling and Resource Allocation under Channel Uncertainty in Federated Learning Madhusanka Manimel Wadu, Sumudu Samarakoon, and Mehdi Bennis Centre for Wireless Communications (CWC), University of Oulu, Finland {madhusanka.manimelwadu, sumudu.samarakoon, mehdi.bennis}@oulu.fi Abstract—The performance of federated learning (FL) over also depends on the training data distributions over devices wireless networks depend on the reliability of the client-server which generally is non-independent and identically distributed connectivity and clients’ local computation capabilities. In this (IID) [8], [9]. Hence, the impact of training data distribution article we investigate the problem of client scheduling and resource block (RB) allocation to enhance the performance of in terms of balanced-unbalancedness and IID versus non- model training using FL, over a pre-defined training duration IIDness on model training is analyzed in few works [5], under imperfect channel state information (CSI) and limited [10], [11]. In [10], the authors have analyzed the effect of local computing resources. First, we analytically derive the gap non-IID data distribution through numerical simulations for between the training losses of FL with clients scheduling and a a visual classification task. Likewise in [11], authors have centralized training method for a given training duration. Then, we formulate the gap of the training loss minimization over empirically analyzed the impact of non-IID data distribution client scheduling and RB allocation as a stochastic optimization on the performance of FL. problem and solve it using Lyapunov optimization. A Gaussian Except a handful of works [4], [12]–[17], the vast ma- process regression-based channel prediction method is leveraged jority of the existing literature assumes ideal client-server to learn and track the wireless channel, in which, the clients’ communication conditions, overlooking channel dynamics and CSI predictions and computing power are incorporated into the scheduling decision. Using an extensive set of simulations, we val- uncertainties. In [12], communication overhead is reduced by idate the robustness of the proposed method under both perfect using the lazily aggregate gradients (LAG) based on reusing and imperfect CSI over an array of diverse data distributions. outdated gradient updates. Due to the limitations in commu- Results show that the proposed method reduces the gap of the nication resources, scheduling the most informative clients is training accuracy loss by up to 40.7 % compared to state-of-the- one possible solution [13], [14], [17]. In [13] authors propose a art client scheduling and RB allocation methods. client-scheduling algorithm for FL to maximize the number of Index Terms—Federated learning, channel prediction, Gaus- scheduled clients assuming that communication and computa- sian process regression (GPR), resource allocation, scheduling, tion delays are less than a predefined threshold but, the impact 5G and beyond of client scheduling was not studied. In [14], the authors study the impact of conventional scheduling policies (e.g., random, I. I NTRODUCTION round robin, and proportional fairness) on the accuracy of FL The staggering growth of data generated at the edge of over wireless networks relying on known channel statistics. wireless networks sparked a huge interest in machine learning In [15], the training loss of FL is minimized by joint power (ML) at the network edge, coined edge ML [3]. In edge ML, allocation and client scheduling. A probabilistic scheduling training data is unevenly distributed over a large number of framework is proposed in [18], seeking the optimal trade- devices, and every device has a tiny fraction of the data. One off between channel quality and importance of model update of the most popular model training methods in edge ML is fed- considering the impact of channel uncertainty in scheduling. erated learning (FL) [4], [5]. The goal of FL is to train a high- Moreover, the impact of dynamic channel conditions on FL quality centralized model in a decentralized manner, based on is analyzed only in few works [4], [14], [19]. In [4], authors local model training and client-server communication while propose over-the-air computation-based approach leveraging training data remains private [3], [4], [6]. Recently, FL over the ideas of [20] to speed-up the global model aggregation wireless networks gained much attention in communication utilizing the superposition property of a wireless multiple- and networking with applications spanning a wide range of access channels with scheduling and beamforming. In addition domains and verticals ranging from vehicular communication to channel prediction, clients’ computing power are utilized for to ,blockchain and healthcare [5]–[7]. The performance of FL their local models updates. Stochastic gradient decent (SGD) highly depends on the communication and link connectivity in is widely used for updating their local models, in which the addition to the model size and the number of clients engaged in computation is executed sample by sample [21]. In [19], the training. The quality of the trained model (inference accuracy) authors proposed the compressed analog distributed stochastic gradient descent (CA-DSGD) method, which is shown to be This work is supported by Academy of Finland 6G Flagship (grant no. robust against imperfect channel state information (CSI) at the 318927) and project SMARTER, projects EU-ICT IntellIoT and EUCHIS- devices. While interestingly the communication aspects in FL TERA LearningEdge, Infotech-NOOR. A preliminary version of this work appears in the proceedings of IEEE WCNC 2020 [1] and first author’s thesis such as optimal client scheduling and resource allocation in the work [2] absence of perfect CSI along with the limitations in processing 0090-6778 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Oulu University. Downloaded on June 17,2021 at 05:09:45 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCOMM.2021.3088528, IEEE Transactions on Communications 2 power for SGD-based local computations are neglected in all the problem into a set of linear problems that are solved at these aforementioned works. each time slot [29]. In this view, we present the joint client Acquiring CSI through pilot sequence exchanges introduces scheduling and RB allocation algorithm that simultaneously an additional signaling overhead that scales with the number explore and predict CSI to improve the accuracy of FL over of devices. There are a handful of works dealing with wireless wireless links. With an extensive set of simulations we validate channel uncertainties [22], [23]. Authors in [22] demonstrated the proposed methods over an array of IID and non-IID data the importance of reliable fading prediction for adaptive trans- distributions capturing the heterogeneity in dataset sizes and mission in wireless communication systems. Channel predic- available classes. In addition, we compare the feasibility of tion via fully connected recurrent neural networks are pro- the proposed methods in terms of fairness of the trained posed in [23]. Among channel prediction methods, Gaussian global model and accuracy. Simulation results show that the process regression (GPR) is a light weight online technique proposed method achieves up to 40.7 % reduction in the loss where the objective is to approximate a function with a non- of accuracy compared to the state-of-the-art client scheduling parametric Bayesian approach under the presence of nonlinear and RB allocation methods. and unknown relationships between variables [24], [25]. A The rest of this paper is structured as follows. Section Gaussian process (GP) is a stochastic process with a collection II presents the system model and formulates the problem of random variables indexed by time or space. Any subset of of model training over wireless links under imperfect CSI. these random variables forms multidimensional joint Gaussian In Section III, the problem is first recast in terms of the distributions, in which GP can be completely described by loss of accuracy due to client scheduling compared to a their mean and covariance matrices. The foundation of the centralized training method. Subsequently, GPR-based CSI GPR-approach is Bayesian inference, where a priori model prediction is proposed followed by the derivation of Lyapunov is chosen and updated with observed experimental data [24]. optimization-based client scheduling and RB allocation poli- In the literature, GPR is used for a wide array of practical cies under both perfect and imperfect CSI. Section IV numeri- applications including communication systems [26]–[28]. In cally evaluates the proposed scheduling policies over state-of- [26] GPR is used to estimate Rayleigh channel. In [27], a the-art client scheduling techniques. Finally, conclusions are problem of localization in a cellular network is investigated drawn in Section V. with GPR-based possition predictions. In [28], GPR is used to predict the channel quality index (CQI) to reduce CQI II. S YSTEM M ODEL AND P ROBLEM F ORMULATION signaling overhead. In our prior works, [1], [2], GPR-based Consider a system consisting of a set K of K clients channel prediction is used to derive a client scheduling and that communicate with a parameter server (PS) over wireless resource block (RB) allocation policy under imperfect channel links. Therein, the k-th client has a private dataset Dk of conditions assuming unlimited computational power availabil- size DPk , which is a partition of the global dataset D of size ity per client. Investigating the performance of FL under D = k Dk . For communication with the server for local clients’ limited computation power under both IID and non- model updating, a set B of B(≤ K) RBs is shared among IID data distributions remains an unsolved problem. those clients1 . The main contributions of this work over [1], [2] are the derivation of a joint client scheduling and RB allocation A. Scheduling, resource block allocation, and channel estima- policy for FL under communication and computation power tion limitations and a comprehensive analysis of the performance of model training as a function of (i) system model parameters, Let sk (t) ∈ {0, 1} be an indicator where sk (t) = 1 (ii) available computation and communication resources, (iii) indicates that the client k is scheduled by the PS for uplink non-IID data distribution over the clients in terms of the communication at time t and sk (t) = 0, otherwise. Multiple heterogeneity of dataset sizes and available classes. In this clients are simultaneously scheduled by allocating each at work, we consider a set of clients that communicate with most with one RB. Hence, we define the RB allocation vector a server over wireless links to train a neural network (NN) λk (t) = [λk,b (t)]∀b∈B for client k with λk,b (t) = 1 when RB model within a predefined training duration. First, we derive b is allocated to client k at time t, and λk,b (t) = 0, otherwise. an analytical expression for the loss of accuracy in FL with The client scheduling and RB allocation are constrained as scheduling compared to a centralized training method. In order follows: to reduce the signaling overhead in pilot transmission for sk (t) ≤ 1† λk (t) ≤ 1 ∀k, t, (1) channel estimation, we consider the communication scenario where 1† is the transpose of the all-one vector. The rate at with imperfect CSI and we leverage GPR to learn and track which the k-th client communicates in the uplink with the PS the wireless channel while quantifying the information on the at time t is given by, unexplored CSI over the network. To do so, we formulate P p|h (t)|2 the client scheduling and RB allocation problem as a trade- rk (t) = b∈B λk,b (t) log2 1 + Ik,bk,b (t)+N0 , (2) off between optimal client scheduling, RB allocation, and CSI exploration under both communication and computation where p is a fixed transmit power of client k, hk,b (t) is the resource constraints. Due to the stochastic nature of the afore- channel between client k and the PS over RB b at time t, mentioned problem, we resort to the dual-plus-penalty (DPP) 1 For B > K, all clients simultaneously communicate their local models technique from the Lyapunov optimization framework to recast to the PS. 0090-6778 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Oulu University. Downloaded on June 17,2021 at 05:09:45 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCOMM.2021.3088528, IEEE Transactions on Communications 3 Ik,b (t) represents the uplink interference on client k imposed Resource allocation CSI PDF 1 CSI by other clients over RB b, and N0 is the noise power spectral GPR estimation at the server t time density. Note that, a successful communication between a time Predicted CSI 4 model scheduled client and the server within a channel coherence averaging Global time is defined by satisfying a target minimum rate. In this 2 RB allocation model regard, a rate constraint is imposed on the RB allocation in terms of a target signal to interference plus noise ratio (SINR) 5 global model broadcasting γ0 as follows: Local model 3 local model uploading λk,b (t) ≤ I γ̂k,b (t) ≥ γ0 ∀k, b, t, (3) Server p|hk,b (t)|2 Client where γ̂k,b (t) = Ik,b (t)+N0 and the indicator I(γ̂ ≥ γ0 ) = 1 Dataset only if γ̂ ≥ γ0 . Fig. 1: FL with client scheduling under limited wireless B. Computational power consumption and computational time resources and imperfect CSI. Most of the clients in wireless systems are mobile devices operated by power-limited energy sources, with limited com- putational capabilities. As demonstrated in [30] the dynamic datasets using SGD using the PS, which in return does model power consumption Pc of a processor with clock frequency averaging and shares the global model with all the clients. of ω is proportional to the product of V 2 ω. Here, V is Under imperfect CSI, prior to transmission the channels the supply voltage of the computing, which is approximately need to be estimated via sampling. Instead of performing linearly proportional to ω [31]. Motivated by [30] and [31], channel measurements beforehand the inferred channels be- in this work, we adopt the model Pc ∝ ω 3 to capture tween clients and the PS over each RB are as ĥ(t) = the computational power consumption per client. Due to the J t, {τ, h(τ )}τ ∈N (t) using past N channel observations. concurrent tasks handled by the clients in addition to local Under channel prediction, hk,b (t) = ĥk,b (t) is used as signal model training, the dynamics of the available computing power to interference plus noise ratio (SINR) in (2) and (3). The Pc at a client is modeled as a Poison arrival process [32]. choice of small N ensures low computation complexity of Therefore, we assume that the computing power Pck (t) of the inference task. Here, the set N (t) consists of N most client k at time t follows an independent and identically recent time instances satisfying s(τ ) = 1 and τ < t. With distributed exponential distribution with mean µc . As a result, λk,b (t) = 1, the channel hk,b (t) is sampled and used as an the minimum computational time required for client k is given observation for the future, i.e., the RB allocation and channel by, sampling are carried out simultaneously. In this regard, we µc Dk Mk define the information on the channel between client k and τc,k (t) = p , (4) 3 Pck (t)/b the server at time t as j k (t) = [jk,b (t)]b∈B . For accurate CSI predictions, it is essential to acquire as much informa- where Mk is the number of SGD iterations computed over Dk . tion about the CSI over the network [33].P This is done by The constants µc and b are the number of clock cycles required exploring new information by maximizing k j †k (t)λk (t) at to process a single sample and power utilized per number each time t while minimizing the loss F (w, D). Therefore, the of computation cycles, respectively. To avoid unnecessary empirical loss minimization problem over the training duration delays in the overall training that are caused by clients’ local is formally defined as follows: computations, we impose a constraint on the computational minimize F w(T ), D − Tϕ k,t j †k (t)λk (t), time, with threshold τ0 over the scheduled clients as follows: P (6a) w(t),s(t),Λ(t),∀t sk (t) ≤ I(τmin,k (t) ≤ τ0 ) ∀k, t. (5) subject to (1)-(3), (5), (6b) † AΛ (t) 1 ∀t, (6c) C. Model training using FL † 1 s(t) ≤ B ∀t, (6d) The goal in FL is to P minimize† a regularized loss func- K s(t) ∈ {0, 1} , λk (t) ∈ {0, 1} ∀t, (6e) b 1 tion F (w, D) = D xi ∈D f (xi w) + ξ%(w), which is wk (t) = argmin F (w0 |w(t − 1), Dk ) ∀k, t parametrized by a weight vector w (referred as the model) w0 over the global dataset within a predefined communication (6f) duration T . Here, f (·) captures a loss of either regression w(t) = P Dk ∀t, k D sk (t)w k (t) (6g) or classification task, %(·) is a regularization function such as Tikhonov, xi is the input vector, and ξ(> 0) is the where Λ† (t) = [λ†k (t)]k∈K , ϕ(> 0) controls the impact of regularization coefficient. It is assumed that, clients do not the CSI exploration, and A is a B × K all-one matrix. The share their data and instead, compute their models over the orthogonal channel allocation in (6c) ensures collision-free local datasets. client uplink transmission with Ik,b (t) = 0 and constraint (6d) For distributed model training in FL, clients share their local defines the maximum allowable clients to be scheduled. The models that are computed by solving F (w, D) over their local SGD based local model calculation at client k is defined in 0090-6778 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Oulu University. Downloaded on June 17,2021 at 05:09:45 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCOMM.2021.3088528, IEEE Transactions on Communications 4 (6f). The choice of SINR target γ0 in (3) ensures that local time t. Here, ∆θ k (t) is the change in the dual variable θ k (t) models are uploaded within a single coherence time interval2 . for client k at time t given as below, III. O PTIMAL CLIENT-S CHEDULING AND RB A LLOCATION ∆θ k (t) ≈ argmaxδ∈RDk − D 1 [fi (−θ k (t) − δ)]D 1 † ∗ i=1 k P OLICY VIA LYAPUNOV O PTIMIZATION ξ 1 † η/ξ − %∗ v(t) − D δ X k %∗ v(t) − 2D 2 kX k δk 2 , (8) The optimization problem (6) is coupled over all clients K hence, in what follows, we elaborate on the decoupling of where η depends on the partitioning of D [35]. It is worth (6) over clients and the server, and derive the optimal client noting that ∆θ k (t) in (8) is computed based on the previous scheduling and RB allocation policy. global value v(t) received by the server. Then, the scheduled clients upload ∆v k (t) to the server. Following the dual formu- A. Decoupling loss function of (6) via dual formulation lation, the model aggregation and update in (6g) at the server is modified as follows: Let us consider an ideal unconstrained scenario where the server gathers all the data samples and trains a global model P v(t + 1) := v(t) + k∈K sk (t)∆v k (t). (9) in a centralized manner. Let F0 = minw F (w, D) be the Using (9), the server computes the coupled term %∗ v(t + 1) minimum loss under centralized training. By the end of the training duration T , we define the gap between the proposed in (7d). FL under communication constraints and centralized training It is worth noting that from the t-th update, ∆θ k (t) in as ε(T ) = F w(T ), D − F0 . In other words, ε(T ) is the (8) maximizes ∆ψ θ k (t) , which is the change in the dual accuracy loss of FL with scheduling compared to centralized function ψ θ(t) corresponding to client k. Let θ ?k (t) be the ? training. Since F0 is independent of the optimization variables dual variable at time t, in which ∆ψ θ k (t) ≥ local optimal in (6a), replacing F (w, D) by ε(T ) does not affect optimality ∆ψ θ k (t) is held. Then for a given accuracy βk (t) ∈ (0, 1) under the same set of constraints. of local SGD updates, the following condition is satisfied: To analyse the loss of FL with scheduling, we consider the ∆ψk ∆θ ?k (t) −∆ψk ∆θ k (t) dual function of (6a) with the dual variable θ = [θ1 , . . . , θD ], ≤ βk (t), (10) X = [X k ]k∈K with X k = [xi ]D T i=1 , and z = X w as follows: k ∆ψk ∆θ k (t) −∆ψk (0) P where ∆ψk (0) is the change in ψ with a null update from the 1 T θ T (z−X T w) k-th client. For simplicity, we assume that βk,t = β for all ψ(θ) = min xi ∈D D fi (xi w) + ξ%(w) + D w,z k ∈ K and t, hereinafter. With (10), the gap between FL with scheduling and the centralized method is bounded as shown (7a) a in Theorem 1: 1 Dξ%(w) − θ T X T w = D inf w Theorem 1. The upper bound of ε(T ) after T communication PD + i=1 inf − θi zi − fi (zi ) (7b) rounds is given by, zi b T = ξsup wT v − %(w) + P Pk≤K Dk w ε(T ) ≤ D 1 − (1 − β) t≤T T D k s (t) . PK PDk k=1 i=1 sup fi (zi ) + θi zi (7c) Proof: See Appendix A. zi c PK PDk 1 ∗ ∗ This yields that the minimization of ε(T ) can be achieved by =− k=1 i=1 D fi (−θi ) − ξ% (v) . (7d) | {z | {z } } minimizing its upper bound defined in Theorem 1. Henceforth, #1 #2 the equivalent form of (6) is given as follows: Here, step (a) rearranges the terms, (b) converts the infimum T D 1 − (1 − β) t,k TDDk sk (t) P operations to supremum while substituting v = Xθ/ξD, minimize [∆θ k (t)]k ,s(t),Λ(t),∀t and (c) uses the definition of conjugate function fi∗ (−θi ) = ϕX † inf {−θi zi − fi (zi )} and %∗ (v) = sup{wT v − %(w)} [34]. − j k (t)λk (t), (11a) zi w T k,t With the dual formulation, the relation between the primal and dual variables is w = ∇%∗ (v) [14]. Based on the dual formu- subject to (6b)-(6e), (8), (9). (11b) lation, the loss FL with scheduling is ε(T ) = ψ0 − ψ θ(T ) where ψ0 is the maximum dual function value obtained from B. GPR-based information metric J(·) for unexplored CSI the centralized method. Note that the first term of (7d) decouples per client and For CSI prediction, we use GPR with a Gaussian kernel thus, can be computed locally. In contrast, the second term in function to estimate the nonlinear relation of J(·) with a GP (7d) cannot be decoupled per client. To compute %∗ (v), first, prior. For a finite data set {tn , h(tn )}n∈N , the aforementioned each client k locally computes ∆v k (t) = ξD 1 X k ∆θ k (t) at GP becomes a multi-dimensional Gaussian distribution, with zero mean and covariance C = [c(tm , tn )]m,n∈N given by, 2 The prior knowledge on channel statics, model size, transmit power, and bandwidth is used to choose γ0 . c(tm , tn ) = exp − ζ11 sin2 ζπ2 (tm − tn ) , (12) 0090-6778 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Oulu University. Downloaded on June 17,2021 at 05:09:45 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCOMM.2021.3088528, IEEE Transactions on Communications 5 Pt Pt where ζ1 and ζ2 are the length and period hyper-parameters, Here, ν̃(t) = 1t τ =1 ν(τ ) and ˜l(t) = 1t τ =1 l(τ ) are the respectively [36]. Henceforth, the CSI prediction at time t and running time averages of the auxiliary variables at time t. its uncertainty/variance is given by [37], Theorem 2. The upper bound of the Lyapunov DPP is given † −1 by, ĥ(t) = c (t)C [h(tn )]n∈N , (13) Υ(t) = c(t, t) − c† (t)C −1 c(t), (14) T −1 ∆L − φ DT 1 − ν̃(t) E[ν(t)|q(t)] + ϕE[l(t)|g(t)] ≤ where c(t) = [c(t, tn )]n∈N . The client and RB dependence E[q(t) ν(t) − u(t) + g(t) l(t) − k j †k (t)λk (t) + L0 P is omitted in the discussion above for notation simplicity. T −1 The uncertainty measure Υ(.) of the predicted channel ĥ − φ DT 1 − ν̃(t) ν(t) + ϕl(t) |q(t), g(t)], (18) calculated using GPR framework is used as j(.) in (6a) which Proof: See Appendix B. in turn allowing to exploring and sampling channels with high The motivation behind deriving the Lyapunov DPP is that uncertainty towards improving the prediction accuracy. Finally, minimizing the upper bound of the expected conditional Lya- it is worth nothing that under perfect CSI ĥ(t) = h(t) and punov DPP at each iteration t with a predefined φ yields the j(t) = 0. tradeoff between the virtual queue stability and the optimality of the solution for (16) [29]. In this regard, the stochastic C. Joint client scheduling and RB allocation optimization problem of (16) is solved via minimizing the Due to the time average objective in (11a), the problem upper bound in (18) at each time t as follows: sk (t) + g(t)j †k (t)λk (t) − P q(t)(1−β)Dk (11) gives rise to a stochastic optimization problem defined maximize k D over t = {1, . . . , T }. Therefore, we resort to the drift plus s(t),Λ(t),ν(t),l(t) penalty (DPP) technique from Lyapunov optimization frame- χ(t)ν(t) − g(t) − φϕ l(t), (19a) work to derive the optimal scheduling policy [29]. Therein, subject to (6b)-(6d), (16c), (16d), (19b) the Lyapunov framework allows us to transform the original s(t) ∈ {0, 1}K , λk (t) ∈ {0, 1}b ∀t, (19c) stochastic optimization problem into a series of optimizations T −1 problems that are solved at each timeP t, as discussed next. where χ(t) = q(t) − φDT 1 − ν̃(t) and the variables First, we denote u(t) =P(1 − β) k sk (t)Dk /D and define ∆θ k (t) with constraints (8) and (9) are decoupled from (19). its time average ū = t≤T u(t)/T . Then, we introduce By relaxing the integer (more specifically, boolean) variables auxiliary variables ν(t) andP l(t) with time average lower in (19c) as linear variables, the objective and constraints bounds ν̄ ≤ ū and ¯l ≤ T1 k,t j †k (t)λk (t) ≤ l0 , respectively. become affine, in which, (19) is recast as a linear program To track the time average lower bounds, we introduce virtual (LP) as follows: queues q(t) and g(t) with the following dynamics [29], [38], sk (t) + g(t)j †k (t)λk (t) − P q(t)(1−β)Dk [39]: maximize k D s(t),Λ(t),ν(t),l(t) χ(t)ν(t) − g(t) − φϕ l(t), (20a) q(t + 1) = max 0, q(t) + ν(t) − u(t) , (15a) subject to (6b)-(6d), (16c), (16d), (20b) P † g(t + 1) = max 0, g(t) + l(t) − k j k (t)λk (t) . (15b) 0 s(t), λk (t) 1. (20c) Therefore, (11) can be recast as follows: Due to the independence, the optimal auxiliary variables are derived by decoupling (16a), (16c), and (16d) as follows: minimize D(1 − ν̄)T − ϕ¯l, (16a) ( ( [∆θ k (t)]k ,s(t),Λ(t),ν(t),l(t)∀t ? 1 − β if χ(t) ≥ 0, ? l0 if g(t) ≥ φϕ, ν (t) = l (t) = subject to (11b), (15), (16b) 0 otherwise, 0 otherwise. 0 ≤ ν(t) ≤ 1 − β ∀t, (16c) (21) 0 ≤ l(t) ≤ l0 ∀t, (16d) Theorem 3. The optimal scheduling s? (t) and RB allocation u(t) = k P (1−β)Dk sk (t) ∀t. variables Λ? (t) are found using an interior point method D (16e) (IPM). Proof: See Appendix C. The quadratic Lyapunov function of (q(t), g(t)) is L(t) = The joint client scheduling and RB allocation is summarized q(t)2 + g(t)2 /2. Given q(t), g(t) , the expected condi- tional Lyapunov one slot drift at time t is ∆L = E[L(t + 1) − in Algorithm 1 and the iterative procedure is illustrated in L(t)|q(t), g(t)]. Weighted by a trade-off parameter φ(≥ 0), we Fig. 2. First, all clients compute their local models using add a penalty term to penalize a deviation from the optimal local SGD iterations with available computation power and solution to obtain the Lyapunov DPP [29], upload the models to the PS. Parallelly, at the PS, the channels are predicted using GPR based on prior CSI samples and ∂ clients are scheduled following the scheduling policy shown [(1 − ν)T D]ν=ν̃(t) E[ν(t)|q(t)] − ϕE[l(t)|g(t)] = φ in Algorithm 1. Scheduled clients upload their local models to ∂ν the PS, then at the PS the received models are averaged out to T −1 − φ DT 1 − ν̃(t) E[ν(t)|q(t)] + ϕE[l(t)|g(t)] , (17) obtain a new global model and broadcast back to all K clients. 0090-6778 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Oulu University. Downloaded on June 17,2021 at 05:09:45 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCOMM.2021.3088528, IEEE Transactions on Communications 6 Algorithm 1 Joint Client Scheduling and RB Allocation TABLE I: Simulation parameters Input: D, γ0 , β, p, B, ξ Parameter Value Output: s? (t), Λ? (t) for all t Number of clients (K) 10 1: q(0) = g(0) = 0, ν(0) = l(0) = 0, v(0) = 0 Number of RB (B) 6 2: for t = 1 to T do Local model solving optimality (β) 0.7 Transmit Power (in Watts) (p) 1 3: Each client update PS with computing power state Model training and scheduling information (CPSI) Training duration (T ) 100 4: Each client computes ∆θ k (t) using (8) Number of local SGD iterations (M ) 10 5: Channel prediction using GPR with (13) Learning rate (η) 0.2 Regularizer parameter (ξ) 1 6: Calculate ν ? (t) and l? (t) using (21) Lyapunov trade-off parameter (φ) 1 7: Derive s? (t) and Λ? (t) by solving (20) using an IPM Tradeoff weight (ϕ) 1 8: Local model state information (MSI) (∆v k (t), ∆θ k (t)) SINR threshold (γ0 ) 1.2 Computation time threshold (τ0 ) 1.2 uploading to the server Channel prediction (GPR) 9: Update ν̃(t), q(t) via (15), v(t) and θ(t) with (9) length parameter (ζ1 ) 2 10: Global model v(t) broadcasting period parameter (ζ2 ) 5 Number of past observations (N ) 20 11: t→t+1 12: end for optimization problem with a nonlinear time average objective. In this setting, by sampling the scheduled clients, the PS gets The stochastic optimization problem (11) is decoupled into additional information on the CSI. In the example presented a sequence of optimization problems (19) that are solved at in Fig. 2, the dashed arrows correspond to non-scheduled each time iteration t using the Lyapunov DPP technique with clients due to the lack of computational resources and/or guaranteed convergence [40, Section 4]. Following Remark poor channel conditions. This strategy allows the proposed 1, the solution of (20) yields the optimality of (19) ensuring scheduling method to avoid unnecessary computation and that the convergence guarantees are held under the Lypunov communication delays. DPP method. Since the optimal solution of (20) is optimal for (19), the optimality of the proposed solution depends on the recasted problem (16) that relies on the Lypunov DPP technique. Moreover, the optimality of the Lyapunov DPP- based solution is in the order of O( ϕ1 ) [40, Section 7.4]. Finally, the complexity of Algorithm 1 depends on the complexity of the IPM used to solve the LP. Specifically, the complexity of the proposed solution at each iteration is given by the computational complexity of solving the LP which is in the order of O(n3 L) [41] with n = (D + B + 1)K + 2 variables, each of which is represented by a L-bits code. IV. S IMULATION R ESULTS In this section, we evaluate the proposed client schedul- ing method and RB allocation using MNIST and CIFAR-10 datasets assuming f (·) and %(·) as the cross entropy loss function and Tikhonov regularizer, respectively. A subset of the MNIST dataset with 6000 samples consisting of equal sizes Fig. 2: Illustration of the flow of the operation per client and of ten classes of 0-9 digits are distributed over K = 10 clients, PS over the iterative training process. whereas for CIFAR-10 data samples are from ten different categories but following the same data distribution. In addition, the wireless channel follows a correlated Rayleigh distribution D. Convergence, optimality, and complexity of the proposed [42] with mean to noise ratio equal to γ0 . For perfect CSI, a solution single RB is dedicated for channel estimation. The remaining Towards solving (6a), the main objective (6a) is decoupled parameters are presented in Table I. over clients and server using a dual formulation. Then, the For IID datasets, training data is partitioned into K subsets upper bound of ε(T ), which is the accuracy loss using schedul- of equal sizes with each consisting of equal number of samples ing compared to a centralized training, is derived in Theorem from all 10 classes, which are randomly distributed over the 1, in which, solving (11) is equivalent to solving (6). It is K clients. The impact of the non-IID on the performance is worth noting that minimizing ε(T ), guarantees convergence studied for i) heterogeneous dataset sizes: clients having train- to the minimum training loss, which is achieved as T → ∞ ing datasets with different number of samples and ii) hetero- [14, Appendix B]. In (11), the minimization of the analytical geneous class availability: clients’ datasets contain different expression of ε(T ) with a finite T boils down to a stochastic number of samples per class. The dataset size heterogeneity is 0090-6778 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Oulu University. Downloaded on June 17,2021 at 05:09:45 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCOMM.2021.3088528, IEEE Transactions on Communications 7 summarized in Table II. TABLE II: Proposed Algorithms and benchmark algorithm Model Description QAW-GPR PS uses dataset sizes to prioritize       clients and GPR-based CSI predic-    tions for scheduling. QAW PS uses pilot-based CSI estima-    Proposed tion and dataset sizes to prioritize Methods   clients.      QUNAW PS uses pilot-based CSI estimation without accounting dataset size of      clients. RANDOM Clients are scheduled randomly.    Baselines PF Clients are scheduled with fairness    in terms of successful model up- Fig. 3: From IID datasets to non-IID datasets under different  loading. choices of Zipf parameter (σ) and Dirichlet parameter (α).  IDEAL All clients are scheduled assuming Ideal  The performance of the proposed algorithms are evaluated setup  no communication or computation under the choices highlighted with the four regions A, B, C,  constraints. and D. A. Loss of accuracy modeled by partitioning the training dataset over clients using which, the dataset of client k is com- the Zipf distribution, in P Fig. 4 compares the loss of accuracy in all FL methods at posed of Dk = Dk −σ / κ∈K κ −σ number of samples [43]. each model aggregation round with respect to the centralized Here, the Zipf’s parameter σ = 0 yields uniform/homogeneous model training. Here, we have considered the unbalanced data distribution over clients (600 samples per client), whereas dataset distribution (σ = 1.017) among clients to analyze increasing σ results in heterogeneous dataset sizes among the impact of dataset size in scheduling. It can be noted that clients as shown in Fig. 3. To control the heterogeneity in IDEAL has the lowest loss of accuracy ε(100) = 0.03 due class availability over clients, we adopt the Dirichelet distri- to the absence of both communication and computation con- bution, which is parameterized by a concentration parameter straints. Under perfect CSI, Fig. 4a plots QAW and QUNAW α ∈ (0, ∞] to distribute data samples from each class among for two different RB values B ∈ {3, 6}. With a 2× increase clients [44]. With α = 0, each client’s dataset consists of in RBs, the gain of the gap in loss in both QAW and QUNAW samples from a single class in which increasing alpha yields is almost the same. For B = 6, Fig. 4a shows that the QAW datasets with training data from several classes but the majority reduces the gap in loss by 22.8 % compared to QUNAW. The of data is from few classes. As α → ∞, each client receives reason for that is that QAW cleverly schedules clients with a dataset with samples drawn uniformly from all classes as higher data samples compared to QUNAW when the dataset illustrated in Fig. 3. distribution among clients is unbalanced. Under imperfect CSI, QAW-GPR, PF, and RANDOM are compared in Fig. Throughout the discussion, centralized training refers to 4b alongside IDEAL and QAW. While RANDOM and PF training that takes place at the PS with access to the entire show a poor performance, QAW-GPR outperforms QAW by dataset. In addition, how well the global model generalizes reducing the gap in loss by 23.6 %. The main reason for this for individual clients is termed as “generalization” in this improvement is that QAW needs to sacrifice some of its RBs discussion. Training data samples are distributed among clients for channel measurements while CSI prediction in QAW-GPR except in centralized training. In the proposed approach data leverages all RBs. samples are drawn from ten classes of handwritten digits. We compare several proposed RB allocation and client scheduling policies as well as other baseline methods. Under perfect CSI, B. Impact of system parameters two variants of the proposed methods named quantity-aware Fig. 5 shows the impact of the available communication scheduling QAW and quantity-unaware scheduling QUNAW resources (RBs) on the performance of the trained mod- are compared, whose difference stems from either accounting els F (w(100), D). Without computing and communication or neglecting the local dataset size in model updates during constraints, IDEAL shows the lowest loss while RANDOM scheduling. Under imperfect CSI, GPR-based channel predic- and PF exhibit the highest losses due to client scheduling tion is combined together with QAW yielding the QAW-GPR with limited RBs. On the other hand the proposed methods method. For comparison, we adopt two baselines: a random QAW-GPR, performs better than QAW and QUNAW for scheduling technique RANDOM and a proportional fair PF B ≤ 10 thanks to additional RBs when CSI measurements method where fairness is expressed in terms of successful are missing. Beyond B = 10, the available number of RBs model uploading. In addition, we use the vanilla FL method exceeds K, and thus, channel sampling in QAW-GPR is [6] without RB constraints, denoted as IDEAL hereinafter. All limited to at most K samples. Hence, increasing B beyond proposed scheduling policies as well as baseline methods are K = 10 results in increased number of under-sampled RBs 0090-6778 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Oulu University. Downloaded on June 17,2021 at 05:09:45 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCOMM.2021.3088528, IEEE Transactions on Communications 8 IDEAL QAW QUNAW RANDOM QAW-GPR PF (a) FL with perfect CSI and B = {3, 6}. Fig. 5: Impact of the available RBs on the gap of loss ε(100) for K = 10 clients for σ = 0 and α → ∞ dataset distribution (region A of Fig. 3). (b) FL with imperfect CSI and B = 6. Fig. 4: Comparison of the loss of accuracy in all FL methods for each model aggregation round vs. centralized training, Zipf parameter σ = 1.017 and α → ∞(region C of Fig. 3). Fig. 6: Impact of the number of clients on the gap of accuracy loss ε(100) for B = 5 RBs for σ = 0 and α → ∞ dataset distribution (region A of Fig. 3). yielding high uncertainty in GPR and poor CSI predictions. Inaccurate CSI prediction leads to scheduling clients with weak channels (stragglers), which consequently provides a loss CSI and training performance, giving rise to higher losses of performance in QAW-GPR. compared to the proposed methods. The impact of the number of clients (K) in the system with Fig. 7 analyzes the performance as the network scales uni- fixed RBs (B = 5) on the trained models performance and formly with the number of RBs and users, i.e., K = B. Under under different training policies is shown in Fig. 6. It can be ideal conditions, the loss in performance when increasing K noted that all methods exhibit higher losses when increasing shown in Fig. 7 follows the same reasoning presented under K due to: i) local training with fewer data samples (2500/K) Fig. 6. The GPR-based CSI prediction allows QAW-GPR which deteriorates in the non-IID regime, ii) the limited frac- to allocate all RBs for client scheduling yielding the best tion of clients (5/K) that are scheduled at once (except for the performance out of the proposed and baseline methods. It is IDEAL method). The choice of equal number of samples per also shown that the worst performance among all methods clients (balanced data sets with σ = 0) results highlights that except IDEAL is at K = B = 2, owing to the dynamics of QAW and QUNAW exhibit identical performance as shown CSI leading to scheduling stragglers over limited RBs. With in Fig. 6. Under limited resources, QAW-GPR outperforms all increasing K and B, the number of possibilities that clients other proposed methods and baselines by reaping the benefits can be successfully scheduled increases, leading to improved of additional RBs for GPR-based channel prediction. The performance in all baseline and proposed methods. In contrast baseline methods PF and RANDOM are oblivious to both to the baseline methods, the proposed methods optimize client 0090-6778 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Oulu University. Downloaded on June 17,2021 at 05:09:45 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCOMM.2021.3088528, IEEE Transactions on Communications 9 Fig. 7: The analysis of the loss of accuracy ε(100) as the Fig. 8: Impact of dataset distribution balanced-unbalancedness system scales with fixed clients to RBs ratio, (i.e., B = K), on the loss of accuracy ε(100) for B = 5, K = 10, with with σ = 0 and α → ∞ dataset distribution (region A of Fig. α → ∞. 3). and QUNAW are identical. Therein, both QAW and QUNAW scheduling to reduce the loss of accuracy, achieving closer exhibit about 43.6 % reduction in accuracy loss compared to performance to IDEAL with increasing K and B. However, RANDOM. beyond K = B = 5, the performance loss due to smaller Next in Fig. 9, we analyze the impact of non-IID data in local datasets outweighs the performance gains coming from terms of the available number of training data from all classes an increasing number of scheduled clients, and thus, all three on the accuracy of the trained model. Here, we compare the proposed methods follow the trend of IDEAL as illustrated performance of the proposed methods with the baselines for in Fig. 7. Compared to the PF baseline with K = B = 15, several choices of α with Dk = 250 for all k ∈ K (i.e., QAW-GPR shows a reduction in the loss of accuracy by 20 %. σ = 0). Fig. 3 shows the class distribution over clients for different choices of α with equal dataset size Dk = 250. Fig. 9 illustrates that the training performance degrades as the C. Impact of dataset distribution samples data distribution becomes class wise heterogeneous. Fig. 8 plots the impact of data distribution in terms of For instance from the IID case (α = 0) to the case with balanced-unbalancedness in terms of training sample size α → ∞ case, QAW achieves a loss of accuracy reduction by per client on the loss of accuracy ε(100). Here, the x-axis 86.5 % compared to α = 0. However, QAW, QAW-GPR and represents the local dataset size of the client having the lowest QUNAW perform better than RANDOM and PF. Finally, it number of training data, i.e., the dataset size of the 10th client can be noticed that from α = 0.01 to α = 10 the gap in terms D10 as per the Zipf distribution with α → ∞. With balanced of accuracy loss performance of the trained model increases datasets, all clients equally contribute to model training, hence rapidly for all methods. Beyond those points, changes are scheduling a fraction of the clients results in a significant small comparably to the inner points. loss in performance. In contrast, differences in dataset sizes reflect the importance of clients with large datasets. Therefore, scheduling important clients yields lower gaps in performance, D. Impact of computing resources even for methods that are oblivious to dataset sizes but The impact of the limitations in computation resources in fairly schedule all clients, as shown in Fig. 8. Among the model training and client scheduling is analyzed in Fig. 10. proposed methods, QAW-GPR outperforms the others thanks Since allocating RBs to stragglers results in poor RB utiliza- to using additional RBs with the absence of CSI measurement. tion, we define the RB utilization metric as the percentage Compared to the baselines PF and QAW, QAW-GPR shows a of RBs used for a successful model upload over the allocated reduction of loss of accuracy by 76.3 %, for the highest skewed RBs. Then we compare two variants of QAW with and without data distribution (D10 = 40). In contrast, QUNAW yields considering the computation constraint (5) referred to as higher losses compared to QAW, QAW-GPR when training CAW (original QAW in the previous discussions) and CUAW, data is unevenly distributed among clients. As an example, respectively. It is worth highlighting that the computation the reduction of the loss of accuracy in QAW at D10 = 40 is threshold is inversely proportional to the average computing 25.72 % compared to 40.7 % for QUNAW. The reason behind power availability as per (4), i.e. lower τ0 corresponds to these lower losses is that client scheduling takes into account higher ε(t) and vice versa. Fig. 10 indicates that the use of the training dataset size. For D10 = 600, due to the equal CPSI for client scheduling in CAW reduces the number of dataset sizes per client, the accuracy loss provided by QAW scheduled computation stragglers resulting in a lower accuracy 0090-6778 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Oulu University. Downloaded on June 17,2021 at 05:09:45 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCOMM.2021.3088528, IEEE Transactions on Communications 10 // // 0.12 0.1 0.08 0.06 0.04 0.02 // // 0 2 4 6 8 10 12 14 Fig. 9: Impact of the class-wise data heterogeneity on the loss Fig. 11: Impact of number of local SGD iterations for loss of of accuracy with σ = 0. accuracy ε(100) in both PF (drops out due to the computation constraint) and CUAW QAW (not scheduled). Hence, increasing local SGD iterations CAW beyond M = 8 results in fewer clients to contribute for the CUAW RB utl. training, in which, increased losses of accuracy with QAW and CAW RB utl. PF are observed as shown in Fig. 11. E. Fairness Fig.12 indicates how well the global model generalizes for individual clients. Therein, the global model is used at the client side for inference, and the per client histogram of model accuracy is presented. It can be seen that, the global model in IDEAL generalizes well over the clients yielding the highest accuracy on average (96.8 %) over all clients and the lowest variance with 5.6. With QAW-GPR, 96.1 % average accuracy and variance of 7.8 is observed. It is also seen that Fig. 10: Impact of CPSI for QAW scheduling QAW and QUNAW have almost equal means (95.2 %) and variances of 15.2 and 16.3, respectively. Scheduling clients with a larger dataset in QAW provides a lower variance in loss, in addition to higher RB utilization over CUAW. Overall, accuracy compared to QUNAW. Although RANDOM and PF CAW with QAW scheduling performs better than CUAW with are CSI-agnostic, they yield an average accuracy of 93.4 % QAW scheduling. For instance, compared to CUAW, CAW and 94.1 % respectively with the highest variance of 18.2 and achieves 11.6 % reduction in loss of accuracy with τ0 = 0.6 17.6. This indicates that client scheduling without any insight which increases to 43.1 % with τ0 = 1.4. When τ0 is small, on datasize distribution and CSI fails to provide high training the number of stragglers increases leading to poor performance accuracy or fairness under communication constraints. for both CAW and CUAW. It is worth nothing that considering Finally, Fig. 13 compares the generalization performance CPSI in the scheduling, CAW achieves at least 18.2 % increase of one of the proposed methods, QAW, for CIFAR-10 dataset in RB utilization in addition to the loss of accuracy reduction instead of MNIST under both IID and non-IID data. In contrast by at least 11.6 %. to MNIST, CIFAR-10 consists of color images of physical The impact of local SGD iterations under limited computing objects from ten distinct classes [45]. For training, we adopt power availability is analyzed in Fig. 11. Due to the assump- a 3-layer convolutional neural network (CNN) with K = 10 tion of unlimited computing power availability, IDEAL per- clients training over T = 1000 iterations. A total of 2500 data forms well with the increasing local SGD iterations. Similarly, samples are distributed over the clients under four settings gradual reductions in the loss of accuracy for both QAW and corresponding to the four regions in Fig. 3: i) IID data with PF can be seen as M increases from two to eight. However, α → ∞ and σ = 0 in region A, ii) equal dataset sizes (α = 0) further increasing M results in longer delays for some clients under heterogeneous class availability (α = 0) in region B, iii) in local computing under limited processing power as per (4). homogeneous class availability (α → ∞) with heterogeneous Such computation stragglers do not contribute to the training dataset sizes (σ = 1.017) in region C, and iv) heterogeneity 0090-6778 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Oulu University. Downloaded on June 17,2021 at 05:09:45 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCOMM.2021.3088528, IEEE Transactions on Communications 11 (a) IDEAL scheduling Fig. 12: Fairness comparison of the training accuracy among clients, Zipf parameter σ = 1.071 and α → ∞ (region C of Fig. 3). in both (σ = 1.017) and (α = 0) under region D. Fig. 13a indicates that the training performance with CIFAR-10 dataset significantly suffers from non-IID data under IDEAL communication and computation conditions. Interestingly, the proposed QAW yields identical performance than IDEAL un- der balanced datasets disregarding the identicalness of the data. However, with unbalanced datasets, the training performance of QAW is degraded as illustrated in Fig. 13b. The underlying reason is that the unbalanced datasets induce non-IID data at each client, and scheduling fewer clients is insufficient to obtain higher training performance. (b) QAW scheduling V. C ONCLUSION Fig. 13: Performance comparison of the proposed QAW In this work, FL over wireless networks with limited com- scheduling approach on CIFAR-10 trained with CNN vs. the putational and communication resources, and under imperfect IDEAL scheduling scheme. CSI is investigated. To achieve a training performance close to a centralized training setting, a novel client scheduling and RB function. By adding and subtracting the optimal dual function allocation policy leveraging GPR-based channel prediction is value to the R.H.S. : proposed. Through extensive sets of simulations the benefits of FL using the proposed client scheduling and RB allocation E[ψ(θ(t + 1)) − ψ(θ(t))] ≥ E[ψ(θ ? )] − E[ψ(θ(t))] policy are validated and analyzed in terms of (i) system param- +E[ψ(θ(t+1))]−E[ψ(θ ? )] eters, model performance and computation resource limitations PK PK (number of RBs, number of clients) and (ii) heterogeneity of = k=1 ∆ψ(∆θ ?k (t)) − E[ψ(θ(t))] + k=1 ∆ψ(∆θ k (t)) PK data distribution over clients (balanced-unbalanced, IID and − k=1 ∆ψ(∆θ ?k (t)). non-IID). Results show that the proposed methods reduce the From (10) and following definition E[ψ(θ(t))] = gap of the accuracy by up to 40.7 % compared to state-of-the- P K art client scheduling and RB allocation methods. i=0 ∆ψ(0), A PPENDIX E[ψ(θ(t + 1)) − ψ(θ(t))] PK A. Proof of Theorem 1 ≥ k=1 ∆ψ(∆θ ?k (t)) − E[ψ(θ(t))] PK After t and t + 1 communication rounds, the expected −β ? k=1 ∆ψ(∆θ k (t)) − E[ψ(θ(t))] increment in the dual function of (6a) is, PK ? = (1 − β) k=1 ∆ψ(∆θ k (t)) − E[ψ(θ(t))] . E[ψ(θ(t + 1)) − ψ(θ(t))] ≥ E[ψ(θ(t + 1))] − E[ψ(θ(t))]. However by selecting subset of users per iteration and repeat- This inequality holds since the expectations of difference is ing up to only t number of communication iterations the bound greater than the difference of the expectations of a convex is lower bounded as below, 0090-6778 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Oulu University. Downloaded on June 17,2021 at 05:09:45 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCOMM.2021.3088528, IEEE Transactions on Communications 12 C. Proof of Theorem 3 E[ψ(θ(t + 1)) − ψ(θ(t))] Let Θk = q(t)(1−β)DD k , Ωk,b = g(t)jk,b (t), and B 0 and K0 PK ? be the sets of allocated resource blocks and scheduled clients ≥ (1 − β) k=1 ∆ψ(∆θ k (t)) − E[ψ(θ(t))] Pt PK Dk PK respectively. ? ≥ (1 − β) τ =1 i=1 tD sk (t) k=1 ∆ψ(∆θ k (t)) Consider a scenario with non-integer solutions at optimality, − E[ψ(θ(t))] i.e., λ?k,b ∈ (0, 1] for b ∈ B 0 and s?k ∈ (0, 1] for k ∈ K0 . PK ? Note that for all b ∈ B 0 , Ωk,b = Φ for some Φ(≥ 0) is held Now, following [14,? Appendix B], { k=1 ∆ψ(∆θ k (t)) − (∵ if Ωk,b0 > Φ for some b0 ∈ B 0 , then optimality holds E[ψ(θ(t))]} ≥ s̄ ψ(θ ) − E[ψ(θ(t))] , where s̄ ∈ (0, 1), we only when λk,b0 = 1 and λk,b = 0 for all b ∈ / B 0 \ {b0 }). can have, † P satisfies 1 λP Moreover, optimality k = 1. With the constraint ε(T ) = E[ψ(θ ? ) − ψ(θ(T + 1))] (1), this yields b∈B Ωk,b λk,b = b∈B0 Φλk,b = Φ. Hence, assigning λk,b0 = 1 for any b0 ∈ B 0 with λk,b00 = 0 for all = E[ψ(θ ? ) − ψ(θ(T ))] − E[ψ(θ(T + 1)) − ψ(θ(T ))] b00 ∈/ B 0 \ {b0 } satisfies all constraints while resulting in the ≤ E[ψ(θ ? ) − ψ(θ(T ))]− same optimal value, i.e., 1† λk = Φ. Hence, it can be noted that a solution with non-integer λ?k is not unique and there Pt PK Dk ? (1 − β) τ =1 i=1 T D sk (t) ψ(θ ) − E[ψ(θ(T ))] PT PK Dk exists a corresponding integer solution for λ?k [claim A]. = 1 − (1 − β) τ =1 i=1 T D sk (t) ψ(θ ? ) Substituting the above result in (1) yields sk ≤ 1 and hence, − E[ψ(θ(T ))] (1) and (20c) overlap. Following the same argument as for λ?k , PT PK T ≤ 1 − (1 − β) τ =1 Dk i=1 T D sk (t) ψ(θ ? ) it can be shown that there exists an integer solution for s?k that − E[ψ(θ(0))] yields the same optimal value as with s?k ∈ (0, 1] for k ∈ K0 [claim B]. In [46], it is proved that ψ(θ ? ) − E[ψ(θ(0))] < D. Based on the claims A and B, it can be noted that any non- Following that, integer optimal solution is not unique, and there exists at least T X K T one integer solution for S ? and Λ? . Hence, by solving (20) Dk using IPM and then selecting integer values for S ? and Λ? , X ε(T ) ≤ 1 − (1 − β) sk (t) D. τ =1 i=1 TD the optimal solution of (19) is obtained. B. Proof of Theorem 2 R EFERENCES Using the inequality max(0, x)2 ≤ x2 , on (15) we have, [1] M. M. Wadu, S. Samarakoon, and M. Bennis, “Federated learning under 2 channel uncertainty: Joint client scheduling and resource allocation,” q 2 (t+1) q 2 (t) ν(t)−u(t) in 2020 IEEE Wireless Communications and Networking Conference 2 ≤ 2 + 2 + q(t) ν(t) − u(t) ,(22a) (WCNC), pp. 1–6, 2020. 2 [2] M. D. W. M. Wadu, “Communication-efficient scheduling policy for l(t)− k j †k (t)λk (t) P g 2 (t+1) g 2 (t) 2 ≤ 2 + 2 + g(t) l(t) federated learning under channel uncertainty,” JULTIKA, University of P † Oulu: http://urn.fi/URN:NBN:fi:oulu-201912213414, 2019. − k j k (t)λk (t) . (22b) [3] J. Park, S. Samarakoon, M. Bennis, and M. Debbah, “Wireless network intelligence at the edge,” Proceedings of the IEEE, vol. 107, pp. 2204– With L(t) = q(t)2 + g(t)2 /2, one slot drift ∆L can be 2239, Nov. 2019. expressed as follows: [4] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over- the-air computation,” IEEE Transactions on Wireless Communications, X † vol. 19, no. 3, pp. 2022–2035, 2020. ∆L ≤ E[q(t) ν(t) − u(t) + g(t) l(t) − j k (t)λk (t) [5] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. k Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al., 2 2 “Advances and open problems in federated learning,” 2019. + ν(t)−u(t) /2+ l(t)− k j †k (t)λk (t) /2|q(t), g(t)]. P [6] J. Konečnỳ, H. B. McMahan, D. Ramage, and P. Richtárik, “Federated (23) optimization: Distributed machine learning for on-device intelligence,” 2 arXiv preprint arXiv:1610.02527, 2016. L0 is a uniform bound on ν(t) − u(t) /2 + l(t) − [7] H. Kim, J. Park, M. Bennis, and S.-L. Kim, “Blockchained on-device P † 2 federated learning,” IEEE Communications Letters, vol. 24, no. 6, k j k (t)λk (t) /2 for all t, thus (23) can be expressed as pp. 1279–1283, 2019. follows: [8] N. Görnitz, A. Porbadnigk, A. Binder, C. Sannelli, M. Braun, K.-R. Müller, and M. Kloft, “Learning and evaluation in presence of non-iid ∆L ≤ E[q(t) ν(t) − u(t) + g(t) l(t) − k j †k (t)λk (t) + label noise,” in Artificial Intelligence and Statistics, pp. 293–302, 2014. P [9] J. L. Balcázar, R. Gavalda, and H. T. Siegelmann, “Computational L0 |q(t), g(t)]. (24) power of neural networks: A characterization in terms of kolmogorov complexity,” IEEE Transactions on Information Theory, vol. 43, no. 4, Adding penalty term (17), upper bound of DPP can be pp. 1175–1183, 1997. expressed as, [10] T.-M. H. Hsu, H. Qi, and M. Brown, “Measuring the effects of non- identical data distribution for federated visual classification,” arXiv T −1 preprint arXiv:1909.06335, 2019. ∆L − φ DT 1 − ν̃(t) E[ν(t)|q(t)] + ϕE[l(t)|g(t)] ≤ [11] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated learning with non-iid data,” arXiv preprint arXiv:1806.00582, 2018. E[q(t) ν(t) − u(t) + g(t) l(t) − k j †k (t)λk (t) + L0 P [12] T. Chen, G. Giannakis, T. Sun, and W. Yin, “LAG: Lazily aggregated T −1 gradient for communication-efficient distributed learning,” in Advances − φ DT 1 − ν̃(t) ν(t) + ϕl(t) |q(t), g(t)], (25) in Neural Information Processing Systems, pp. 5050–5060, 2018. 0090-6778 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Oulu University. Downloaded on June 17,2021 at 05:09:45 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCOMM.2021.3088528, IEEE Transactions on Communications 13 [13] T. Nishio and R. Yonetani, “Client selection for federated learning with [37] F. Pérez-Cruz, S. V. Vaerenbergh, J. J. Murillo-Fuentes, M. Lázaro- heterogeneous resources in mobile edge,” in 2019 IEEE International Gredilla, and I. Santamarı́a, “Gaussian processes for nonlinear signal Conference on Communications (ICC), (Shanghai, China), pp. 1–7, processing: An overview of recent advances,” IEEE Signal Processing IEEE, May 2019. Magazine, vol. 30, pp. 40–50, July 2013. [14] H. H. Yang, Z. Liu, T. Q. Quek, and H. V. Poor, “Scheduling policies [38] D. Bethanabhotla, G. Caire, and M. J. Neely, “Adaptive video streaming for federated learning in wireless networks,” IEEE Transactions on for wireless networks with multiple users and helpers,” IEEE Transac- Communications, vol. 68, no. 1, pp. 317–333, 2019. tions on Communications, vol. 63, no. 1, pp. 268–285, 2014. [15] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint [39] Y. Mao, J. Zhang, and K. B. Letaief, “A lyapunov optimization approach learning and communications framework for federated learning over for green cellular networks with hybrid energy supplies,” IEEE Journal wireless networks,” arXiv preprint arXiv:1909.07972, 2019. on Selected Areas in Communications, vol. 33, no. 12, pp. 2463–2477, [16] Y. Sun, S. Zhou, and D. Gündüz, “Energy-aware analog aggregation 2015. for federated learning with redundant data,” in ICC 2020-2020 IEEE [40] M. J. Neely, “Stability and probability 1 convergence for queueing International Conference on Communications (ICC), pp. 1–7, IEEE, networks via lyapunov optimization,” Journal of Applied Mathematics, 2020. vol. 2012, 2012. [17] M. M. Amiri, D. Gunduz, S. R. Kulkarni, and H. V. Poor, “Update [41] P. Gritzmann and V. Klee, “On the complexity of some basic problems aware device scheduling for federated learning at the wireless edge,” in computational convexity: I. containment problems,” Discrete Mathe- arXiv preprint arXiv:2001.10402, 2020. matics, vol. 136, no. 1-3, pp. 129–174, 1994. [18] J. Ren, Y. He, D. Wen, G. Yu, K. Huang, and D. Guo, “Scheduling [42] T. S. Rappaport, Wireless communications: principles and practice, in cellular federated edge learning with importance and channel aware- vol. 2. 1996. ness,” arXiv preprint arXiv:2004.00490, 2020. [43] I. Moreno-Sánchez, F. Font-Clos, and Á. Corral, “Large-scale analysis of zipf’s law in english texts,” PloS one, vol. 11, no. 1, p. e0147073, [19] M. M. Amiri and D. Gündüz, “Federated learning over wireless fading 2016. channels,” IEEE Transactions on Wireless Communications, vol. 19, [44] J. Pitman and M. Yor, “The two-parameter poisson-dirichlet distribution no. 5, pp. 3546–3557, 2020. derived from a stable subordinator,” The Annals of Probability, pp. 855– [20] B. Nazer and M. Gastpar, “Computation over multiple-access channels,” 900, 1997. IEEE Transactions on Information Theory, vol. 53, no. 10, pp. 3498– [45] A. Krizhevsky, G. Hinton, et al., “Learning multiple layers of features 3516, 2007. from tiny images,” 2009. [21] L. Bottou, “Large-scale machine learning with stochastic gradient de- [46] V. Smith, S. Forte, C. Ma, M. Takac, M. I. Jordan, and M. Jaggi, scent,” in Proceedings of COMPSTAT’2010, pp. 177–186, Springer, “CoCoA: A general framework for communication-efficient distributed 2010. optimization,” arXiv preprint arXiv:1611.02189, 2016. [22] A. Duel-Hallen, “Fading channel prediction for mobile radio adaptive transmission systems,” Proceedings of the IEEE, vol. 95, no. 12, pp. 2299–2313, 2007. [23] W. Liu, L.-L. Yang, and L. Hanzo, “Recurrent neural network based nar- rowband channel prediction,” in 2006 IEEE 63rd Vehicular Technology Conference, vol. 5, pp. 2173–2177, IEEE, 2006. [24] M. Karaca, O. Ercetin, and T. Alpcan, “Entropy-based active learning for wireless scheduling with incomplete channel feedback,” Computer Madhusanka Manimel Wadu received his B.Sc. Networks, vol. 104, pp. 43–54, 2016. (Hons) degree in Electrical and Electronics En- [25] M. Osborne and S. J. Roberts, “Gaussian processes for prediction,” Tech. gineering from the University of Peradeniya, Sri Rep. PARG-07–01, 2007. Lanka, in 2015 and M.Sc. degree in wireless com- [26] F. Pérez-Cruz, S. Van Vaerenbergh, J. J. Murillo-Fuentes, M. Lázaro- munication engineering from the University of Oulu, Gredilla, and I. Santamaria, “Gaussian processes for nonlinear signal Finland by the end of 2019. He was a research as- processing: An overview of recent advances,” IEEE Signal Processing sistant (part-time) at the University of Oulu, Finland Magazine, vol. 30, no. 4, pp. 40–50, 2013. from 2018 to 2019 and he has almost three years [27] A. Schwaighofer, M. Grigoras, V. Tresp, and C. Hoffmann, “Gpps: A of industry exposure in Sri Lanka. He is currently gaussian process positioning system for cellular networks,” Advances in a Ph.D. student at the University of Oulu, Finland. Neural Information Processing Systems, vol. 16, pp. 579–586, 2003. His research interests include artificial intelligence (AI), machine learning improvements in the perspective of communication, [28] A. Chiumento, M. Bennis, C. Desset, L. Van der Perre, and S. Pollin, and wireless channel predictions. “Adaptive csi and feedback estimation in lte and beyond: a gaussian process regression approach,” EURASIP Journal on Wireless Communi- cations and Networking, vol. 2015, no. 1, p. 168, 2015. [29] M. J. Neely, “Stochastic network optimization with application to communication and queueing systems,” Synthesis Lectures on Commu- nication Networks, vol. 3, no. 1, pp. 1–211, 2010. [30] W. Yuan and K. Nahrstedt, “Energy-efficient cpu scheduling for multi- media applications,” ACM Transactions on Computer Systems (TOCS), Sumudu Samarakoon (S’08, M’18) received the vol. 24, no. 3, pp. 292–331, 2006. B.Sc. degree (honors) in electronic and telecom- [31] K. De Vogeleer, G. Memmi, P. Jouvelot, and F. Coelho, “The en- munication engineering from the University of ergy/frequency convexity rule: Modeling and experimental validation Moratuwa, Moratuwa, Sri Lanka, in 2009, the on mobile devices,” in International Conference on Parallel Processing M.Eng. degree from the Asian Institute of Tech- and Applied Mathematics, pp. 793–803, Springer, 2013. nology, Khlong Nueng, Thailand, in 2011, and the [32] I. T. Castro, L. Landesa, and A. Serna, “Modeling the energy harvested Ph.D. degree in communication engineering from by an rf energy harvesting system using gamma processes,” Mathemat- the University of Oulu, Oulu, Finland, in 2017. ical Problems in Engineering, vol. 2019, 2019. He is currently a Docent (Adjunct Professor) with [33] M. Karaca, T. Alpcan, and O. Ercetin, “Smart scheduling and feed- the Centre for Wireless Communications, Univer- back allocation over non-stationary wireless channels,” in 2012 IEEE sity of Oulu. His main research interests are in International Conference on Communications (ICC), (Ottawa, Canada), heterogeneous networks, small cells, radio resource management, machine pp. 6586–6590, IEEE, June 2012. learning at wireless edge, and game theory. Dr. Samarakoon received the [34] S. Boyd, S. P. Boyd, and L. Vandenberghe, Convex optimization. Best Paper Award at the European Wireless Conference, Excellence Awards Cambridge university press, 2004. for innovators and the outstanding doctoral student in the Radio Technology Unit, CWC, University of Oulu, in 2016. He is also a Guest Editor of Telecom [35] J. Hiriart-Urruty and C. Lemaréchal, Fundamentals of Convex Analysis. (MDPI) special issue on “millimeter wave communications and networking Grundlehren Text Editions, Springer Berlin Heidelberg, 2004. in 5G and beyond.” [36] E. P. Xing, Advanced Gaussian Processes. 10-708: Probabilistic Graph- ical Models, University in Pittsburgh, Pennsylvania: Carnegie Mellon School of Computer Science, Feb. 2015. 0090-6778 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Oulu University. Downloaded on June 17,2021 at 05:09:45 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCOMM.2021.3088528, IEEE Transactions on Communications 14 Dr Mehdi Bennis is an Associate Professor at the Centre for Wireless Communications, University of Oulu, Finland, Academy of Finland Research Fellow and head of the intelligent connectivity and networks/systems group (ICON). His main research interests are in radio resource management, hetero- geneous networks, game theory and distributed ma- chine learning in 5G networks and beyond. He has published more than 200 research papers in interna- tional conferences, journals and book chapters. He has been the recipient of several prestigious awards including the 2015 Fred W. Ellersick Prize from the IEEE Communications Society, the 2016 Best Tutorial Prize from the IEEE Communications Society, the 2017 EURASIP Best paper Award for the Journal of Wireless Commu- nications and Networks, the all-University of Oulu award for research, the 2019 IEEE ComSoc Radio Communications Committee Early Achievement Award and the 2020 Clarviate Highly Cited Researcher by the Web of Science. Dr Bennis is an editor of IEEE TCOM and Specialty Chief Editor for Data Science for Communications in the Frontiers in Communications and Networks journal. Dr Bennis is an IEEE Fellow. 0090-6778 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Oulu University. Downloaded on June 17,2021 at 05:09:45 UTC from IEEE Xplore. Restrictions apply.

(PDF) Joint Client Scheduling and Resource Allocation Under Channel Uncertainty in Federated Learning