Q∗ : Energy and delay-efficient dynamic queue management in TCP/IP virtualized data centers Enzo Baccarellia , Paola G. Vinueza Naranjoa , Mohammad Shojafara , Michele Scarpinitia,∗ a Department of Information Engineering, Electronic and Telecommunication, Sapienza University of Rome, via Eudossiana 18, 00184, Rome, Italy Abstract The emerging utilization of Software-as-a-Service (SaaS) Fog computing centers as an Internet virtual computing commodity is raising concerns over the energy consumptions of networked data centers for the support of delay-sensitive applications. In addition to the energy consumed by the servers, the energy wasted by the network devices that support TCP/IP reliable inter- Virtual Machines (VMs) connections is becoming a significant challenge. In this paper, we propose and develop a framework for the joint characterization and optimization of TCP/IP SaaS Fog data centers that utilize a bank of queues for increasing the fraction of the admitted workload. Our goal is two-fold: (i) we maximize the average workload admitted by the data center; and, (ii) we minimize the resulting networking-plus-computing average energy consumption. For this purpose, we exploit the Lyapunov stochastic optimization approach, in order to design and ana- lyze an optimal (yet practical) online joint resource management framework, which dynamically performs: (i) admission control; (ii) dispatching of the admitted workload; (iii) flow control of the inter-VM TCP/IP connections; (iv) queue control; (v) up/down scaling of the processing fre- quencies of the instantiated VMs; and, (vi) adaptive joint consolidation of both physical servers and TCP/IP connections. The salient features of the resulting scheduler (e.g., the Q∗ sched- uler) are that: (i) it admits distributed and scalable implementation; (ii) it provides deterministic bounds on the instantaneous queue backlogs; (iii) it avoids queue overflow phenomena; and, (iv) it effectively tracks the (possibly unpredictable) time-fluctuations of the input workload, in order to perform joint resource consolidation without requiring any a priori information and/or forecast of the input workload. Actual energy and delay performances of the proposed scheduler are numerically evaluated and compared against the corresponding ones of some competing and state-of-the-art schedulers, under: (i) Fast - Giga - 10Giga Ethernet switching technologies; (ii) various settings of the reconfiguration-consolidation costs; and, (iii) synthetic, as well as real- world workloads. The experimental results support the conclusion that the proposed scheduler can achieve over 30 percent energy savings. Keywords: TCP/IP virtualized data centers, energy-vs.-delay tradeoff, dynamic resource reconfiguration-consolidation, SaaS cloud, Fog computing. ∗ Corresponding author. Tel.: +390644585869. Email addresses:

[email protected]

(Enzo Baccarelli),

[email protected]

(Paola G. Vinueza Naranjo),

[email protected]

(Mohammad Shojafar),

[email protected]

(Michele Scarpiniti) Preprint submitted to Computer Communications October 24, 2016 1. Introduction Software-as-a-Service (SaaS) is quickly emerging as a leading model for the provision of on- demand network-intensive cloud services (like Web 2.0 and HTML5 services) to remote clients over the Internet [1]. SaaS is becoming the increasingly prevalent service delivery model as the underlying broadband networking and virtualization technologies that support Internet Web services and service-oriented mature architectures [2]. In fact, by allowing the clients to run their requests in form of Virtual Machines (VMs), the SaaS model is compliant with the “pay-as- you-go” paradigm of the Cloud. However, this requires that data center providers shift to adopt a VM-centric perspective, in which both revenues and consumed resources are optimized on a per-VM basis. The consumed resources must also include the network and computing energies, that are quickly becoming critical and, then, should be metered on a per-VM basis under the SaaS model [2]. At this regard, we observe that, while energy saving techniques for virtualized computing servers have evolved [2], the energy consumption of the large number of switches and routers equipping intra-data center networks is becoming a major concern. The analysis in [3] points out that, in typical large-scale data centers by Google, the network energy is over 30 percent of the total one at server utilization below 70 percent, which is typical in production data centers. Motivated by the aforementioned considerations, we focus on the application scenario of Fig. 1. In this scenario, remote clients exploit Internet connections for submitting their workload to the serving data center. According to the SaaS model, the data center is virtualized, the client workloads are submitted in form of demands for VM processing/storage quanta, and reliable inter-VM communication is attained through end-to-end virtualized TCP/IP connections. Fur- thermore, in order to effectively cope with the (unpredictable) peaks of the input workload, the Middleware layer of the data center is equipped with an input queue cascaded to a set of per-VM local queues. Regarding the novelty of the technical contribution of the paper, it mainly relies on the joint dynamic energy-efficient reconfiguration and consolidation of the networking-plus-computing virtualized resources available at the Middleware layer of the data center. Specifically, the af- forded resource management problem jointly embraces: (i) dynamic access control of the input workload; (ii) dynamic control of the network flows of the instantiated TCP/IP connections; (iii) dynamic control of the queue backlogs; (iv) dynamic up/down scaling of the processing frequen- cies of the instantiated VMs; and, (v) joint dynamic consolidation of both physical servers and TCP/IP connections. QoS guarantees are provided in form of hard (e.g., deterministic) upper bounds on the queue backlogs, while the pursued objective is the achievement of an optimized tradeoff among the contrasting tasks of low networking-plus-computing energy consumptions and low queue backlogs. Due to the high number of involved physical servers and switches and the (typically, unpredictable) quick time-fluctuations of the input workload, we also require that the reconfiguration and consolidation actions performed by the proposed scheduler (namely, the Q∗ Scheduler (Q∗ S )) are dynamic (e.g., they do not require any forecast and/or a priori in- formation on the statistics of the input workload), and are amenable of distributed and scalable implementation. Overall, due its self-configuring and distributed nature, it is expected that the proposed Q∗ S well fits the requirements of the emerging TCP/IP-based Fog computing paradigm for the energy- saving support of delay and packet’s loss sensitive applications [4, 5]. At this regard, we antic- ipate that the bank of local queues of Fig. 1 allows us to perform flow and congestion control of the corresponding per-VM TCP/IP connections, while the input queue enforces admission 2 Figure 1: The considered virtualized networked computing platform. Dark blue boxes are virtual network interface cards. control on the input workload, in order to guarantee QoS to the admitted clients’ requests. 1.1. Related work and outline of the paper An examination of the related work supports the conclusion that the optimization of the aforementioned scheduling actions has been till now pursued along four main parallel research directions, so that the proposed joint solving approach seems to be somehow new. A first research direction focuses on the resource provisioning and management in SaaS clouds [6, 7, 8, 9, 10, 11, 12, 13]. Specifically, dynamic scaling of Web applications in multi-tier data centers is the common focus of [6, 7] and [8]. In MUSE [6], each physical server hosts replicas of all instantiated Web applications. The proposed dispatching algorithm allows the client’s requests to be served within finite average delays, while minimizing the number of under- utilized servers. In [7], network flow algorithms are developed to balance the per-application workloads over the running application instances. By focusing on connection-oriented Internet services (like Widows Live Messenger), [8] presents an integrated approach, in order to perform load balancing and server provisioning. However, unlike our approach, all cited works do not rely on the utilization of virtual machines and assume that the served applications are rigidly structured in static multi-tier architectures. In contrast, our architecture of Fig. 1 is oriented to Fog computing-based environments, so that: (i) resource scheduling is performed on per-VM basis; and, (ii) no restriction is enforced on how applications are structured inside the VMs. Consolidation of physical servers in virtualized data centers is the main focus of [9, 10] and [11]. Roughly speaking, after recognizing that the optimal VM-to-physical server mapping is an 3 NP-hard problem, these works propose various reduced-complexity greedy-type heuristics, that aim at minimizing the energy costs of the performed mappings. However, unlike our contribu- tion, all these works do not consider time-fluctuations of the input workload and, then: (i) do not perform dynamic VM placements; and, (ii) do not carry out queue management. Dynamic management of the computing resources hosted by queue-equipped SaaS data cen- ters is, indeed, the topic of [12] and [13]. Like our work, the solving approach pursued by these contributions relies on the exploitation of the Lyapunov’s criterion and aims at attaining good energy-vs.-delay tradeoffs. However, unlike our work, we note that: (i) both these works do not consider the networking aspects, do not perform network flow control and focus only on the man- agement of the computing resources; (ii) [12] does not consider server consolidation; and, (iii) [13] performs server consolidation by directly resorting to an offline (e.g., static) approach, that applies exhaustive search and, then, presents a (worst-case) computational complexity that ex- ponentially scales up with the number of VMs. Furthermore, both [12] and [13] do not provide closed-form expressions for the dynamic scaling of the processing frequencies of the running VMs. A second research direction focuses on the resource provisioning in Infrastructure-as-a- Service (IaaS) data centers [14, 15, 16, 17]. For this purpose, [14] resorts to Dynamic Voltage and Frequency Scaling (DVFS) to tune the CPU speeds to the offered workload. Powernap in [15] considers emerging hardware technologies (like solid state disks and self-refresh DRAM) for enabling rapid resource consolidation. When a server is turned ON, Somniloquy in [16] noti- fies an embedded system implemented by an ad-hoc designed physical network interface card to delegate the main operating system, in order to give the illusion that the server is always turned ON. Like our contribution, works in [14, 15] and [16] develop dynamic task scheduling policies, but, being focused on the IaaS model, they do not consider queue management, access control and network flow control. Work in [17] deals, indeed, with the network aspects of virtualized IaaS data centers. Specifically, by leveraging the topology properties of the fat-tree networks, the authors of [17] develop a traffic engineering-based heuristic for the VM placement. It aims at reducing the inter-pod network traffic by exploiting the a priori knowledge of the traffic flows generated by the running applications. Hence, like our contribution, [17] focuses on the reduc- tion of the network energy in virtualized data centers. However, unlike our contribution, this work: (i) does not consider the energy wasted by the physical servers; (ii) does not perform access control; and, (iii) does not perform queue management. A third research direction focuses on the optimization of the congestion control algorithm of the TCPNewReno protocol, in order to cope with the so-called “in-cast” traffic congestion induced by MapReduce-type applications [18, 19, 20]. For this purpose, [18] proposes to modify the multiplicative decrement policy of the standard TCP protocol, in order to account for the messages of network congestion generated by the switches (see Eq. (2) of [18]). This approach is further refined in [19], that proposes a so-called gamma-correction factor, in order to adjust the TCP congestion window by simultaneously accounting for the experienced congestion level and deadline of the supported connection. The latest improvement proposed in [20] includes also a mechanism for the dynamic tuning of the timers for the generation of suitably delayed ACK messages. Overall, like our work, all these TCP-oriented contributions focus on the flow control and consider the TCP/IP protocol as an effective means for managing the end-to-end transport connections in virtualized data centers. However, unlike our work, they: (i) do not consider the management of the computing physical servers; (ii) do not afford resource consolidation; and, (iii) do not consider the energy consumption of the underlying data centers. A last research direction deals with the energy-efficient distributed resource management in 4 federated clouds. At this regard, the (recent) contributions in [21] and [22] develop a promising approach that aims at maximizing the management profit through a suitable energy-saving dis- tribution of the workload over the available set of geographically dispersed data centers. Specif- ically, the framework considered by these papers assumes that each data center: (i) is equipped with input queues, in order to perform admission control; and, (ii) is supplied by renewable (e.g., green) and on-grid (e.g., brown) energy sources. The common goal of both papers is the max- imization of the utilization of green energy, while lowering the usage of the brown energy. For this purpose, the (novel) notions of green and brown workloads and service rates are introduced, in order to facilitate the separation of the resulting profit-maximization problem into two simpler coupled sub-problems, namely, the green energy maximization sub-problem and the brown en- ergy cost minimization sub-problem. The paper in [21] focuses on an application scenario with a single class of service, while the work in [22] generalizes the framework to the case of multiple classes of service. For this purpose, in [22], each service class is modeled as a G/D/1 queue and is characterized by a maximum value of the tolerated requests’ loss probability. Overall, like our work, the papers in [21] and [22] afford the joint problem of the energy-saving admission control and dynamic scaling of the processing speeds in data centers equipped with input queues. However, unlike our contribution, we point out that: (i) both papers in [21] and [22] focus on inter-data center workload distribution; (ii) they consider only the (green-plus-brown) energy wasted by the computing servers, but neglect the corresponding network energy consumptions; (iii) they do not afford the topic of the server and/or switch consolidation; and, (iv) being the data centers considered in [21] and [22] not virtualized, these papers do not consider the topic of the flow control of the per-VM queues and corresponding per-VM TCP/IP connections. The remainder of this paper is organized as follows. After introducing the architecture of the considered TCP/IP virtualized data center in Section 2, the tackled optimization problem and the Lyapunov-based solving approach are detailed in Section 3. The proposed Q∗ scheduler is pre- sented in Section 4, while its performance optimality and related QoS guarantees are the topics of Section 5. After discussing several implementation aspects and implementation complexity issues in Section 6, the results of the carried out experimental work are presented in Section 7. Finally, some conclusions and hints for future research are summarized in the conclusive Section 8. The final Appendixes report the proofs of the main analytical results. Regarding the adopted notation, E{·} indicates expectation, , means “equal by definition”, min{a; b} (resp., max{a; b}) is the min (resp., max) operator, while [x]+ indicates max{0; x}. Fi- → − ¯ notations are used for indicating column vectors and nally, the arrowed (·) and over-barred (·) time averages, respectively. 2. The considered virtualized TCP/IP computing platform Fig. 1 reports the main functional blocks of emerging virtualized networked data centers working under the SaaS model [4, 5, 23]. They operate at the Middleware layer and are composed by: (i) an Admission Control Server (ACS), that also acts as gateway Internet router; (ii) an input queue, that temporarily buffers the admitted workload; (iii) a Load Balancer, that dynamically reconfigures and consolidates the available computing-plus-networking physical resources and also dispatches the workload buffered by the input queue to the turned ON VMs; (iv) a Virtual Switch, that performs network flow control and manages the TCP/IP end-to-end connections; and, (v) a bank of VMs, with each VM equipped with a local buffer for temporarily storing the assigned workload. Inspired by the emerging next generation of broadband Fog-computing data 5 centers [2, 4], we consider a time slotted system, where the duration T S (s) of each slot can range from tens of milliseconds to tens of minutes. In the sequel, t is the discrete slot index, with the t-th slot spanning the (semi-open) interval [tT S , (t + 1)T S ), t ≥ 0. The infrastructure layer of the considered data center is composed by a set {PS (k), k = 1, 2, . . . , NS E } of (possibly, heterogeneous) physical servers. PS (k) (e.g., the k-th server) may PNS E host up to Mmax (k) (possibly, heterogeneous) virtual machines, so that: MV , k=1 Mmax (k), is the resulting maximum number of VMs hosted by the considered data center, while {V M( j), j = 1, . . . , MV } is the resulting set of available VMs. The (typically wired) physical intra-data center network is composed by a set {PW(m), m = 1, 2, . . . , NS W } of (possibly, heterogeneous) physical switches. Each switch is equipped with multiple ports and each port hosts a (typically Ethernet- type) Physical Network Interface Card (PNIC). Before proceeding, four main remarks about the considered networked architecture are in order. First, in our framework, the physical network may be single or multi-hop and of arbitrary topology. Being of wired type, we only assume that the (possibly, multi-hop) route spanned by each end-to-end TCP/IP connection of Fig. 1 is stable, e.g., it does not change during the time duration of the supported connection. Second, as a consequence of the performed consolidation actions, the set of turned ON VMs and associated TCP/IP connections may vary over the time. Hence, in the sequel, S(t) ⊆ {1, 2, . . . , MV } indicates the set of VMs and related connections that are turned ON at slot t. Third, the admission control and load balancing actions can be implemented by either centralized front-end servers or distributed back-end servers. At this regard, we anticipate that the pursued solving approach and resulting optimal scheduling actions apply, regardless of the specific implementation choice. Fourth, at least in the case when each physical server hosts homogeneous VMs, the corresponding maximum number Mmax (k) of VMs hosted by the k-th physical server is directly evaluated as in: $ ( )% CoPS (k) RaPS (k) DiPS (k) BwPS (k) Mmax (k) = min ; ; ; , coV M (k) raV M (k) diV M (k) bwV M (k) where CoPS (k) (resp., RaPS (k), DiPS (k), and BwPS (k)) is the number of physical cores (resp., RAM capacity, disk capacity and I/O bandwidth capacity) of the k-th physical server, while coV M (k), raV M (k), diV M (k), and bwV M (k) are the corresponding per-VM demands. 2.1. Input workload and dynamic of the input queue At the end of each slot, new workload arrives at the input of the ACS of Fig. 1. This happens according to a random process {w(t) ∈ R+0 , t ≥ 0}, that is limited up to wmax ∈ R+0 Information Units (IU) per slot1 . The arrival process is assumed independent from the queue backlogs and its statistics may vary in an unpredictable way, so that we do not assume any a priori information about the statistical behavior of {w(t)}. Hence, after indicating by a(t) ∈ R+0 (IU/slot) the number of IUs out of w(t) that are admitted into the input queue of Fig. 1 at the end of slot t, the following constraint holds2 : 0 ≤ a(t) ≤ w(t) ≤ wmax , ∀ t ≥ 0. (1) 1 The meaning of an IU is application dependent. It may represent a bit, byte, segment or packet. We anticipate that, in the carried out tests of Section 7, IU is understood as Mbit. 2 Regarding the rejected fraction: 1 − (a(t)/w(t)) of the workload, the data center can return a negative feedback to the clients, who, in turn, may resend later their requests to the data center. 6 Let r j (t) ∈ R+0 (IU/slot), j = 1, . . . , MV , be the workload that is drained by the input queue during slot t and, then, forwarded to V M( j) over the j-th TCP/IP connection. Since only the turned ON VMs may receive new workload, we must have: r j (t) = 0, j < S(t). Furthermore, the summation of the forwarded workloads cannot exceed the current backlog s(t) ∈ R+0 of the input P queue, so that we must have: j∈S(t) r j (t) ≤ s(t). Finally, since the maximum flow conveyed by each TCP/IP connection is upper limited by the maximum size of the corresponding congestion P window [18], we introduce the additional constraint: j∈S(t) r j (t) ≤ rmax . Above all, we must have: r j (t) ≥ 0, j = 1, . . . , MV , ∀ t ≥ 0, (2.1) r j (t) = 0, j < S(t), ∀ t ≥ 0, (2.2) X r j (t) ≤ min{s(t); rmax }, ∀ t ≥ 0, (2.3) j∈S(t) so that the time-evolution of the backlog {s(t) ∈ R+0 , t ≥ 0} of the input queue reads as in (see Fig. 1):      X  s(t + 1) =  s(t) −  r j (t) + a(t), t ≥ 0, s(0) = 0. (3) j∈S(t) + 2.2. Networking-plus-computing energy consumptions Main goal of the Transport-layer connections of Fig. 1 is to provide end-to-end links from the input queue to the local ones by exploiting the (possibly, multi-hop) routes done available by the underlying switched physical network. As pointed out in [18, 19] and [20], there are (at least) two main reasons to resort to the TCP/IP (mainly, the TCPNewReno) protocol, in order to implement these connections. First, it guarantees, by design, loss and error-free transport of data. Second, it performs congestion control, in order to match the per-connection flows to the aggregate traffic conveyed by the overall network. At this regard, we note that the energy Enet ( j, t) (Joule) consumed by the j-th TCP/IP connection is the summation: setup Enet ( j, t) = Enet ( j) + Edyn net ( j, t), (4) setup of a static Enet ( j) portion, and a dynamic Edyn net ( j, t) portion. As better detailed in Section 6, the static portion does not depend on the conveyed flow and accounts for the summation of the setup energies of the PNICs actually crossed by the j-th connection. The dynamic portion accounts for the additional flow-depending energy consumed by the crossed PNICs. At this regard, we point out that the analysis detailed, for example, in [24] leads to the conclusion that the dynamic energy wasted by TCPNewReno connections working in the steady-state (e.g., in the congestion avoidance state) is well captured by the following closed-form expression: γ Edyn net ( j, t) = σ j r j (t) , (5) where the state σ j ((Joule) × (slot/IU)γ ) of the j-th connection is, in turn, given by:  γ  RT T j  σ j = G j   . (6) 1.22 MS S  7 In Eq. (6), we have that [24]: (i) MS S (IU) is the maximum size of a TCP segment; (ii) RT T j is the average round-trip-time (in multiple of the slot period) of the j-th connection; (iii) γ > 1 is a dimension-less shaping exponent that depends on the utilized switching technology (see Section 7); and, (iv) G j (Joule) is the summation of the dynamic energy consumptions of the PNICs crossed by the j-th connection. We point out that, after performing the routing operation, both setup Enet ( j) in (4) and G j in (6) may be profiled online (see Section 6.5). Under the SaaS model, clients submit their computing requests as variable-size VMs and, then, the data center provider must charge them on a per-VM basis [1]. Hence, according to this VM-centric perspective, as in [25], we proceed to model the overall computing energy Ecom ( j, t) (Joule) consumed by V M( j) at slot t as in:    f j (t) α Ecom ( j, t) = Ecom ( j, t) + Ecom ( j, t) − Ecom ( j, t)  max  . idle max idle  (7) fj In Eq. (7), f j (t) (IU/slot) is the number of IUs processed by V M( j) at slot t, while f jmax (IU/slot) is the corresponding maximum processing capability, so that we must have: 0 ≤ f j (t) ≤ f jmax (t), ∀ t ≥ 0. (8) Furthermore, Eidle com ( j, t) (Joule) is the energy consumed by V M( j) at slot t in the idle state (e.g., when V M( j) is turned ON but its processing frequency vanishes), while Emax com ( j, t) is the corre- sponding energy when V M( j) runs at f jmax . Finally, α ≥ 2 is a dimension-less exponent which depends on the energy profile of the hosting physical server [25]. At this regard, we observe that, due to the performed server consolidation actions, the number of turned ON VMs per-server is time-varying, so that both the idle and maximum energies in (7) may vary over the time [25], and their actual values should be periodically profiled (see Section 6.5). Passing to consider the dynamic of the j-th local queue of Fig. 1, let q j (t) ∈ R+0 be its current backlog. Hence, since V M( j) acts as the (virtual) server of the j-th local queue and its service rate is f j (t), we have that: q j (t + 1) = [q j (t) − f j (t)]+ + r j (t), t ≥ 0, q j (0) = 0, j ∈ S(t), (9.1) and q j (t + 1) ≡ q j (t), q j (0) = 0, j < S(t), (9.2) where Eq. (9.2) accounts for the fact that turned OFF VMs do not process workload. All queues work under the (general) G/G/1 service model. Overall, the total networking-plus-computing energy Etot (t) (Joule) consumed by the data center of Fig. 1 at slot t is the summation of the corresponding computing and networking energies and, then, equates: X Etot (t) , Etot com (t) + Etot net (t) ≡ (Ecom ( j, t) + Enet ( j, t)) . (10) j∈S(t) 3. Tackled optimization problem and dynamic solving approach Goal of the data center provider is to attain the (hopefully) best trade-off among two con- trasting targets, namely, the maximization of the average admitted workload (e.g., the data center 8 throughput) and the minimization of the average networking-plus-computing energy consump- tion. This is, indeed, the goal of the optimization P problem we are going to formally introduce. For this purpose, let w , limt→∞ (1/t) t−1 τ=0 E{a(τ)} (IU/slot) be the time-average expected input workload. Furthermore, let Π be the set of all feasible scheduling policies, that is, the policies that, at each slot, select the admitted workload, the network flows, the VMs processing frequencies and the set of turned ON VMs without violating the constraints in (1), (2.1), (2.2), (2.3) and (8). Hence, after considering any feasible scheduling policy π ∈ Π that takes per-slot scheduling decisions: Sπ (t), aπ (t), rπj (t) and f jπ (t) for all j = 1, . . . , MV , let: 1X t−1 aπ , lim E{a(τ)}, (11) t→∞ t τ=0 π π π be the average expected throughput under the policy π, and let: rπj , f j , Ecom ( j), and Enet ( j), j = 1, . . . , MV , be the analogously defined time-average expected network flows, VM process- ing frequencies, VM computing energies and per-connection networking energies, respectively. Therefore, after indicating by: g : R+0 → R+0 , (12) the revenue function adopted by the data center provider, and by β (Joule)−1 the corresponding Energy Usage Efficiency (EUE) coefficient [2], our target is to solve the following stochastic optimization problem:      X MV     π π π  max   g(a ) − β E com ( j) + E net ( j)   , (13.1) π∈Π     j=1 s.t.: 0 ≤ aπ ≤ w, (13.2) X MV π aπ ≤ f j. (13.3) j=1 Table 1 reports the main introduced taxonomy. Regarding the optimization problem in (13), three explicative remarks are in order. First, the revenue function in (12) prices the data center throughput according to the billing policy of the data center provider [2]. Hence, since its analytical expression is strongly provider-depending, according to the law of diminishing marginal revenue in economics [1], we limit to assume that g(·) in (12) is a non-negative, strictly increasing and concave function, that vanishes in the origin and admits continue first derivative3 . Second, several reports corroborate the conclusion that, in state-of-the-art data centers, the energy consumption of non-Information Technology (non-IT) equipment (like cooling) may be considered almost proportional to that consumed by physical switches and servers [2]. Hence, the coefficient β in (13.1) is numerically equal to the 3 Just as an example, the log-function g(r) = log (1 + r) meets the above assumptions and is, indeed, a popular revenue function [2]. 9 Table 1: Main taxonomy of the paper. Symbol Meaning/Role {PS (k), k = 1, . . . , NS E } Set of physical servers {PW(m), m = 1, . . . , NS W } Set of physical switches {V M( j), j = 1, . . . , MV } Set of VMs f j (t) (IU/slot) Processing frequency of V M( j) at t r j (t) (IU/slot) Flow of the j-th TCP/IP connection at t w(t) (IU/slot) Input workload at t a(t) (IU/slot) Admitted workload at t s(t), z(t), {q j (t)} (IU) Queue backlogs at t Ecom ( j, t), Enet ( j, t) (Joule) j-th computing and networking consumptions at t Etot (t) (Joule) Total data center energy consumption at t w, a, r j , f j (IU/slot) Time averages of w(t), a(t), r( j) and f j (t) Ecom ( j), Enet ( j) (Joule) Time averages of Ecom ( j, t) and Enet ( j, t) V Lyapunov control parameter α, γ Per-VM and per-connection power exponents Eidle max com ( j, t), Ecom ( j, t) (Joule) Per-VM idle and maximum energies setup σ j , Enet ( j) (Joule) Per-connection dynamic and setup energies ratio of the total energy consumed by the data center to that by the IT equipment. Just as an example, old energy-inefficient data centers can have values of β larger than 2, while emerging energy-proportional data centers exhibit values of β as low as 1.2 [2]. Third, we observe that the constraint in (13.2) on the average throughput is compliant with the per-slot constraint in (1), while (13.3) limits the average throughput up to the average aggregate processing capability of all available VMs. Let πopt (resp., Xopt ) be the solution of (13) (resp., the value of the objective function in (13.1) under πopt ). Although πopt well meets the (aforementioned) contrasting targets of the data center provider, from an application point of view, its computation presents at least three main challenges. First, the evaluation of πopt generally resists closed-form expression, and its analytical characterization relies on suitable arguments of randomized scheduling (see [26] for an in-depth formal examination of this topic). Second, the characterization of πopt requires the a priori knowledge of the average input workload w in (13.2). Since the workload offered to large- scale production data centers is typically non stationary, its statistics may vary over the time and are not known in advance. Third, even if πopt can be computed for given workload statistics, the obtained expression would not be adaptive to unpredictable changes in the workload statistics and, then, it should be re-computed from scratch when the workload statistics change. Motivated by these considerations, in the next sub-Section, we will present a dynamic solving approach that bypasses these drawbacks. 3.1. The pursued dynamic solving approach Inspired by (quite) recent contributions on the dynamic control of queue systems [12, 13], the pursued solving approach relies on a suitable application of the Lyapunov optimization tech- nique. We anticipate that, so doing, the average performance of the developed dynamic scheduler approaches arbitrarily close the optimal one Xopt . This is achieved without requiring any a priori 10 knowledge or forecasting of w in (13.2), while assuring QoS guarantees in terms of determinis- tically limited queue backlogs. Since the used formal framework is similar to that detailed, for example, in [26], in the sequel, we limit to report only the three steps that are more peculiar of our scenario. First, after introducing the non-negative auxiliary variable c, we recast the problem in (13) in the following equivalent form4 :      X MV      max  g(c) − β Ecom ( j) + Enet ( j)   , (14.1)     j=1 s.t.: c ≤ a, (14.2) 0 ≤ a ≤ w, (14.3) X MV a≤ f j. (14.4) j=1 Since g(·) is, by assumption, an increasing function, the constraint in (14.2) is attained at the optimum, so that the problems in (13) and (14) are equivalent, e.g., they admit the same solution. Second, we observe that the constraint in (14.2) may be viewed as the stability constraint on a virtual (e.g., dummy) queue with average arrival (resp., service) rate c (resp., a). Specifically, the dynamic of the backlog {z(t), t ≥ 0} of this virtual queue reads as in: z(t + 1) = [z(t) − a(t)]+ + c(t), t ≥ 0, z(0) = 0, (15) where {c(t) ∈ R+0 , t ≥ 0} is a virtual arrival process whose time-average expectation equates c and its peak value is limited up to wmax , that is (see Eqs. (1) and (14.2)), 1X t−1 c = lim E{c(τ)}, (16.1) t→∞ t τ=0 and 0 ≤ c(t) ≤ wmax , ∀ t ≥ 0. (16.2) Third, the constraint in (14.2) on the stability of the virtual queue in (15) may be, in turn, auto- matically enforced by applying the Lyapunov criterion (see, for example, the chapters 4 and 5 of [26] for a general discussion of this topic). Toward this end, after noting that the backlogs of the queues of turned OFF VMs do not change (see Eq. (9.2)), let:   1  2 X  L(t) ,  s (t) + z2 (t) + q2j (t) , (17) 2 j∈S(t) 4 In order to speed up the notation, we omit the explicit dependence on the policy π. 11 be the Lyapunov function at slot t. Since Eq. (17) measures the overall queue congestion, in order to stabilize the queue system by persistently pushing the Lyapunov function towards zero, we introduce the one-step conditional Lyapunov drift function, formally defined as in [26]: − → D(t) , E L(t + 1) − L(t) h (t) , (18) →− h iT where h (t) , s(t), z(t), {q j (t), j ∈ S(t)} is the (column) vector of the queue backlogs in (17). Hence, as in [26], the goal of the optimal (in the Lyapunov sense) scheduler is to minimize the following drift-minus-utility function by acting on a per-slot basis:      X →     − h (t)  D(t) − V E   g(c(t)) − β (E com ( j, t) + E net ( j, t))   . (19)     j∈S(t) As in [26], the (dimension-less and non negative) scalar parameter V is a knob that allows the data center provider to tune the queue stability-vs.-utility trade-off. Specifically, large (resp., low) values of V put the emphasis on the maximization (resp., minimization) of the utility (resp., queue delays). 3.2. Upper bounding Drift-Minus-Utility Unfortunately, due to the presence of the maximum operator in Eqs. (3), (9) and (15), the per-slot minimization of (19) with respect to the set of variables: n o c(t), a(t), S(t), {r j (t), f j (t), j ∈ S(t)} , (20) resists closed-form expression. Hence, in order to tackle with this challenge, we apply the same bounding technique of [26]. Shortly, we square both sides of Eqs. (3), (9) and (15), and, then, apply the following algebraic inequality: (max {0; a − b} + d)2 ≤ a2 + b2 + d2 − 2a(b − d), ∀ a, b, d ≥ 0, (21) in order to bound the Lyapunov drift in (18). Hence, after introducing the attained bound on D(t) into (19), we arrive at the following final upper bound on the drift-minus-utility function:      X →     − h (t)  D(t) − V E  g(c(t)) − β (E com ( j, t)+, Enet ( j, t))       j∈S(t) (22.1) X 3 →− ≤ B0 + E{ψm (t) | h (t)}, m=0 12 with the dummy positions:   1  2 X MV  2 B0 , 2 rmax + 3 wmax + ( f j )  , max 2 (22.2) 2 j=1 ψ0 (t) , a(t) (s(t) − z(t)) , (22.3) ψ1 (t) , c(t) z(t) − V g (c(t)) , (22.4) X ψ2 (t) , r j (t) q j (t) − s(t) + V β Enet ( j, t) , (22.5) j∈S(t) X ψ3 (t) , V β Ecom ( j, t) − f j (t) q j (t) . (22.6) j∈S(t) Before proceeding, three main remarks are in order. First, it is proved in [26] that the upper bound in (22.1) is tight at the optimum, so that, without loss of optimality, we may directly min- imize it in place of the drift-minus-utility function. Furthermore, since B0 in (22.2) is a constant and the expectations in (22.1) are conditioned on the queue backlogs, this minimization reduces to minimize on a per-slot basis the ψ’s functions in (22.3) – (22.6) [26]. Second, since these functions do not depend on the statistics of the input workload, all these results presented in the sequel apply, regardless of the (a priori unknown) workload statistics. Third, since the queue backlogs in (22.3) – (22.6) play the role of know parameters, the resulting dynamic scheduler of Section 4 is of clairvoyant-type. Hence, in our framework, there is not reason to consider dy- namic migration of running VMs and/or dynamic re-routing of the on-going TCP/IP connections. This means, in turn, that the placement of both VMs and TCP/IP connections may be assumed static and captured by the following sets of binary variables:    1, if V M( j) is hosted by PS (k), dk j =   (23.1) 0, otherwise, and    1, if j-th connection crosses PW(m), bm j =  (23.2) 0, otherwise. In our framework, they play the role of known binary constants and the topic of their actual setup is addressed in Section 6.4. 4. The proposed joint dynamic scheduler The proposed dynamic scheduler interleaves reconfiguration and consolidation slots. Specif- ically, at each reconfiguration slot, the set S(t) of turned ON VMs and turned ON TCP/IP con- nections remains unchanged, while the scheduler performs the following reconfiguration actions: (i) admission control and updating of the virtual queue; and, (ii) flow control and scaling of the processing frequencies. At each consolidation slot, the set S(t) is, at first, updated and, then, the mentioned reconfigurations actions take place. The analytical expressions: n o c∗ (t), a∗ (t), S∗ (t), {r∗j (t), f j∗ (t), j ∈ S∗ (t)} , (24) of the control actions taken by the proposed dynamic scheduler is detailed in the remaining part of this Section 4. 13 4.1. Updating of the virtual queue and access control Let gr (r) , ∂g(r)/∂r, be the derivative of the revenue function in (12), and let g−1 + r (r) : R0 → R, be the resulting inverse function. Hence, the minimization of ψ0 (·) in (22.3) (resp., ψ1 (·) in (22.4)) with respect to c(t) (resp., a(t)) under the constraint in Eq. (16.2) (resp., in Eq. (1)), leads to the following results (see the Appendix A for the proof). Proposition 1 (Dynamic virtual queue updating and access control). The dynamic scheduler updates the admitted traffic and arrivals of the virtual queue according to:    w(t), for s(t) < z(t), ∗ a (t) =   (25.1) 0, otherwise, and ( ( !)) ∗ z(t) c (t) = max 0; min wmax ; g−1 r . (25.2) V In order to gain insight about the behavior of the action control in (25.2), we observe that, under the previously reported assumptions, the function g−1 r (r) is strictly decreasing and may also assume negative values. As a consequence, we have that: (i) c∗ (t) in (25.2) decreases and, then, vanishes for increasing values of the ratio (z(t)/V); (ii) at V = 0, we have that (see Eqs. (15) and (25.2)): c∗ (t) ≡ z(t) ≡ 0, ∀t≥0; (26) and, (iii) the sensitivity of c∗ (t) on the ratio (z(t)/V) increases for increasing values of the slope of g−1 r (·). Passing to consider the admission control in (25.1), we observe that it implements a threshold- based flow control, that admits all the current input workload only if the backpressure: (s(t) − z(t)) is negative, e.g., the input queue is less congested than the virtual one. This implies that, at V = 0, we have that (see Eqs. (3) and (26)): a∗ (t) ≡ s(t) ≡ 0, ∀ t ≥ 0. (27) 4.2. Per-connection flow control and per-VM scaling of the processing frequencies The minimization of ψ2 (·) in Eq. (22.5) (resp., ψ3 (·) in Eq. (22.6)) under the constraints in (2.1) – (2.3) (resp., in (8)) leads to the following flow (resp., frequency scaling) control action (see the Appendix B for the proof). Proposition 2 (Dynamic network flow control and VM frequency scaling). For j ∈ S(t), the dynamic scheduler updates the per-connection network flows and per-VM processing frequencies as in:  1    s(t)−(q j (t)+ζ ∗ (t)) γ−1 ∗  γ V β σj , for q j (t) + ζ ∗ (t) ≤ s(t), r j (t) =   (28)  0, otherwise, and   1   α−1     ( f jmax )α q j (t)      max   f j∗ (t) = min   fj ;     . (29)   max idle α V β Ecom ( j, t) − Ecom ( j, t)   14 For j < S(t), we have: r∗j (t) = f j∗ (t) = 0. (30) ∗ Furthermore, ζ (t) in (28) is the non-negative Lagrange multiplier of the constraint in (2.3). It is the (unique non-negative) root of the following nonlinear algebraic equation: X r∗j (ζ(t)) = min {s(t); rmax } , (31) j∈S(t) where r∗j (·) is given by (28), with ζ ∗ (t) replaced by the dummy variable ζ(t). Regarding the network flow control in (28), two main remarks are in order. First, in the presence of network costs (e.g., in the case of positive values of σ j ), the performed flow control does not longer follow the (usual) Join-to-the-Shortest-Queue policy reported, for example, in [12] and [13]. In fact, the flow r∗j (·) conveyed by the j-th connection gently scales down for increasing values of the network cost σ j , and the connections with larger energy costs tend to transport lower flows. Second, larger values of ζ ∗ (·) tend to stronger reduce the network flows dispatched to more congested local queues. Passing to consider the frequency scaling control action in (29), we observe that the j-th pro- cessing frequency scales up for increasing values of the ratio: q j (t)/V, and approaches f jmax at large enough values of this ratio. Intuitively, this is in agreement with the following two prop- erties: (i) in order to stabilize the most congested local queues, higher values of the processing frequencies are required (see Eq. (9)); and, (ii) vanishing values of V push the overall queue system to be less energy conserving, while enforcing queue stability (see Eq. (19)). Finally, we observe that, at V = 0, the term proportional to the networking (resp., computing) energy in (22.5) (resp., in (22.6)) vanishes. As a consequence, since also s(t) vanishes at V = 0 (see Eq. (27)), the minimization of (22.5) and (22.6) directly leads to: r∗j (t) ≡ q j (t) ≡ 0, and f j∗ (t) ≡ f jmax , ∀ t ≥ 0, at V = 0. (32) 4.3. Joint consolidation of the networking-plus-computing resources At each consolidation slot, the set S(t) must be selected over the (discrete) collection Ω of all feasible configurations of turned ON VMs. After observing that only ψ2 (·) and ψ3 (·) in (22.5) and (22.6) depend on S(t), the optimal set S∗ (t) of turned ON VMs is individuated by solving the following optimization problem: S∗ (t) = arg min {ψ2 (t) + ψ3 (t)} , s.t.: Eqs. (2.1), (2.2), (2.3) and (8). (33) S(t)∈Ω At V = 0, all network flows vanish (see Eq. (32)), so that (33) reduces to the maximization of: P j∈S(t) f j (t) q j (t), under the constraint in (8). This maximum is obtained, in turn, by running all the available VMs at their maximum frequencies, so that we have: S∗ (t) ≡ {1, . . . , MV }, and f j∗ (t) ≡ f jmax , ∀ j, at V = 0. (34) However, at V > 0, the solution of (33) resists closed-form evaluation and, in principle, it should be computed by exhaustive search over the discrete set Ω. This introduces, in turn, two main challenges. First, being the implementation complexity of the exhaustive search proportional to the size of Ω, it scales up in an exponential (resp., linear) way for increasing MV in the case of heterogeneous (resp., homogeneous) data centers5 . In practice, this limits the maximum 5 Specifically, in the homogeneous case, we have: Ω = {∅, {1, 2}, {1, 2, 3}, . . . , {1, 2, . . . , MV }}, so that the size of Ω is: (1 + MV ). However, in the heterogeneous case, the size of Ω scales as the binomial coefficient: NMV . SE 15 affordable values of MV up to: 11 – 12 (resp., 2000 – 4000) in the case of heterogeneous (resp., homogeneous) data centers. Second, the exhaustive search-based approach is inherently offline and, then, requires that, at each consolidation slot, the solution of (33) is re-computed from scratch. In order to tackle with these challenges, in the Section 4.3.1, we develop a dynamic two-step approach for the joint consolidation of the computing-plus-networking resources. Specifically, in the first step, we transform the discrete optimization problem in (33) into an equivalent one that involves only continuous variables. The obtained continuous problem is non-convex and, then, it resists closed-form solution. Hence, in the second step, we develop a suitable set of adaptive iterations that approach the solution of the afforded consolidation problem by starting from the resource allocation configuration already computed at the previous slot. 4.3.1. The proposed dynamic approach for the joint adaptive consolidation Let Pidle max S E (k) (resp., PS E (k)) be the idle (resp., maximum) power (measured in (Watt)) con- sumed by the k-th physical server of Fig. 1. At each consolidation slot, we have that: (i) V M( j) is turned ON iff its processing frequency is positive; (ii) the k-th physical server is turned OFF only when all the hosted VMs are turned OFF (see Eq. (23.1)); (iii) the j-th TCP/IP connection is turned OFF iff V M( j) is turned OFF (see Eq. (2.2)); and, (iv) all the available VMs and TCP/IP connections are candidate to be turned ON/OFF. As a direct consequence, after indicating by u−1 (x) the unit-step Heaviside’s function (that is, u−1 (x) = 0 for x ≤ 0, and u−1 (x) = 1 for x > 0), the total computing and networking energies in (10) consumed by all physical servers and PNICs are given by:  M     NS E  X    idle X V  Pmax (k) − Pidle (k) X MV  f j (t) α     tot Ecom (t) = T S    PS E (k) u−1  dk j f j (t) + SE SE dk j  max   , (35.1)    M max (k) f    k=1 j=1 j=1 j and X MV γ setup Etot net (t) = σ j r j (t) + Enet ( j) u−1 f j (t) , (35.2) j=1 where the coefficients {dk j } in (35.1) account for the performed VM-to-physical server mapping (see Eq. (23.1)). Hence, after introducing the dummy position:       X MV    tot tot ϕ(t) ,  r j (t) q j (t) − s j (t) − q j (t) f j (t)   + V β Ecom (t) + E net (t) , (36)   j=1   the consolidation problem in (33) may be recast in the following equivalent form: min ϕ(t), (37.1) {r j (t), f j (t), j=1,...,MV } s.t.: 0 ≤ f j (t) ≤ f jmax , j = 1, . . . , MV , (37.2) 0 ≤ r j (t) ≤ rmax u−1 f j (t) , j = 1, . . . , MV , (37.3) X MV r j (t) ≤ min {s(t); rmax } . (37.4) j=1 16 Regarding the problem formulation in (37), four explicative remarks are in order. First, the terms proportional to the unit-step function account for the increments/decrements of the static energies that stem from turning ON/OFF physical servers and TCP/IP connections. Second, the 2MV optimization variables in (37) are continuous-valued. Hence, unlike (33), (37) is a continuous optimization problem. Third, as a consequence of the (aforementioned) structural properties in (i)-(iv), the problem n formulations o in (33) and (37) are equivalent. Specifically, after computing the solution: r∗j (t), f j∗ (t) of (37), the corresponding optimal set of the turned ON VMs is evaluated as in: n o S∗ (t) = j : f j∗ (t) > 0, j = 1, . . . , MV . (38) Fourth, according to (2.2), the constraint in (37.3) guarantees that the turned OFF VMs do not receive workload, while the constraint in (37.4) is compliant with (2.3). Due to the presence of the unit-step function, the consolidation problem in (37) is not con- vex and, then, it resists closed-form solution. Hence, as a second step of the proposed solving approach, we proceed to develop a set of iterations, in order to approach its solution. For this purpose, we leverage the primal-dual iteration-based algorithm recently revised in [27]. It builds up a set of parallel gradient-based iterations for the adaptive tracking of both the primal variables (e.g., the flow rates and processing frequencies in (37)) and dual variables (e.g., the Lagrange multipliers of the considered constraints). Specifically, in the proposed dynamic implementa- tion of the Q∗ scheduler, primal-dual iterations run: (i) at the beginning of each reconfiguration slot, in order to compute the solution ζ ∗ (t) of the nonlinear algebraic equation in (31); and, (ii) at the beginning of each consolidation slot, in order to iteratively approach the solution of the consolidation problem in (37). For this purpose, after indicating by y(t) a (scalar) generic re- source variable to be updated at slot t and by L(t) the Lagrangian function of the considered optimization problem, the n-th iteration assumes the following general form [27]: h i y(n+1) (t) = y(n) (t) − ̺(n) (n) y (t) ∇y L (t) , n ≥ 0, with: y(0) (t) = y∗ (t − 1), (39) + where n = 1, 2, . . . , is a discrete iteration index which runs at the beginning of the considered slot t. Furthermore, according to [27], in Eq. (39) we have that: (i) ∇y L(n) (t) is the scalar derivative of the considered Lagrangian function, which is done with respect to the variable y and is evaluated at the resource configuration setting attained at the n-th iteration; (ii) ̺(n) (t) is a (n) y 2 (n) suitable n-varying adaptive step-size, that, according to [27], is set as in: ̺y (t) = 0.5 y (t) ; and, (iii) the projection in (39) accounts for the non-negative values of all involved variables. At each slot t, the iterations in (39) start from the solution y∗ (t − 1) attained at the previous slot. This allows the scheduler to effectively track over time the solution of the afforded consolidation problem without re-computing it from scratch, while escaping from the local minima of the objective function in (36). Just as an illustrative example, at the beginning of each reconfiguration slot, the root ζ ∗ (t) of the nonlinear algebraic equation in (31) for the optimal network flow control is computed by running the following iterations in the n-index:     2  X  ζ (n+1)  (n) (n)  (t) = ζ (t) − 0.5 ζ (t) min {s(t); rmax } − r j ζ (t)  , n ≥ 0, ∗ (n) (40) j∈S(t) + where r∗j (·) is given by Eq. (28), with ζ ∗ (t) replaced by ζ (n) (t). 17 5. Performance optimality and QoS of the proposed scheduler In order to formally present the performance of the proposed scheduler of Section 4, let T max ≥ 1 be the maximum expected delay (in multiple of the slot period) between two con- secutive consolidation slots and let gr (0) be the (positive) value of the derivative of the revenue function in (12) at r = 0. Furthermore, after indicating by E∗tot (t) the total energy in (10) con- sumed by the proposed scheduler of Section 4 at slot t, let:   t−1      1 X   1X ∗ t−1    ∗ χ , lim inf  g  ∗  E {a (τ)} − β E Etot (τ)  , (41) t→∞   t t   τ=0 τ=0 be the corresponding achieved expected utility. Hence, the following performance results hold (see the Appendix C for the proof). Proposition 3 (Hard QoS guarantees). Under arbitrary input workload (possibly, exceeding the processing capability of the overall data center), the proposed dynamic scheduler guarantees that: a) all the queues are strongly stable at each slot t and for any V ≥ 0, that is, z(t) ≤ gr (0)V + wmax , ∀t ≥ 0, (42.1) s(t) ≤ gr (0)V + 2wmax , ∀t ≥ 0, (42.2) q j (t) ≤ gr (0)V + 2wmax + rmax , ∀t ≥ 0, j = 1, . . . , MV ; (42.3) b) for any V ≥ 0, the average utility in (41) attained by the proposed scheduler is lower bounded as in: B0 T max χ∗ ≥ χopt − , (43) V where B0 is the constant in (22.2) and χopt is the (previously defined) value of the objective function in (13.1) under the (possibly unknown) optimal policy πopt . The above results on the performance of the proposed scheduler merit four main remarks. First, Eqs. (42.1) – (42.3) guarantee deterministic upper bounds on all queue backlogs, that, by the Little’s Theorem, translate into hard QoS guarantees on the total queue delays experienced by the clients of the data center of Fig. 1. Roughly speaking, this is due to the access control policy in (25.1), that is capable to stabilize the queues, even when the input workload exceeds the processing capability of the data center. As a consequence, in order to avoid queue overflow phenomena, it suffices that the buffering capacity of each queue equates the corresponding upper bound in (42). Second, in order to attain limited queue backlogs, the value assumed by gr (0) in (42) must be finite and (hopefully) as low as possible. This supports the utilization of log- like revenue functions, while forbids to resort to power-like functions, such as, for example, g(r) = rm , with 0 < m < 1. Third, the χ∗ -vs.-χopt gap in (43) is limited up to: (B0 T max )/V. Hence, it vanishes for V → ∞, but, doing in so, all the bounds in (42) on the queue backlogs linearly grow unbounded. This confirms that, as in [12, 13], even in our framework, there is an O(1/V, V) utility-vs.-delay tradeoff. Fourth, at fixed V, the optimality gap in (43) linearly 18 increases with the inter-consolidation delay T max . Hence, we expect that the performance of the proposed scheduler degrades under low rates of occurrence of the consolidation actions, and we anticipate that the numerical results of Section 7 confirm, indeed, this expectation. Finally, at vanishing values of V, the control actions of the proposed scheduler are given by Eqs. (26), (32) and (34), so that the resulting utility in (41) is directly computable as in: X NS E χ∗ = − β T S Pmax S E (k). (44) k=1 6. Implementation aspects and implementation complexity In this Section, we address several aspects related to the adaptive and distributed implementa- tion of the proposed scheduler, together with the resulting implementation complexity. So doing, we also point out some possible generalizations of the considered application scenario. 6.1. Dynamic selection of the consolidation slots and implementation complexity of the proposed scheduler In agreement with the dynamic nature of the proposed scheduler, as in [11], we adopt a dynamic approach for the selection of the consolidation slots, that is based on the online evalu- ation of the per-slot average utilization of the turned ON VMs. Specifically, after indicating by |S(t)| the size of the set S(t) (e.g., the number of turned ON VMs at slot t), the current average utilization:  ∗  1 X  f j (t)  U(t) ,  , (45) |S(t)| j∈S(t)  f jmax  of the currently turned ON VMs is computed, together with the exponentially weighted moving average prediction of the next average utilization [11]: b b U(t + 1) = λ U(t) + (1 − λ) U(t), (46) where the scalar λ is (usually) set to 0.7 [11]. Afterwards, as in [11], the following l-out-m decision rule is applied: the next slot (t + 1) is flagged as a consolidation slot if at least l out m most recently measured utilizations in (45) as the predicted one in (46) fall out of a target interval: I , [T hL , T hU ], where T hL (resp., T hU ) is the lower (resp., upper) desired value of the average utilization. Overall, Table 2 reports a pseudo-code for the implementation of the proposed joint dynamic scheduler, namely, the Q∗ scheduler. Regarding its implementation complexity, three main remarks are in order. First, since the access control in (25.1) and the evaluation of the virtual arrivals in (25.2) require the per-slot measurements of the input workload and backlogs of the input and virtual queues, these control actions may be implemented by the Access Server and Load Balancer of Fig. 1, respectively. Likewise, the iterations in (40) for updating the (global) Lagrange multiplier ζ ∗ (·) may be carried out by the Load Balancer, that also broadcasts to the turned ON VMs the current backlog of the input queue. Second, each VM may update its own network flow and processing frequency in a distributed way by directly measuring the current backlog of its local queue (see Eqs. (28) 19 Table 2: A pseudo-code of the proposed Q∗ scheduler The Q∗ Scheduler (Q∗ S) 1: for t ≥ 1 do 2: Perform access control through Eq. (25.1); 3: Update the virtual arrivals through Eq. (25.2); 4: Update z(t + 1) through Eq. (15), with a(t) and c(t) replaced by a∗ (t) and c∗ (t); 5: Apply the l-out-m decision rule and flag t as reconfiguration or consolidation slot; 6: if t is a consolidation slot then 7: Perform resource consolidation by running the iterations in Eq. (39); 8: Update S(t) through Eq. (38); 9: end if 10: if t is a reconfiguration slot then 11: Compute ζ ∗ (t) through the iterations in Eq. (40); 12: Evaluate Eqs. (28) and (29); 13: end if 14: Update s(t + 1) through Eq. (3), with {r j (t)} and a(t) replaced by {r∗j (t)} and a∗ (t); 15: Update q j (t + 1) through Eq. (9), with { f j (t)} and {r j (t)} replaced by { f j∗ (t)} and {r∗j (t)}; b 16: Update U(t) and U(t + 1) through Eqs. (45) and (46); 17: end for and (29)). Hence, the total per-slot implementation complexity of the Q∗ scheduler of Table 2 is O(MV ), but the corresponding per-VM implementation complexity does not depend on the (possibly, very large) number MV of the VMs hosted by the data center, that is, it is of the order of O(1). Third, since the iterations in (39) are carried out at the beginning of each consolidation slot, the iteration index n must run faster than the slot time T S . Hence, the actual time duration T I (s) of each n-indexed iteration must be so small so to allow the iterations in (39) to converge within a limited fraction of the slot time. At this regard, the formal results of Theorem 3.3 of [27] assure the asymptotic convergence of the iterations in (39). Furthermore, we have numerically ascertained that, at least in the carried out tests, about 20-25 n-indexed steps suffice, in order to achieve the convergence of (39) with a final accuracy below 1 percent. Hence, in the carried out tests, we pose T I = T S /300. Overall, the above remarks lead to the conclusion that the implementation of the Q∗ scheduler is distributed over the available VMs, while the resulting per-slot execution time is of the order of about 25-30 iteration periods, regardless of the (possibly large) size of the considered data center. 6.2. Managing the turning OFF/ON delay Turning ON a VM may require some time, specially when the corresponding TCP/IP connec- tion and hosting physical server must be also turned ON [2, 11]. Let T VON M be the (dimensionless) fraction of the slot time needed to turn ON a VM and the associated TCP/IP connection. Hence, since the corresponding time available for the data transport is reduced by a fraction equal to 20 T VON M , the constraint in (37.4) modifies as in:   X  X  ON   r j (t) + 1 − T V M  r j (t) ≤ min {s(t); rmax } . (47) j∈S(t−1) j<S(t−1) The constraint in Eq. (47) applies when t is a consolidation slot, so that the first (resp., second) summation is over the set of VMs that were turned ON (resp., turned OFF) at the previous slot (t − 1). 6.3. Managing multiple applications and multi-tier data centers The performance optimality of the proposed scheduler is retained also in the more general cases of multi-application [12, 13] and multi-tier [6, 7, 8] data centers. In multi-application data centers, NAP ≥ 2 applications generate parallel input workloads that are processed by a bank of networked computing platforms [12]. Since each per-application computing platform retains the architecture of Fig. 1 and manages own dedicated networking- plus-computing resources, as in [12] and [13], even in our framework, the overall optimal sched- uler reduces to NAP parallel sub-schedulers, that operate on a per-application basis and still apply the control actions of Section 4. A similar conclusion holds also for multi-tier data centers for emerging Web 2.0 applications. They are composed by the cascade of NT ≥ 2 networked computing platforms, that sequentially process the input workload in a pipelined fashion [6, 7, 8]. Hence, after indicating by: w(i+1) (t) , P (i) (i+1) j∈S(i) (t) f j (t) (resp., a (t)), i ≥ 1, the workload processed by the i-th tier (resp., the workload admitted at the input of the (i + 1)-th tier), the backlog of the input queue of the (i + 1)-th tier evolves as in:       X    (i+1)   (i+1)   (i+1)   (i+1) s (t + 1) =  s (t) − min   rmax ; r j (t)   + a (t). (48)   (i+1)   j∈S (t)+ The corresponding dynamic of the backlog of the j-th local queue is given by: h i q(i+1) j (t + 1) = q(i+1) j (t) − f j(i+1) (t) + r(i+1) j (t), j ∈ S(i+1) (t), (49.1) + and q(i+1) j (t + 1) ≡ q(i+1) j (t), j < S(i+1) (t), (49.2) where S(i+1) (t) is the set of turned ON VMs of the (i + 1)-th tier at slot t. Hence, under the (common) assumption that each tier is equipped with own dedicated networking-plus-computing resources [6], each tier is managed by a local scheduler that still applies the control actions of Section 4. 6.4. Managing Virtual-to-Physical resource mapping Task of the Virtualization layer of Fig. 1 is to map the demands for the per-connection flows in (28) and per-VM processing frequencies in (29) done by the Middleware layer into adequate channel bandwidths and CPU cycles at the underlying Network and Server layers. These map- pings may be performed by equipping the Virtualization layer of Fig. 1 by the so-called mClock 21 and SecondNet mappers in [28] and [29], respectively. Specifically, Table 1 of [28] points out that the mClock mapper is capable to guarantee CPU cycles on a per-VM basis by adaptively managing the computing power of the underlying DVFS-enabled (possibly, multi-core) physi- cal servers. The output of the mClock mapper is the set of binary coefficients {dk j } in (23.1). Likewise, the SecondNet network mapper provides Ethernet-type contention-free links atop any set of TCP-based (possibly, multi-hop) end-to-end connections by resorting to a suitable Port- Switching based Source Routing [29]. Its output is the set of binary coefficients {bm j } in (23.2). As pointed out in [28] and [29], both mClock and SecondNet mappers may be implemented by exploiting the primitive functionalities provided, for example, by (commodity) Xen hypervisors [23]. 6.5. Dynamic profiling of the computing-networking energy consumptions The maximum and idle energies in Eq. (7) consumed by each VM may be dynamically pro- filed on a per-slot basis by equipping the Virtualization layer of Fig. 1 with the Joulemeter tool in [25]. It is a software tool that is capable to provide per-slot and per-VM energy metering functionalities as currently exist in hardware for physical servers. For this purpose, Joulemeter uses hypervisor-observable hardware power states to track the VM energy usage on each hard- ware component (see Section 5 of [25] for a detailed description). Interestingly, the field trials reported in [25] support the conclusion that, at least when the VMs hosted by each physical server are homogeneous, the per-slot maximum and idle energies wasted by a turned ON VM are proportional to the maximum and idle powers consumed by the hosting physical server, that is (see Eq. (7)), X NS E   idle  T S Pidle S E (k)   Ecom ( j, t) = dk j   , j ∈ S(t), (50.1) k=1 M k (t) and X NS E ! T S Pmax S E (k) Emax com ( j, t) = dk j , j ∈ S(t), (50.2) k=1 Mk (t) where Mk (t) is the number of VMs running atop k-th physical server at slot t. Passing to consider the online profiling of the setup and dynamic energies in (4) and (5) of the j-th TCP/IP connection, we observe that, in emerging broadband data centers, each PNIC typically supports a single connection [17], in order to reduce the resulting average round-trip- time and, then, saving energy (see Eqs. (5) and (6)). Hence, after indicating by PSsetup W (m) (resp., Pdyn SW (m)) the setup (resp., dynamic) energy consumed by each PNIC hosted by the m-th physical switch, the resulting setup and dynamic energies in (4) and (5) of the j-th connection may be profiled online as the corresponding summations of the setup and dynamic energies consumed by the crossed PNICs, that is (see Eq. (23.2)), X NS W Enet setup ( j) = bm j T S PSsetup W (m), j ∈ S(t), (51.1) m=1 and X NS W Gj = bm j T S Pdyn S W (m), j ∈ S(t). (51.2) m=1 22 Finally, after evaluating Eqs. (51.1) and (51.2), the γ exponent in (5) may be profiled as in: setup log Emax net ( j) − Enet ( j) /σ j γ= , j ∈ S(t), (51.3) log(rmax ) where Emax net ( j) is the profiled setup-plus-dynamic energy in (4) consumed by the j-th connection when it works at r j (t) = rmax . We anticipate that, in the experimental work of Section 7, we use Eqs. (50) and (51), in order to profile the involved energies. 6.6. Managing discrete processing frequencies Dynamic scaling of the VM processing frequencies relies on DVFS-enabled physical servers that, in general, do available only a finite set: F , { b f (0) , b f (1) , . . . , b f (H−1) }, of H ≥ 2 discrete CPU processing speeds. Hence, in order to deal with continuous and discrete frequency set- tings under a unified framework, we borrow the time-sharing approach developed in [30]. For this purpose, let b f (s) , and b f (s+1) be the (discrete) allowed frequencies that surround the (con- tinuous) processing frequency f j∗ (t) in Eq. (29), that is, b f (s) ≤ f j∗ (t) ≤ b f (s+1) . Hence, ac- cording to [30], V M( j) runs at b f (s) (resp., b f (s+1) ) during a fraction k0 (resp., (1 − k0 )) of slot t. In order to leave α unchangedα the resulting α per-slot α energy consumption, we set (see Eq. (7)): k0 = f b(s+1) ∗ − f j (t) / f b (s+1) − fb (s) . In practice, the frequency hopping mechanism required for performing this time-sharing operation may be implemented by equipping the phys- ical servers with commodity delta modulators [30]. 7. Experimental work The targeted data center of Fig. 1 is a generic SaaS cloud, so that its performance should be evaluated by considering a large-scale infrastructure. However, since it is challenging to carry out repeatable large-scale field trials on real-world data centers, we choose numerical simulations as a means to evaluate and compare the performance of the Q∗ scheduler, in order to assure the repeatability of the performed tests. The CloudSim toolkit [31] has been selected as the simulation platform, mainly due to the fact that it natively supports a number of primitives for modeling networked SaaS data centers. In detail, the simulated data center comprises NS E = 200 heterogeneous servers and each server may host up to Mmax = 10 virtual machines, so that the total number of VMs is MV = 2, 000. The simulated servers are partitioned into three sets of 40, 60, and 100 servers, re- spectively. According to the power profiles of current commodity servers [2], the (idle; maxi- mum) per-server power consumptions of these three sets of servers are set to: (120 (Watt); 224 (Watt)), (152 (Watt); 256 (Watt)), and (180 (Watt); 284 (Watt)), respectively. The topology of the simulated network is the fat-tree one [32], so that the set of servers is partitioned into 20 equal-size pods. Hence, the resulting total number of physical switches is NS W = 500 (e.g., 200 edge switches, 200 aggregation switches and 100 core switches), while each 20-port switch is equipped with Ethernet-type PNICs. According to the power profile of Cisco Nexus 3548 com- modity switches, the per-switch setup (resp., maximum) power is 130 (Watt) (resp., 250 (Watt)). Both the Admission Control Server and Load Balancer are co-allocated at the access router of Fig. 1, so that they act as root of the fat-tree network and are connected to each core switch by dedicated links. Minimum-hop static routing is applied to compute the Load Balancer-to-VM shortest routes. 23 Both synthetic and real-world workloads are considered for the tests, with the peak-workload set to 70 percent of the overall processing capacity of the data center, that is, rmax = wmax = 0.70 × MV × f max , (Mbit/slot) (52) with f max = 12.5 (Mbit/slot). Each simulated VM is equipped with H = 6 discrete processing frequencies, the simulated revenue function is the logarithmic one (e.g., g(r) = log(1 + r)), while the 3-out-5 decision rule is implemented, in order to dynamically select the consolidation slots. Unless otherwise stated, the default values for the upper and lower consolidation thresholds are T hL = 0.25 and T hU = 0.75, the (normalized) delay for turning ON a VM is T ON VM = 10−2 , while −1 the default value of the EUE coefficient in (13.1) is β = 1.5 (Joule) . Table 3: Default setup of the main simulated parameters. Parameters −2 T S = 1 (s) T VON M = 10 Pidle S E = 120, 152, 180 (Watt) f max = 12.5 (Mbit/slot) H=6 Pmax S E = 224, 256, 284 (Watt) rmax = wmax = 17.5 (Gbit/slot) T hL = 0.25 PSsetup W = 130 (Watt) NS E = 200, NS W = 500 T hU = 0.75 Pmax S W = 250 (Watt) MV = 2, 000 β = 1.5 (Joule)−1 M max = 10 α = 2, γ = 1.1 V = 5, 500 MS S = 1, 500 (Byte) ∗ Finally, the actual average total delay T tot (in multiple of the slot period) of the Q∗ scheduler is measured as in: P 1 Pt−1 ∗ 1 X MV lim ∗ limt→∞ 1t t−1 ∗ τ=0 s (τ) t→∞ t τ=0 q j (τ) T tot = 1 Pt−1 P MV ∗ + 1 Pt−1 ∗ , (53) limt→∞ t τ=0 j=1 r j (τ) MV j=1 limt→∞ t τ=0 f j (τ) where, according to the Little’s Theorem, the first ratio is the average delay induced by the input queue of Fig. 1, while each term of the second j-indexed summation is the average delay of the j-th local queue. 7.1. Performance tests under synthetic workload A first collection of simulations is carried out under synthetic workload. This means that the input workload {w(t)} is an independent and identically distributed random sequence, whose samples are evenly distributed over the set [0, wmax ]. Each simulated point is averaged over 106 slots. Goal of this first set of simulations is to unveil the impact of the Ethernet switching tech- nologies on the energy performance of the Q∗ scheduler. For this purpose, Table 4 reports the numerically evaluated per-connection profiles of the tested Fast, Giga and 10G Ethernet tech- nologies, while Fig. 2 reports the corresponding simulated average per-slot and per-VM energy consumptions. The energy curves of Fig. 2 refer to the case of disabled consolidation. In order to gain an intuitive insight on the decreasing behavior of the plots of Fig. 2, we note that, since the exponents γ and α in Eqs. (5) and (7) are larger than the unit, the resulting energy functions are convex. Hence, lower energy consumptions are obtained by splitting the input workload over larger numbers of VMs and TCP/IP connections, provided that the consolidation is not performed, e.g., the number of turned ON servers is constant. Passing to compare the 24 Table 4: Numerically evaluated energy profiles of the tested Ethernet switching technologies. Per-connection parameters Fast Ethernet Giga Ethernet 10G Ethernet σ (Joule × (s/Mbit)γ ) 1.8 × 10−3 1.3 × 10−3 8.5 × 10−4 γ 1.3 1.25 1.1 setup Enet (Joule) 1.9 × 10−4 1.1 × 10−4 8 × 10−5 RT T (ms) 4.2 0.45 0.04 numerical values of the plots of Fig. 2, we point out that, since the simulated peak workload in (52) is quite large, the energy consumptions of the tested switching technologies are mainly dominated by the corresponding round-trip-times in (6). Hence, since the (profiled) average round-trip-time of the 10G Ethernet technology is the lowest one (see the last row of Table 4), the resulting consumed energies of Fig. 2 reduce of about 15 percent by passing from the Fast Ethernet technology to the 10G one. For this reason, all the numerical results reported in the sequel will refer to the 10G Ethernet technology. Furthermore, they also will account for the effect of resource consolidation. 21 Per-slot and per-VM total consumed energy (Joule) Fast Eth. 20 Giga Eth. 10G Eth 19 18 17 16 15 14 13 200 400 600 800 1000 1200 1400 1600 1800 2000 MV Figure 2: Per-slot and per-VM total energy consumptions of the Q∗ scheduler under the tested switching technologies for the application scenario of Section 7.1 at V = 5, 500 and β = 1.5 (Joule)−1 . Let pass to test the performance sensitivity of the Q∗ scheduler on the Lyapunov parameter V and EUE coefficient β, in terms of: (i) attained average utility (see Fig. 3); (ii) average queue delay (see Fig. 4); and, (iii) fraction of the admitted workload (see Fig. 5). In order to explain the behaviors of the plots of Figs. 3, 4 and 5 on an intuitive basis, three main remarks are in order. First, at fixed β and for increasing values of V, the minimization of the drift-plus-utility function in (19) is dominated by the minimization of the corresponding energy sum term. Hence, 25 #10 4 3 2.5 - = 1.2 (Joule) 2 -1 - = 1.5 (Joule) Average Utility -1 - = 2.0 (Joule) -1 1.5 1 0.5 0 -0.5 0 2000 4000 6000 8000 10000 12000 V Figure 3: Average utility-vs.-V of the proposed Q∗ scheduler under the application scenario of Section 7.1 for different values of β. 250 Average Total queue delay T *tot (slot) 200 150 100 50 - = 1.2 (Joule) -1 - = 1.5 (Joule) -1 - = 2.0 (Joule) -1 0 0 2000 4000 6000 8000 10000 12000 V ∗ Figure 4: Average queue delay T tot (slot) of the Q∗ scheduler under the application scenario of Section 7.1 for different values of β. we expect that, for increasing values of V, the attained average utility in (14.1) increases and this expectation is, indeed, compliant with the increasing behaviors of the plots of Fig. 3. Second, at fixed β and for increasing values of V, the ratio z(t)/V in Eq. (25.2) decreases, so that c∗ (t) in (25.2) increases. Hence, due to the constraint in (14.2), also the average admitted 26 1 Fraction of the admitted workload 0.95 0.9 0.85 0.8 0.75 0.7 - = 1.2 (Joule) -1 - = 1.5 (Joule) -1 0.65 - = 2.0 (Joule) -1 0.6 0 2000 4000 6000 8000 10000 12000 V Figure 5: Average fraction of the input workload admitted by the Q∗ scheduler under the application scenario of Section 7.1 for different values of β. workload a∗ increases for increasing values of V. The increment of the admitted average work- load, in turn, gives arise to: (i) the increment of the fraction of the admitted workload in Fig. 5; and, (ii) the corresponding increment of the average queue delays in Fig. 4. Third, at fixed V and for increasing values of β, the overall energy wasted by the data center tends to increase (see Eq. (14.1)). This increment forces the Q∗ scheduler to decrease the fraction of the admitted workload (see Fig. 5), that, in turn, leads to reduce both the average queue delays in Fig. 4 and the attained average utility in Fig. 3. 7.2. Performance tests under real-world workload In order to test the performance of the Q∗ scheduler under time-correlated workload, we consider the arrival sequence of Fig. 6. It reports the real-world trace in Fig. 2 of [33] and refers to the I/O workload sampled from four RAID volumes of an enterprise storage cluster in Microsoft. In order to meet the peak workload in (52), each arrival of Fig. 6 conveys a workload of 1.1 (kbit). The plots of Fig. 7 report the simulated per-slot and per-VM average total, networking and computing energy consumptions of the Q∗ scheduler at β = 1.5 (Joule)−1 under the real-world trace of Fig. 6. In order to explain the decreasing behavior of the plots of Fig. 7, we observe that, for increasing values of V, the term in Eq. (19) proportional to the networking-plus-computing energy increases in a linear way. Therefore, target of the scheduler becomes the minimization of these energies, so that the curves of Fig. 7 decrease for increasing values of V. Overall, the interesting conclusion arising from the examination of the numerical values of Fig. 7 is that the contribution of the networking energy to the total consumed energy is around 28-30 percent over the overall spectrum of the considered values of V. This confirms, indeed, the conclusion drawn by [3], namely, the network component of the total energy consumed by emerging energy- proportional data centers is not longer negligible, specially when the server utilization is below 27 #10 6 2 1.8 Number of per-slot arrivals 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 100 200 300 400 500 600 700 800 900 1000 t Figure 6: Sampled trace of an I/O arrival sequence from an enterprise cluster in Microsoft [33]. The (numerically evaluated) Peak-to-Mean-Ratio (PMR) and time-correlation coefficient are 2.49 and 0.85, respectively. Per-slot and per-VM average consumed energy (Joule) 25 Total energy Computing energy 20 Networking energy 15 10 5 0 0 2000 4000 6000 8000 10000 12000 V Figure 7: Per-slot and per-VM average total, computing and networking energy consumptions of the Q∗ scheduler under the application scenario of Section 7.2 at β = 1.5 (Joule)−1 70 percent. Goal of the remaining part of this Section is twofold. First, it aims at investigating the effect of the consolidation thresholds of Section 6.1 on the energy performance of the Q∗ scheduler. Second, it aims at testing the actual capability of the adaptive consolidation algorithm of Section 28 4.3.1, when the input workload exhibits fast time-fluctuations (see Fig. 6). For this purpose, three different event-driven policies are simulated. They trigger consolidation at slot t when at least l = 3 out last m = 5 values of the utilization factor in (45) plus its predicted value in (46) fall out of the following target intervals: (i) I(1) , [0.1, 0.9] (Case 1); (ii) I(2) , [0.25, 0.75] (Case 2); (iii) I(3) , [0.35, 0.65] (Case 3). The corresponding simulated energy loss (in percent) suffered by the iterative consolidation algorithm of Section 4.3.1 with respect to the exhaustive search-based optimal one is reported in Table 5. Table 5: Percent average energy loss of the Q∗ scheduler against the exhaustive search-based optimal one. The application scenario of Section 7.2 is considered at V = 5, 500 and β = 1.5 (Joule)−1 . Case 1 Case 2 Case 3 Synthetic workload 1.3% 1.1% 0.85% Real-world workload 1.1% 0.9% 0.72% An examination of the results of Table 5 leads to two main conclusions. First, the energy loss suffered by the proposed consolidation algorithm increases by passing from Case 3 to Case 1 under both the synthetic and real-world workloads. Intuitively, this is due to the fact that, by design, the rate of occurrence of the consolidation events increases when we pass from Case 1 to Case 3 and this increases, in turn, the chances of convergence of the iterations in Eq. (39) to the optimal consolidated configuration. This behavior is, indeed, in agreement with the lower bound in (43), that decreases for increasing values of the inter-consolidation interval T max . Second, the energy penalty suffered by the proposed consolidation algorithm remains limited up to 1.3 percent, even when the time behavior of the input workload is, by design, fully unpredictable, as in the case of the synthetic workload. The plots of Fig. 8 report the time-behaviors of the numbers of VMs and physical servers that are dynamically turned ON by the Q∗ scheduler under the real-world workload of Fig. 6. A comparison of the plots of Fig. 8 with the arrival trace of Fig. 6 supports the conclusion that the consolidation action performed by the proposed scheduler is capable to track the abrupt (and unpredicted) changes of the input workload, with a delay limited up to 1-2 slot periods. 7.3. Performance comparisons with competing and state-of-the-art schedulers Goal of a last set of simulations is to compare the energy performance of the Q∗ scheduler against the corresponding ones of some companion schedulers available in the open literature. In order to carry out fair performance-vs.-implementation complexity comparisons, in the first part of this Section we consider (quite novel) competing schedulers that retain implementation com- plexities of the same order or larger than the one of the Q∗ scheduler, namely, the Modified Best- fit-Decreasing Scheduler (MBS ) [11] and the Exhaustive-Search-based Scheduler (ES S ) [13]. Afterwards, in the second part of this Section, we pass to consider some simple-to-implement state-of-the-art schedulers that are currently employed in production data servers, namely, the Maximum Density consolidation Scheduler (MDS ) [34] and the First-Fit-decreasing Scheduler (FFS ) [35]. In both cases, the performance of the basic StaTic Scheduler (S T S ) [24] is evaluated as an ultimate benchmark. At this regard, we observe that the S T S does not perform resource reconfiguration and con- solidation actions. It assumes that a fixed number of TCP/IP connections and VMs constantly run at the maximum flow rates and processing frequencies, in order to constantly provide the 29 1400 140 Turned ON VMs Number of turned ON servers Turned ON servers Number of turned ON V Ms 1200 120 1000 100 800 80 600 60 400 40 200 20 0 200 400 600 800 1000 t Figure 8: Time behaviors of the number of VMs and physical servers turned ON by the Q∗ scheduler. The application scenario of Section 7.2 is considered at V = 5, 500, β = 1.5 (Joule)−1 , and T max = 1 (slot). networking and computing capacities needed to process the peak workload. Hence, the total energy consumed by the S T S provides, by design, an upper benchmark. At the opposite side of the performance spectrum, the ES S performs Lyapunov-based dynamic resource reconfiguration and consolidation, under the following (over-optimistic) assumptions: (i) the networking energy consumption is negligible; and, (ii) the server consolidation is carried out offline through (compu- tationally expensive) exhaustive search. Hence, the (optimistic) energy performance of the ES S acts as a lower benchmark. The MBS performs Lyapunov-based resource reconfiguration by also accounting for the networking energy consumption. However, MBS resorts to the Modified Best Fit Decreasing greedy heuristic in [11], in order to perform dynamic resource consolidation. Hence, by design, the MBS implements the minimum-incremental-energy greedy policy. Table 6 reports the simulated average total energy consumptions of the mentioned schedulers under the (previously defined) three settings of the consolidation thresholds. An examination of Table 6 leads to three main conclusions. First, since the S T S does not perform resource reconfiguration/consolidation, its energy performance does not depend on the setting of the consolidation thresholds. However, the energy consumptions of all other schedulers decrease for increasing rate of the occurrence of the consolidation events, that is, when we pass from Case 1 to Case 3. Second, the average energy savings of the proposed Q∗ S over the S T S is of the order of 57 percent under Case 1, but increases up to about 70 percent under Case 3. The corresponding average energy saving of the Q∗ S over the MBS is of the order of 24-26 percent. 30 Table 6: Per-slot and per-VM average total energy consumptions (Joule) of the tested competing schedulers under the application scenario of Section 7.3 at V=5, 500 and β = 1.5 (Joule)−1 . The previously defined three cases of threshold settings are considered. STS MBS Q∗ S ESS Case 1 21.8 17.51 13.90 10.6 Case 2 21.8 16.62 13.30 10.2 Case 3 21.8 15.87 12.80 9.8 Third, as it could be expected, the (optimistic) energy consumptions of the ES S are lower than the corresponding ones of the Q∗ S of about 29-31 percent. This is compliant with the (previously reported) networking energy consumptions of Fig. 7 and average energy loss of Table 5, whose aggregate impact on the overall energy consumption is, indeed, around 30 percent. Passing to consider the performance of the (aforementioned) state-of-the-art schedulers, we (shortly) review their basic principles. Specifically, the MDS considers the idle energies of the servers as the main sources of energy consumptions, while it neglects the dynamic components of the computing energies (see the second term in the inner summation of Eq. (35.1)) and the overall network energies consumed by the switches (see Eq. (35.2)). As a consequence, at each consolidation slot, the MDS randomly selects the minimum number of servers needed to process the input workload and, then, runs at the maximum processing frequencies all the VMs hosted by the selected servers [34]. Since the number of turned ON switches needed to support a multi-hop network route in- creases with the number of the hops (see Eqs. (51.1) and (51.2)), the FFS considers the number of the hops of the load balancer-to-server route as an (implicit) index of the network energy con- sumed by the server. Hence, in order to account for the energy efficiencies of the networked servers, the FFS performs, at first, a static ranking of the servers by considering (in an ordered way) the following three indexes: (i) the idle powers of the servers; (ii) their maximum powers; and, (iii) the (aforementioned) numbers of hops of the network routes [35]. So doing, the first (resp., last) servers of the attained ranked list are the most (resp., less) energy-efficient ones. Afterwards, at each consolidation slot, the obtained ranked list is scanned from the top to the bottom and V M(i), i = 1, 2, . . . is assigned to the first server of the list that still retains sufficient spare computing capacity [35]. The VM-to-server assignment halts when the minimum number of VMs needed to process the input workload is allocated. At this point, the selected physical servers are turned ON, the input workload is evenly split among the allocated VMs and, finally, the per-VM processing frequencies and per-connection flows are computed. Passing to consider the implementation complexity aspects, we point out that the MDS does not perform, by design, sorting operations, so that its implementation complexity does not vary with the size of the data center [34]. The FFS performs a (single static) sorting of the physical servers, so that its implementation complexity scales up as O NS E log NS E [35]. However, since the FFS enforces the utilization of the most energy efficient networked servers, we expect that its energy consumptions are (somewhat) below than the ones of the MDS , especially in hetero- geneous data centers. The numerical values of Table 7 confirm, indeed, this expectation, and point out that, at least in the considered application scenarios, the average energy consumptions of the FFS are around 2-3 percent lower than the ones of the MDS . Finally, since the sorting operation performed by the FFS is static (e.g., it does not account for the dynamic components of the computing-plus-networking energies actually consumed at 31 Table 7: Per-slot and per-VM average total energy consumptions (Joule) of the MDS and FFS under the application scenario of Section 7.3. MDS FFS Case 1 18.85 18.80 Case 2 17.90 17.48 Case 3 16.82 16.46 run-time), we expect that the (above reported) energy consumptions of the FFS are larger than the corresponding ones of the Q∗ S , especially when the input workload is fast time-varying (see Fig. 6). The comparison of the numerical results of Tables 6 and 7 confirms, indeed, this expectation and supports the conclusion that the energy consumptions of the Q∗ S are lower than the corresponding ones of the FFS of about 22-23 percent. 8. Conclusions and future developments In this paper, we developed a Lyapunov-based dynamic scheduler for the combined: (i) ac- cess control; (ii) queue control; (iii) network flow control; (iv) processing frequency scaling; and, (v) joint resource consolidation, in virtualized TCP/IP-based data centers that operate under the SaaS model. The overall goal is the dynamic achievement of an optimized energy-vs.-delay tradeoff under the (unpredictable) time-fluctuations of the input workload. Remarkable features of the proposed Q∗ S are that: (i) its implementation is distributed and scalable; and, (ii) it is capable to track the time-fluctuations of the input workload and provide hard QoS bounds on the attained utility-vs.-delay performance, without requiring any a priori information and/or fore- casting of the statistics of the input workload. The carried out performance tests and performance comparisons corroborate the actual effectiveness of the proposed scheduler under both synthetic and real-world workloads. As suggested in Section 6, this work can be extended in some directions of potential in- terest. Just as an additional example, emerging Big Data Stream applications [36] served by broadband wireless (possibly, multi-antenna) access networks [37, 38, 39] present so rapid time- fluctuations of the wireless environment and input workload that the here considered assumption of static virtual-to-physical resource mapping and routing could fall short. Including dynamic virtual-to-physical resource mapping and routing into the considered optimization framework would lead to a mixed-integer non-convex optimization problem, that would involve all layers of the network-plus-computing protocol stacks. This (seemingly very challenging) cross-layer optimization problem is currently under investigation by the authors. Acknowledgment This work has been developed under the umbrella of the PRIN2015 project with grant num- ber 2015YPXH4W 004: “A green adaptive Fog computing and networking architecture (GAU- ChO)”, funded by the Italian MIUR. The authors would like to thank all anonymous reviewers for their precious and helpful com- ments and suggestions. 32 Appendix A. Proof of Proposition 1 The minimum of ψ0 (t) in Eq. (22.3) under the constraint in (1) is attained at a∗ (t) = 0 (resp., ∗ a (t) = w(t)) when s(t) ≥ z(t) (resp., s(t) < z(t)). This leads to the all-nothing threshold-type access control in Eq. (25.1). In order to derive Eq. (25.2), we observe that the maximization of ψ1 (t) in Eq. (22.4) under the box constraint in (16.2) is a convex optimization problem. As a consequence, its solution may be obtained through the orthogonal projection of the solution c(t) = g−1 e r (z(t)/V) of the unconstrained minimization of ψ1 (t) onto the admissible set [0, wmax ]. So doing, we obtain Eq. (25.2). At this regard, we observe that, after expanding the max / min operators, Eq. (25.2) may be recast in the following equivalent form:    0, for z(t) > Vgr (0),   −1 z(t)  ∗ c (t) =  g r , for Vgr (wmax ) ≤ z(t) ≤ Vgr (0), (A.1)    V wmax , for 0 ≤ z(t) < Vgr (wmax ) , where gr (wmax ) (resp., gr (0)) is the derivative of g(r) at r = wmax (resp., r = 0). Appendix B. Proof of Proposition 2 The expression in (28) for the optimized network flows is the solution of the following con- strained optimization problem: min ψ2 (t), s.t.: Eqs. (2.1) and (2.3). (B.1) {r j (t), j∈S(t)} After retaining the non negativity constraint in (2.1) as implicit, the Lagrangian function of (B.1) reads as in:     X    L = ψ2 (t) + ζ(t)  r j (t) − min {s(t); rmax } ,  (B.2) j∈S(t) where ζ(t) is the non-negative Lagrange multiplier of the constraint in (2.3). Hence, after equat- ing to zero the derivative of (B.2) with respect to r j (t) and solving the resulting algebraic equa- tion, we arrive at Eq. (28). Furthermore, the algebraic equation in (31) for the evaluation of ζ(t) follows by equating to zero the derivative of (B.2) with respect to ζ(t). Passing to prove the validity of Eq. (29), we observe that it is the solution of the following constrained optimization problem: min ψ3 (t), s.t.: Eq. (8). (B.3) { f j (t), j∈S(t)} Since (B.3) is a convex optimization problem, its solution may be obtained by projecting the solution:  α  α−1 1  q j (t) f jmax  e f j (t) =  ∗    , (B.4) idle α V β Emax com ( j, t) − Ecom ( j, t) h i of the corresponding unconstrained optimization problem onto the admissible set 0, f jmax . So doing, we directly arrive at Eq. (29). 33 Appendix C. Proof of Proposition 3 Proof of Part a) – The proof of the bound in (42.1) is by induction over t. Specifically, since z(0) = 0, and then, the bound is met a t = 0, let us assume that Eq. (42.1) holds at t. Hence, we must prove that it also holds at (t + 1). For this purpose, in the sequel, we separately consider the cases of Eq. (A.1). (i) In the case of Vgr (0) < z(t) ≤ Vgr (0) + wmax , we have (see Eq. (A.1)): c∗ (t) = 0, so that from Eq. (15), we have that: z(t + 1) ≤ z(t) + c∗ (t) ≡ z(t) ≤ Vgr (0) + wmax . (C.1) (ii) In the case of Vgr (wmax ) ≤ z(t) < Vgr (0), the following chain of inequalities is obtained by exploiting Eqs. (15) and (A.1), together with the decreasing behavior of g−1 r (r): ! ! z(t) V gr (wmax ) z(t +1) ≤ z(t)+g−1 r ≤ z(t)+g−1 r ≡ z(t)+wmax ≤ Vgr (0)+wmax . (C.2) V V (iii) In the case of z(t) < Vgr (wmax ), the following developments arise by exploiting Eq. (A.1) and the decreasing behavior of gr (r): z(t + 1) ≤ z(t) + c∗ (t) ≡ z(t) + wmax ≤ Vgr (wmax ) + wmax ≤ Vgr (0) + wmax . (C.3) This concludes the proof of the bound in Eq. (42.1). Passing to consider the bound in (42.2), its validity is proved by the following chain of in- equalities, that follows from the combined exploitation of Eqs. (2.3), (25.1) and (42.1):          X        ∗  s(t + 1) = max  0; s(t) − min  r max ; r j (t)    + a∗ (t)         (C.4) ( j∈S(t)) ≤ s(t) + a∗ (t) ≤ z(t) + w(t) ≤ z(t) + wmax ≤ Vgr (0) + 2wmax . Finally, the proof of the bound in (42.3) starts from Eq. (9.1) and, then, exploits Eqs. (2.3), (28), (42.2) and the non negativity of ζ ∗ (t), in order to perform the following developments: q j (t + 1) ≤ q j (t) + r j (t) ≤ (s(t) − ζ ∗ (t)) + r∗j (t) X ≤ (s(t) − ζ ∗ (t)) + rk∗ (t) k∈S(t) (C.5) ∗ ∗ ≤ s(t) − ζ (t) + min {s(t); rmax } ≤ s(t) − ζ (t) + rmax ≤ s(t) + rmax ≤ Vgr (0) + 2wmax + rmax . This completes the proof of Part a) of Proposition 3. Proof of Part b) – Let us consider the policy π0 that always rejects all the input workload and turns OFF all the physical servers and switches. Since this policy (obviously) meets both constraints in (13.2) and (13.3), the set of the admissible policies is not empty, so that the con- strained problem in (13) is feasible. Since the bound in (22.1) on the drift-minus-utility function holds under any feasible policy, it must hold under the (possibly unknown) optimal policy πopt 34 that solves the constrained problem in (13). Hence, after applying this bound under πopt and, then, carrying out the unconditional expectations in Eq. (22.1), we arrive at the following lower bound: 1X t−1 B0 T max E{χ∗ (τ)} ≥ χopt − . (C.6) t τ=0 V In Eq. (C.6), we have that: χ∗ (τ) , g (c∗ (τ)) − β E∗tot (τ), (C.7) is the value of the utility in (14.1) at slot τ under the dynamic Lyapunov-based policy π∗ of Section 4, while χopt is the (previously defined) average value achieved by the objective function in (13.1) under the optimal policy πopt . Since g(r) is concave, an application of the Jensen’s inequality allows us to write:  t−1  1X  1 X  t−1 ∗ E {g (c (τ))} ≤ g   E {c (τ)} . ∗ (C.8) t τ=0 t τ=0 Hence, after introducing (C.8) into (C.6) and letting t → ∞, we have that: ∗ B0 T max g c∗ − β Etot ≥ χopt − , (C.9) V ∗ where c∗ (resp., Etot ) is the time-average expected value of {c∗ (τ)} (resp., {E∗tot (τ)}). Since the virtual queue in (15) is (strongly) stable (see Eq. (42.1)), the constraint in (14.2) is guaranteed to hold. Therefore, since g(.) is (strictly) increasing, we have that: g(a∗ ) ≥ g(c∗ ), so that Eq. (C.8) leads to: ∗ B0 T max g a∗ − β Etot ≥ χopt − . (C.10) V Eq. (C.10) proves the validity of Eq. (43) and concludes the proof of Proposition 3. References [1] K. V. K. M. Kumar, Software as a service for efficient cloud computing, Int. J. of Research in Eng. and Technology (IJRET) 3 (1) (2014) 178–181. [2] C. Wu, R. Buyya, Cloud data centers and cost modeling, Morgan Kaufmann, 2015. [3] D. Abts, M. R. Marty, P. M. Wells, P. Klausler, H. Liu, Energy proportional datacenter networks, in: Proc. of ACM Int. Sypm. on Computer Architecture (ISCA2010), Saint-Malo, France, 2010, pp. 338–347. [4] N. Peter, Fog computing and its real time applications, Int. J. of Emerging Technology and Advanced Engineering (IJETAE) 5 (6) (2015) 266–269. [5] M. Shojafar, N. Cordeschi, E. Baccarelli, Energy-efficient adaptive resource management for real-time vehicular cloud services, IEEE Trans. Cloud Computing, DOI: 10.1109/TCC.2016.2551747. [6] J. S. Chase, D. C. Anderson, P. N. Thakar, A. M. Vahdat, A. Doyle, Managing energy and server resources in hosting centers, in: Proc. of ACM Symp. on Operating System Principles (SOSP2001), Banff, Canada, 2001, pp. 103–116. [7] C. Tang, M. Steinder, M. Spreitzer, G. Pacifici, A scalable application placement controller for enterprise data centers, in: Proc. of the Int. World Wide Web Conference (WWW2007), Banff, Canada, 2007, pp. 331–340. [8] G. Chen, H. Wenbo, J. Liu, S. Nath, L. Rigas, L. Xiao, F. Zhao, Energy-aware server provisioning and load dispatching for connection-intensive internet services, in: Proc. of the USENIX Symp. on Networked System Design and Implementation (NSDI2008), San Francisco, CA, USA, 2008, pp. 337–350. 35 [9] Z. Xiao, W. Song, Q. Chen, Dynamic resource allocation using virtual machines for cloud computing environment, IEEE Trans. Parallel Distrib. Syst. 24 (6) (2013) 1107–1117. [10] D. M. Divakaran, T. N. Le, M. Gurusamy, An online integrated resource allocator for guaranteed performance in data centers, IEEE Trans. Parallel Distrib. Syst. 25 (6) (2014) 1382–1392. [11] A. Beloglazov, J. Abawajy, R. Buyya, Energy-aware resource allocation heuristics for efficient management of data centers for cloud computing, Future Generation Computer Systems 28 (5) (2012) 755–768. [12] F. Liu, Z. Zhou, H. Jin, B. Li, B. Li, H. Jiang, On arbitrating the power-performance tradeoff in SaaS Clouds, in: Proc. of IEEE INFOCOM 2013 (INFOCOM2013), Turin, Italy, 2013, pp. 872–880. [13] R. Urgaonkar, U. Kozat, K. Igarashi, M. J. Neely, Dynamic resource allocation and power management in vir- tualized data centers, in: Proc. of IEEE Netw. Oper. Manage. Symp. (NOMS2010), Istanbul, Turkey, 2010, pp. 479–486. [14] R. Nathuji, K. Schwan, Virtualpower: coordinated power management in virtualized enterprise systems, in: Proc. of the ACM SIGOPS Symp. on Oper. Syst. Principles (SOSP2007), Skamania, WA, USA, 2007, pp. 265–278. [15] D. Meisner, B. T. Gold, T. F. Wenisch, The Powernap server architecture, ACM Trans. on Computer Systems 29 (1) (2011) 1–24. [16] Y. Agarwal, S. Hodges, R. Chandra, J. Scott, P. Bahl, R. Gupta, Somniloquy: augmenting network interfaces to reduce pc energy usage, in: Proc. of the USENIX Symp. on Networked System Design and Implementation (NSDI2009), Boston, MA, USA, 2009, pp. 365–380. [17] L. Wang, F. Zhang, J. A. Aroca, A. V. Vasilakos, K. Zheng, C. Hou, D. Li, Z. Liu, GreenDCN: a general framework for achieving energy efficiency in data center networks, IEEE J. on Sel. Areas in Communications (JSAC) 32 (1) (2014) 4–15. [18] M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel, B. Prabhkar, S. Sengupta, M. Sridharan, Data center TCP (DTCP), in: Proc. of ACM SIGCOMM 2010, New Delhi, India, 2010, pp. 63–74. [19] B. Vamanan, J. Hasan, T. N. Vijaykumar, Deadline-aware datacenter TCP (D2 TCP), in: Proc. of ACM SIGCOMM 2012, Helsinki, Finland, 2012, pp. 115–126. [20] T. Das, K. M. Sivalingam, TCP improvements for data center networks, in: Proc. of IEEE Int. Con. on Com. Sys. and Networks (COMSNETS2013), Bangalore, India, 2013, pp. 1–10. [21] A. Kiani, N. Ansari, Toward low-cost workload distribution for integrated green data centers, IEEE Communica- tions Letters 19 (1) (2015) 26–29. [22] A. Kiani, N. Ansari, Profit maximization for geographical dispersed green data centers, IEEE Trans. Smart Grid, DOI: 10.1109/TSG.2016.2562565. [23] M. Portnoy, Virtualization Essentials, John Wiley & Sons, 2012. [24] N. Cordeschi, M. Shojafar, D. Amendola, E. Baccarelli, Energy-efficient adaptive networked datacenters for the QoS support of real-time applications, J. Supercomputing 71 (2) (2015) 448–478. [25] A. Kansal, F. Zhao, J. Liu, N. Kothari, A. A. Bhattacharya, Virtual machine power metering and provisioning, in: Proc. ACM Symp. Cloud Computing (SoCC2010), Indianapolis, IN, USA, 2010, pp. 39–50. [26] M. J. Neely, Stochastic network optimization with application to communication and queueing systems, Morgan & Claypool, 2010. [27] Q. Peng, A. Walid, J.-S. Hwang, S. H. Low, Multipath TCP: analysis, design, and implementation, IEEE/ACM Trans. Networking 24 (1) (2016) 596–609. [28] A. Gulati, A. Merchant, P. J. Varman, mClock: handling throughput variability for hypervisor IO scheduling, in: Proc. USENIX Symp. on Networked System Design and Implementation (NSDI2010), San Jose, CA, USA, 2010, pp. 1–7. [29] C. Guo, G. Lu, H. J. Wang, S. Yang, C. Kong, P. Sun, W. Wu, Y. Zhang, Secondnet: a data center network virtualization architecture with bandwidth guarantees, in: Proc. ACM Co-Next, Philadelphia, USA, 2010, pp. 15– 26. [30] K. Li, Performance analysis of power-aware task scheduling algorithms on multiprocessor computers with dynamic voltage and speed, IEEE Trans. Parallel Distrib. Syst. 19 (11) (2008) 1484–1497. [31] R. N. Calheiros, R. Ranjan, A. Beloglazov, C. A. F. D. Rose, R. Buyya, CloudSim: a toolkit for modeling and sim- ulation of cloud computing environments and evaluation of resource provisioning algorithms, Software: Practice and Experience 41 (1) (2011) 23–50. [32] M. Al-Fares, A. Loukissas, A. Vahdat, A scalable, commodity data center network architecture, in: Proc. of ACM SIGCOMM 2008, Seattle, WA, USA, 2008, pp. 63–74. [33] Z. Zhou, F. Liu, Y. Xu, R. Zou, H. Xu, J. Lui, H. Jin, Carbon-aware load balancing for geo-distributed cloud services, in: Proc. IEEE Int. Symp. Modelling Anal. Simulation Comput. Telecomun. Syst. (MASCOTS2013), San Francisco, CA, USA, 2013, pp. 232–241. [34] Z. Cao, S. Dong, An energy-aware heuristic framework for virtual machine consolidation in Cloud computing, J. Supercomputing 69 (1) (2014) 429–451. [35] A. Verma, P. Ahuja, A. Neogi, pMapper: power and migration cost aware application placement in virtualized 36 systems, in: Proc. of ACM/IFIP/USENIX 9th Int. Conf. on Middleware, Leuven, Belgium, 2008, pp. 243–264. [36] E. Baccarelli, N. Cordeschi, A. Mei, M. Panella, M. Shojafar, J. Stefa, Energy-efficient dynamic traffic offloading and reconfiguration of networked data centers for Big Data mobile computing: review, challenges, and a case study, IEEE Network 30 (2) (2016) 54–61. [37] E. Baccarelli, M. Biagi, R. Bruno, M. Conti, E. Gregori, Broadband wireless access networks: a roadmap on emerging trends and standards, in: Broadband services: business models and technologies for community networks, John Wiley & Sons, 2005, pp. 215–240. [38] N. Cordeschi, T. Patriarca, E. Baccarelli, Stochastic traffic engineering for real-time applications over wireless networks, J. Network and Computer Applications 35 (2) (2012) 681–694. [39] E. Baccarelli, M. Biagi, Error resistant space-time coding for emerging 4G-WLANs, in: Proc. of IEEE Wireless Communications and Networking Conference (WCNC2003), New Orleans, USA, 2003, pp. 72–77. 37