(PDF) MLPR: Efficient influence maximiza

Social Network Analysis and Mining (2021) 11:3 https://doi.org/10.1007/s13278-020-00704-0 ORIGINAL ARTICLE MLPR: Efficient influence maximization in linear threshold propagation model using linear programming Farzaneh Ghayour‑Baghbani1 · Masoud Asadpour1 · Heshaam Faili1 Received: 20 May 2020 / Revised: 19 September 2020 / Accepted: 28 October 2020 © Springer-Verlag GmbH Austria, part of Springer Nature 2020 Abstract Influence maximization is an important research topic in social networks that has different applications such as analyzing spread of rumors, interest, adoption of innovations, and feed ranking. The goal is to select a limited size subset of vertices (called a seed-set) in a Social Graph, so that upon their activation, a maximum number of vertices of the graph become activated, due to the influence of the vertices on each other. The linear threshold model is one of two classic stochastic propagation models that describe the spread of influence in a network. We present a new approach called MLPR (matrix multiplication, linear programming, randomized rounding) with linear programming used as its core in order to solve the influence maximization problem in the linear threshold model. Experiments on four real data sets have shown the efficiency of the MLPR method in solving the influence maximization problem in the linear threshold model. The spread of the output seed-sets is as large as when the state-of-the-art algorithms are used; however, unlike most of the existing algorithms, the runtime of our method is independent of the seed size and does not increase with it. Keywords Influence maximization · Linear program · Linear threshold model · Viral marketing 1 Introduction The social network is modeled here as a directed graph in which vertices represent the individuals in the network Influence maximization (IM) is one of the important prob- and edges represent social relations among the vertices. We lems in social network analysis which has recently attracted call this graph, the Social Graph. The initial set of nodes is many researchers in different fields such as social science, called seed-set, and the number of activated nodes is called computer science, economics, politics, and biology. A good spread of the seed-set. The spread achieved by each seed-set example of an application of IM in real world is viral mar- is defined as the spread function of the seed-set. Having the keting (Chen et al. 2013). In viral marketing, some of the fixed-sized seed-set, the goal is to find the best seed-set that individuals are selected based on some chosen preferences maximizes the spread function. and then free (or discounted) samples of a product are given Influence propagation in networks has been explained by to them. If the product is satisfactory enough, it is expected diffusion models. This paper deals with progressive types that the receivers of the free samples become advertisers of of diffusion models, in which once a node is activated, it the product through word of mouth. It is hoped that the peo- remains active forever. There are two progressive propaga- ple in the network are indirectly influenced and thus hope- tion models: the independent cascade model (IC) and the fully buy the product, or become activated (we use this term linear threshold model (LT) (Kempe et al. 2003). hereafter). Many other applications could have similar sce- IM in IC and LT is NP-hard (Chen et al. 2010a, b). The narios, including the spread of rumors, stories, interest, trust, IM problem is composed of two sub-problems; spread com- referrals, adoption of innovations in organizations, human putation and seed selection. Computing the spread function and non-human epidemics, feed ranking, etc. for a given seed-set is NP-hard in both models, even if the seed size is one. The next step deals with the selection of * Masoud Asadpour seed nodes in order to achieve a maximum spread, assuming

[email protected]

we could compute the spread for any arbitrary seed-set. Even if spread computation is done in polynomial time, finding 1 Department of Electrical and Computer Engineering, the optimal seed-set would be NP-hard (Kempe et al. 2003). University of Tehran, Tehran, Iran 13 Vol.:(0123456789) 3 Page 2 of 10 Social Network Analysis and Mining (2021) 11:3 The spread function in both models is monotone and sub- There are methods which try to solve IM by LP, but none modular, so Kempe et al. proposed a greedy approach to of them could solve it efficiently in LT, i.e., either their runt- this problem (Kempe, Kleinberg, and Tardos 2003). Because ime is higher than the greedy algorithm or the spread of the of monotonicity and sub-modularity, we can achieve an produced seed-set is less than the greedy algorithm (Ghay- approximation ratio of (1 − 1/e) by selecting the node with our et al. 2019; Ackerman et al. 2010; Keskİn and Güler the maximum marginal gain in each step. However, as men- 2018; Küçükyavuz 2017). To the best of our knowledge, tioned, computing the spread functions is NP-hard. By using MLPR is the first method that efficiently solves IM in LT a Monte Carlo (MC) simulation, an approximation ratio of using LP. (1 − 1/e − 𝜀) is achieved. The major drawback of this greedy The main contributions of this paper can be summarized algorithm is its poor performance. The greedy algorithm as: has O(nk) calls to spread computation procedure, in which n is the number of the nodes and k is the seed size. Spread • MLPR efficiently solves the IM problem in LT using LP computation by numerous MC simulation calls is the other • The runtime of the proposed method is independent of source of poor performance. the seed size. Most of the previous methods have at least a linear • Experiments show that the MLPR method is much dependence to parameter k. These algorithms follow a faster than the state-of-the-art heuristic algorithms for greedy approach and repeat a procedure k times, adding a LT propagation model and produces high quality results. node in each step to the current seed-set through an incre- The results are as good as the results of the state-of-the- mental approach. Unlike these methods, in this paper, we art approximation algorithms. propose an approach based on linear programming (LP) and the main benefit of our method is its convenient runtime, The paper is organized as follows: Sect. 2 describes the which is independent from parameter k, the seed size. preliminaries. In Sect. 3, we give a more detailed introduc- We call our method MLPR (matrix multiplication, LP, tion to the previous approaches for solving the IM problem randomized rounding), which is able to solve the IM prob- that are applicable to LT. Section 4 describes the details lem in LT. Our algorithm can be implemented using the of the MLPR method. Section 5 presents the experimen- available LP solvers like GLPK,1 CBC,2 CPLEX,3 and tal results. The paper ends with the conclusion and future Gurobi.4 works,. MLPR is comprised of three main steps. First, an Influ- ence Graph is created from the Social Network Graph using sparse matrix multiplication. The Influence Graph provides 2 Preliminaries an estimate of the influence of each node on the others. Then, in the second step, a mixed integer linear program The social network is modeled here as a directed graph (MILP) with 2|V| variables and 2|V|+ 1 constraints is used G = (V, E) where V is a finite set of vertices and E ⊆ V × V to represent the IM problem for the Influence Graph. The is the set of directed edges. We call this graph, the Social MILP has |V| binary variables that indicate whether each Network Graph. Vertices show the individuals in the network node is in the seed-set or not, and the remaining |V| vari- and edges show social relations (e.g., influence, following) ables show the expected activation probability of the nodes. among vertices. Each edge (u,v) has a weight (or probability) The MILP could not be solved in polynomial time, so we parameter wuv (or puv) showing the strength (or probability) solve the linear program resulted from the linear relaxation of the influence of node u on node v. Propagation occurs in of the MILP. In the third step, the seed-set is selected based discrete time steps t = 0,1,2,…. Each node v ∈ V has two on randomized rounding procedure using the result of the possible states; active or inactive. Active nodes in time t linear program. are shown by St ⊆ V and called the active set in time t. S0 The MLPR method does not guarantee any approximation is the initial seed and the final active set when the seed-set of the spread, but experimental results show that the spread is S0 is shown as Φ(S0 ) . Spread function of each seed-set is of seed-sets produced by MLPR are as large as the spread of the spread achieved by the seed-set and is shown by 𝜎(S0 ). seed-sets created by the state-of-the-art approaches. IC and LT are stochastics, and the spread function for each seed-set is defined as the expected size of the final active set 𝜎(S0 ) = E(|Φ(S0 )|). 1 https://www.gnu.org/software/glpk/. In IC, each edge (u,v) has a probability parameter puv. If 2 https://projects.coin-sor.org/Cbc. a node u is activated in time t, it could try to activate each 3 https://www-01.ibm.com/software/commerce/optimization/cplex of its outgoing neighbors, v in the next time step t + 1. The -optimizer/. node u has a one-time chance to activate its neighbor, v. 4 https://www.gurobi.com/. The neighbor v is activated with probability puv and u does 13 Social Network Analysis and Mining (2021) 11:3 Page 3 of 10 3 not have any other chance to directly activate v in the future most well-known heuristic algorithms for IC; and LDAG (of course u might be able to indirectly activate v in future (Chen et al. 2010) and SIMPATH (Goyal et al. 2011) for LT. through activation of intermediate neighbors of v). SIMPATH is the state-of-the-art heuristic algorithm for LT In LT, each edge (u,v) has a weight parameter wuv. The (Chen et al. 2013). sum of the weights of incoming edges for each node is at TipTop is the state-of-the-art algorithm for IM in IC most 1. Each node v is given a random threshold 𝜏v from [0, with approximation guarantee (Li et al. 2019). Tang et al. 1] with uniform distribution. In each time step t, if the sum proposed the Martingale approach, which could solve of the weights of the active incoming edges of an inactive IM in both IC and LT and it has (1 − 1/e − 𝜀)-approxi- node v is greater than or equal to the threshold 𝜏v , the node mation with probability at least (1 − 1/nl ) , where n is the is activated. It has been proven in both models that, if we are number of nodes (Tang 2015). This algorithm runs in only concerned with the final active set, immediate activa- O((k + l)(n + m) log n∕𝜀2 ) expected time (Tang 2015). tion is not important (Chen et al. 2013). IC and LTs are not Hung T. et al. improved the Martingale approach practically equivalent, but their generalized models, i.e., the General while preserving approximation ratio and time complexity IC and the General Threshold models, are equivalent (Chen and called it Stop-and-Stare (Nguyen, Thai, and Dinh 2016; et al. 2013). Huang et al. 2017). Martingale approach and its improve- ments are state-of-the-art algorithm in terms of runtime and solution quality which apply to both propagation models. 3 Related works Guney presented an efficient LP method for solving IM in IC. This method is superior to Martingale approach in IM was first studied by Domingos and Richardson (Domin- terms of solution quality, but not in terms of runtime and gos and Richardson 2001). They modeled the problem using scalability (Güney 2019). There are integer LP models for Markov Random fields. Kempe et al. modeled the problem IM in LT, but they do not result in an efficient algorithm as a discrete optimization problem and introduced two prop- to solve IM (Ghayour et al. 2019; Keskİn and Güler 2018; agation models: IC and LT. In both of them, IM is NP-hard Küçükyavuz 2017). (Kempe, Kleinberg, and Tardos 2003). They showed that We propose an approach called MLPR for the pur- the spread function is monotone and sub-modular; hence, pose of efficiently solving the IM problem in LT. In this a greedy approach could achieve an approximation ratio approach, the size of the seed-set is used in the constraints of (1 − 1/e − 𝜀) if the node with maximum marginal gain and as a result it does not affect the runtime of the algorithm is selected in each step. However, computing the spread directly,5 while all other algorithms whose bases are greedy function is NP-hard itself, so finding the node with maxi- have at least a linear growth factor of the seed size, k. This mum marginal gain is also NP-hard. Since the Monte Carlo linear program has 2|V| variables and 2|V|+ 1 constraints. estimation is used for computing the spread function, their final algorithm is called Monte Carlo Greedy (MC Greedy), which has an approximation ratio of (1 − 1/e − 𝜀) . Due to the 4 MLPR: The Proposed Method numerous and expensive Monte Carlo estimations in each step required for selecting seed nodes, MC Greedy is ineffi- Our method has three main steps. The first step is the pre- cient—especially in social networks with millions of nodes. process step and is done only once for each Social Graph. Leskovec et al. noticed that the marginal gain of a node The second and the most important step assigns a probability in the current iteration does not surpass its marginal gain to each node representing its membership probability to the in the previous iterations. By this technique, they signifi- seed-set. The third step is the post-process step that creates cantly reduced the number of calls to the spread estima- a seed-set corresponding to the probabilities obtained from tion procedure. This method is called CELF (Leskovec the second step. We name our algorithm as MLPR based on et al. 2007). CELF was reported to be 700 times faster than the techniques used in each step: the MC Greedy. Another improvement to CELF is called Step 1 Matrix multiplication: Estimating the Influence CELF + + (Goyal et al. 2009). Goyal et al. has reported Graph from the Social Network Graph using sparse matrix that CELF + + is approximately 35–55% faster than CELF. multiplication. These are greedy algorithms with approximation guarantees Step 2 Linear Programming: Modeling IM in LT as a for IM (Chen et al. 2013). They could be applied to both IC linear program using the Influence Graph. and LT propagation models. There are some heuristics for IM that do not have approx- imation ratios but are faster than CELF + + . These heu- ristic methods are usually model-dependent. IRIE (Jung 5 The runtime for solving linear programs depends on the number of et al. 2012) and MIA/PMIA (Wang et al. 2012) are the variables and constraints of the linear program. 13 3 Page 4 of 10 Social Network Analysis and Mining (2021) 11:3 ∑ Step 3 Randomized rounding: Creating the candidate max imize Sv seed-set using the result of the linear program via a rand- (3) v∈V omized rounding procedure. Following, we describe the details of the MLPR ⎧ Sv ≤ yv ∀v ∈ V (4) algorithm. ⎪ ∑ Step 1-Matrix multiplication: The Influence Graph, M, y ⎪ v ≤ M S uv u ∀v ∈ V (5) ⎪ ∑u∈V is a graph in which each element Mu,v shows the cumulative Subjected to⎨ Sv = k (6) influence of node u on node v. Notice that the influence can ⎪ y ∈ [0, 1] ∀v ∈ V (7) be direct or indirect. In other words, Mu,v shows the sum of ⎪ v ⎪ S ∈ {0, 1} ∀v ∈ V (8) the direct impact of u on v and the indirect impacts of u on ⎩ v v via other nodes. In LT, influence of node u on another node v ( 𝛾u,v ) could For every node v, variable yv is the expected activation be stated as the sum of the influences of u via all simple probability of the nodes and variable Sv tells us whether v is paths6 to v. According to the Live Arc Model, the influence selected as a seed or not. Matrix M is the influence matrix of u on v via path P in LT is equal to the product of the that was estimated in the previous step and k is the seed- weights of edges in P (Chen et al. 2013): set size. Equation (3) represents IM problem as maximizing ∑ ∏ the expected activation probability of all nodes. Equation 4 𝛾u,v = we (1) ensures if a node is in the seed-set, it is then active. Equa- P∈𝜌(u,v) e∈P tion (5) shows that the expected activation probability of a node cannot be higher than the influence it receives from the where 𝜌(u, v) is the set of all existing simple paths from u to seed nodes, directly or indirectly. Equation (6) states that v and we is weight of edge e. Computing 𝜌(u, v) is NP-hard. seed size is k. Finally, Eqs. (7–8) are variable definitions. Since 0 ≤ we ≤ 1 the influence on long paths is usually neg- For every node v, yv has a real value in range [0, 1] and Sv ligible, we use the sum over short paths as an estimation. is a binary variable, either 1 (selected for seed-set) or 0 (not The adjacency matrix G represents the direct influence selected). of nodes. Because G is representing a Social Graph, it does Solving MILP is not possible in polynomial time. So we not have any self-loop7, so all elements on the main diago- solve its LP relaxation. The LP relaxation is same as Eqs. nal of the matrix are zero. G2 is matrix G multiplied by (3–7), but (8) is replaced with: itself. G2 may contain non zero values on the main diagonal. We define G2* to be the same as G2 but all elements on the Sv ∈ [0, 1] ∀v ∈ V (9) main diagonal are zero. Gh* is defined in a similar way. Gh* is G(h−1)* × G except all elements on the main diagonal are Step 3-Randomized rounding: Now we can use the zero. The elements on the main diagonal represent cycles answer of LP relaxation to solve the IM problem. can be and because we need simple paths, we should exclude them interpreted as probability and k nodes can be selected by by setting the main diagonal to zero in each step. using roulette wheel selection. This procedure is repeated G2* is the influence of nodes on each other via simple and the seed with maximum spread is considered as the paths of length 2. In a similar way, Gh* shows influence of answer. This approach of interpreting the LP relaxation nodes on each other via simple paths of length h. We use the answer is known as Randomized Rounding (Raghavan & following formula as an estimate for the influence of nodes Tompson 1987). on each other: The pseudocode of MLPR is given in Algorithm 1. Line 1 of Algorithm 1 is for parameter setting. Parameter h is the M = I + G + G2∗ + ... + Gh* (2) same as parameter h in Eq. 2 and is the depth of paths that M is the estimated influence matrix and I is the identity we consider to estimate the Influence Graph M. Line 2 calls matrix. Notice that in real networks, G is sparse, so methods the pre-process function with Social Graph G and Depth h for multiplying sparse graphs are feasible. as input and Return Influence Graph M. Line 3 calls the LP Step 2- Linear Programing: Now that we have an esti- solver with the LP described in Eqs. (3–9). The answer of mation for influences, we can represent IM using MILP as the LP is the value of the variables. These values are used in follows: the next step as probabilities for specifying the best seed-set. Algorithm 2 is a pseudocode for the pre-processing step that produces an estimation of Influence Graph M as Eq. 1. Algorithm 3 is the post-processing step, which selects a 6 In graph theory a simple path is a path in a graph which does not seed-set based on the values of Sv considered as probability. have repeating vertices. 7 An edge from a node to itself. 13 Social Network Analysis and Mining (2021) 11:3 Page 5 of 10 3 complexity. So computing Influence Graph M from Social Graph G according to Eq. 2 to depth h could be done in O(hn2. 3728639) time and O(n2) memory. There are many parallel and distributed algorithms for matrix multiplication and summation. Fork-join style matrix multiplication on shared memory systems has 𝜃(n3 ∕ log2 n) speedup and in a more practical variant achieves 𝜃(n2 ) speedup (Leiserson, Smith, and Randall 1998). The main step of MLPR algorithm is solving the LP. Kar- markar’s algorithm solves LP in O(N 3.5 L) time where L is the number of bits the input could be encoded in and N is the number of variables (Karmarkar 1984). Therefore, if the number of nodes of the graph is n, LP (3–9) with 2n vari- ables could be solved in asymptotic time O(n3.5 L). The third step is specifying the seed-set according to probabilities obtained from the LP. This step has runt- ime O(k) and therefore overall runtime of the algorithm is O(n3.5 L + k) . Because k ≤ n the runtime of the algorithm is O(n3.5 L). To summarize the MLPR algorithm has O(hn2. 3728639) preprocess time and finds a seed-set of size k in O(n3.5 L) asymptotic time. 5 Experiments 4.1 Complexity analysis This section evaluates the MLPR method experimentally by comparing it to the state-of-the-art approaches. We used a The pre-processing step of the MLPR algorithm, as 2.27 GHz Core i5 CPU and 8 GB memory. described in the previous section, is composed of multiplica- tion and summation of matrices. Gall presented an algorithm 5.1 Datasets for matrix multiplication with asymptotic time complexity of O(n2. 3728639) and O(n2) memory complexity (Gall 2014). Four real datasets were used in the experiments. Table 1 Sum of matrices could be done in O(n2) time and memory shows the specifications of these datasets. TheNetHEPT8 dataset, frequently used in previous IM studies (Goyal, Lu, and Lakshmanan 2011), is a collabora- Table 1 Datasets tion network of the “High Energy Physics (Theory)” sec- Dataset #Node #Edge Corresponding site tion of arXiv.9 Nodes are authors and edges between them NetHEPT 15,233 58,891 arxiv.org WikiVote 7,115 103,689 wikipedia.org Flixter 1,049,511 7,058,818 flixter.com 8 https://research.microsoft.com/en-us/people/weic/graphdata.zip. Twitter 2,029,108 8,015,119 twitter.com 9 https://arxiv.org/. 13 3 Page 6 of 10 Social Network Analysis and Mining (2021) 11:3 Table 2 Performance of MLPR h WikiVote NetHEPT method Memory(MB) Time(s) Spread Memory(MB) Time(s) Spread 1 45.4 24.6 836.8 55.4 87 1207.5 2 397.2 51.4 843 120 93.6 1343.1 3 1479.7 582.1 868 407 94.2 1435.6 4 1977.1 663.6 834 449.4 150.1 1354.8 5 2097 1027.7 868 3446.6 2064.4 1438.9 Fig. 1 Spread reached by different algorithms for NetHEPT, WikiVote, Flixter and Twitter datasets with different seed sizes are co-authorship relations. NetHEPT has 15,233 nodes and collected in our laboratory. Nodes of this dataset are users 58,891 edges. and edges show which user follows which user. The Wiki Vote dataset10 is a voting history network origi- Table 1 shows summarized information on the used nating from Wikipedia, where nodes represent Wikipedia datasets. users, and a directed edge from u to v means v voted on u. WikiVote has 7115 nodes and 103,689 edges. 5.2 Benchmarks Flixter is a social movie site, allowing users to share movie ratings, discover new movies and meet others with We compared MLPR with three algorithms, Monte Carlo similar movie taste. The Flixter dataset11 has 1 M node and Greedy (Kempe et al. 2005), SIMPATH (Goyal et al. 7 M edges. Nodes are users and edges show friendships. 2011), and the Martingale approach (Tang 2015). Monte Twitter is a famous social network, allowing users to Carlo Greedy is the basic approximation algorithm with share short messages called tweets with each other. The (1 − 1/e − 𝜀) approximation guarantee. SIMPATH is the Twitter dataset 12 with 2 M nodes and 8 M edges was state-of-the-art heuristic algorithm for LT (Tang 2015). The Martingale approach gives (1 − 1/e − 𝜀) approximation guarantee with probability at least (1 − 1/nl ) , where n is the 10 https://snap.stanford.edu/data/wiki-Vote.html. number of nodes (Tang 2015). 11 https://snap.stanford.edu/data/wiki-Vote.html. 12 https://sociallab.ir/. 13 Social Network Analysis and Mining (2021) 11:3 Page 7 of 10 3 Fig. 2 The execution time of different algorithms with different seed sizes 5.3 Parameter settings 5.4 LP solver tools Table 2 shows the results of the MLPR method for differ- We used two LP solver tools: GLPK and Gurobi. GLPK is a ent h values on WikiVote and NetHEPT datasets with seed free LP solver and Gurobi is an enterprise one. Gurobi has size 50. We observed that by choosing h = 3, we could find been reported to be up to 20 times faster than GLPK in serial seed-sets that trigger large spread in short time while con- runs (Meindl and Templ 2012). suming affordable memory. Increasing h to above 3 sig- nificantly increases the required memory and processing 5.5 Results time, but does not significantly increase the spread of the seed-set. So in the next experiments, we compare MLPR Fig. 1 shows the spread of seed-sets computed by different with other IM algorithms on LT, with h = 3. algorithms on the four datasets with different seed size from For the SIMPATH algorithm, we used the parameter 5 to 50. It is observed that the spread of the three methods is specification proposed by Goyal et al. (Goyal et al. 2011) very similar: SIMPATH, Martingale approach and MLPR. and set cutoff parameters to 10–3 and 10–2. In Fig. 1 and Appendix A shows in more details how the spread changes Fig. 2 “SIMPATH10-3” is the SIMPATH method with the in MLPR method in comparison with the other methods and cutoff parameter set at 1 0–3 and “SIMPATH10-2” is the proves that, in terms of the spread, results of the three meth- SIMPATH method with the cutoff parameter set at 10–2. ods are very close. So we have to compare them on the other The Monte Carlo Greedy method needs to set the factors, e.g., execution time. number of Monte Carlo steps. Using larger Monte Carlo Figure 2 shows the execution time of different algorithms. steps gives more accurate results, but takes much longer. The horizontal axis shows different seed sizes and the verti- We set the number of Monte Carlo steps to 10 and 100. cal axis displays the runtime in minutes in logarithmic scale. Although these number of steps are too small to achieve For MLPR, sum of the runtime of the steps 2 and 3 (LP high quality results, the algorithm with more Monte Carlo solver and randomized rounding) are reported. The first step steps could not complete the task in an affordable time (multiplying the matrix of the Social Graph to create the on our hardware. In Fig. 1 and Fig. 2, MCG100 is the Influence Graph) is a pre-processing step and done only once CELF + + method executed with 100 Monte Carlo steps for each Social Graph so it is not included in the timing. and MCG10 is the CELF + + method executed with 10 The Monte Carlo Greedy is the base algorithm with an Monte Carlo steps. approximation guarantee and CELF + + is its efficient imple- mentation. By increasing the number of Monte Carlo steps, 13 3 Page 8 of 10 Social Network Analysis and Mining (2021) 11:3 this algorithm could find seed-sets with higher spreads; but Table 3 Improvement percentage of MLPR over previous methods on it is seen, in Fig. 2 for NetHEPT and WikiVote datasets that, NetHEPT dataset it takes a long time and is inefficient in comparison with the Seed size MC10 MC100 SIM- SIM- Martingale other algorithms; so, we did not run this algorithm for the PATH10-2 PATH10-3 larger datasets, i.e., Flixter and Twitter. 5 50.6 −3.0 8.8 2.7 −10.0 We observe in Fig. 2 that the SIMPATH method has a 10 82.2 6 −7.7 −7.5 −12.4 much longer runtime in comparison with the Martingale 15 151.3 20.4 7.3 −5.1 −5.2 approach and MLPR. That is because SIMPATH and most 20 114.5 33.6 −0.1 1.8 0.2 other previous algorithms are incremental and their runtime 25 126.1 43.2 0.4 −0.9 −4.3 increases with seed size. However, MLPR is not incremental 30 124 53.7 3.5 0.2 −3.4 and its runtime does not increase by seed size. The main step 35 105.6 52.2 −0.5 −0.8 −4.5 of MLPR is solving the LP which depends on the number 40 120 53.7 0.5 −3.5 −3.3 of variables and constraints. The LP of the MLPR method 45 133.0 69.7 −0.7 −2.6 −1.8 has 2|V| variables and 2|V|+ 1 constraints that is independ- 50 116.1 69.0 1.1 −1.1 −1.5 ent from seed size. Seed size only appears as a constant k in one of the constraints of the LP, i.e., Equation (6). The first step of MLPR depends on the number of nodes and edges of the graph and has nothing to do with seed size as well. Table 4 Improvement percentage of MLPR over previous methods on The randomized rounding step depends on the number of the WikiVote dataset seed size, but because its runtime is negligible in compari- Seed size MC10 MC100 SIM- SIM- Martingale son with the LP step, it does not affect the overall runtime PATH10−2 PATH10−3 of the algorithm. 5 3.5 0 0 0 1.5 We mentioned that we have used two LP solver tools: 10 22.7 6.9 0 0 −2.4 GLPK and Gurobi. The answer of the problem (spread) does 15 46.1 8.0 −0.8 1.7 −1.9 not depend on the LP solver tool; so in Fig. 1, there is one 20 52.8 13.5 1.6 0.2 −1.3 curve for MLPR per each dataset. However, the execution 25 57.7 16.9 0.5 0.2 −0.5 time of MLPR significantly changes with the use of efficient 30 72.4 26.4 0.8 0.9 −0.3 LP solvers. In Fig. 2, we include two curves for MLPR per 35 84.1 28.8 0.7 0.6 0.6 each dataset. In Fig. 2, MLPR (GLPK) shows the runtime of 40 84.8 39.2 1.3 0.8 0.8 MLPR with GLPK as the solver tool and MLPR (Gurobi) 45 92.3 40.6 2.3 2.5 1.5 shows the runtime of MLPR with Gurobi. As Fig. 2 shows, 50 89.0 43.4 4.2 3.9 2.0 Gurobi has shorter runtime than GLPK. Runtime of Gurobi is still longer than Martingale approach, but we observe how using powerful LP Solver tools could decrease the runtime of MLPR. Table 5 Improvement percentage of MLPR over previous methods on To conclude, although the Monte Carlo Greedy algorithm Flixter dataset proposed by Kempe et el. has (1 − 1/e − 𝜀)-approximation, Seed size SIMPATH10-2 SIMPATH10-3 Martingale it needs many steps and consequently a lot of time to reach high quality answers. That is why many efforts have been 5 0.02 0.02 0.02 carried out to develop new approximations or new heuris- 10 0.03 0.02 −0.006 tic algorithms. SIMPATH and the Martingale approach, 15 0.02 −0.002 −0.006 the two state-of-the-art algorithms, as well as MLPR—our 20 0.01 −0.002 −0.014 proposed heuristic algorithm—could produce answers that 25 0.02 0.01 −0.005 have very close spreads. However, the runtime of these three 30 0.03 0.01 0.004 algorithms is different. SIMPATH’s runtime increases with 35 0.03 0.01 0.001 40 0.05 0.02 0 seed size and has a longer runtime in comparison with the 45 0.03 0.002 −0.004 Martingale and MLPR. In our experiments, the Martingale 50 0.04 −0.001 −0.003 approach showed to have shorter runtime than our method, but our method could replace its LP solver step with more efficient LP solver engines upon they are available. Fortunately, the Martingale approach has (1 − 1/e − 𝜀) but experimental results showed that the spread for the approximation with at least (1 − 1/nl ) probability. Although answers of the Martingale approach and MLPR are very MLPR does not have any approximation guarantee, close. 13 Social Network Analysis and Mining (2021) 11:3 Page 9 of 10 3 Table 6 Improvement percentage of MLPR over previous methods on The columns SIMPATH 10–2, SIMPATH 10–3 and Mar- Twitter dataset tingale contain both positive and negative small percent- Seed size SIMPATH10−2 SIMPATH10-3 Martingale ages. We observe that as the size of datasets grows (from NetHept which is the smallest dataset to Twitter which is 5 0 0 0 the biggest dataset) both negative percentages decreases and 10 0.0002 0.001 −0.0003 the absolute value of the improvements approaches to zero. 15 0.004 0.001 −0.001 These results plus considering the fact that computing the 20 0.002 0.001 0.0004 exact spread of the resulted seed-sets is NP-hard and the 25 0.002 0.001 0.0005 computed spreads are approximations, leads us to state that 30 0.003 0.001 0.0002 “MLPR along with SIMPATH and Martingale produce simi- 35 0.006 0.005 −0.001 lar spreads”. So we have to compare them with other per- 40 0.003 0.001 −0.002 formance factors, e.g., execution time (Tables 3, 4, 5 to 6). 45 0.007 0.002 −0.001 50 0.005 0.002 −0.001 References 6 Summary and future works Ackerman E, Ben-Zwi O, Wolfovitz G (2010) “Combinatorial Model and Bounds for Target Set Selection.” Theoretical Comp Sci In this paper, we tackled the problem of IM in Social Net- 411(44–46): 4017–4022 https: //doi.org/10.1016/j.tcs.2010.08.021 works for LT diffusion model. We designed a three-step https: //linkin ghub. elsevi er.com/retrie ve/pii/S03043 97510 00456 1. approach called MLPR benefiting from linear programs. Chen W, Lakshmanan LVS, Castillo C (2013) “Information and Influence Propagation in Social Networks.” Synthesis Lectures MLPR could compute seed-sets with large spread while its Data Manage 5 (4) 1–177 http://www.morganclaypool.com/doi/ runtime is independent from the seed size, which makes it abs/https://doi.org/10.2200/S00527ED1V01Y201308DTM037. prominent in comparison to the previous algorithms. Chen W, Wang C, Wang Y (2010) “Scalable IM for Prevalent Viral We compared our proposed method experimentally with Marketing in Large-Scale Social Networks.” In ACM SIG- KDD Conference on Knowledge Discovery and Data Min- the best existing algorithms: CELF + + , which is the fastest ing, 1029. New York ACM Press, New York, USAhttps://doi. greedy approach that reaches an (1 − 1/e − 𝜀) approximation org/10.1145/18358 0 4.18359 3 4. https : //dl.acm.org/citat i on. ratio; SIMPATH, the state-of-the-art heuristic method cre- cfm?doid=1835804.1835934. ated for LT; and the Martingale approach, the state-of-the-art Chen W, Yuan Y, Zhang L(2010) “Scalable IM in Social Networks Under LT.” In IEEE International Confrence on Data Min- approximation algorithm which has (1 − 1/e − 𝜀)-approxi- ing, 88–97. https://ieeexplore.ieee.org/xpls/abs_all.jsp?arnum mation with at least (1 − 1/nl )probability. The results show ber=5693962. that MLPR had a much better performance in comparison Domingos P, Richardson M (2001) “Mining the Network Value of with CELF + + and SIMPATH. MLPR could match per- Customers.” In ACM SIGKDD Conference on Knowledge Dis- covery and Data Mining, 57–66. New York, New York, USA: formance of the Martingale approach when used alongside ACM Press. doi:https://doi.org/10.1145/502512.502525. https suitable LP solver tools although MLPR does not have an ://portal.acm.org/citation.cfm?doid=502512.502525. approximation guarantee. Le Gall F (2014) “Powers of Tensors and Fast Matrix Multiplica- In the future, we would like to work on the ability of our tion.” Proceedings of the 39th International Symposium on Sym- bolic and Algebraic Computation (ISSAC 2014): 1–28. https:// method to express changes in the model, e.g., node weights. arxiv.org/abs/1401.7714. GhayourBaghbani F, Asadpour M, Faili H (2019) Integer LP for IM. Iran J Sci Technol Trans Electr Eng 43:627. https: //doi. Appendix A: improvement percentage org/10.1007/s40998-019-00178-7 Goyal A, Lu W, Lakshmanan LKS (2009) “CELF ++ : Optimizing of spread the Greedy Algorithm for IM in Social Networks.” In Inter- national World Wide Web Conference. https: //www.cs.ubc. We observed in Fig. 1 that spread of MLPR is more than ca/~goyal/research/celf++.pdf. Monte Carlo Greedy algorithm; however, it has similar per- Goyal A, Lu W, Lakshmanan LKS (2011) “SIMPATH: An Effi- cient Algorithm for IM Under LT.” In 2011 IEEE 11th Inter- formance to SIMPATH and Martingale. In this appendix national Conference on Data Mining, 211–220. IEEE. https:// we compared them in more details. Tables3,4,5 and 6 show doi.org/10.1109/ICDM.2011.132. https://ieeexplore.ieee.org/ improvement percentage of the MLPR method over the pre- lpdocs/epic03/wrapper.htm?arnumber=6137225. vious methods in terms of the spread. Güney E (2019) An Efficient LP Based Method for the IM Problem in Social Networks. Information Sci 503:589–605. https://doi. MC10 and MC100 in Tables 3 and 4 show big improve- org/10.1016/j.ins.2019.07.043 ments for MLPR (except for MC100 in NetHEPT dataset Huang K, Wang S, Bevilacqua G, Xiao X, Lakshmanan LKS (2017) with seed size 5). “Revisiting the Stop-and-Stare Algorithms for IM.” In Inter- national Conference on Computational Social Networks, 13 3 Page 10 of 10 Social Network Analysis and Mining (2021) 11:3 10:913–924. https://link.springer.com/chapter/https : //doi. and Data Mining, 420. New York: ACM Press, New York USA org/10.1007/978-3-030-04648-4_23. https://doi.org/10.1145/1281192.1281239. https://portal.acm. Jung K, Heo W, Chen W (2012) “IRIE: Scalable and Robust IM org/citation.cfm?doid=1281192.1281239. in Social Networks.” In IEEE International Confrence on Li X, Smith JD, Dinh TN, Thai MT (2019) “TipTop : ( Almost ) Data Mining, 1–19. https: //ieeexp lore. ieee.org/xpls/abs_all. Exact Solutions for IM in Billion-Scale Networks.” In IEEE/ jsp?arnumber=6413832. ACM Trans on Netw (TON), 649–661. https://doi.org/10.1109/ Karmarkar N (1984) A new polynomial time algorithm for LP. Com- TNET.2019.2898413. binatorica 4:373–395. https: //doi.org/10.1007/BF0257 9150 Meindl B, Templ M (2012) Analysis of Commercial and Free and Open https://web.archive.org/web/20131228145520/http://retis.sssup Source Solvers for Linear Optimization Problems. In: Technical .it/~bini/teaching/optim2010/karmarkar.pdf Report 1; Institut F, Statistik U, Eds. Wahrscheinlichkeitstheorie: Kempe D, Kleinberg J, Tardos E (2003) “Maximizing the Spread of Wien, Austria. https://www.statistik.tuwien.ac.at/forschung/CS/ Influence through a Social Network.” In ACM SIGKDD Confer- CS-2012-1complete.pdf ence on Knowledge Discovery and Data Mining. https://dl.acm. Nguyen HT, MT Thai, TN Dinh (2016) Stop-and-Stare : Optimal org/citation.cfm?id=956769. Sampling Algorithms for Viral Marketing in Billion-Scale Net- Kempe D, Kleinberg J, Tardos E (2005) Influential Nodes in a Diffu- works. In: SIGMOD ’16: Proceedings of the 2016 International sion Model for Social Networks. Autom Languages Progr. https Conference on Management of Data, pp 695–710 https://doi. ://doi.org/10.1007/11523468_91 org/10.1145/2882903.2915207 Keskİn ME, Güler G (2018) “IM in Social Networks : an Integer Raghavan, Prabhakar; Tompson, Clark D. (1987), "Randomized round- Programming Approach.” Turk J Elec Eng Comp Sci 26: 3383– ing: A technique for provably good algorithms and algorithmic 3396 https://doi.org/10.3906/elk-1802-212. https://www.seman proofs", Combinatorica, 7 (4): 365–374 https://doi.org/10.1007/ ticscholar.org/paper/Influence-maximization-in-social-netwo BF02579324 Tang, Youze (2015) “IM in Near-Linear Time : A rks%3A-an-Keski n -G%C3%BCler / 4730a 8 2b77 c f16c 9 fc8a Martingale Approach.” In SIGMOD, 1539–1554. https://dl.acm. 7d12cb2a5cab7fb87478. org/citation.cfm?id=2723734. Küçükyavuz H, Simge W (2017) “A Two-Stage Stochastic Program- Wang C, W Chen, Y Wang (2012) “Scalable IM for IC in Large-Scale ming Approach for IM in Social Networks.” Computational Social Networks.” In ACM SIGKDD Conference on Knowledge Optim Appl https://doi.org/10.1007/s10589-017-9958-x. Discovery and Data Mining, 25:545–576 https://doi.org/10.1007/ Leiserson C, Arthur E, Smith C, Randall KH (1998) Cilk: Effcient s10618-012-0262-1. Multithreaded Computing Cilk. Ph.D. thesis at the. Mass Inst Technol. pp 1–179. https://supertech.csail.mit.edu/papers/randa Publisher’s Note Springer Nature remains neutral with regard to ll-phdthesis.pdf jurisdictional claims in published maps and institutional affiliations. Leskovec J, Krause A, Guestrin C, Faloutsos C, VanBriesen J, Glance N (2007) “Cost-Effective Outbreak Detection in Net- works.” In ACM SIGKDD Conference on Knowledge Discovery 13

(PDF) MLPR: Efficient influence maximization in linear threshold propagation model using linear programming