(PDF) Learning Everywhere: Pervasive Mac

Learning for Effective arXiv:1902.10810v1 [cs.DC] 27 Feb 2019 Abstract—The ologies provide improvements. the interaction and motivates We introduce can achieve lation based performance promise of examines domains. cyberinfrastructure Everywhere paradigm presents. This paper large-scale chine learning Performance and possible to pervasively one do this challenges opposed to simulation can be found at [1]. The convergence gies [2] improvements. limits of transistor nally formulated architectures. in highly need to be and performance. In domain in statistical extreme scale limitations • MLforHPC: Using ML to enhance HPC applications and systems This categorization is related to Jeff Dean’s ”Machine Learning for Systems and Systems for Machine Learning” [7] and Satoshi Matsuoka’s convergence of AI and HPC [8].We further subdivide HPCforML as • HPCrunsML: Using HPC to execute ML with high performance • SimulationTrainedML: Using HPC simulations to train ML algorithms, which are then used to understand exper- imental data or simulations. We also subdivide MLforHPC as • MLautotuning: Using ML to configure (autotune) ML or HPC simulations. Already, autotuning with systems like ATLAS is hugely successful and gives an initial view of MLautotuning. As well as choosing block sizes to improve cache use and vectorization, MLautotuning can also be used for simulation mesh sizes [9] and in big data problems for configuring databases and complex systems like Hadoop and Spark [10], [11]. • MLafterHPC: ML analyzing results of HPC as in trajec- tory analysis and structure identification in biomolecular simulations • MLaroundHPC: Using ML to learn from simulations and produce learned surrogates for the simulations. The same ML wrapper can also learn configurations as well as results. This differs from SimulationTrainedML as there typically a learnt network is used to redirect observation whereas in MLaroundHPC we are using the ML to improve the HPC performance. • MLControl: Using simulations (with HPC) in control of experiments and in objective driven computational campaigns [12]. Here the simulation surrogates are very valuable to allow real-time predictions. All 6 topics above are important and pose many research issues in computer science and cyberinfrastructure, directly in application domains and in the integration of technology with applications. However, in this paper, we focus on topics in MLforHPC, with close coupling between ML, simulations, and HPC. We involve applications as a driver for the requirements and evaluation of the computer science and infrastructure. In researching MLaroundHPC we will consider ML wrappers for either HPC simulations or complex ML algorithms imple- mented with HPC. Our focus is on how to increase effective performance with the learning everywhere principle and how to build efficient learning everywhere parallel systems. One can view the use of ML learned surrogates as a per- formance boost that can lead to huge speedups as calculation of a prediction from a trained network, can be many orders of magnitude faster than full execution of the simulation as shown in section III-D. One can reach Exa or even equivalent performance for simulations with existing systems. These high-performance surrogates are valuable in education and control scenarios by just speeding existing simulations. Simple examples are the use of a surrogate to challenges posed by scenarios where data is sparse and knowledge of underlying mechanism is inadequate. Across domains, the two approaches have been compared [16]. Ma- chine learning approach usually needs a large amount of observation data for training, and does not explicitly account for mechanisms that govern the the complex phenomenon. On the other hand, mechanistic models (like agent-based models) result from a bottom-up approach; but they tend to have too many parameters, are compute intensive and hard to calibrate. In recent years, there have been several efforts to study physical processes under the umbrella of theory-guided data science (TGDS), with focus on artificial neural networks (ANN) as the primary learning tool. [17] provides a survey of these methods and their application to hydrology, climate science, turbulence modeling, etc. where the underlying theory can be used to reduce the variance in model parameters by introducing constraints or priors in the model space. Here we consider a particular class of mechanistic models - network dynamical systems, which have been applied in diverse domains such as epidemiology and computational social science. A network dynamical system is composed of a network where nodes of the network are agents (repre- senting population, computers, etc.) and the edges capture the interactions between them. A popular example of such systems is the SEIR model of disease spread in a social network [18]. The complexity of the dynamics in such a network, due to individual level heterogeneity and interactions, makes it difficult to train a machine learning model that can be generalized to patterns not yet presented in historical data. Completely data driven models cannot discover higher resolution details (e.g. county level incidence) from lower resolution ground truth data (e.g. state level incidence). Learning from observational and simulation data: Data sparsity is often a challenge for applying machine learning, especially deep learning methods to forecasting problems in socio-technical systems. One example of such problems is to predict weekly incidence in future weeks in an influenza epidemic. In such socio-technical systems, we usually have only limited observational data, e.g. weekly incidence number reported to the Centers for Disease Control and Prevention (CDC). Such data is of low spatial temporal resolution (weekly at state level), not real time (at least one week delay), incom- plete (reported cases are only a small fraction of actual ones), and noisy (adjusted several times after being published), thus necessitating a hybrid framework for forecasting by learning from observational and simulation data. Observations need to be augmented with existing domain knowledge and behavior encapsulated in the agent-based model to inform the learning algorithm. In such hybrid framework, the network dynamical system is used to guide the learning algorithm so that it conforms to the principles (consistency). At the same time, the learning algorithm will facilitate model selection in a principled manner. Moreover, the synthetic data goes beyond the observation data, thus helps voiding overfitting and makes the learned model capable of processing patterns unseen in the observation data (gen- 4) Prediction of bifurcations in models. 5) Design of maximally discriminatory experiments – pre- dict the parameter sets by which two models can be differentiated. 6) Run time backwards, to determine initial conditions that lead to observed endpoints. 7) The elimination of short time scales, e.g., short-circuit the calculations of advection-diffusion. 8) Generating additional spatial data sets from experimental images. Representative prior work by Karniadakis [13], Kevrekidis [22] and Nemenman [23] shows that neural networks can reproduce the temporal behaviors of biochemical regulatory and signaling networks. Ref. [24] has shown that networks can learn nonlinear biomechanics simulations of the aorta–being able to predict the stress and strain distribution in the human aorta from the morphology observable with MRI or CT. C. Machine Learning and Molecular Simulations 1) Nanoscale simulation: Despite the employment of the optimal parallelization techniques suited for the size and complexity of the system, nanoscale simulations remain time consuming. In research settings, simulations can take up to several days and it is often desirable to foresee expected over- all trends in key quantities; for example, how does the contact density vary as a function of ion concentration in nanoscale confinement or how the peak positions of the pair correlation functions characterizing nanoparticle assembly evolve as the environmental parameters are tuned. Given the dramatic rise in ML and HPC technologies, it is not the question of if, but when, ML can be integrated with HPC to enhance nanoscale simulation methods. Recent years have seen a surge in the use of ML to accelerate material simulation techniques: ML has been used to predict parameters, generate configurations in material simulations, and classify material properties (see Ref [1] and citations therein). At this time, it is critical to understand and develop the software frameworks to build ML layers around HPC to 1) enhance simulation performance 2) enable real-time and anytime engagement, and 3) broaden the applicability of simulations for both research and education (in-classroom) usage. In the context of nanoscale simulation, an initial set of applications for the MLaroundHPC framework can be the prediction of the structure or correlation functions (outputs) characterizing the nanoscale system over a broad range of experimental control parameters (inputs). MLaroundHPC can enable the following outcomes: 1) Learn pre-identified critical features associated with the simulation output. 2) Generate accurate predictions for un-simulated state- points (by entirely bypassing simulations). 3) Exhibit auto-tunability (with new simulation runs, the ML layer gets better at making predictions). 4) Enable real-time, anytime, and anywhere access to sim- ulation results (particularly important for education use). mechanical physical equations. Roitberg et al. [33] trained a NN on QM DFT calculations, based on modified Behler-Parrinello symmetry functions. The resulting ANI-1 model was shown to be chemically accurate, transferrable, with a performance similar to a classical force field, thus enabling ab-initio molecular dynamics (AIMD) at a fraction of the cost of ”true” DFT AIMD. Extensions of their work with an active learning (AL) approach demonstrated that proteins in an explicit water environment can be simulated with a NN potential at DFT accuracy [34]. The AL approach reduced the amount of required training data to 10% of the original model [34] by iteratively adding training data calculations for regions of chemical space where the current ML model could not make good predictions. Using transfer learning, the ANI-1 potential was also extended to predict energies at the highest level of quantum chemistry calculations (coupled cluster CCSD(T)), with speedups in the billion. In general the focus has been on achieving DFT-level accuracy because NN potentials are not cheaper to evaluate than most classical empirical potentials. However, replacing solvent-solvent and solvent-solute interactions, which typically make up 80%-90% of the computational effort in a classical all-atom, explicit solvent simulation, with a NN potential promises large performance gains at a fraction of the cost of traditional implicit solvent models and with an accuracy comparable to the explicit simulations [35], as also discussed above in the case of electrolyte solutions. Furthermore, in- clusion of polarization, which is expensive (factor 3-10 in current classical polarizable force fields [36]) but of great interest when studying the interaction of multivalent ions with biomolecules might be easily achievable with appropriately trained ML potentials. III. I NTEGRATING ML AND HPC: BACKGROUND O PPORTUNITIES A primary contribution of this paper is in the categorization, description and examples of the different ways in which ML can enhance HPC (MLforHPC). Before we expound upon MLforHPC and open research issues, we provide a a summary status of HPC for ML (beyond the obvious and well-studied use of GPUs for ML). A. HPC for Machine Learning There has been substantial community progress here with the Industry supported MLPerf [37] machine learning bench- mark activity and Ubers Horovod Open Source Distributed Deep Learning Framework for TensorFlow [38]. We have stud- ied different parallel patterns (kernels) of machine learning ap- plications, looking in particular at Gibbs Sampling, Stochastic Gradient Descent (SGD), Cyclic Coordinate Descent (CCD) and K-means clustering [39]. These algorithms are fundamen- tal for large-scale data analysis and cover several important categories: Markov Chain Monte Carlo (MCMC), Gradient Descent and Expectation and Maximization (EM). We show that parallel iterative algorithms can be categorized into four types of computation models (a) Locking, (b) Rotation, (c) the training data set, we can use the averaged predictions as the output of the model. However, averaging many different model instances implies a practical difficulty that one has to conduct multiple optimization tasks to secure a statistically meaningful sample distribution of the predictions. Given the assumption that the model might as well be a complex one to minimize the bias component (e.g. a deep neural network), the model averaging strategy is computationally challenging Dropout has been extensively used in deep learning as a regularization technique [42], but recent researches as an uncertainty quantification (UQ) tool [43]. The procedure can be seen as an efficient way to maintain a pool of multiple network instances for the same optimizatio task. It is an efficient ensemble technique as it applies a randomly sampled Bernoulli mask to a layer-wise input unit, thus exposing the optimization process to many differently structured instances of the network. A a set of differently thinned versions of the form a sample distribution of predictions to be used as a UQ metric. The dropout-based UQ scheme can provide an opportunity for the MLaroundHPC simulation experiments. As a data-driven model it is reasonable to assume that a better ML surrogate can be found once the training routine sees more examples generated from the simulation experiment However, creating more examples to train a better ML model is a conflicting requirement as the purpose of training the ML surrogate is to avoid such computation. The UQ scheme can play a role here to provide the training routine with a way to quantify the uncertainty in the prediction—once it is low enough, the training routine might less likely need more data. C. Machine Learning for HPC Here we review the nature of the Machine Learning needed for MLforHPC in different application domains. The Machine Learning (ML) load depends on 1) Time interval between its invocations, which will translate into the number samples S and 2) size D of data set specifying each sample. This size could be as large as the number of degrees of freedom in simulation or could be (much) smaller if just a few parameters are needed to define simulation. We note two general issues • There can very important data transfer and storage issues in linking the Simulations and Machine Learning parts of system. This could need carefully designed architectures for both hardware and software. • The Simulations and Machine Learning subsystems are likely to require different node optimizations as in differ- ent types and uses of accelerators. D. Science Exemplar: Nanosimulations In this subsection, using the example of Nanosimulations, we show progress in all areas at the intersection of HPC and ML are having an impact. In each of two cases below, one is using scikit-learn, Tensorflow and the Keras wrapper for Tensorflow, as the ML huge, and your ML objectives may be looking for a deep relationship, and you may have to invoke an ensemble of ANN’s and this will change hardware needs. Scaling of Effective Performance: An initial approach to estimate speedup in a hybrid MLaroundHPC situation is given in [26] for a nano simulation. One can estimate the speedup in terms in terms of four times Tseq the sequential execution time of simulation; Ttrain the time for the parallel execution of simulation to give training data; Tlearn is the time per sample to train the learning networkl; and Tlookup is the inference time to predict the results of the simulation by using the trained network. In the formula below, Nlookup is the number of trained neural net inferences and Ntrain the number of parallel simulations used in training. Tseq (Nlookup Ef f ectiveSpeedup S = Tlookup Nlookup + (Ttrain + Tlearn )Ntrain Tseq This formula reduces to the classic simple Ttrain when there is Nlookup no machine learning and in the limit of large Ntrain Tseq Tlookup which can be huge! There are many caveats and assumptions here. We are considering a simple case where one runs the Ntrain sim- ulations, followed by the learning and then all the Nlookup inferences. Further we assume the training simulations are useful results and not just overhead. We also have not properly considered how to build in the likelihood that training, learning and lookup phases are probably using different hardware configurations with different node counts. E. Opportunities and Research Issues Research Issues: In addition to the six categories at the interface of ML and HPC, the research issues we identify reflect the multiple interdisciplinary activities linked in our study of MLforHPC, including application domains described in sections II-A, II-B, II-C1 and II-C2, as well as coarse graining studied in our case for network science and nano- bio areas. We have identified the following research areas, which can be categorized into Algorithms and Methods (1-5), Applied Math (10), Software Systems (6,7), Performance Measurement and Engineering (8,11). 1) Where can application domains use MLaroundHPC and MLautotuning effectively and what science is enabled by this 2) Which ML and DL approaches are most relevant and how can they be set up to enable broad user-friendly MLaroundHPC and MLautotuning in domain science 3) How can Uncertainty Quantification be enabled and separately study ergodicity (bias) and accuracy issues? 4) Is there new area of algorithmic research focusing on finding algorithms that can be most effectively learnt? 5) Is there a general multiscale approach using MLaroundHPC. 6) What are appropriate systems frameworks for MLaroundHPC and MLautotuning. For example, sion Health initiative and Intel through the Parallel Computing Center at Indiana University. JPS and JAG were partially supported by NSF 1720625, NIH U01 GM111243 and NIH GM122424. SJ was partially supported by ExaLearn – a DOE Exascale Computing project. R EFERENCES [1] Geoffrey Fox, James A. Glazier, JCS Kadupitiya, Vikram Jadhao, Minje Kim, Judy Qiu, James P. Sluka, Endre Somogyi, Madhav Marathe, Abhijin Adiga, Jiangzhuo Chen, Oliver Beckstein, and Shantenu Jha. Learning Everywhere: Pervasive machine learn- ing for effective High-Performance computation: Application back- ground. Technical report, Indiana University, February 2019. http://dsc.soic.indiana.edu/publications/Learning Everywhere.pdf. [2] Geoffrey Fox, Judy Qiu, Shantenu Jha, Saliya Ekanayake, and Supun Kamburugamuve. Big data, simulations and HPC convergence. In Springer Lecture Notes in Computer Science LNCS 10044, 2016. [3] Peter M Kasson and Shantenu Jha. Adaptive ensemble simulations of biomolecules. Current Opinion in Structural 2018. Cryo electron microscopy: the impact of in biology Biophysical and computational methods - Part A. [4] NSF1849625 workshop series BDEC2: Toward a common digital con- tinuum platform for big data and extreme-scale computing (BDEC2). https://www.exascale.org/bdec/. [5] José Miguel Hernández-Lobato, Klaus-Robert Matt J. Kusner, Stefan Chmiela, Kristof T. Schütt. molecules and materials. In Proceedings of the December 2018. [6] Oliver Beckstein,Geoffrey Fox, Judy Qiu, David Laszewski, John Paden, Shantenu Jha, Fusheng Wang, Madhav Marathe, Anil Vullikanti, Thomas Cheatham. Contributions big data computing. Technical report, Digital 2018. [7] Jeff Dean. Machine learning for systems and systems for machine learning. In Presentation at 2017 Conference on Neural Information Processing Systems, 2017. [8] Satoshi Matsuoka. Post-K: A game changing supercomputer for con- vergence of HPC and big data / AI. Multicore 2019, February 2019. [9] JCS Kadupitiya, Geoffrey C. Fox, Vikram Jadhao. Machine learning for parameter auto-tuning in molecular dynamics dynamics of ions near polarizable nanoparticles. Technical report, Indiana University, November 2018. [10] Microsoft Research. AI for database and data analytic systems at microsoft faculty summit. https://youtu.be/Tkl6ERLWAbA, 2018. Ac- cessed: 2019-1-29. [11] Microsoft Research. AI for AI systems at microsoft faculty summit. https://youtu.be/MqBOuoLflpU, 2018. Accessed: 2019-1-29. [12] Francis J. Alexander, Shantenu Jha. Objective driven computational experiment design: An ExaLearn perspective. In Fox, editor, Online Resource for Big Data and Workshop, November 2018. [13] Maziar Raissi, Paris Perdikaris, and George Em informed deep learning (part i): Data-driven differential equations. arXiv, November 2017. [14] Mustafa Mustafa, Deborah Bard, Wahid Bhimji, Rami Al-Rfou, and Zarija Luki. Creating virtual universes using networks. Technical report, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, 06 2017. [15] Shing Chan and Ahmed H Elsheikh. A machine learning approach for efficient uncertainty quantification using multiscale Phys., 354:493–511, February 2018. [16] A Townsend Peterson, Monica Pape\cs, and Jorge and correlative models of ecological niches. European Journal of Ecology, 1(2):28–38, 2015. [17] Anuj Karpatne, Gowtham Atluri, James H. Faghmous, Michael Stein- bach, Arindam Banerjee, Auroop Ganguly, Shashi Shekhar, Nagiza Samatova, and Vipin Kumar. Theory-guided data science: A new paradigm for scientific discovery from data. IEEE Transactions on R parallel computing center at indiana university Knowledge and Data Engineering, 29(10):2318–2331, 2017. [18] Mark EJ Newman. Spread of epidemic disease on networks. Physical review E, 66(1):016128, 2002. Grandinetti, L., Joubert, G.R., Sterling, T., Computing: New Frontiers in High Performance Data. IOS Press, 2017. [41] Intel parallel universe issue 32 page 31: Judy Qiu, Harp-DAAL for High-Performance https://software.intel.com/sites/default/files/paralle Accessed: 2018-9-30. [42] George E Dahl, Tara N Sainath, and Geoffrey E deep neural networks for lvcsr using rectified In Acoustics, Speech and Signal Processing (ICASSP), International Conference on, pages 8609–8613. [43] Yarin Gal and Zoubin Ghahramani. Dropout as a tion: Representing model uncertainty in deep conference on machine learning, pages 1050–1059, Everywhere: Pervasive Machine Learning High-Performance Computation Geoffrey Fox1, James A. Glazier1 , JCS Kadupitiya1, Vikram Jadhao1, Minje Kim1 , Judy Qiu1 , James P. Sluka1 , Endre Somogyi1, Madhav Marathe2 , Abhijin Adiga2 , Jiangzhuo Chen2, Oliver Beckstein3, Shantenu Jha4,5 1 Indiana University, Bloomington, IN 2 University of Virginia, Charlottesville, VA 3 Arizona State University, Tempe, AZ 4 Rutgers, the State University of New Jersey, Piscataway, NJ 08854, USA 5 Brookhaven National Laboratory, Upton, New York, 11973 convergence of HPC and data intensive method- spite of several orders of magnitude improvement in efficiency a promising approach to major performance from these adaptive ensemble algorithms, the complexity This paper provides a general description of of phase space and dynamics for modest physical systems, between traditional HPC and ML approaches the ”Learning Everywhere” paradigm for HPC. require additional orders of magnitude improvements and the concept of ”effective performance” that one performance gains. by combining learning methodologies with simu- In many application domains, integrating traditional HPC approaches, and distinguish between traditional approaches with machine learning methods arguably holds the as measured by benchmark scores. To support the greatest promise towards overcoming these barriers. The need integrating HPC and learning methods, this paper specific examples and opportunities across a series of for performance increase underlies the international efforts It concludes with a series of open computer science and behind the exascale supercomputing initiatives and we believe questions and challenges that the Learning that integration of ML into large scale computations (for both simulations and analytics) is a very promising way to get even large performance gains. Further, it can enable paradigms such I. I NTRODUCTION as control or steering and provide a fundamental approach describes opportunities at the interface between to coarse-graining which is a difficult but essential aspect of simulations, experiment design and control, ma- the many multi-scale application areas. Papers at two recent (ML including deep learning DL) and High- workshops BDEC2 [4] and NeurIPS [5] confirm our point Computing. We describe both the current status of view and our approach is synergistic with the BDEC2 research issues in allowing machine learning process with its emphasis on new application requirements enhance computational science. How should and their implications for future scientific computing software and where is it valuable? We focus on research platforms. We would like to distinguish between traditional on computing for science and engineering (as performance measured by operations per second or benchmark commercial) use cases for both big data and big scores and the effective performance that one gets by combin- problems. More details including further citations ing learning with simulation and gives increased performance as seen by the user without changing the traditional system of HPC and data-intensive methodolo- characteristics. This is of particular interest in cases where provide a promising approach to major performance there is a tight coupling between the learning and simulation Traditional HPC simulations are reaching the components (as outlined below for MLforHPC). The need original progress. The end of Dennard scaling of for significant enhancement in the effective performance of power usage and the end of Moores Law as origi- HPC motivates the introduction of a new paradigm in HPC: has yielded fundamentally different processor Learning Everywhere! The architectures continue to evolve, resulting Different Interfaces of ML and HPC: We have identified costly if not damaging churn in scientific codes that [6], [4] several important distinctly different links between ma- finely tuned to extract the last iota of parallelism chine learning (ML) and HPC. We define two broad categories: HPCforML and MLforHPC, sciences such as biomolecular sciences, advances • HPCforML: Using HPC to execute and enhance ML algorithms and runtime systems have enabled performance, or using HPC simulations to train ML ensemble based applications [3] to overcome algorithms (theory guided machine learning), which are of traditional monolithic simulations. However, in then used to understand experimental data or simulations. represent a chemistry potential or a larger grain size to solve the diffusion equation underlying cellular and tissue level simulations. Development of systematic ML-based coarse- graining techniques in both socio-technical simulations and nano-bio(cell)- tissue layering arises as an important area of research. In general, Domain-specific expertise will be needed to understand the needed accuracy and the number of training simulation runs needed. There are many groups working in MLaroundHPC but most of the work is just starting and not built around a systematic study of research issues as we propose. There is some deep work in building reduced dimension models to use in control scenarios [13]. We look at three distinct important areas: Networked systems with socio-technical simulations, multiscale cell and tissue simulations and at a finer scale biomolecular and nanoscale molecular systems. We note that biomolecular and biocomplexity areas which represent 40% of the HPC cycles used on NSF computational resources and so this is an area that is particularly ready and valuable. Molecular sciences has had several successful examples of using ML for autotuning and ML for analyz- ing the output of HPC simulation data. Several fields have made progress in using MLaroundHPC, e.g., Cosmoflow and CosmoGAN [14] are amongst the better known projects; and the Materials community is actively exploring the uptake of MLControl for the design of materials [4]. This paper does not cover development of new ML al- gorithms but rather the advancing the understanding of ML, including Deep Learning (DL) in support of MLaroundHPC. Of course, the usage experience is likely to suggest new ML approaches of value outside the MLaroundHPC arena. If one is to use an ML to replace a simulation, then an accuracy estimate is essential and as discussed in III-B there is a need to build on initial work on UQ (Uncertainty Quantification) with ML [15] such as that using dropout regularization to build ensembles for UQ. There are more sophisticated Bayesian methods to investigate. The research must also address er- godicity, viz., have we learned across the full phase space of initial values. Here methods taken from Monte-Carlo arena could be useful as reliable integration over a domain is related to reliable estimates of values defined across a domain. Further much of our learning is for analytic functions whereas much of the existing DL experience is for discrete-valued classifiers of commercial importance. Section III discusses cyberinfrastructure and computer sci- ence questions, section III-B covers uncertainty quantification for learnt results while section III-C the infrastructure require- ments needed to implement MLforHPC. Section III-D gives a general performance analysis method and applies to current cases, Section III-E covers new opportunities and research issues. Zetta scale II. S CIENCE E XEMPLARS hardware A. Machine learning for Networked Systems In this section we describe a hybrid method that fuses machine learning and mechanistic models to overcome the eralizability). When the dynamical system is more detailed (e.g. individual level) than the observation data, the hybrid framework allows detailed forecasting (high resolution). Epidemic Forecasting: Simulation trained machine learn- ing methods can be used for epidemic forecasting. An ex- ample of such a framework is DEFSI (Deep Learning Based Epidemic Forecasting with Synthetic Information) proposed in [19]. It consists of (i) a model configuration module that estimates a distribution for each parameter in an agent based epidemic model based on coarse surveillance data; (ii) simulation-geenrated synthetic training data module which generates high-resolution training data by running HPC sim- ulations parameterized from distributions estimated in the previous module; (iii) a two-branch deep neural network trained on the synthetic training dataset and used to make details forecasts with coarse surveillance data as inputs. Experimental results show that DEFSI performs comparably or better than the other methods for state level forecasting; and it outperforms the EpiFast method for county level forecasting. See Ref. [1] and citations therein for details. B. ML for Virtual Tissue and Cellular Simulations 1) Virtual Tissue Models: Virtual Tissue (VT) simulations [20] are mechanism-based multiscale spatial simulations of living tissues that address questions about development, main- tenance, damage and repair. They also find application in the design of tissues (tissue engineering) and the development of medical therapies, especially personalized therapies. VT simulations are computationally challenging for a number of reasons: 1) VT simulations are agent-based, with the core agent often representing biological cells. The number of cells in a real tissue is often of the order of 108 or more. 2) Agents are often hierarchical, with agents composed of multiple agents at smaller scales. 3) Agents interact strongly with each other, often over significant ranges [21]. 3) Individual agents typi- cally contain complex sub models that control their properties and behaviors. 4) Materials properties may be complex, like the shear thickening or thinning or swelling or contraction of fiber networks. 5) Modeling transport and diffusion is compute intensive. 6) Models are typically stochastic, so predictivity requires many replicas. 7) Simulations involve uncertainty both in model parameters and in model structure. 8) Bi- ological and medical time-series data are often qualitative, semi-quantitative or differential, making their use in classical optimization difficult. 9) VT models often produce movies of configurations over time. 10) Finally, simulating populations can add several orders of magnitude to the computational challenge. It is possible that ML techniques can be used to short circuit implementations at and between scales. 2) Virtual Tissue Modelling and AI + MLandHPC: AI can directly benefit VT applications in a number of ways: 1) Short-circuiting: The replacement of computationally costly modules with learned analogues. 2) Parameter fitting in high dimensional parameter spaces. 3) Treating stochasticity in results as information rather than noise. 5) No run is wasted. Training needs both successful and unsuccessful runs. To illustrate these outcomes, we discuss nanoscale simula- tions aimed at the computation of the structure of ions confined by surfaces that are nanometers apart which has been the focus of recent experiments and computational studies (see Ref [1] and citations therein). Typically, the entire ionic distribution averaged over sufficient number of independent samples gen- erated during the simulation is a quantity of interest. However, in many important cases, average values of contact density or center density directly relate to important experimentally- measured quantities such as the osmotic pressure [25]. Further, often it is useful to visualize expected trends in the behavior of contact or mid-point density as a function of solution conditions or ionic attributes, before running simulations to explore specific system conditions. It is thus desirable that a “smart” simulation framework provide rapid estimates of these critical output features with high accuracy. MLaroundHPC can enable precisely this as we recently showed that an artificial neural network successfully learns from completed simulation results the desired features associated with the output ionic density profiles to rapidly generate predictions for contact, peak, and center densities in excellent agreement with the results from explicit simulations [26]. 2) Biomolecular simulations: The use of ML and in par- ticular DL approaches for biomolecular simulations [27] lags behind other areas such as nano-science and materials science [28]. This might be partly due to the difficulty to account for large heterogeneous systems with important interactions at short and long length scales. But it might also indicate that the commonly used classical empirical force fields are surprisingly successful [29] and it is not easy to outperform them at this level of approximation. Therefore, one primary direction of research in this area is to improve the accuracy of the simulation while maintaining the performance of empirical energy functions. One promising approach is based on work by Behler and Parrinello [30] who devised a NN-based potential that was trained on quantum mechanical DFT energies; their key in- sight was to represent the total energy as a sum of atomic contributions and represent the chemical environment around each atom by an identically structured NN, which takes as input appropriate symmetry functions that are rotation and translation invariant as well as invariant to exchange of atoms while correctly reflecting the local environment that determines the energy [31]. Based on this work, Gastegger et al. [32] used ML to accelerate ab-initio MD (AIMD) to compute accurate IR spectra for organic molecules including the biological Ala+ 3 tripeptide in the gas phase. Interestingly, the ML model was able to reproduce anharmonicities and incorporate proton transfer reactions between different Ala+ 3 tautomers without having been explicit trained on such a chemical event, highlighting the promise of such an approach to incorporate a wide range of physically relevant effects with the right training data. The ML model was >1000 faster than the traditional evaluation of the underlying quantum Allreduce, (d) Asynchronous, based on the synchronization patterns and the effectiveness of the model parameter update. A major challenge of scaling is owing to the fact that compu- tation is irregular and the model size can be huge. At the meantime, parallel workers need to synchronize the model continually. By investigating collective vs. asynchronous meth- ods of the model synchronization mechanisms, we discover that optimized collective communication can improve the model update speed, thus allowing the model to converge faster. The performance improvement derives not only from accelerated communication but also from reduced iteration computation time as the model size may change during the model convergence. To foster faster model convergence, we need to design new collective communication abstractions. We identify all 5 classes of data-intensive computation[2], from pleasingly parallel to machine learning and simulations. To re-design a modular software stack with native kernels to effectively utilize scale-up servers for machine learning and data analytics applications. We are investigating how simulations and Big Data can use common programming environments with a runtime based on a rich set of collectives and libraries for a model-centric approach [40], [41]. Parallel Computing: We know that heterogeneity can lead to difficulty in parallel computing. This is extreme for MLaroundHPC as the ML learnt result can be huge factors (105 in our initial example[26]) faster than simulated answers. Further learning can be dynamic within a job and within different runs of a given job. One can address by load balancing the unlearnt and learnt separately but this can lead to geometric issues as quite likely that ML learning works more efficiently (for more potential simulations) in particular regions of phase space. AND B. Uncertainty Quantification for Deep Learning An important aspect of the use of a learned ML model is that one must learn not just the result of a simulation but also the uncertainty of the prediction e.g. if the learned result is valid enough to be used. This can be explained in the sense of the bias-variance trade-off, which is based on the decomposition of the expected error into two parts: variance and bias. The variance part explains the uncertainty of the model training process due to the randomness in the training algorithms or the lack of representativeness of the training set. A regularization scheme can reduce the variance so that the model complexity is in control and can result in a smoother model. However, the regularization approach comes at the cost of an increased amount of bias, which is another term in the expected error decomposition that explains the fitness of the model—by regularizing the model the training algorithm can do only a limited effort to minimize the training error. On the contrary, an unregularized model with a higher model complexity than necessary can also result in a minimal training error, while it suffers from high variance. Ideally, the bias-variance trade-off can be resolved to some degree by averaging trained instances of an originally complex model. Once these model instances are complex enough to fit subsystem. The papers [26], [9] are using ML to learn results (ionic density at a given location) of a complete simulation • D=5 with the five specifying features as confinement length h, positive valency zp , negative valency zn , salt concentration c, and the diameter of the ions d. • S= 4805 which 70% of total 6864 runs with 30% of the total runs used for testing. . In [9], one is not asking ML to predict a result as in [26],but rather training an Artificial Neural Net (ANN) to ensure that revisit it the simulation runs at its optimal speed (using for example, dropout the lowest allowable timestep dt and ”good” simulation control parameters for high efficiency) while retaining the accuracy of n the final result (e.g. density profile of ions). For this particular application, we could get away by dividing a 10 million time- step run ( 10 nanoseconds that is a typical timescale to reach equilibrium and get data in such systems) into 10 separate runs. network can • Input data size D= 6 (1 input uses 64 bits floats and 5 inputs use 32 bits integers - total 224 bits) • Input number of samples (S) = 15640 (70% training 30% test) • Hidden layer 1 = 30 • Hidden layer 1 = 48 . • Output variables = 3 Creation of the training dataset took = 64 cores * 80 hrs * 5400 simulation runs = 28160000 or 28 million CPU hours on Indiana University’s BigRed2 GPU compute nodes. Each run is 10 million steps long, and you use/learn/train ML every 1 million steps (so block size is a million), yielding 10 times more samples than runs. Generalizing this, the hardware needs will depend on how often you block, to stop and train the network, and then either on-the-fly or post-simulation, use that training to accelerate simulation or evaluate structure respectively. Blocking every timestep will not improve the training as typically, it won’t of training produce a statistically independent data point to evaluate any structure you desire. So you want to block at a timescale that is at least greater than the autocorrelation time dc; this is, of course, dependent on example you are looking at – and so your blocking and learning will depend on the application. In [26], it is small and dc is 3-5 dt; in glasses, it can be huge as the viscosity is high; and in biomolecular simulations, it will also depend on the level of coarse-graining and will be different in fully atomistic or very coarse-grained systems. The training effort will also depend on the input data size D, and the complexity of the relationship you are trying to learn which change the number of hidden layers and nodes per layer. For example, suppose you are tracking a particle (a side atom on a molecule in a typical nanoscale simulation), in order to come up with a metric (e.g. distance between two side atoms on different molecules) to track the diversity of clusters of particles during the self-assembly process. This comes from expectation that correlations between side atoms may be critical to a macroscopic property (such as formation of these particles into a FCC crystal). In this case your D is should we wrap microservices invoked by a Function as a Service environment? Where and how should we enable learning systems? Is Dataflow useful? 7) The different characters of surrogate and real executions produce system challenges as surrogate execution is much faster and invokes distinct software and hardware. This heterogeneity gives challenges for parallel com- puting, workload management and resource scheduling (heterogeneous and dynamic workflows). The implica- tion for performance is briefly discussed in sections III-A and III-D. 8) Scaling applications that are composed of multiple het- erogeneous computational (execution) units, and have distinct forms of parallelism that need balanced perfor- mance. Consider a workload comprised of NL learn- + Ntrain ) ing units, NS simulations units. The relative number of learning units to simulation units will vary with application and problem type. The relative values will even vary over execution time of the application, as the becomes amount of data generated as a ratio of training data will vary. This requires runtime systems that are capable of real-time performance tuning and adaptive execution for workloads comprised of multiple heterogeneous tasks. 9) The application of these ideas to statistical physics problems may need different techniques than those used in deterministic time evolutions. 10) The existing UQ frameworks based on the dropout tech- nique can provide the level of certainty as a probabilistic distribution in the prediction space. However, it does not always mean that the quality of the distribution is dependent on the quality/quantity of data. For example, two models with different dropout rates can produce different UQ results. If the goal of UQ in MLaroundHPC context is to supply only an adequate amount of data, we need a more reliable UQ method tailored for this purpose rather than the dropout technique that tends to manipulate the architecture of the model. 11) Application agnostic description and defintion of effec- tive performance enhancement. C ONCLUSIONS Broken Abstractions, New Abstractions: In traditional HPC the prevailing orthodoxy is Faster is Better has driven the quest for abstractions of hierarchical parallelism to speed- ing up single units of works. Relinquishing the orthodoxy based upon hierarchical (vertical) parallelism as the only route to performance is necessary. The new paradigm in HPC — Learning Everywhere, implies new performance, scaling and execution approaches. In this new paradigm, multiple, concurrent heterogeneous units of work replace single large units of works, which thus require both hierarchical (vertical) parallelism as well horizontal (many task) parallelism. ACKNOWLEDGMENTS This work was partially supported by NSF CIF21 DIBBS 1443054 and nanoBIO 1720625; the Indiana University Preci- [19] Lijing Wang, Jiangzhuo Chen, and Madhav Marathe. DEFSI: Deep learning based epidemic forecasting with synthetic information. In Proceedings of the 30th innovative Applications of Artificial Intelligence (IAAI), 2019. [20] James M Osborne, Alexander G Fletcher, Joe M Pitt-Francis, Philip K Maini, and David J Gavaghan. Comparing individual-based approaches to modelling the self-organization of multicellular tissues. PLoS Comput. Biol., 13(2):e1005387, February 2017. [21] James P. Sluka, Xiao Fu, Maciej Swat, Julio M. Belmonte, Alin Cosmanescu, Sherry G. Clendenon, John F. Wambaugh, and James A. Glazier. A liver-centric multiscale modeling framework for xenobiotics. PLoS ONE, 11(9), 2016. [22] Qianxiao Li, Felix Dietrich, Erik M. Bollt, and Ioannis G. Kevrekidis. Extended dynamic mode decomposition with dictionary learning: A data-driven adaptive spectral decomposition of the koopman oper- ator. Chaos: An Interdisciplinary Journal of Nonlinear Science, 27(10):103111, 2017. [23] Adam A Margolin, Ilya Nemenman, Katia Basso, Chris Wiggins, Gustavo Stolovitzky, Riccardo Dalla Favera, and Andrea Califano. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics, 7 Suppl Biology, 52:87 – 94, 1:S7, March 2006. the cryo-EM revolution [24] Liang Liang, Minliang Liu, Caitlin Martin, John A Elefteriades, and Wei Sun. A machine learning approach to investigate the relationship between shape features and numerically predicted risk of ascending aortic aneurysm. Biomech. Model. Mechanobiol., 16(5):1519–1533, October 2017. Müller, Brooks Paige, [25] Jos W. Zwanikken and Monica Olvera de la Cruz. Tunable soft structure Machine learning for in charged fluids confined by dielectric interfaces. Proceedings of the NeurIPS 2018 Workshop, National Academy of Sciences, 110(14):5301–5308, 2013. [26] JCS Kadupitiya , Geoffrey C. Fox , and Vikram Jadhao. Machine learn- Crandall, Gregor von ing for performance enhancement of molecular dynamics simulations. Technical report, Indiana University, December 2018. to High-Performance [27] Adrià Pérez, Gerard Martı́nez-Rosell, and Gianni De Fabritiis. Simula- Science Center, September tions meet machine learning in structural biology. Current Opinion in Structural Biology, 49:139 – 144, 2018. [28] Keith T Butler, Daniel W Davies, Hugh Cartwright, Olexandr Isayev, and Aron Walsh. Machine learning for molecular and materials science. Nature, 559(7715):547–555, July 2018. [29] Stefano Piana, John L Klepeis, and David E Shaw. Assessing the accu- racy of physical models used in protein-folding simulations: quantitative evidence from long molecular dynamics simulations. Curr Opin Struct simulations: Efficient Biol, 24:98–105, Feb 2014. [30] Jrg Behler and Michele Parrinello. Generalized Neural-Network Rep- resentation of High-Dimensional Potential-Energy Surfaces. Physical Review Letters, 98(14):146401, April 2007. [31] Jrg Behler. First Principles Neural Network Potentials for Reactive Simulations of Large Molecular and Condensed Systems. Angewandte Chemie International Edition, 56(42):12828–12840, 2017. [32] Michael Gastegger, Jrg Behler, and Philipp Marquetand. Machine learning molecular dynamics for the simulation of infrared spectra. Terry Moore, Geoffrey Chemical Science, 8(10):6924–6935, 2017. Extreme-Scale Computing [33] J. S.Smith, O. Isayev, and A. E.Roitberg. ANI-1: an extensible neural network potential with DFT accuracy at force field computational cost. Karniadakis. Physics Chemical Science, 8(4):3192–3203, 2017. solutions of nonlinear partial [34] Justin S. Smith, Ben Nebgen, Nicholas Lubbers, Olexandr Isayev, and Adrian E. Roitberg. Less is more: Sampling chemical space with active learning. The Journal of Chemical Physics, 148(24):241733, May 2018. generative adversarial [35] Jiang Wang, Christoph Wehmeyer, Frank Noé, , and Cecilia Clementi. Machine learning of coarse-grained molecular dynamics force fields. arXiv, 1812.01736v2, 2018. [36] Pedro E. M. Lopes, Olgun Guvench, and Alexander D. MacKerell. Cur- methods. J. Comput. rent Status of Protein Force Fields for Molecular Dynamics Simulations, pages 47–71. Springer New York, New York, NY, 2015. Soberón. Mechanistic [37] MLPERF benchmark suite for measuring performance of ML soft- ware frameworks, ML hardware accelerators, and ML cloud platforms. https://mlperf.org/. Accessed: 2019-2-8. [38] Uber Engineering. Horovod: Uber’s open source distributed deep learning framework for TensorFlow. https://eng.uber.com/horovod/. Accessed: 2019-2-8. [39] Intel led by Judy Qiu. http://ipcc.soic.iu.edu/. Accessed: 2018-9-30. [40] Bingjing Zhang, Bo Peng, and Judy Qiu. Parallelizing big data machine learning applications with model rotation. In Fox, G., Getov, V., editor, Advances in Parallel Computing and Big big data computing. l-universe-issue-32.pdf,http://dsc.soic.indiana.edu/publications/Intel-Magazine-HarpDAAL10.pdf. Hinton. Improving linear units and dropout. 2013 IEEE IEEE, 2013. bayesian approxima- learning. In international 2016.

(PDF) Learning Everywhere: Pervasive Machine Learning for Effective High-Performance Computation