Learning Everywhere: Pervasive Machine Learning for Effective High-Performance Computation Geoffrey Fox1, James A. Glazier1 , JCS Kadupitiya1, Vikram Jadhao1, Minje Kim1 , Judy Qiu1 , James P. Sluka1 , Endre Somogyi1, Madhav Marathe2 , Abhijin Adiga2 , Jiangzhuo Chen2, Oliver Beckstein3, Shantenu Jha4,5 1 Indiana University, Bloomington, IN 2 University of Virginia, Charlottesville, VA 3 Arizona State University, Tempe, AZ arXiv:1902.10810v1 [cs.DC] 27 Feb 2019 4 Rutgers, the State University of New Jersey, Piscataway, NJ 08854, USA 5 Brookhaven National Laboratory, Upton, New York, 11973 Abstract—The convergence of HPC and data intensive method- spite of several orders of magnitude improvement in efficiency ologies provide a promising approach to major performance from these adaptive ensemble algorithms, the complexity improvements. This paper provides a general description of of phase space and dynamics for modest physical systems, the interaction between traditional HPC and ML approaches and motivates the ”Learning Everywhere” paradigm for HPC. require additional orders of magnitude improvements and We introduce the concept of ”effective performance” that one performance gains. can achieve by combining learning methodologies with simu- In many application domains, integrating traditional HPC lation based approaches, and distinguish between traditional approaches with machine learning methods arguably holds the performance as measured by benchmark scores. To support the greatest promise towards overcoming these barriers. The need promise of integrating HPC and learning methods, this paper examines specific examples and opportunities across a series of for performance increase underlies the international efforts domains. It concludes with a series of open computer science and behind the exascale supercomputing initiatives and we believe cyberinfrastructure questions and challenges that the Learning that integration of ML into large scale computations (for both Everywhere paradigm presents. simulations and analytics) is a very promising way to get even large performance gains. Further, it can enable paradigms such I. I NTRODUCTION as control or steering and provide a fundamental approach This paper describes opportunities at the interface between to coarse-graining which is a difficult but essential aspect of large-scale simulations, experiment design and control, ma- the many multi-scale application areas. Papers at two recent chine learning (ML including deep learning DL) and High- workshops BDEC2 [4] and NeurIPS [5] confirm our point Performance Computing. We describe both the current status of view and our approach is synergistic with the BDEC2 and possible research issues in allowing machine learning process with its emphasis on new application requirements to pervasively enhance computational science. How should and their implications for future scientific computing software one do this and where is it valuable? We focus on research platforms. We would like to distinguish between traditional challenges on computing for science and engineering (as performance measured by operations per second or benchmark opposed to commercial) use cases for both big data and big scores and the effective performance that one gets by combin- simulation problems. More details including further citations ing learning with simulation and gives increased performance can be found at [1]. as seen by the user without changing the traditional system The convergence of HPC and data-intensive methodolo- characteristics. This is of particular interest in cases where gies [2] provide a promising approach to major performance there is a tight coupling between the learning and simulation improvements. Traditional HPC simulations are reaching the components (as outlined below for MLforHPC). The need limits of original progress. The end of Dennard scaling of for significant enhancement in the effective performance of transistor power usage and the end of Moores Law as origi- HPC motivates the introduction of a new paradigm in HPC: nally formulated has yielded fundamentally different processor Learning Everywhere! architectures. The architectures continue to evolve, resulting Different Interfaces of ML and HPC: We have identified in highly costly if not damaging churn in scientific codes that [6], [4] several important distinctly different links between ma- need to be finely tuned to extract the last iota of parallelism chine learning (ML) and HPC. We define two broad categories: and performance. HPCforML and MLforHPC, In domain sciences such as biomolecular sciences, advances • HPCforML: Using HPC to execute and enhance ML in statistical algorithms and runtime systems have enabled performance, or using HPC simulations to train ML extreme scale ensemble based applications [3] to overcome algorithms (theory guided machine learning), which are limitations of traditional monolithic simulations. However, in then used to understand experimental data or simulations. • MLforHPC: Using ML to enhance HPC applications and represent a chemistry potential or a larger grain size to solve systems the diffusion equation underlying cellular and tissue level This categorization is related to Jeff Dean’s ”Machine simulations. Development of systematic ML-based coarse- Learning for Systems and Systems for Machine Learning” [7] graining techniques in both socio-technical simulations and and Satoshi Matsuoka’s convergence of AI and HPC [8].We nano-bio(cell)- tissue layering arises as an important area of further subdivide HPCforML as research. In general, Domain-specific expertise will be needed • HPCrunsML: Using HPC to execute ML with high to understand the needed accuracy and the number of training performance simulation runs needed. • SimulationTrainedML: Using HPC simulations to train There are many groups working in MLaroundHPC but ML algorithms, which are then used to understand exper- most of the work is just starting and not built around a imental data or simulations. systematic study of research issues as we propose. There is some deep work in building reduced dimension models to use We also subdivide MLforHPC as in control scenarios [13]. We look at three distinct important • MLautotuning: Using ML to configure (autotune) ML or areas: Networked systems with socio-technical simulations, HPC simulations. Already, autotuning with systems like multiscale cell and tissue simulations and at a finer scale ATLAS is hugely successful and gives an initial view biomolecular and nanoscale molecular systems. of MLautotuning. As well as choosing block sizes to We note that biomolecular and biocomplexity areas which improve cache use and vectorization, MLautotuning can represent 40% of the HPC cycles used on NSF computational also be used for simulation mesh sizes [9] and in big data resources and so this is an area that is particularly ready problems for configuring databases and complex systems and valuable. Molecular sciences has had several successful like Hadoop and Spark [10], [11]. examples of using ML for autotuning and ML for analyz- • MLafterHPC: ML analyzing results of HPC as in trajec- ing the output of HPC simulation data. Several fields have tory analysis and structure identification in biomolecular made progress in using MLaroundHPC, e.g., Cosmoflow and simulations CosmoGAN [14] are amongst the better known projects; and • MLaroundHPC: Using ML to learn from simulations the Materials community is actively exploring the uptake of and produce learned surrogates for the simulations. The MLControl for the design of materials [4]. same ML wrapper can also learn configurations as well as This paper does not cover development of new ML al- results. This differs from SimulationTrainedML as there gorithms but rather the advancing the understanding of ML, typically a learnt network is used to redirect observation including Deep Learning (DL) in support of MLaroundHPC. whereas in MLaroundHPC we are using the ML to Of course, the usage experience is likely to suggest new ML improve the HPC performance. approaches of value outside the MLaroundHPC arena. If one • MLControl: Using simulations (with HPC) in control is to use an ML to replace a simulation, then an accuracy of experiments and in objective driven computational estimate is essential and as discussed in III-B there is a need to campaigns [12]. Here the simulation surrogates are very build on initial work on UQ (Uncertainty Quantification) with valuable to allow real-time predictions. ML [15] such as that using dropout regularization to build All 6 topics above are important and pose many research ensembles for UQ. There are more sophisticated Bayesian issues in computer science and cyberinfrastructure, directly in methods to investigate. The research must also address er- application domains and in the integration of technology with godicity, viz., have we learned across the full phase space applications. However, in this paper, we focus on topics in of initial values. Here methods taken from Monte-Carlo arena MLforHPC, with close coupling between ML, simulations, and could be useful as reliable integration over a domain is related HPC. We involve applications as a driver for the requirements to reliable estimates of values defined across a domain. Further and evaluation of the computer science and infrastructure. In much of our learning is for analytic functions whereas much researching MLaroundHPC we will consider ML wrappers of the existing DL experience is for discrete-valued classifiers for either HPC simulations or complex ML algorithms imple- of commercial importance. mented with HPC. Our focus is on how to increase effective Section III discusses cyberinfrastructure and computer sci- performance with the learning everywhere principle and how ence questions, section III-B covers uncertainty quantification to build efficient learning everywhere parallel systems. for learnt results while section III-C the infrastructure require- One can view the use of ML learned surrogates as a per- ments needed to implement MLforHPC. Section III-D gives a formance boost that can lead to huge speedups as calculation general performance analysis method and applies to current of a prediction from a trained network, can be many orders cases, Section III-E covers new opportunities and research of magnitude faster than full execution of the simulation as issues. shown in section III-D. One can reach Exa or even Zetta scale II. S CIENCE E XEMPLARS equivalent performance for simulations with existing hardware systems. These high-performance surrogates are valuable in A. Machine learning for Networked Systems education and control scenarios by just speeding existing In this section we describe a hybrid method that fuses simulations. Simple examples are the use of a surrogate to machine learning and mechanistic models to overcome the challenges posed by scenarios where data is sparse and eralizability). When the dynamical system is more detailed knowledge of underlying mechanism is inadequate. Across (e.g. individual level) than the observation data, the hybrid domains, the two approaches have been compared [16]. Ma- framework allows detailed forecasting (high resolution). chine learning approach usually needs a large amount of Epidemic Forecasting: Simulation trained machine learn- observation data for training, and does not explicitly account ing methods can be used for epidemic forecasting. An ex- for mechanisms that govern the the complex phenomenon. ample of such a framework is DEFSI (Deep Learning Based On the other hand, mechanistic models (like agent-based Epidemic Forecasting with Synthetic Information) proposed models) result from a bottom-up approach; but they tend to in [19]. It consists of (i) a model configuration module have too many parameters, are compute intensive and hard to that estimates a distribution for each parameter in an agent calibrate. In recent years, there have been several efforts to based epidemic model based on coarse surveillance data; (ii) study physical processes under the umbrella of theory-guided simulation-geenrated synthetic training data module which data science (TGDS), with focus on artificial neural networks generates high-resolution training data by running HPC sim- (ANN) as the primary learning tool. [17] provides a survey ulations parameterized from distributions estimated in the of these methods and their application to hydrology, climate previous module; (iii) a two-branch deep neural network science, turbulence modeling, etc. where the underlying theory trained on the synthetic training dataset and used to make can be used to reduce the variance in model parameters by details forecasts with coarse surveillance data as inputs. introducing constraints or priors in the model space. Experimental results show that DEFSI performs comparably Here we consider a particular class of mechanistic models or better than the other methods for state level forecasting; and - network dynamical systems, which have been applied in it outperforms the EpiFast method for county level forecasting. diverse domains such as epidemiology and computational See Ref. [1] and citations therein for details. social science. A network dynamical system is composed of B. ML for Virtual Tissue and Cellular Simulations a network where nodes of the network are agents (repre- senting population, computers, etc.) and the edges capture 1) Virtual Tissue Models: Virtual Tissue (VT) simulations the interactions between them. A popular example of such [20] are mechanism-based multiscale spatial simulations of systems is the SEIR model of disease spread in a social living tissues that address questions about development, main- network [18]. The complexity of the dynamics in such a tenance, damage and repair. They also find application in the network, due to individual level heterogeneity and interactions, design of tissues (tissue engineering) and the development makes it difficult to train a machine learning model that of medical therapies, especially personalized therapies. VT can be generalized to patterns not yet presented in historical simulations are computationally challenging for a number of data. Completely data driven models cannot discover higher reasons: 1) VT simulations are agent-based, with the core resolution details (e.g. county level incidence) from lower agent often representing biological cells. The number of cells resolution ground truth data (e.g. state level incidence). in a real tissue is often of the order of 108 or more. 2) Agents Learning from observational and simulation data: Data are often hierarchical, with agents composed of multiple agents sparsity is often a challenge for applying machine learning, at smaller scales. 3) Agents interact strongly with each other, especially deep learning methods to forecasting problems in often over significant ranges [21]. 3) Individual agents typi- socio-technical systems. One example of such problems is cally contain complex sub models that control their properties to predict weekly incidence in future weeks in an influenza and behaviors. 4) Materials properties may be complex, like epidemic. In such socio-technical systems, we usually have the shear thickening or thinning or swelling or contraction of only limited observational data, e.g. weekly incidence number fiber networks. 5) Modeling transport and diffusion is compute reported to the Centers for Disease Control and Prevention intensive. 6) Models are typically stochastic, so predictivity (CDC). Such data is of low spatial temporal resolution (weekly requires many replicas. 7) Simulations involve uncertainty at state level), not real time (at least one week delay), incom- both in model parameters and in model structure. 8) Bi- plete (reported cases are only a small fraction of actual ones), ological and medical time-series data are often qualitative, and noisy (adjusted several times after being published), thus semi-quantitative or differential, making their use in classical necessitating a hybrid framework for forecasting by learning optimization difficult. 9) VT models often produce movies of from observational and simulation data. configurations over time. 10) Finally, simulating populations Observations need to be augmented with existing domain can add several orders of magnitude to the computational knowledge and behavior encapsulated in the agent-based challenge. It is possible that ML techniques can be used to model to inform the learning algorithm. In such hybrid short circuit implementations at and between scales. framework, the network dynamical system is used to guide 2) Virtual Tissue Modelling and AI + MLandHPC: AI can the learning algorithm so that it conforms to the principles directly benefit VT applications in a number of ways: (consistency). At the same time, the learning algorithm will 1) Short-circuiting: The replacement of computationally facilitate model selection in a principled manner. Moreover, costly modules with learned analogues. the synthetic data goes beyond the observation data, thus 2) Parameter fitting in high dimensional parameter spaces. helps voiding overfitting and makes the learned model capable 3) Treating stochasticity in results as information rather of processing patterns unseen in the observation data (gen- than noise. 4) Prediction of bifurcations in models. 5) No run is wasted. Training needs both successful and 5) Design of maximally discriminatory experiments – pre- unsuccessful runs. dict the parameter sets by which two models can be To illustrate these outcomes, we discuss nanoscale simula- differentiated. tions aimed at the computation of the structure of ions confined 6) Run time backwards, to determine initial conditions that by surfaces that are nanometers apart which has been the focus lead to observed endpoints. of recent experiments and computational studies (see Ref [1] 7) The elimination of short time scales, e.g., short-circuit and citations therein). Typically, the entire ionic distribution the calculations of advection-diffusion. averaged over sufficient number of independent samples gen- 8) Generating additional spatial data sets from experimental erated during the simulation is a quantity of interest. However, images. in many important cases, average values of contact density Representative prior work by Karniadakis [13], Kevrekidis or center density directly relate to important experimentally- [22] and Nemenman [23] shows that neural networks can measured quantities such as the osmotic pressure [25]. Further, reproduce the temporal behaviors of biochemical regulatory often it is useful to visualize expected trends in the behavior and signaling networks. Ref. [24] has shown that networks can of contact or mid-point density as a function of solution learn nonlinear biomechanics simulations of the aorta–being conditions or ionic attributes, before running simulations to able to predict the stress and strain distribution in the human explore specific system conditions. It is thus desirable that a aorta from the morphology observable with MRI or CT. “smart” simulation framework provide rapid estimates of these critical output features with high accuracy. MLaroundHPC can C. Machine Learning and Molecular Simulations enable precisely this as we recently showed that an artificial neural network successfully learns from completed simulation 1) Nanoscale simulation: Despite the employment of the results the desired features associated with the output ionic optimal parallelization techniques suited for the size and density profiles to rapidly generate predictions for contact, complexity of the system, nanoscale simulations remain time peak, and center densities in excellent agreement with the consuming. In research settings, simulations can take up to results from explicit simulations [26]. several days and it is often desirable to foresee expected over- 2) Biomolecular simulations: The use of ML and in par- all trends in key quantities; for example, how does the contact ticular DL approaches for biomolecular simulations [27] lags density vary as a function of ion concentration in nanoscale behind other areas such as nano-science and materials science confinement or how the peak positions of the pair correlation [28]. This might be partly due to the difficulty to account functions characterizing nanoparticle assembly evolve as the for large heterogeneous systems with important interactions environmental parameters are tuned. Given the dramatic rise at short and long length scales. But it might also indicate in ML and HPC technologies, it is not the question of if, but that the commonly used classical empirical force fields are when, ML can be integrated with HPC to enhance nanoscale surprisingly successful [29] and it is not easy to outperform simulation methods. Recent years have seen a surge in the them at this level of approximation. Therefore, one primary use of ML to accelerate material simulation techniques: ML direction of research in this area is to improve the accuracy of has been used to predict parameters, generate configurations the simulation while maintaining the performance of empirical in material simulations, and classify material properties (see energy functions. Ref [1] and citations therein). At this time, it is critical to One promising approach is based on work by Behler and understand and develop the software frameworks to build ML Parrinello [30] who devised a NN-based potential that was layers around HPC to 1) enhance simulation performance 2) trained on quantum mechanical DFT energies; their key in- enable real-time and anytime engagement, and 3) broaden the sight was to represent the total energy as a sum of atomic applicability of simulations for both research and education contributions and represent the chemical environment around (in-classroom) usage. each atom by an identically structured NN, which takes In the context of nanoscale simulation, an initial set of as input appropriate symmetry functions that are rotation applications for the MLaroundHPC framework can be the and translation invariant as well as invariant to exchange of prediction of the structure or correlation functions (outputs) atoms while correctly reflecting the local environment that characterizing the nanoscale system over a broad range of determines the energy [31]. Based on this work, Gastegger experimental control parameters (inputs). MLaroundHPC can et al. [32] used ML to accelerate ab-initio MD (AIMD) to enable the following outcomes: compute accurate IR spectra for organic molecules including 1) Learn pre-identified critical features associated with the the biological Ala+ 3 tripeptide in the gas phase. Interestingly, simulation output. the ML model was able to reproduce anharmonicities and 2) Generate accurate predictions for un-simulated state- incorporate proton transfer reactions between different Ala+ 3 points (by entirely bypassing simulations). tautomers without having been explicit trained on such a 3) Exhibit auto-tunability (with new simulation runs, the chemical event, highlighting the promise of such an approach ML layer gets better at making predictions). to incorporate a wide range of physically relevant effects with 4) Enable real-time, anytime, and anywhere access to sim- the right training data. The ML model was >1000 faster ulation results (particularly important for education use). than the traditional evaluation of the underlying quantum mechanical physical equations. Allreduce, (d) Asynchronous, based on the synchronization Roitberg et al. [33] trained a NN on QM DFT calculations, patterns and the effectiveness of the model parameter update. based on modified Behler-Parrinello symmetry functions. The A major challenge of scaling is owing to the fact that compu- resulting ANI-1 model was shown to be chemically accurate, tation is irregular and the model size can be huge. At the transferrable, with a performance similar to a classical force meantime, parallel workers need to synchronize the model field, thus enabling ab-initio molecular dynamics (AIMD) at a continually. By investigating collective vs. asynchronous meth- fraction of the cost of ”true” DFT AIMD. Extensions of their ods of the model synchronization mechanisms, we discover work with an active learning (AL) approach demonstrated that that optimized collective communication can improve the proteins in an explicit water environment can be simulated model update speed, thus allowing the model to converge with a NN potential at DFT accuracy [34]. The AL approach faster. The performance improvement derives not only from reduced the amount of required training data to 10% of accelerated communication but also from reduced iteration the original model [34] by iteratively adding training data computation time as the model size may change during the calculations for regions of chemical space where the current model convergence. To foster faster model convergence, we ML model could not make good predictions. Using transfer need to design new collective communication abstractions. learning, the ANI-1 potential was also extended to predict We identify all 5 classes of data-intensive computation[2], energies at the highest level of quantum chemistry calculations from pleasingly parallel to machine learning and simulations. (coupled cluster CCSD(T)), with speedups in the billion. To re-design a modular software stack with native kernels In general the focus has been on achieving DFT-level to effectively utilize scale-up servers for machine learning accuracy because NN potentials are not cheaper to evaluate and data analytics applications. We are investigating how than most classical empirical potentials. However, replacing simulations and Big Data can use common programming solvent-solvent and solvent-solute interactions, which typically environments with a runtime based on a rich set of collectives make up 80%-90% of the computational effort in a classical and libraries for a model-centric approach [40], [41]. all-atom, explicit solvent simulation, with a NN potential Parallel Computing: We know that heterogeneity can promises large performance gains at a fraction of the cost lead to difficulty in parallel computing. This is extreme for of traditional implicit solvent models and with an accuracy MLaroundHPC as the ML learnt result can be huge factors comparable to the explicit simulations [35], as also discussed (105 in our initial example[26]) faster than simulated answers. above in the case of electrolyte solutions. Furthermore, in- Further learning can be dynamic within a job and within clusion of polarization, which is expensive (factor 3-10 in different runs of a given job. One can address by load current classical polarizable force fields [36]) but of great balancing the unlearnt and learnt separately but this can lead interest when studying the interaction of multivalent ions with to geometric issues as quite likely that ML learning works biomolecules might be easily achievable with appropriately more efficiently (for more potential simulations) in particular trained ML potentials. regions of phase space. III. I NTEGRATING ML AND HPC: BACKGROUND AND B. Uncertainty Quantification for Deep Learning O PPORTUNITIES An important aspect of the use of a learned ML model is A primary contribution of this paper is in the categorization, that one must learn not just the result of a simulation but description and examples of the different ways in which ML also the uncertainty of the prediction e.g. if the learned result can enhance HPC (MLforHPC). Before we expound upon is valid enough to be used. This can be explained in the MLforHPC and open research issues, we provide a a summary sense of the bias-variance trade-off, which is based on the status of HPC for ML (beyond the obvious and well-studied decomposition of the expected error into two parts: variance use of GPUs for ML). and bias. The variance part explains the uncertainty of the model training process due to the randomness in the training A. HPC for Machine Learning algorithms or the lack of representativeness of the training set. There has been substantial community progress here with A regularization scheme can reduce the variance so that the the Industry supported MLPerf [37] machine learning bench- model complexity is in control and can result in a smoother mark activity and Ubers Horovod Open Source Distributed model. However, the regularization approach comes at the cost Deep Learning Framework for TensorFlow [38]. We have stud- of an increased amount of bias, which is another term in ied different parallel patterns (kernels) of machine learning ap- the expected error decomposition that explains the fitness of plications, looking in particular at Gibbs Sampling, Stochastic the model—by regularizing the model the training algorithm Gradient Descent (SGD), Cyclic Coordinate Descent (CCD) can do only a limited effort to minimize the training error. and K-means clustering [39]. These algorithms are fundamen- On the contrary, an unregularized model with a higher model tal for large-scale data analysis and cover several important complexity than necessary can also result in a minimal training categories: Markov Chain Monte Carlo (MCMC), Gradient error, while it suffers from high variance. Descent and Expectation and Maximization (EM). We show Ideally, the bias-variance trade-off can be resolved to some that parallel iterative algorithms can be categorized into four degree by averaging trained instances of an originally complex types of computation models (a) Locking, (b) Rotation, (c) model. Once these model instances are complex enough to fit the training data set, we can use the averaged predictions as subsystem. The papers [26], [9] are using ML to learn results the output of the model. However, averaging many different (ionic density at a given location) of a complete simulation model instances implies a practical difficulty that one has to • D=5 with the five specifying features as confinement conduct multiple optimization tasks to secure a statistically length h, positive valency zp , negative valency zn , salt meaningful sample distribution of the predictions. Given the concentration c, and the diameter of the ions d. assumption that the model might as well be a complex one • S= 4805 which 70% of total 6864 runs with 30% of the to minimize the bias component (e.g. a deep neural network), total runs used for testing. the model averaging strategy is computationally challenging. In [9], one is not asking ML to predict a result as in [26],but Dropout has been extensively used in deep learning as a rather training an Artificial Neural Net (ANN) to ensure that regularization technique [42], but recent researches revisit it the simulation runs at its optimal speed (using for example, as an uncertainty quantification (UQ) tool [43]. The dropout the lowest allowable timestep dt and ”good” simulation control procedure can be seen as an efficient way to maintain a parameters for high efficiency) while retaining the accuracy of pool of multiple network instances for the same optimization the final result (e.g. density profile of ions). For this particular task. It is an efficient ensemble technique as it applies a application, we could get away by dividing a 10 million time- randomly sampled Bernoulli mask to a layer-wise input unit, step run ( 10 nanoseconds that is a typical timescale to reach thus exposing the optimization process to many differently equilibrium and get data in such systems) into 10 separate structured instances of the network. runs. A a set of differently thinned versions of the network can • Input data size D= 6 (1 input uses 64 bits floats and 5 form a sample distribution of predictions to be used as a UQ metric. The dropout-based UQ scheme can provide an inputs use 32 bits integers - total 224 bits) • Input number of samples (S) = 15640 (70% training 30% opportunity for the MLaroundHPC simulation experiments. As a data-driven model it is reasonable to assume that a test) • Hidden layer 1 = 30 better ML surrogate can be found once the training routine • Hidden layer 1 = 48 sees more examples generated from the simulation experiment. • Output variables = 3 However, creating more examples to train a better ML model is a conflicting requirement as the purpose of training the ML Creation of the training dataset took = 64 cores * 80 hrs * surrogate is to avoid such computation. The UQ scheme can 5400 simulation runs = 28160000 or 28 million CPU hours play a role here to provide the training routine with a way on Indiana University’s BigRed2 GPU compute nodes. Each to quantify the uncertainty in the prediction—once it is low run is 10 million steps long, and you use/learn/train ML every enough, the training routine might less likely need more data. 1 million steps (so block size is a million), yielding 10 times more samples than runs. C. Machine Learning for HPC Generalizing this, the hardware needs will depend on how often you block, to stop and train the network, and then either Here we review the nature of the Machine Learning needed on-the-fly or post-simulation, use that training to accelerate for MLforHPC in different application domains. The Machine simulation or evaluate structure respectively. Blocking every Learning (ML) load depends on 1) Time interval between its timestep will not improve the training as typically, it won’t invocations, which will translate into the number of training produce a statistically independent data point to evaluate any samples S and 2) size D of data set specifying each sample. structure you desire. So you want to block at a timescale that This size could be as large as the number of degrees of is at least greater than the autocorrelation time dc; this is, of freedom in simulation or could be (much) smaller if just a course, dependent on example you are looking at – and so few parameters are needed to define simulation. We note two your blocking and learning will depend on the application. In general issues [26], it is small and dc is 3-5 dt; in glasses, it can be huge • There can very important data transfer and storage issues as the viscosity is high; and in biomolecular simulations, it in linking the Simulations and Machine Learning parts of will also depend on the level of coarse-graining and will be system. This could need carefully designed architectures different in fully atomistic or very coarse-grained systems. for both hardware and software. The training effort will also depend on the input data size • The Simulations and Machine Learning subsystems are D, and the complexity of the relationship you are trying to likely to require different node optimizations as in differ- learn which change the number of hidden layers and nodes ent types and uses of accelerators. per layer. For example, suppose you are tracking a particle (a side atom on a molecule in a typical nanoscale simulation), D. Science Exemplar: Nanosimulations in order to come up with a metric (e.g. distance between two In this subsection, using the example of Nanosimulations, side atoms on different molecules) to track the diversity of we show progress in all areas at the intersection of HPC and clusters of particles during the self-assembly process. This ML are having an impact. comes from expectation that correlations between side atoms In each of two cases below, one is using scikit-learn, may be critical to a macroscopic property (such as formation Tensorflow and the Keras wrapper for Tensorflow, as the ML of these particles into a FCC crystal). In this case your D is huge, and your ML objectives may be looking for a deep should we wrap microservices invoked by a Function relationship, and you may have to invoke an ensemble of as a Service environment? Where and how should we ANN’s and this will change hardware needs. enable learning systems? Is Dataflow useful? Scaling of Effective Performance: An initial approach 7) The different characters of surrogate and real executions to estimate speedup in a hybrid MLaroundHPC situation is produce system challenges as surrogate execution is given in [26] for a nano simulation. One can estimate the much faster and invokes distinct software and hardware. speedup in terms in terms of four times Tseq the sequential This heterogeneity gives challenges for parallel com- execution time of simulation; Ttrain the time for the parallel puting, workload management and resource scheduling execution of simulation to give training data; Tlearn is the (heterogeneous and dynamic workflows). The implica- time per sample to train the learning networkl; and Tlookup tion for performance is briefly discussed in sections is the inference time to predict the results of the simulation III-A and III-D. by using the trained network. In the formula below, Nlookup 8) Scaling applications that are composed of multiple het- is the number of trained neural net inferences and Ntrain the erogeneous computational (execution) units, and have number of parallel simulations used in training. distinct forms of parallelism that need balanced perfor- mance. Consider a workload comprised of NL learn- Tseq (Nlookup + Ntrain ) Ef f ectiveSpeedup S = ing units, NS simulations units. The relative number Tlookup Nlookup + (Ttrain + Tlearn )Ntrain of learning units to simulation units will vary with Tseq This formula reduces to the classic simple Ttrain when there is application and problem type. The relative values will Nlookup even vary over execution time of the application, as the no machine learning and in the limit of large Ntrain becomes Tseq amount of data generated as a ratio of training data will Tlookup which can be huge! vary. This requires runtime systems that are capable of There are many caveats and assumptions here. We are real-time performance tuning and adaptive execution for considering a simple case where one runs the Ntrain sim- workloads comprised of multiple heterogeneous tasks. ulations, followed by the learning and then all the Nlookup 9) The application of these ideas to statistical physics inferences. Further we assume the training simulations are problems may need different techniques than those used useful results and not just overhead. We also have not properly in deterministic time evolutions. considered how to build in the likelihood that training, learning 10) The existing UQ frameworks based on the dropout tech- and lookup phases are probably using different hardware nique can provide the level of certainty as a probabilistic configurations with different node counts. distribution in the prediction space. However, it does not always mean that the quality of the distribution is E. Opportunities and Research Issues dependent on the quality/quantity of data. For example, Research Issues: In addition to the six categories at the two models with different dropout rates can produce interface of ML and HPC, the research issues we identify different UQ results. If the goal of UQ in MLaroundHPC reflect the multiple interdisciplinary activities linked in our context is to supply only an adequate amount of data, study of MLforHPC, including application domains described we need a more reliable UQ method tailored for this in sections II-A, II-B, II-C1 and II-C2, as well as coarse purpose rather than the dropout technique that tends to graining studied in our case for network science and nano- manipulate the architecture of the model. bio areas. 11) Application agnostic description and defintion of effec- We have identified the following research areas, which can tive performance enhancement. be categorized into Algorithms and Methods (1-5), Applied Math (10), Software Systems (6,7), Performance Measurement C ONCLUSIONS and Engineering (8,11). Broken Abstractions, New Abstractions: In traditional 1) Where can application domains use MLaroundHPC and HPC the prevailing orthodoxy is Faster is Better has driven MLautotuning effectively and what science is enabled the quest for abstractions of hierarchical parallelism to speed- by this ing up single units of works. Relinquishing the orthodoxy 2) Which ML and DL approaches are most relevant and based upon hierarchical (vertical) parallelism as the only how can they be set up to enable broad user-friendly route to performance is necessary. The new paradigm in HPC MLaroundHPC and MLautotuning in domain science — Learning Everywhere, implies new performance, scaling 3) How can Uncertainty Quantification be enabled and and execution approaches. In this new paradigm, multiple, separately study ergodicity (bias) and accuracy issues? concurrent heterogeneous units of work replace single large 4) Is there new area of algorithmic research focusing on units of works, which thus require both hierarchical (vertical) finding algorithms that can be most effectively learnt? parallelism as well horizontal (many task) parallelism. 5) Is there a general multiscale approach using MLaroundHPC. ACKNOWLEDGMENTS 6) What are appropriate systems frameworks for This work was partially supported by NSF CIF21 DIBBS MLaroundHPC and MLautotuning. For example, 1443054 and nanoBIO 1720625; the Indiana University Preci- sion Health initiative and Intel through the Parallel Computing [19] Lijing Wang, Jiangzhuo Chen, and Madhav Marathe. DEFSI: Deep Center at Indiana University. JPS and JAG were partially learning based epidemic forecasting with synthetic information. In Proceedings of the 30th innovative Applications of Artificial Intelligence supported by NSF 1720625, NIH U01 GM111243 and NIH (IAAI), 2019. GM122424. SJ was partially supported by ExaLearn – a DOE [20] James M Osborne, Alexander G Fletcher, Joe M Pitt-Francis, Philip K Exascale Computing project. Maini, and David J Gavaghan. Comparing individual-based approaches to modelling the self-organization of multicellular tissues. PLoS Comput. Biol., 13(2):e1005387, February 2017. R EFERENCES [21] James P. Sluka, Xiao Fu, Maciej Swat, Julio M. Belmonte, Alin Cosmanescu, Sherry G. Clendenon, John F. Wambaugh, and James A. [1] Geoffrey Fox, James A. Glazier, JCS Kadupitiya, Vikram Jadhao, Glazier. A liver-centric multiscale modeling framework for xenobiotics. Minje Kim, Judy Qiu, James P. Sluka, Endre Somogyi, Madhav PLoS ONE, 11(9), 2016. Marathe, Abhijin Adiga, Jiangzhuo Chen, Oliver Beckstein, and [22] Qianxiao Li, Felix Dietrich, Erik M. Bollt, and Ioannis G. Kevrekidis. Shantenu Jha. Learning Everywhere: Pervasive machine learn- Extended dynamic mode decomposition with dictionary learning: A ing for effective High-Performance computation: Application back- data-driven adaptive spectral decomposition of the koopman oper- ground. Technical report, Indiana University, February 2019. ator. Chaos: An Interdisciplinary Journal of Nonlinear Science, http://dsc.soic.indiana.edu/publications/Learning Everywhere.pdf. 27(10):103111, 2017. [2] Geoffrey Fox, Judy Qiu, Shantenu Jha, Saliya Ekanayake, and Supun [23] Adam A Margolin, Ilya Nemenman, Katia Basso, Chris Wiggins, Kamburugamuve. Big data, simulations and HPC convergence. In Gustavo Stolovitzky, Riccardo Dalla Favera, and Andrea Califano. Springer Lecture Notes in Computer Science LNCS 10044, 2016. ARACNE: an algorithm for the reconstruction of gene regulatory [3] Peter M Kasson and Shantenu Jha. Adaptive ensemble simulations networks in a mammalian cellular context. BMC Bioinformatics, 7 Suppl of biomolecules. Current Opinion in Structural Biology, 52:87 – 94, 1:S7, March 2006. 2018. Cryo electron microscopy: the impact of the cryo-EM revolution [24] Liang Liang, Minliang Liu, Caitlin Martin, John A Elefteriades, and in biology Biophysical and computational methods - Part A. Wei Sun. A machine learning approach to investigate the relationship [4] NSF1849625 workshop series BDEC2: Toward a common digital con- between shape features and numerically predicted risk of ascending tinuum platform for big data and extreme-scale computing (BDEC2). aortic aneurysm. Biomech. Model. Mechanobiol., 16(5):1519–1533, https://www.exascale.org/bdec/. October 2017. [5] José Miguel Hernández-Lobato, Klaus-Robert Müller, Brooks Paige, [25] Jos W. Zwanikken and Monica Olvera de la Cruz. Tunable soft structure Matt J. Kusner, Stefan Chmiela, Kristof T. Schütt. Machine learning for in charged fluids confined by dielectric interfaces. Proceedings of the molecules and materials. In Proceedings of the NeurIPS 2018 Workshop, National Academy of Sciences, 110(14):5301–5308, 2013. December 2018. [26] JCS Kadupitiya , Geoffrey C. Fox , and Vikram Jadhao. Machine learn- [6] Oliver Beckstein,Geoffrey Fox, Judy Qiu, David Crandall, Gregor von ing for performance enhancement of molecular dynamics simulations. Laszewski, John Paden, Shantenu Jha, Fusheng Wang, Madhav Marathe, Technical report, Indiana University, December 2018. Anil Vullikanti, Thomas Cheatham. Contributions to High-Performance [27] Adrià Pérez, Gerard Martı́nez-Rosell, and Gianni De Fabritiis. Simula- big data computing. Technical report, Digital Science Center, September tions meet machine learning in structural biology. Current Opinion in 2018. Structural Biology, 49:139 – 144, 2018. [7] Jeff Dean. Machine learning for systems and systems for machine [28] Keith T Butler, Daniel W Davies, Hugh Cartwright, Olexandr Isayev, learning. In Presentation at 2017 Conference on Neural Information and Aron Walsh. Machine learning for molecular and materials science. Processing Systems, 2017. Nature, 559(7715):547–555, July 2018. [8] Satoshi Matsuoka. Post-K: A game changing supercomputer for con- [29] Stefano Piana, John L Klepeis, and David E Shaw. Assessing the accu- vergence of HPC and big data / AI. Multicore 2019, February 2019. racy of physical models used in protein-folding simulations: quantitative [9] JCS Kadupitiya, Geoffrey C. Fox, Vikram Jadhao. Machine learning evidence from long molecular dynamics simulations. Curr Opin Struct for parameter auto-tuning in molecular dynamics simulations: Efficient Biol, 24:98–105, Feb 2014. dynamics of ions near polarizable nanoparticles. Technical report, [30] Jrg Behler and Michele Parrinello. Generalized Neural-Network Rep- Indiana University, November 2018. resentation of High-Dimensional Potential-Energy Surfaces. Physical [10] Microsoft Research. AI for database and data analytic systems at Review Letters, 98(14):146401, April 2007. microsoft faculty summit. https://youtu.be/Tkl6ERLWAbA, 2018. Ac- [31] Jrg Behler. First Principles Neural Network Potentials for Reactive cessed: 2019-1-29. Simulations of Large Molecular and Condensed Systems. Angewandte [11] Microsoft Research. AI for AI systems at microsoft faculty summit. Chemie International Edition, 56(42):12828–12840, 2017. https://youtu.be/MqBOuoLflpU, 2018. Accessed: 2019-1-29. [32] Michael Gastegger, Jrg Behler, and Philipp Marquetand. Machine [12] Francis J. Alexander, Shantenu Jha. Objective driven computational learning molecular dynamics for the simulation of infrared spectra. experiment design: An ExaLearn perspective. In Terry Moore, Geoffrey Chemical Science, 8(10):6924–6935, 2017. Fox, editor, Online Resource for Big Data and Extreme-Scale Computing [33] J. S.Smith, O. Isayev, and A. E.Roitberg. ANI-1: an extensible neural Workshop, November 2018. network potential with DFT accuracy at force field computational cost. [13] Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Physics Chemical Science, 8(4):3192–3203, 2017. informed deep learning (part i): Data-driven solutions of nonlinear partial [34] Justin S. Smith, Ben Nebgen, Nicholas Lubbers, Olexandr Isayev, and differential equations. arXiv, November 2017. Adrian E. Roitberg. Less is more: Sampling chemical space with active [14] Mustafa Mustafa, Deborah Bard, Wahid Bhimji, Rami Al-Rfou, and learning. The Journal of Chemical Physics, 148(24):241733, May 2018. Zarija Luki. Creating virtual universes using generative adversarial [35] Jiang Wang, Christoph Wehmeyer, Frank Noé, , and Cecilia Clementi. networks. Technical report, Lawrence Berkeley National Laboratory, Machine learning of coarse-grained molecular dynamics force fields. Berkeley, CA 94720, 06 2017. arXiv, 1812.01736v2, 2018. [15] Shing Chan and Ahmed H Elsheikh. A machine learning approach for [36] Pedro E. M. Lopes, Olgun Guvench, and Alexander D. MacKerell. Cur- efficient uncertainty quantification using multiscale methods. J. Comput. rent Status of Protein Force Fields for Molecular Dynamics Simulations, Phys., 354:493–511, February 2018. pages 47–71. Springer New York, New York, NY, 2015. [16] A Townsend Peterson, Monica Pape\cs, and Jorge Soberón. Mechanistic [37] MLPERF benchmark suite for measuring performance of ML soft- and correlative models of ecological niches. European Journal of ware frameworks, ML hardware accelerators, and ML cloud platforms. Ecology, 1(2):28–38, 2015. https://mlperf.org/. Accessed: 2019-2-8. [17] Anuj Karpatne, Gowtham Atluri, James H. Faghmous, Michael Stein- [38] Uber Engineering. Horovod: Uber’s open source distributed deep bach, Arindam Banerjee, Auroop Ganguly, Shashi Shekhar, Nagiza learning framework for TensorFlow. https://eng.uber.com/horovod/. Samatova, and Vipin Kumar. Theory-guided data science: A new Accessed: 2019-2-8. paradigm for scientific discovery from data. IEEE Transactions on [39] Intel R parallel computing center at indiana university led by Judy Qiu. Knowledge and Data Engineering, 29(10):2318–2331, 2017. http://ipcc.soic.iu.edu/. Accessed: 2018-9-30. [18] Mark EJ Newman. Spread of epidemic disease on networks. Physical [40] Bingjing Zhang, Bo Peng, and Judy Qiu. Parallelizing big data machine review E, 66(1):016128, 2002. learning applications with model rotation. In Fox, G., Getov, V., Grandinetti, L., Joubert, G.R., Sterling, T., editor, Advances in Parallel Computing: New Frontiers in High Performance Computing and Big Data. IOS Press, 2017. [41] Intel parallel universe issue 32 page 31: Judy Qiu, Harp-DAAL for High-Performance big data computing. https://software.intel.com/sites/default/files/parallel-universe-issue-32.pdf,http://dsc.soic.indiana.edu/publications/Intel-Magazine-HarpDAAL10.pdf. Accessed: 2018-9-30. [42] George E Dahl, Tara N Sainath, and Geoffrey E Hinton. Improving deep neural networks for lvcsr using rectified linear units and dropout. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 8609–8613. IEEE, 2013. [43] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approxima- tion: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016.
US