Molecular Physics An International Journal at the Interface Between Chemistry and Physics ISSN: 0026-8976 (Print) 1362-3028 (Online) Journal homepage: https://www.tandfonline.com/loi/tmph20 Machine learning for collective variable discovery and enhanced sampling in biomolecular simulation Hythem Sidky, Wei Chen & Andrew L. Ferguson To cite this article: Hythem Sidky, Wei Chen & Andrew L. Ferguson (2020) Machine learning for collective variable discovery and enhanced sampling in biomolecular simulation, Molecular Physics, 118:5, e1737742, DOI: 10.1080/00268976.2020.1737742 To link to this article: https://doi.org/10.1080/00268976.2020.1737742 Published online: 10 Mar 2020. Submit your article to this journal Article views: 738 View related articles View Crossmark data Citing articles: 1 View citing articles Full Terms & Conditions of access and use can be found at https://www.tandfonline.com/action/journalInformation?journalCode=tmph20 MOLECULAR PHYSICS 2020, VOL. 118, NO. 5, e1737742 (21 pages) https://doi.org/10.1080/00268976.2020.1737742 RESEARCH ARTICLE Machine learning for collective variable discovery and enhanced sampling in biomolecular simulation Hythem Sidkya , Wei Chenb and Andrew L. Ferguson a a Pritzker School of Molecular Engineering, University of Chicago, Chicago, IL, USA; b Department of Physics, University of Illinois at Urbana-Champaign, Urbana, IL, USA ABSTRACT ARTICLE HISTORY Classical molecular dynamics simulates the time evolution of molecular systems through the phase Received 12 December 2019 space spanned by the positions and velocities of the constituent atoms. Molecular-level thermody- Accepted 21 February 2020 namic, kinetic, and structural data extracted from the resulting trajectories provide valuable infor- KEYWORDS mation for the understanding, engineering, and design of biological and molecular materials. The machine learning; molecular cost of simulating many-body atomic systems makes simulations of large molecules prohibitively simulation; deep learning; expensive, and the high-dimensionality of the resulting trajectories presents a challenge for anal- enhanced sampling; ysis. Driven by advances in algorithms, hardware, and data availability, there has been a flare of collective variables interest in recent years in the applications of machine learning – especially deep learning – to molec- ular simulation. These techniques have demonstrated great power and flexibility in both extracting mechanistic understanding of the important nonlinear collective variables governing the dynamics of a molecular system, and in furnishing good low-dimensional system representations with which to perform enhanced sampling or develop long-timescale dynamical models. It is the purpose of this article to introduce the key machine learning approaches, describe how they are married with sta- tistical mechanical theory into domain-specific tools, and detail applications of these approaches in understanding and accelerating biomolecular simulation. 1. Introduction through its phase space spanned by the atomic positions Classical molecular dynamics (MD) simulation is a and velocities under a Hamiltonian defining the many- workhorse tool for the study of molecular and atomic body interaction potential. Analysis of the resulting sim- systems to understand and predict their behaviour by ulation trajectories provides a means to estimate the integrating Newton’s equations of motion at the molec- structural, thermodynamic, and dynamical properties of ular scale [1,2]. The essence of the technique is to sim- the system. Performing a molecular dynamics requires ulate the dynamical evolution of a molecular system three chief ingredients: an initial system configuration, an CONTACT Andrew L. Ferguson
[email protected]Pritzker School of Molecular Engineering, University of Chicago, 5640 South Ellis Avenue, Chicago, IL 60637, USA © 2020 Informa UK Limited, trading as Taylor & Francis Group 2 H. SIDKY ET AL. interaction potential, and a means to integrate the classi- of membrane proteins. Enhanced sampling techniques cal equations of motion. This approach was anticipated in presently engage this challenge with approaches that fall 1812 by Pierre Simon de Laplace’s Gedankenexperiment largely into one of four classes [13–20]. (I) Path sam- that posited ‘an intelligence which could comprehend all pling techniques that efficiently sample reactive path- the forces by which nature is animated and the respec- ways between two pre-defined states. (II) Tempering or tive positions of the beings which compose it, if moreover generalised ensemble approaches that modify the sys- this intelligence were vast enough to submit these data tem Hamiltonian to lower barrier heights and improve to analysis . . . to it nothing would be uncertain, and sampling of configurational space. (III) Decomposition the future as the past would be present to its eyes’ [3]. techniques that break the (configurational) phase space Alder and Wainwright were the first to realise Laplace’s of the system into a number of disjoint metastable states ‘clockwork universe’ in 1957 through their pioneering and construct a kinetic model for the dynamical transi- molecular dynamics simulations employing state-of-the- tions between these states. (IV) Collective variable (CV) art computers and simulation algorithms to approximate biasing techniques that accelerate sampling and barrier the role of the ‘all-seeing intelligence’ [4,5]. Modern hopping along pre-specified order parameters. advances in computational hardware and software and The first class of approaches – path sampling – focuses force fields constructed from quantum mechanical cal- on sampling the interconversion pathways between two culations and precise experimental measurements have defined states of interest, making it less well suited to enabled simulations of systems of billions [6,7] and even the global exploration of a previously uncharted configu- trillions [8] of atoms. However, validated force fields for rational space. Recent work by Bolhuis and co-workers arbitrary materials and conditions are still lacking, and combining path reweighting with transition path sam- the inherently serial nature of numerical integration and pling has, however, demonstrated a means to estimate the requirement for short time steps on the order of the underlying free energy surface in the vicinity of the femtoseconds to preserve numerical stability have largely barrier and terminal states [21]. limited simulations of non-trivial systems to millisecond The second class of approaches – tempering – is time scales [9–11]. Karplus and Petsko elegantly artic- well suited to systems for which there is very little ulated these deficiencies in their 1990 article with their prior knowledge as to what collective variables are most assertion holding equally true today [12]: ‘Two limita- important in governing the system dynamics, but suf- tions in existing simulations are the approximations in fer from the drawback that much computational effort the potential energy functions and the lengths of the sim- is expended sampling modified Hamiltonians that are ulations. The first introduces systematic errors and the generally not of direct interest but serve only to support second statistical errors.’ The continued success of MD is improved sampling [13]. critically contingent on progress on both of these fronts The third class of approaches – discrete kinetic mod- and each is an important and active area of research in els – requires the definition of a partitioning of phase the field. The present review considers recent advances space into a set of disjoint metastable states and therefore enabled by machine learning in general, and deep requires sampling of these thermally-relevant configu- learning in particular, in engaging the second of these rations. As such, methods from the second or fourth challenges. classes of techniques are profitably employed to effi- The statistical errors in structural, thermodynamic, ciently sample the configurational space rather than and kinetic properties is fundamentally a sampling prob- relying upon the exploration provided by unbiased lem. Simulation trajectories furnished by standard MD simulations. do not offer sufficiently comprehensive sampling of The fourth class of approaches – collective variable the states or events of interest to provide robust esti- biasing – appears to suffer from the deficiency that they mations of the properties of interest [11,12]. Proper presuppose the availability of ‘good’ CVs along which sampling of the relevant states and transition rates is to drive sampling. We define ‘good’ in the sense that critical for the success of biomolecular simulations in driving sampling along these CVs leads to lower vari- applications including identification of the native and ance estimators of the structural, thermodynamic, or metastable states of a protein, resolution of protein kinetic properties of interest [22–25]. As such, these CVs binding pockets and association free energy of ligands should typically be coincident with or closely related to and drugs, prediction of the permeability of membrane the important dynamical motions of the system and drive modulating peptides, understanding of the mechanisms sampling over free energy barriers connecting thermally of protein allostery, prediction of the stable structures relevant states that would be rarely be surmounted in and aggregation pathways of self-assembling peptides, unbiased simulations. For all but the simplest systems it is and modelling of the activation pathways and kinetics not possible to intuit good CVs [26,27], and accelerating MOLECULAR PHYSICS 3 bad CVs that are irrelevant to the important molecu- CV is required [33–35]. Putative CVs can be considered lar motions can lead to poorer sampling than standard order parameters spanning a reduced-dimensional sub- unbiased MD. For this reason, the development of tech- space of the molecular configurational space. The qual- niques to determine good CVs for enhanced sampling ity of the subspace defined by these CVs is frequently is of ‘paramount concern in the continued evolution of scored according to one of two common metrics: high- such methods’ [13]. In 2018 we published a review of variance CVs parameterise a subspace that maximally nonlinear machine learning approaches for data-driven preserves the configurational variance contained within CV discovery [16]. It is the purpose of the present review a molecular simulation trajectory [36–38], whereas slow to provide an update to this fast moving field and illu- CVs (i.e. highly autocorrelated CVs) span a subspace that minate some recent advances in employing tools from maximally preserves the kinetic content. machine learning – deep learning in particular – for CV High-variance CV discovery is more straightforward discovery and enhanced sampling. We also direct the and amenable to a wide array of established machine interested reader to a number of other recent reviews learning and dimensionality reduction techniques. Data- of machine learning in molecular simulation [28], soft driven discovery of these CVs take simulation trajectories materials engineering [29,30], materials science [31], col- as their input, and it is typically possible to apply these lective variable identification [32], and enhanced sam- techniques to non-time ordered data and data gener- pling [13]. ated by biased sampling where the thermodynamic bias The structure of this review is as follows. In Section 2, can be exactly cancelled by thermodynamic reweighting we present a brief survey of some of the most preva- [39]. Conceptually, these techniques can be thought of as lent and powerful machine learning techniques that identifying and parameterising a low-dimensional sub- have found broad adoption within the molecular simu- space within the high-dimensional configurational phase lation community. Building upon these fundamentals, in space to which the simulation data are approximately Section 3 we detail recent advances in CV discovery and restrained [40,41]. We note here the apparent ‘chicken enhanced sampling enabled by these machine learning and egg’ problem wherein CV discovery requires sim- tools. We focus our discussion upon biomolecular sim- ulation trajectories that provide good sampling of the ulations, and in particular protein folding, where many thermally relevant phase space, whereas the generation of these developments and successes have been demon- of such trajectories requires enhanced sampling in good strated. We will largely focus on all-atom simulations CVs [32]. The solution, of course, is to iterate between where the sampling problem is most severe, but all tech- rounds of CV discovery and enhanced sampling until niques discussed may be equally well applied to coarse- convergence of the CVs and phase space exploration grained calculations. Finally, in Section 4 we present our [32,36,42,43]. outlook upon emerging challenges and opportunities for The application of machine learning for high-variance the field. CV discovery was pioneered through the use of lin- ear dimensionality reduction tools such as principal component analysis (PCA) and multidimensional scal- 2. Survey of popular machine learning ing (MDS) [44,45]. However, the inherent linearity of techniques for CV discovery these approaches limited the capabilities in identify- The data-driven CVs sought for enhanced sampling ing the important nonlinear CVs characteristic of com- are those which provide improved statistical estimates plex molecular systems. In more recent years, non- of the properties of interest. Typically, these CVs are linear dimensionality reduction and manifold learning correlated with the highest-variance or slowest-evolving techniques have been employed, including locally lin- collective degrees of freedom, and therefore can also pro- ear embedding (LLE) [46,47], Isomap [48–51], local vide molecular-level insight and understanding of sys- tangent space alignment [52], Hessian eigenmaps [53], tem properties and behaviour. In principle, enhanced Laplacian eigenmaps [54], diffusion maps (DMAPS) sampling could be conducted in all possible combina- [40, 43,55–57], sketch maps [37,58,59] and t-distributed tions of CVs and those which provide the statistically Stochastic Neighbor Embedding (t-SNE) [60]. These optimal estimates of the property we seek to estimate more powerful techniques have largely superseded lin- declared the ‘best’. Of course the enormous computa- ear approaches but do tend to suffer from the absence of tional cost associated with a blind search in the com- an explicit functional mapping of atomic coordinates to binatorial space of all possible CVs entirely defeats the the CVs, which can present challenges in interpretabil- purpose of enhanced sampling to provide efficient and ity and implementing biased sampling [36,42,43,61, 62]. accelerated property estimation. Accordingly, a construc- Deep learning techniques based on artificial neural net- tive criterion by which to define and determine a ‘good’ works have recently emerged as a means to discover 4 H. SIDKY ET AL. high-variance nonlinear CVs that are equipped with explicit and differentiable functional mappings to the atomic coordinates [36,63]. Slow CV discovery tends to be more challenging and approachable with a narrower class of machine learn- ing tools. These approaches are also more restrictive in that they typically require (long) time-ordered trajecto- ries that have been propagated under the unbiased system Hamiltonian. Depending on the particular dynamical propagator that is implemented, approaches do exist to relax the requirement for unbiased trajectories by per- forming dynamical reweighting of biased simulation tra- jectories [64–69]. Conceptually, these approaches seek linear or nonlinear functions of the configurational coor- dinates that are maximally autocorrelated and therefore parameterise the slowest-evolving molecular motions. In general, these approaches owe their mathematical Figure 1. Schematic diagram of a three-layer fully-connected foundations to the properties of the transfer operator feed-forward neural network. The output of neuron i from layer k (a.k.a. Perron-Frobenius operator or propagator) or its is denoted yik and the bias node for layer k denoted bk . The arrows adjoint the Koopman operator [70–82] and associated connecting pairs of neurons are the trainable weights wji . The variational principles such as the variational approach to output of each layer is computed from a weighted sum of outputs conformational dynamics (VAC) [83,84] or variational of the previous layer passed through a nonlinear activation func- tion (Equation (1)). (Image constructed using code downloaded approach to Markov processes (VAMP) [79]. Classical from http://www.texample.net/tikz/examples/neural-network techniques for slow CV discovery include time-lagged with the permission of the author Kjell Magne Fauske.) independent component analysis (TICA) [85,86] kernel TICA (kTICA) [87,88], dynamical mode decomposition (DMD) [73,89–95], extended dynamical mode decom- arbitrary precision. In a fully-connected ANN, the neu- position (EDMD) [77,96,97], canonical correlation anal- rons in each layer take as their inputs the outputs from ysis (CCA) [79,98], Markov state models (MSMs) the previous layer, apply a nonlinear activation function, [19,20], or Ulam’s method [70,99–102]. More recent and pass on their outputs to the next layer. A schematic approaches based on deep learning include deep CCA diagram of a three-layer feed-forward fully-connected [103], variational approach to Markov processes net- neural network in Figure 1. Mathematically, the output works (VAMPnets) [80,104], state-free reversible VAMP- yik from neuron i of fully connected layer k is given by, nets (SRVs) [105,106], time-lagged autoencoders (TAEs) ⎛ ⎞ [107,108], and variational dynamics encoders (VDEs) N [108–110]. yik = f ⎝ wjik yjk−1 + bki ⎠ , (1) In the remainder of this section we survey four of j=1 the most popular machine learning techniques – ANNs, DMAPS, MSMs, and TICA – that serve as the foun- where wjik and bi define the layer weights and biases, dations for many recent methodological developments respectively. The activation function f (x) is an arbi- in high-variance and slow CV discovery and enhanced trary nonlinear function but is often taken to be sampling that we discuss in Section 3. tanh (x) or some form of rectified linear unit (ReLU) and is applied element-wise to the input. ANNs are typically trained by minimising an objective func- 2.1. Artificial neural networks (ANNs) tion (also called loss function) using some variant of Artificial neural networks (ANNs) are collections of stochastic gradient descent through a process known as activation functions, or neurons, which are compos- backpropagation [114–116]. ited together into layers in order to approximate a Many of the advances in deep learning have been given function of interest [111]. Their utility and power driven by novel network topologies, activation functions, can be largely attributed to the universal approxima- and loss functions adapted to particular tasks. For exam- tion theorem, [112,113], which states that, under mild ple, convolutional neural networks capture spatial invari- assumptions, there exists a finite-size neural network that ance of local features, a useful feature for image analy- is capable of approximating any continuous function to sis [117–119]. Autoassociative neural networks perform MOLECULAR PHYSICS 5 non-linear dimensionality reduction [120]. Generative promote sampling along a particular direction in phase models, such as variational autoencoders [121] and gen- space. CV discovery techniques can identify good reac- erative adversarial networks (GANs) [122], are capable tion coordinates linking important metastable states of of synthesising new, unobserved examples that resemble the system. Fourth, CV discovery and enhanced sam- existing training data. In applications to molecular sys- pling may be conducted within the BG latent space to tems, ANNs have been used to build biasing potentials augment the power of BGs to explore previously unsam- for enhanced sampling [123–127], fit ab initio poten- pled regions of configurational space through the invert- tial energy surfaces [128,129] and determine quantum ible transformation to molecular coordinates. mechanical forces in MD simulations [130], perform coarse graining [131], and generate realistic molecular 2.2. Diffusion maps (DMAPS) configurations [132–134]. To highlight a few specific examples, PotentialNet is a novel neural network archi- Diffusion maps are a dimensionality reduction technique tecture which uses graph convolutions to encode molec- originally proposed by Coifman and Lafon that per- ular structures, accommodating permutation invariance forms nonlinear dimensionality reduction by harmonic and molecular symmetries [135]. SchNet is a variant of analysis of a discrete diffusion process (random walk) a deep tensor neural network that eliminates rotational, constructed over a high-dimensional dataset [56,139]. translational, and permutational atomic symmetries by The first application of DMAPS to molecular simulations construction and has been used to fit molecular poten- demonstrated its capacity to extract dynamically-relevant tial energy landscapes and molecular force fields [136]. collective molecular motions [40], and it has since seen PointNet [137] is a network designed to ingest and pro- widespread adoption as a method for the analysis of cess point cloud data for object classification and part molecular trajectories [27,57,140] and as a component segmentation that eliminates permutational invariances of adaptive biasing methods [42,43,55,62]. Mathemati- by max pooling, and which recently found applications cally, DMAPS construct a random walk over the space in local molecular structure analysis and crystal struc- of molecular configurations recorded over the course of ture classification [138]. CGnets learn free energy func- a molecular simulation, which, in the continuum limit, tions and force fields for coarse-grained molecular rep- can be shown to correspond to a Fokker-Plank (FP) diffu- resentations by fitting against all-atom force data [131]. sion process in the presence of potential wells [141]. The Boltzmann Generators (BG) employ a synthesis of deep leading eigenvectors of the Markov matrix describing the learning, normalising flows, and statistical mechanics to dynamics of the discrete random walk approximate the train an invertible network capable of efficiently sampling leading eigenfunctions of the associated backward FP molecular configurations from the equilibrium distri- operator describing the most slowly relaxing modes of bution [132]. As we will see below, many cutting-edge the diffusion process [139]. The algorithm proceeds by ML approaches rely upon some form of ANN, and, with constructing a kernel matrix K defined as, increasing frequency, deep neural networks (DNN) com- prising many hidden layers. d(i, j)2 Kij = exp − . (2) We note that Boltzmann Generators [132] represent 2 a particularly promising and powerful enhanced sam- pling technique for molecular systems, and although they where i and j index over molecular configurations, d(i, j) do not inherently rely upon the discovery or definition is a user-defined distance metric such as the translation- of CVs for their operation we identify strong syner- ally and rotationally aligned root mean squared deviation gies between these techniques. First, training of BGs (RMSD) between atomic coordinates, and is the user- generally requires a number of examples of molecular defined kernel bandwidth which represents the charac- structures from metastable states of the system and CV teristic step size of the random walk over the data [41]. enhanced sampling techniques may be used to efficiently After row-normalising the kernel matrix to conserve furnish these training examples starting from nothing hopping probabilities, a spectral decomposition gives more than a single structure and a molecular force field. eigenvector/eigenvalue pairs that are truncated at a gap in Second, since BGs can efficiently sample and estimate the eigenvalue spectrum. The resultant top k eigenvectors free energy differences between distantly separated states define the CVs spanning the low-dimensional embed- of the molecular system they may be used to efficiently ding and which parameterise the intrinsic manifold upon generate physically realistic transition pathways between which the diffusion process is effectively restrained. The metastable states identified by CV enhanced sampling. naïve implementation of DMAPS scales quadratically in Third, one mode of BG deployment augments the net- the number of data points, and so variants with reducted work loss function with a “reaction-coordinate loss” to memory and computational costs have been developed, 6 H. SIDKY ET AL. including landmark diffusion maps (L-DMAPS) [142] gradients of the collective variables with respect to the and pivot diffusion maps (P-DMAPS) [143]. atomic coordinates. Although a powerful nonlinear dimensionality reduc- tion technique, DMAPS possess at least two limitations 2.3. Markov state models (MSMs) in its applications to molecular systems. The first is the assumption of diffusive dynamics over the high- Markov state models (MSMs) are a powerful frame- dimensional data, which may or may not be a good work to gain insight from molecular simulation trajec- approximation of the true molecular dynamics. The sec- tories, and guide efficient simulations [20,152]. MSMs ond is the absence of an explicit mapping from the atomic are widely used for studying many biomolecular pro- coordinates to the low-dimensional CVs. As a result, out cesses including protein folding, protein association, of sample extension to new data points outside of the ligand binding, and forging connections with exper- training set require the use of approximate interpola- iment [153, 154]. Constructing MSMs typically fol- tion techniques such as the Nyström extension, Laplacian lows the following steps: feature extraction from the pyramids, or kriging [144–146]. Further, although the molecular simulation trajectory, feature transforma- existence of an explicit function mapping is no guaran- tion, engineering, and elimination of symmetries (e.g. tee of interpretability (consider ANNs), its absence can translation, rotation, permutation), projection of engi- frustrate interpretability of the CVs. A degree of inter- neered features into a low-dimensional subspace, clus- pretability can be recovered by correlating the DMAPS tering low-dimensional projections of configurations CVs with candidate physical variables [40,55], perhaps into microstates, construction of a microstate transi- within an automated search procedure [26, 147, 148], tion matrix, coarse-graining into macrostates, validation by projecting representative molecular configurations and analysis of the microstate and macrostate kinetic over the low-dimensional embedding, or by visualis- models for thermodynamic and dynamic properties. A ing the collective modes in the high-dimensional space schematic illustration of this pipeline is presented in [149, 150]. The absence of an explicit mapping also Figure 2. precludes the calculation of exact derivates of this There are many important aspects in each of these expression, which renders diffusion maps incompati- steps, a detailed discussion of which can be found in ble with enhanced sampling methods such as umbrella Refs. [20,152,155]. Here we observe a few key points. The sampling [151] or metadynamics [22] that require the chief advantage of MSMs in furnishing long-time kinetic Figure 2. Schematic diagram of the Markov state model (MSM) construction and analysis pipeline. (a) Many short molecular dynam- ics trajectories are collected. (b) The snapshots constituting each trajectory are featurized, projected into a low-dimensional space, and clustered into microstates. Each frame in each trajectory is assigned to a microstate. For illustrative purposes, four microstates are con- sidered and coloured green, blue, purple, and pink. (c) Counting the number of transitions between microstates furnishes the transition counts matrix. (d) Assuming the system is at equilibrium and therefore follows detailed balance, the count matrix is symmetrised and normalised to generate the reversible transition matrix defining the conditional transition probabilities between microstates. (e) The equilibrium distribution over microstates is furnished by the leading eigenvector of the transition probability matrix, here illustrated in a pie chart. States with greater populations are more thermodynamically stable. (f) The higher eigenvectors correspond to a hierarchy of increasingly fast dynamical relaxations over the microstates. The first of these possess a negative entry corresponding to the green state and positive entries for the other states, therefore characterising the net transport of probability distribution out of the green microstate and into the blue, purple, and pink. If desired, the microstates can be further coarse grained into macrostates, typically by clustering of the microstate transition matrix. Image reprinted with permission from Ref. [20]. Copyright (2018) American Chemical Society. MOLECULAR PHYSICS 7 models, is that the simulation data required for their con- νi (x) = k uik ξk (x) follow from the solution of the struction need only have reached local equilibrium, in following generalised eigenvalue problem [29,83–85], the sense that the transition probabilities between neigh- bouring microstates are memoryless, allowing the MSM Cτ U = C0 U, (3) to be constructed exclusively from conditional probabil- where is a diagonal matrix of ordered eigenvalues ities that the system appears in state j at time (t + τ ) {λi } that rank order the corresponding eigenvectors given that system appears in state i at time t [154–157]. according to an implied time scale ti = −τ/ ln λi , C0 is This is an extremely valuable property since it alleviates the covariance matrix with elements Cij0 = E[ξi (t)ξj (t)], the need for globally equilibrated simulation trajecto- Cτ is the time-lagged covariance matrix with elements ries that can be exceedingly expensive to generate. As Cijτ = E[ξi (t)ξj (t + τ )], and the columns of U corre- such, MSMs can be constructed from multiple relatively sponding to the {ui } hold the expansion coefficients. short trajectories that can be performed in parallel and The identification of slow modes is particularly impor- initialised adaptively to provide good sampling of all rel- tant when we are interested in understanding or acceler- evant transitions [155,157]. We also observe that many ating kinetic process in molecular systems, for example steps in the MSM construction pipeline profitably employ in protein folding [10] or ligand binding [165]. TICA is other machine learning methods. In particular, TICA, commonly used within the MSM pipeline to define slow SRVs, and VDEs are frequently used in the featuriza- low-dimensional projections of simulation trajectory tion and dimensionality reduction steps [86,106,109] (see data for microstate clustering (Section 2.3). It is known Sections 2.4 and 3.7), spectral clustering is employed that MSM models built on top of TICA components are to lump microstates into macrostates [158], maxi- generally much better performing than those built upon mum likelihood estimation used to enforce reversibil- structural metrics (e.g. root-mean-square deviation of ity [155], and active learning used to adaptively direct atomic positions) [86]. TICA coordinates have also been sampling of undersampled microstate transitions [152]. used as collective variables in which to conduct enhanced VAMPnets and Deep Generative Markov State Mod- sampling using metadynamics [166, 167]. els are two recently-proposed approaches that employ deep learning to replace some or all of the MSM parameterisation pipeline for the construction of dis- 3. Machine learning-enabled advances in crete kinetic models. We discuss these approaches in collective variable discovery and enhanced Section 3.8. sampling We now proceed to detail a selection of recent advances 2.4. Time-lagged independent component analysis in collective variable (CV) discovery and enhanced sam- (TICA) pling in biomolecular simulations that have been enabled by modern machine learning techniques. The selected Time-lagged independent component analysis (TICA) applications are mainly taken from the field of biomolec- (also known as second order ICA or time-structure based ular simulation and largely build upon the foundations ICA, and equivalent to CCA employing time-lagged data established in Section 2. for reversible processes [103,104,107]) is a linear dimen- sionality reduction method that takes as input a featur- 3.1. Diffusion maps-based enhanced sampling ization of a molecular simulation trajectory and identifies maximally autocorrelated linear projections along which A number of enhanced sampling techniques have the dynamical evolution of the system relaxes most emerged that rely on DMAPS to learn a low dimensional slowly [70,83–86,159–163]. This stands in contrast to intrinsic manifold characterising the slowest motions of PCA, which identifies linear projections along which a macromolecular system, then use various schemes to the configurational variance in the simulation trajectory expand the boundaries of the manifold into unexplored is maximal [44,45,164]. The leading TICA components regions. One such method is known as diffusion-map- can be interpreted, within the linear approximation, as directed molecular dynamics (DM-d-MD). In DM-d- the leading ‘slow modes’ whereas the PCA components MD, an initial short simulation is carried out, after which are the leading ‘high variance modes’. It can be shown the locally-scaled variant of diffusion maps is used to that given a (possibly nonlinear) mean-zeroed featur- construct the intrinsic manifold [57]. The configuration ization ξ (x) = {ξk (x)} of the snapshots of a molecular with the largest value of the first diffusion coordinate is simulation trajectory x with frames recorded at a time chosen as the new frontier, and a new short simulation interval τ , the expansion coefficients u defining the hier- is started from that state. This process is repeated until archy of TICA modes defined by the linear projections no new regions are discovered. Selection bias towards 8 H. SIDKY ET AL. frontier points perturbs sampling away from the unbi- ased Boltzmann distribution. In order to reconstruct accurate free energies and sample densities, additional rounds of umbrella sampling are performed at the fron- tier points and reweighting is employed to recover equi- librium statistics. An extended version of DM-d-MD was subsequently proposed to eliminate the need for the addi- tional umbrella sampling and improve the selection of frontier points [43]. In this extended version, swarms of simulations are initialised, terminated, and restarted over the course of the landscape exploration process to maintain an approximately uniform distribution in the first two DMAPS coordinates. By updating the statistical Figure 3. Schematic illustration of iMapD. The curved teal sheet weights of the trajectories within this kill/spawn pro- is a cartoon representation of a low-dimensional manifold resid- cess the necessary reweighting factors are available to ing within the high-dimensional coordinate space of the molecu- correct for the bias introduced in the selection of sim- lar system (black background) and to which the system dynam- ulation starting points and recover estimates of the unbi- ics are effectively restrained. This manifold supports the low- ased free energy landscape. An application of extended dimensional molecular free energy surface of system (red con- tours denote potential wells). The dimensionality of the man- DM-d-MD to alanine-12 demonstrated impressive speedups ifold, good collective variables with which to parameterise it, in exploring the thermally accessible phase space com- and topography of the free energy surface are a priori unknown. pared to unbiased calculations [43]. The heart of the iMapD commences by running short unbiased simulations to DM-d-MD method is to accelerate sampling by the smart perform local exploration of the underlying manifold and which initialisation of unbiased simulations at the frontier of the define an initial cloud of points C (1) . Boundary points are iden- (1) (1) explored phase space rather than through the imposition tified, here BP1 and BP2 , and local PCA applied to define a locally-linear approximation to the manifold geometry that is of artificial bias. On the one hand, this is advantageous locally valid in the vicinity of each point. An outward step is then in that all simulation trajectories evolve under the unbi- (1) taken within these linear subspaces, here from BP1 to expand the ased system Hamiltonian and therefore obey the true exploration frontier. The projected point may lie off the manifold dynamics of the system. On the other hand, the absence due to the linear approximation inherent in the outward projec- of artificial bias means that simulations are reliant on tion and so a short ‘lifting’ operation is employed to relax it back favourable initialisation and thermal fluctuations to drive to the manifold. This point then seeds a new unbiased simula- barrier crossing, so trajectories can be prone to tumble tion that generates a new cloud of points C (2) and the process is repeated until the manifold is fully explored. In this manner iMapD down steep free energy gradients and limit the efficiency explores the manifold by ‘walking on clouds’. Image adapted with of barrier crossing. permission from Ref. [62]. A second approach termed intrinsic map dynamics (iMapD) is due to Chiavazzo et al. [62]. Similar to DM- d-MD, short simulations are conducted and embedded unbiased simulations failed to do so [62]. Whereas DM- using DMAPS. The boundary of this ‘intrinsic map’ is d-MD initialised new simulations at the frontier of the detected and extended outwards by a certain amount currently sampled phase space, iMapD performs a local using local PCA. This step is critical to iMapD, as it extrapolation to seed new points beyond the current involves the projection of points on the intrinsic man- frontier, offering improved sampling efficiency and the ifold into unexplored regions, effectively allowing the possibility to tunnel through free energy barriers. The system to tunnel through free energy barriers. Since the optimal size of the outward step can, however, be difficult projected points may lie off-manifold, a lifting step is per- to determine and, like DM-d-MD, the absence of artificial formed where the new configurations are restrained and bias can impair barrier crossing efficiency. the remaining degrees of freedom are relaxed. Once lift- The application of artificial biasing potentials in the ing is complete, new rounds of unbiased simulation are collective variables identified by DMAPS is made chal- initialised from the projected boundary points and the lenging by the absence of an explicit and differen- procedure repeated until convergence. An illustration of tiable mapping between the atomic coordinates and the operation of iMapD is presented in Figure 3. An appli- the DMAPS CVs. The out-of-sample extension tech- cation of iMapD to computationally challenging simu- niques discussed in Section 2.2 furnish approximate pro- lations of the dissociation of the Mga2 dimer demon- jections for new data and enable energy biases to be strated its capacity to efficiently drive dissociation in just applied in Monte-Carlo simulations as perturbations to three iterations of the technique where millisecond-long the unbiased Hamiltonian conditioned on the current MOLECULAR PHYSICS 9 value of the DMAPS CVs [168–170]. The approxima- tions introduced by these extrapolations, however, typ- ically render them too numerically unstable for reli- able derivative calculation and the implementation of force biases in molecular dynamics simulation. One solution to this problem is offered by the diffusion nets (DNETS) approach of Mishne et al., who train an ANN encoder to learn a functional map from the atomic coordinates to the low-dimensional DMAPS embeddings [171]. By construction, this map is both explicit and differentiable, opening the door to its use within off-the-shelf molecular dynamics enhanced sam- pling techniques such as umbrella sampling or meta- dynamics. The authors also train a ANN decoder to reconstruct molecular configurations from the DMAPS manifold, which may also be useful in ‘hallucinat- ing’ new molecular configurations outside the cur- rently explored phase space that may then be lifted and used to initialise new simulations in the mould Figure 4. Schematic illustration of SandCV. Molecular configura- of iMapD. tions r are aligned to a reference configuration A(r) then pro- jected onto the Isomap manifold using a nearest neighbour pro- jection and a basis function expansion in a number of landmark 3.2. Smooth and nonlinear data-driven CVs points M−1 ◦ P (x). Enhanced sampling using adaptive biasing (SandCV) force (ABF) is effected by propagating biasing forces over the manifold F(ξ ) into forces on atoms F(r) through the Jacobian In a similar spirit to the DNETS approach of Mishne of the explicit and differentiable composite mapping function et al. (Section 3.1), Hashemian et al. developed an C (r) = M−1 ◦ P ◦ A(r). Image reprinted from Ref. [38], with approach termed smooth and nonlinear data-driven col- the permission of AIP Publishing. lective variables (SandCV) to estimate explicit and dif- ferentiable expressions for CVs discovered by nonlinear dimensionality reduction and then apply bias in these points. Enhanced sampling is effected by applying biasing CVs to perform enhanced sampling [38]. In principle, forces over the manifold F(ξ ) and propagating these to SandCV is compatible with any nonlinear dimension- forces on atoms F(r) through the Jacobian of the mapping ality technique and enhanced sampling protocol, but it function DC (r). SandCV is demonstrated in applications was originally developed to operate with Isomap [49] and to alanine dipeptide in vacuum and explicit water. In an adaptive biasing force (ABF) [172]. The heart of SandCV instance of transfer learning, it is shown that data-driven is estimation of the explicit and differentiable function CVs computed for a simpler system (alanine dipeptide in C : r ∈ RD → ξ = C (r) ∈ Rd that projects a D-dimens- vacuum) can be applied to a more complex system (ala- ional all-atom Cartesian configuration r into a point ξ in nine dipeptide in water). In a followup publication, the the d-dimensional Isomap manifold, and d < < D. The authors propose an extension to SandCV that builds an mapping C (r) can be conceived of as the composition of atlas of locally-valid CVs that are subsequently stitched three functional maps, together, which can be valuable in parameterising com- plex free energy topologies where different regions of C (r) = M−1 ◦ P ◦ A(r), (4) conformational space may require different CVs for their as illustrated in Figure 4. A(r) performs alignment of parameterisation [173]. the atomic configuration to (some subset of) the atoms SandCV relies on the availability of representative con- x of a reference structure, P (x) performs a projection of figurations covering the region of configurational phase the aligned configuration to the nearest neighbour point space of interest since the projection of points onto within the previously constructed Isomap manifold, and the manifold is through projection onto nearest neigh- M−1 (x) performs projection of this point into the man- bours. When no such data are available, SandCV uses ifold and is itself the inverse of a function M(ξ ) that initial high-temperature simulations to provide seed con- is a mapping from the points in the manifold back to figurations for the manifold learning. The subsequent the aligned molecular configurations achieved through a enhanced sampling is then able to interpolatively bridge basis function expansion in a small number of landmark the gaps between the sparse initial landscape, but it 10 H. SIDKY ET AL. remains undemonstrated as to whether the algorithm encoding ξ = proj (z) is furnished by a ANN it is explicit can extrapolatively drive sampling into new regions of and differentiable by construction and can be used to configuration space. propagate biasing forces in the CVs F(ξ ) to forces on atoms F(z). To encourage complete sampling of phase space and 3.3. Molecular enhanced sampling with improvement of the data-driven CVs, rounds of CV dis- autoencoders (MESA) covery and enhanced sampling are interleaved in an iter- DNETS (Section 3.1) and SandCV (Section 3.2) fur- ative framework comprising successive: (i) learning CVs nish explicit and differentiable approximations linking from simulation trajectories (either the initial unbiased the atomic coordinates to the low-dimensional CVs fur- trajectory or biased trajectories obtained from previous nished by nonlinear dimensionality reduction, which can iterations of CV biasing) and (ii) applying CV biasing subsequently be used to conduct enhanced sampling. with the learned CV to push the frontier outwards and Chen et al. proposed an alternative nonlinear dimension- drive exploration of new regions of phase space using ality approach based on deep learning that learns nonlin- umbrella sampling (but arbitrary CV biasing approaches ear CV that possess explicit and differentiable mappings may be employed). The process is terminated when CVs by construction [36,63]. In doing so, the functional esti- stabilise between successive rounds and the volume of mation step is eliminated and enhanced sampling may be phase space explores converges. Applications to alanine conducted directly in the learned CVs without approxi- dipeptide and Trp-cage demonstrate the capacity of the mation error. This approach, termed molecular enhanced technique to discover, sample, and determine free energy sampling with autoencoders (MESA), employs an deep surfaces in nonlinear CVs starting from no prior knowl- neural network (DNN) with an autoencoding archi- edge of the system [36]. The iterative expansion of the tecture or ‘autoencoder’ (AE) comprising an encoder frontier and refinement of CVs as a function of loca- proj : z ∈ H → ξ ∈ L that maps molecular configura- tion in phase space is analogous to that in iMapD and tions z in a high-dimensional coordinate space H to a DM-d-MD but the application of accelerating biasing nonlinear projection ξ in a low-dimensional latent space forces greatly enhances barrier crossing. The use of the L, and a decoder rec : ξ ∈ L → ẑ ∈ H that approx- explicit and differentiable mapping to perform enhanced imates the reverse mapping (Figure 5). The network sampling is similar to SandCV but the mathematical is trained to reconstruct its own inputs (i.e. autoen- framework enabled by ANNs is much simpler and the code) such that z ≈ ẑ and therefore discover a low- functional mapping is exact by construction. The use of dimensional latent space ξ defined by the ANN acti- biasing forces does, of course, corrupt the true dynamics vations in a bottleneck layer that preserves the salient of the system and so dynamical observables (e.g. Markov information necessary to perform an approximate recon- state models) cannot be straightforwardly extracted from struction. The appropriate dimensionality of the latent the simulation data. A followup paper shows that tailor- space, and therefore number of nonlinear CVs required ing the autoencoder architecture and error functions can for reconstruction, can be tuned on-the-fly. Since the help discover better CVs, improve sampling efficiency, Figure 5. Molecular enhanced sampling with autoencoders (MESA). An autoencoding neural network (autoencoder) is trained to recon- struct molecular configurations via a low-dimensional latent space where the CVs are defined by neuron activations within the bottleneck layer. The encoder proj performs the low-dimensional projection from molecular coordinates z in the high dimensional atomic coor- dinate space H into the low-dimensional latent space L and the decoder rec performs the approximate reconstruction back to ẑ. The encoder furnishes, by construction, an exact, explicit, and differentiable mapping from the atomic coordinates to CVs that can be modularly incorporated into any off-the-shelf CV biasing enhanced sampling technique. MOLECULAR PHYSICS 11 and favour the discovery of more stable and interpretable native fold of a protein may differ significantly from those CVs [63]. appropriate for the unfolded ensemble, and protein acti- vation frequently involves two (or more) distinct molec- ular events parameterised by different CVs that occur in 3.4. Reweighted autoencoded variational Bayes for series and result in characteristic ‘L-shaped’ landscapes. enhanced sampling (RAVE) By maintaining a sufficiently large ensemble of CVs, one Akin to MESA (Section 3.3), reweighted autoencoded may determine on-the-fly which subset of CVs consti- variational Bayes for enhanced sampling (RAVE) due tute the active space for enhanced sampling at any given to Ribeiro et al. uses DNNs to learn nonlinear CVs for location in phase space. This is the approach taken by enhanced sampling [174]. It differs from MESA in that reinforcement learning based adaptive sampling (REAP) it makes use of variational autoencoders (VAEs) [121], introduced by Shamsi et al., which employs reinforce- seeks a 1D latent space encoding only the leading CV, ment learning (RL) to determine the relative importance and conducts sampling not directly in the discovered of different candidate CVs as a system explores phase CV but in a proxy physical variable (or linear combina- space [35]. REAP proceeds by running an initial round tions thereof) that maximally resembles that of the CV. of short molecular simulations. The resulting configura- The use of VAEs compared to AEs conveys advantages tions are then clustered, the least-populated clusters iden- in producing better regularised and continuous latent tified, and a reward function measuring the normalised space embeddings. Identification of a physical variable χ absolute distance from the ensemble mean evaluated for in which to perform sampling is very attractive from an each candidate CV for each cluster. An optimisation interpretability standpoint, but means sampling is neces- problem is solved to maximise the overall reward func- sarily performed in a proxy for the data-driven CV. The tion as a weighted sum of the candidate CVs, and the quality of the proxy variable in approximating the discov- clusters that offer the highest rewards selected as those ered CV is contingent on the space of candidate physi- from which to harvest configurations to seed a new round cal variables considered. The probability distribution in of simulations. This process is repeated until sufficient the optimal physical variable P(χ ) is then turned into a sampling of the phase space is achieved. The key feature biasing potential Vbias (χ ) = kB T ln P(χ ) from which, by of any RL approach is the reward function, which in the virtue of the physical nature of χ for which an explicit case of REAP is designed to maximise discovery of new relation to the atomic coordinates is known, is straight- conformational states. Like RAVE, the success of REAP is forwardly converted into biasing forces. An iterative pro- contingent on the quality and size of the space of candi- cedure very similar to MESA is then applied to drive date CVs. RL remains one of the less explored areas of ML system exploration by interleaving rounds of biased sim- in applications to molecular simulation, and it remains ulation and CV learning. to be seen what advantages it can bring to adaptive sam- The restriction to single CVs is limiting, but the frame- pling relative to the unsupervised approaches discussed work can, in principle, be extended to multidimensional in Sections 3.1–3.4. CVs. One way to do so may be to employ β-VAEs to encourage independence of the various CVs [175,176], 3.6. Determining collective variables through but an alternative approach is adopted in an extension of supervised learning the framework known as Multi-RAVE in which a set of locally valid one-dimensional CVs are constructed and Supervised learning is also relatively under-explored in the piecewise sum of these position-dependent compo- molecular CV discovery relative to unsupervised tech- nents is a single-nonlinear CVs spanning relevant con- niques since the output variables (a.k.a. dependent vari- figurational space [177]. Numerical experiments with the ables, labels) for which we wish to construct a model disassociation of benzene from L99A lysozome predict in terms of our input variables (a.k.a. independent vari- unbinding free energies in good agreement with experi- ables, descriptors, features) are often not obvious or avail- ment. able. Sultan and Pande recently proposed that the (pre- defined) metastable states of a molecular system may be adopted as output variables and supervised learning 3.5. Reinforcement learning based adaptive deployed to construct a pairwise or one-vs-all decision sampling (REAP) function to discriminate between the states and serve as The CVs parameterising configurational space may vary a CV for enhanced sampling [178]. Such a situation may substantially as a function of location over that space. arise in protein activation where crystal structures for the For example, those CVs appropriate to parameterise and active and inactive states are available but the activation enhance configurational sampling in the vicinity of the pathway and mechanism is unknown. The supervised 12 H. SIDKY ET AL. learning task is cast as a classification problem taking CVs is founded in spectral analysis of the transfer opera- as input the atomic coordinates of the molecule in the tor that propagates probability distributions over molec- various states and output as the labels of the states, and ular microstates through time [83,84,87,105,161,162]. In which is solved by support vector machines (SVM), logis- an important theoretical development, Noé and Nuske tic regression, and ANNs. The resulting decision function showed that the spectral analysis of the operator can – distance to separating hyperplane for SVM, probabil- be performed in a data-driven fashion within the vari- ity or odds ratio for logistic regression, unnormalised ational approach to conformational dynamics (VAC) network output for ANN – provides an explicit and dif- in the case of equilibrium systems [83,84] or varia- ferentiable CV that is deployed in metadynamics simu- tional approach to Markov processes (VAMP) in the case lations to drive sampling between states. The approach of non-reversible and non-stationary dynamics [79,80]. is demonstrated in applications to alanine dipeptide and These frameworks possess a pleasing parallel with the chignolin, where it is shown to effectively drive reactive variational approach to approximate electronic wave- transitions [178]. Success of the approach is predicated functions within a given basis set through solution of the on prior knowledge of the relevant states, and, like path quantum mechanical Roothan-Hall equations [182,183]. sampling, the decision function CVs are inherently inter- Full details of the VAC and VAMP can be found polative and so can have difficulty driving sampling into in Refs. [79,80,83,84,87,105,184]. Here we briefly sur- unexplored regions of phase space. vey a number of recently developed machine learning Mendels et al. independently developed harmonic lin- approaches for slow CV discovery that seek to perform ear discriminant analysis (HLDA) and multi class HLDA data-driven diagonalisation of the transfer operator. (MC-HLDA) as a supervised learning approach based on As discussed in Section 2.4, TICA adopts as a basis a generalisation of Fisher’s linear discriminant [179,180]. set a featurization ξ (x) of the atomic coordinates x (in The method takes as input the means and covariance the original TICA formulation ξ (x) = x) and solves a matrices within a predefined set of descriptors for K generalised eigenvalue problem (Equation (3)) to define metastable states as measured by short molecular sim- maximally autocorrelated linear projections within this ulations. An optimisation problem is formulated to find basis. Kernel TICA (kTICA) is a generalisation of the the (K − 1) linear projections within the descriptor space TICA algorithm described in Section 2.4 that employs that maximise the ratio between the between-state and the kernel trick to apply the TICA machinery within within-state scatter matrices, which corresponds to max- a nonlinear transformation of the feature space [87]. imisation of a Fisher ratio and can be solved via a The nonlinearity of the kernel function provides kTICA generalised eigenvalue problem. The linear projections with greater expressive power, and the capacity to within the descriptor space furnish CVs in which to learn nonlinear slow modes from time-series data with perform metadynamics enhanced sampling. An applica- higher fidelity than TICA [87]. As is typical of kernel- tion to chignolin demonstrates that the method success- based methods, kTICA is computationally expensive fully generates reactive pathways between the folded and and sensitive to kernel selection and hyperparameter unfolded states, although the efficiency of the approach choice [87, 88, 105, 109]. can be sensitive to the user-defined selection of descrip- Time-lagged autoencoders (TAE) approximate slow tors [181]. Again, this approach requires prior knowledge CVs by performing nonlinear time-lagged regression of the relevant metastable states, and is not designed to using deep learning. Applied in the context of molec- drive sampling into new configurational states. ular simulation by Wehmeyer and Noé, TAEs employ an autoencoder architecture in which the encoder maps 3.7. Transfer operator theory and variational a configuration zt at time t to a latent encoding et , approaches to conformational dynamics and the decoder maps et to a time-lagged output The CV discovery approaches discussed thus far have zt+τ = D(et ) that minimises the time-lagged recon- largely sought to discover high-variance CVs within the struction loss to the true time-lagged configuration configurational phase space using some form of unsu- L = E D(et ) − zt+τ true 2 [107] (Figure 6). The under- pervised, supervised, or reinforcement learning. We now lying principle of operation is that minimisation of this proceed to discuss some recent developments in the time-lagged reconstruction loss promotes the discov- data-driven discovery of slow (i.e. maximally autocorre- ery of slow CVs as the latent space variables et [108]. lated) CVs that can often be more mechanistically mean- The technique is demonstrated in applications to alanine ingful and provide superior coordinates for the direct dipeptide and villin protein, and is shown to perform acceleration of the slowest dynamical processes. The favourably against TICA, particularly when suboptimal theoretical foundations for the determination of these molecular featurizations are employed. MOLECULAR PHYSICS 13 as the squared sum of the exponentials of the implied timescales of the slow CVs discovered by SRVs, and is guaranteed by the VAC to reach a maximum when the approximated slow CVs are coincident with the true slow CVs of the transfer operator [20,80,161]. SRVs can be conceived of as an application of TICA in which DNNs are employed to learn optimal nonlinear featurizations of the atomic coordinates as a learned basis set that is Figure 6. Block diagram of a time-lagged autoencoder (TAE). The encoder projects a molecular configuration zt at time t into a subsequently passed to the generalised eigenvalue prob- low-dimensional latent embedding et from which a time-lagged lem (Equation (3)) [105]. The idea of learning an optimal molecular configuration zt+τ at time (t + τ ) is subsequently basis to pass to a linear variational approach was first pro- reconstructed. For τ = 0 the TAE reduces to a standard AE and the posed by Andrew et al. in the context of deep CCA [103] CV discovery process is equivalent to MESA (Section 3.3). Image and first applied to molecular simulations in Mardt et al.’s reprinted from Ref. [107], with the permission of AIP Publishing. VAMPnets [80]. SRVs employ a twin-lobe neural network that trans- Variational dynamics encoders (VDEs) are a deep form pairs of time-lagged molecular configurations learning approach for slow CV discovery first introduced {x(t), x(t + τ )} into a space of d learned nonlinear basis by Hernández et al. [109]. VDEs employ a similar DNN functions {ζ (x(t)), ζ (x(t + τ ))} (Figure 7). These basis autoencoding architecture as TAEs, but differ in their use functions are passed to the linear VAC where solution of of a VAE, as opposed to a standard AE, and a mixed loss the generalised eigenvalue problem furnishes approxima- function, tions to the transfer operator eigenfunctions as orthog- onal linear projections within this basis. The key to the L = λ E D(et ) − zt+τ 2 + LKL − (1 − λ)A(et ), entire approach is the definition of the negative VAMP- (5) r score as a loss function under which the twin-lobed where E D(et ) − zt+τ 2 is the time-lagged recon- ANN is iteratively trained to learn the nonlinear basis struction loss, A(et ) is the autocorrelation of the learned within which linear approximations of the d leading 1D latent space CV et , LKL is a regularisation term that transfer operator eigenfunctions ψ̃ = {ψ1 , ψ2 , . . . , ψd } measures the similarity of the distribution of et in the are computed. Once trained, the ANN and generalised latent space to a Gaussian distribution, and 0 ≤ λ ≤ 1 eigenvalue problem define an explicit and differentiable is a linear mixing parameter [108]. In an application mapping between the atomic coordinates and slow CVs to the folding of villin protein, VDEs were shown to that can be straightforwardly deployed in CV bias- outperform TICA in the discovery of CVs capable of ing enhanced sampling routines [105]. SRVs have been resolving metastable states and that the VDE latent coor- demonstrated in applications to alanine dipeptide, WW dinates produced superior MSMs with slower implied domain, and Trp-cage, and proven to be a simple, effi- timescales [109]. cient, and robust means for slow CV determination that TAEs and VDEs possess two key limitations. First, possesses strong theoretical guarantees [105,106]. More- they are restricted to the discovery of 1D latent spaces over, SRVs have been shown to present an excellent and and cannot be applied to learn multiple hierarchical slow modular replacement for TICA within MSM construc- modes due to the absence of orthogonality constraints tion pipelines. The nonlinear SRV latent space presents in latent space [108]. Second, the incorporation of the a kinetically superior latent space for microstate cluster- time-lagged reconstruction loss within the loss function ing than the linear embeddings furnished by TICA, with compromises the ability of the networks to discover the the resulting MSMs exhibiting faster implied timescale highly autocorrelated (i.e. slow) modes at the expense of convergence and higher kinetic resolution than current high-variance modes [108]. In general, TAEs and VDEs state-of-the art approaches [106]. Replacement of the discover mixtures of maximum variance modes and slow VAC within the SRV with the more general VAMP prin- modes [108]. ciple serves to extend the approach to non-stationary and State-free reversible VAMPnets (SRVs) solve both of non-reversible processes resulting in the more general the deficiencies of TAEs and VDEs for equilibrium sys- state-free non-reversible VAMPnets (SNRV). tems by employing a variational minimisation of a loss function that maximises the VAMP-2 (or more gener- 3.8. Deep learning based MSMs ally VAMP-r) score measuring the cumulative kinetic variance explained within the subspace of data-driven Noé and co-workers recently proposed two variants of slow CVs [105]. The VAMP-2 score can be interpreted MSMs based on deep learning: VAMPnets [80] and 14 H. SIDKY ET AL. Figure 7. State-free reversible VAMPnets. Pairs of time-lagged molecular configurations {x(t), x(t + τ )} are featurized and transformed by a twin-lobe ANN into a space of nonlinear basis functions {ζ (x(t)), ζ (x(t + τ ))}. These basis functions are employed within a linear VAC to furnish approximations ψ̃ to the leading eigenfunctions of the transfer operator. The twin-lobed ANN is trained to max- imise a VAMP-r score measuring the cumulative kinetic variance explained and which reaches a maximum when the eigenfunction approximations are coincident with the true eigenfunctions of the transfer operator. deep generative MSMs (DeepGenMSM) [134]. SRVs C01 = E[χ 0 (xt )χ 1 (xt+τ )T ] are then computed and used (Section 3.7, Figure 7) were inspired by VAMPnets and to estimate the MSM transition matrix between states both approaches share a similar twin-lobe network archi- K = C−1 00 C01 [80]. VAMPnets are illustrated in an appli- tecture to apply deep CCA [80,103,105]. They differ cation to NTL9 where they discovers a 5-state model in two main respects. First, whereas SRVs pass these with kinetic properties on par with a 40-state con- basis functions to a VAC analysis that is appropriate for ventional MSM, thereby illustrating the value of the approximating the transfer operator eigenfunctions for approach in furnishing more parsimonious, efficient, equilibrium data, VAMPnets pass them to a more gen- and interpretable models without compromising kinetic eral VAMP analysis to approximate the transfer operator accuracy [80]. singular functions for non-stationary and non-reversible DeepGenMSMs are a deep learning approach to not data [80,104]. Second, whereas it is the goal of SRVs only learn a MSM defined by a discrete transition matrix to furnish approximations to the leading modes of the between metastable states, but also a means to gen- transfer operator, it is the goal of VAMPnets to offer an erate realistic molecular trajectories including previ- end-to-end replacement for the entire MSM pipeline of ously unseen configurations not included in the training featurization, dimensionality reduction, clustering, and data [104,134]. DeepGenMSMs are based on the follow- kinetic model construction [80]. Integration of these ing representation of the transition density between a steps within a single framework can be advantageous in configuration (xt = x) at time t and (xt+τ = z) at time helping to avoid the extensive parameter tuning that can (t + τ ), plague the various steps in MSM model construction (Section 2.3). VAMPnets achieve this goal by employ- P(xt+τ = z | xt = x) = χ (x)T q(z; τ ) ing softmax activations in the terminal layer of the twin m ANN lobes that map a time-lagged pair of molecu- = χi (x)qi (z; τ ), (6) lar configurations {xt , xt+τ } to fuzzy state assignments i=1 (χ 0 (xt ), χ 1 (xt+τ )), where χ 0 and χ 1 are k-dimensional where χ (x) = [χ1 (x), . . . , χm (x)] is a normalised vector vectors defined over the k softmax output nodes of the representing the probability that configuration x exists two ANN lobes, and which assign a probability that the within each of the i = 1 · · · m metastable macrostates, molecular configuration should be assigned to one of q(z; τ ) = [q1 (z; τ ), . . . , qm (z; τ )] is the vector of ‘land- k metastable macrostates. The instantaneous and time- ing densities’ where qi (z; τ ) = P(xt+τ = z | xt ∈ state i) lagged covariance matrices C00 = E[χ 0 (xt )χ 0 (xt )T ] and defines the probability that a system in macrostate i at MOLECULAR PHYSICS 15 Figure 8. Deep Generative MSM (DeepGenMSM) and the ‘rewiring trick’. (left) The encoder χ(x) within the twin-lobe ANN is trained to learn mappings of molecular configurations x to probabilistic memberships y of one of m macrostates. The generator is trained against the learned ‘landing probabilities’ qi (z; τ ) that a system prepared in macrostate i will transition to molecular configuration z after a time τ . (right) The rewiring trick reconnects the generator and encoder to furnish a valid estimate K̃ for the MSM transition matrix between the embedding into the m discrete states learned by the encoder. Image adapted from Ref. [134], with permission from the author Prof. Frank Noé (Freie Universität Berlin). time t lands in molecular configuration z at time (t + τ ). dipeptide demonstrate its ability to accurately estimate The membership probabilities χ (x) and landing den- the long-time kinetics and stationary distributions and sities q(z; τ ) are learned by training a two-lobe ANN also generate molecularly realistic structures including architecture similar to VAMPnets to maximise the likeli- in regions of phase space where no training data was hood of time-lagged pairs (xt , xt+τ ) observed in simula- supplied [134]. tion trajectories (Figure 8). The MSM transition matrix K between the m metastable states is furnished by the ‘rewiring trick’ wherein K = E[q(x; τ )χ (x)T ]. In order 3.9. Software to generate molecular configurations outside of the train- We list in Table 1 software packages and libraries imple- ing data, it is additionally necessary to train a generator to menting some of the CV discovery and enhanced sam- sample from the density specified by the learned landing pling methods discussed in this review. densities q(z; τ ), G(i, ; τ ) = y ∼ qi (y; τ ), (7) 4. Conclusion and outlook where i indexes over the states and is i.i.d. random It has been the goal of this review to offer a survey noise sampled from a Gaussian distribution that powers of some of the most exciting recent developments and the generator. Applications of DeepGenMSM to alanine applications of machine learning to collective variable discovery and enhanced sampling in biomolecular simu- lation. We sought to expose the essence of each method, Table 1. Software packages and libraries available for some of the collective variable discovery and enhanced sampling tech- its advantages and drawbacks, the systems in which it niques discussed in this review. has been applied and demonstrated, and the availability of software implementations. We close with a retrospec- Method Software Packages tive assessment of the key milestones in the field and our ANNs Keras (keras.io) TensorFlow (www.tensorflow.org) outlook on emerging challenges and opportunities. PyTorch (pytorch.org) The origins of machine learning for CV discovery DMAPS github.com/hsidky/dmaps github.com/DiffusionMapsAcademics/ can be traced back to pioneering applications of linear pyDiffMap dimensionality reduction techniques in the early 1990s. MSM, TICA PyEmma (www.emma-project.org/latest/) MSMBuilder (msmbuilder.org/) The first major development arrived in the early 2000s MESA github.com/weiHelloWorld/accelerated_ with the debut of powerful nonlinear dimensionality sampling_with_autoencoder reduction tools. The mid-2000s witnessed the emergence TAE, VAMPnets github.com/markovmodel/deeptime VDE github.com/msmbuilder/vde of MSMs in the field. The late 2000s and early 2010s saw SRV github.com/hsidky/srv the introduction of techniques focussed on the discov- DeepGenMSM github.com/markovmodel/deep_gen_msm Enhanced Sampling Suites SSAGES (github.com/MICCoM/SSAGES-public) ery of slow as opposed to high-variance CVs. Advances in PLUMED (www.plumed.org) the past several years have been propelled in large part by Colvars (colvars.github.io) deep learning methodologies coming to the fore. ANNs 16 H. SIDKY ET AL. themselves are, of course, not a new idea, with roots dat- each system. For example, peptoid amide bonds occupy ing back to Rosenblatt’s perceptron in 1958 [185], but both cis and trans configurations (in contrast to those of the availability of fast simulation codes (e.g. Gromacs, peptides that are almost exclusively trans) but the tran- HOOMD, LAMMPS, NAMD, OpenMM), cheap stor- sitions between them is a notoriously high-free energy age, inexpensive high-performance GPU hardware, and barrier rare event [195]. To paraphrase the Persian poet user-friendly neural network libraries (e.g. PyTorch, Ten- Ibn Yamin (1286–1367), these slow CVs are ‘known sorFlow, Keras) created ideal conditions for this flare unknowns’ and CV discovery and acceleration must of creative new applications and has supercharged the explicitly account for these effects to achieve good sam- field. There has been a tandem development of enhanced pling and enable CV discovery to identify the ‘unknown sampling techniques for accelerated sampling of config- unknowns’. urational space. Umbrella sampling is one of the earliest Third, recent years have witnessed the convergence techniques that is still in use today [151] and which is of CV discovery and enhanced sampling into inte- itself based on ideas some 10 years prior by McDonald grated frameworks that are not beholden to the ini- and Singer [186]. There has been an enormous prolifer- tial choice of CVs, but perform iterative CV refine- ation of techniques since that time, based on a variety ment in tandem with accelerated phase space exploration of approaches to enhance sampling [187]. Metadynam- either through judicious initialisation of unbiased sim- ics [22], itself based on ideas from local elevation [188] ulations [43,57, 62] or the direct application of artificial and conformational flooding [189], has emerged as one bias [36,63]. These approaches have only been demon- of the most popular, flexible, and robust enhanced sam- strated for high-variance CVs, and it remains to demon- pling techniques [190]. Enhanced sampling approaches strate these iterative strategies for slow CVs. In the case have also benefited from the proliferation of deep learn- of the unbiased sampling, the challenge is to recover ing technologies, and there are now a number of examples estimates of CVs for the equilibrium system from many of ANN-based approaches to build biasing potentials for short non-equilibrium runs, which may be possible using enhanced sampling [123–127]. Koopman reweighting [78]. In the case of biased sam- Looking forward, we see a number of new frontiers pling, the challenge is to estimate unbiased CVs from and important challenges for machine learning-enabled biased trajectories, which may be possible using Girsanov CV discovery and enhanced sampling. First, with rela- reweighting [64,65]. It may also be beneficial to ‘deflate’ tively few exceptions, many of the new tools are devel- out undesired slow modes [196]. oped and tested for relatively small systems, and tend not Fourth, the field can benefit from two current waves to be tested in applications to larger systems. Of course in machine learning that have come to be referred to it is vital to validate new tools in testbed problems where as eXplainable Artificial Intelligence (XAI) and Physics- the ground truth is known a priori, but demonstrating aware Artificial Intelligence (PAI) [197,198]. The degree the efficacy of these approaches in applications to large of interpretability that we require of our CVs is largely a biomolecules of technological, biological, or biomedical matter of context and taste: interpretability may not be a import is crucial in proving their potential in the context primary concern if our CVs are simply viewed as a means of impactful and challenging problems. to enhance sampling, but it may be extremely desirable if Second, applications of these approaches tend to focus we wish to understand mechanisms or learn transferable on single protein molecules (e.g. peptide folding, mem- CVs appropriate for larger classes of systems. One way brane protein activation). There are very good reasons for to achieve interpretability is to use simple (usually lin- this privileging of protein folding from historical – the ear) models that are interpretable by construction (e.g. protein folding problem is a long-standing and alluring linear regression, linear SVMs), but frequently we wish challenge [191,192] – biological – there are unquestion- to exploit the power and flexibility of modern tools (e.g. ably critical problems in protein folding of great biologi- ANNs) without sacrificing interpretability. Very recently cal, biotechnological, and therapeutic value [193,194] – developed XAI tools such as layer-wise relevance propa- and practical – the best-validated computational force gation offer a means to achieve this goal and simultane- fields and experimental crystal structures are available ously detect and avoid so-called ‘clever Hans’ solutions for proteins – perspectives, but there are also compelling that formulate a seemingly correct answer but for the and important problems in related areas such as peptide wrong reasons [199,200]. PAI seeks to incorporate phys- assembly, peptoid engineering, and nucleic acid folding. ical constraints and knowledge into the CV discovery It is important to develop methods in the context of process, and is an extremely attractive for many reasons: diverse applications since it is not always the case that the machine learning algorithms are given a ‘warm start’ methods developed for proteins may be directly transfer- by build upon prior understanding and knowledge of the able and must be adapted to the specific vicissitudes of system, the algorithms can function more robustly and MOLECULAR PHYSICS 17 work with noisier and smaller quantities of data since the Foundation under Grant No. CHE-1841805. H.S. acknowl- model space is physically constrained, discovered CVs edges support from the National Science Foundation Molecular may be made more transferable to related systems, and Software Sciences Institute (MolSSI) Software Fellows program (Grant No. ACI-1547580) [205,206]. the CV predictions can be guaranteed to satisfy particular physical constraints. PAI has proven somewhat difficult to realise in a generalisable way, but there have been ORCID recent successes in particular applications [201]. The rig- Andrew L. Ferguson http://orcid.org/0000-0002-8829-9726 orous enforcement of particular constraints and symme- tries can be attractive in guaranteeing that the discovered References CVs are consistent with the invariances, equivariances, and symmetries of the physical system (e.g. translational [1] D. Frenkel and B. Smit, Understanding Molecular Simu- lation: From Algorithms to Applications (Academic Press, invariance, permutational invariance, rotational equiv- San Diego, 2001). ariance, energy conservation) [202,203]. [2] E.H. Lee, J. Hsin, M. Sotomayor, G. Comellas and K. Fifth, in a similar vein to PAI it can be valuable to Schulten, Structure 17 (10), 1295–1306 (2009). incorporate experimental constraints within the CV dis- [3] P.S. de Laplace, Introduction to Oeuvres, Vol. VII, The- covery process. One may wish to promote CVs consis- orie Analytique de Probabilites (Gauthier-Villars, Paris, 1812). tent with some physical prior knowledge (e.g. burial of [4] B. Alder and T. Wainwright, J. Chem. Phys. 27 (5), hydrophobic residues, known tertiary contact pairs) or 1208–1209 (1957). ensemble averages over the sampled phase space should [5] B.J. Alder and T.E. Wainwright, J. Chem. Phys. 31 (2), be consistent with measured experimental observables. 459–466 (1959). Unlike hard physical constraints that should be rigor- [6] F.F. Abraham, R. Walkup, H. Gao, M. Duchaineau, ously obeyed, it is likely that these experimental con- T.D. De La Rubia and M. Seager, Proceedings of the National Academy of Sciences 99 (9), 5777–5782 staints may be incorporated in a softer manner through, (2002). for example, regularising Bayesian priors [204]. [7] F.F. Abraham, R. Walkup, H. Gao, M. Duchaineau, T.D. Sixth, the implementation and dissemination of open- De La Rubia and M. Seager, Proc. Natl. Acad. Sci. 99 (9), source software and libraries implementing the CV 5783–5787 (2002). discovery and sampling methods is critical in lower- [8] N. Tchipev, S. Seckler, M. Heinen, J. Vrabec, F. Gratl, M. Horsch, M. Bernreuther, C.W. Glass, C. Niethammer, ing the barrier to adoption by new users, guarantee- N. Hammer, B. Krischok, M. Resch, D. Kranzlmüller, ing reproducibility, promoting transparency, enabling H. Hasse, H.J. Bungartz and P. Neumann, Int. J. High community development and collaboration, and offer- Perform. Comput. Appl. 33 (5), 838–854 (2019). ing valuable pedagogical materials for new entrants into [9] D.E. Shaw, P. Maragakis, K. Lindorff-Larsen, S. Piana, the field. The rising popularity of user-friendly Jupyter R.O. Dror, M.P. Eastwood, J.A. Bank, J.M. Jumper, J.K. notebooks (https://jupyter.org/) and repository hosting Salmon, Y. Shan and W. Wriggers, Science 330 (6002), 341–346 (2010). sites such as GitHub (https://github.com) and Bitbucket [10] K. Lindorff-Larsen, S. Piana, R.O. Dror and D.E. Shaw, (https://bitbucket.org/) has made code sharing simpler Science 334 (6055), 517–520 (2011). and easier than ever, and there are encouraging trends [11] F. Noé, Biophys. J. 108 (2), 228 (2015). that doing so is becoming a cultural norm within the field. [12] M. Karplus and G.A. Petsko, Nature 347 (6294), 631–639 In closing, we see many exciting and innovative chal- (1990). [13] C. Abrams and G. Bussi, Entropy 16 (1), 163–199 (2013). lenges and opportunities on the horizon for this fast [14] Y. Miao and J.A. McCammon, Mol. Simul. 42 (13), moving field, and we look forward to the exciting new 1046–1055 (2016). developments that are sure to emerge in the coming [15] C. Chipot and A. Pohorille, Free Energy Calculations years. (Springer, Berlin, 2007). [16] J. Wang and A.L. Ferguson, Mol. Simul. 44 (13-14), 1090–1107 (2018). Disclosure statement [17] R.J. Allen, C. Valeriani and P.R. ten Wolde, J. Phys. 21 A.L.F. is a consultant of Evozyne and a co-author of US Provi- (46), 463102 (2009). sional Patents 62/853,919 and 62/900,420. [18] E.E. Borrero and F.A. Escobedo, J. Chem. Phys. 127 (16), 164101 (2007). [19] C. Wehmeyer, M.K. Scherer, T. Hempel, B.E. Husic, S. Funding Olsson and F. Noé, Living J. Comput. Mol. Sci. 1 (1), 5965 This work was supported by MICCoM (Midwest Center for (2018). Computational Materials), as part of the Computational Mate- [20] B.E. Husic and V.S. Pande, J. Am. Chem. Soc. 140 (7), rials Science Program funded by the U.S. Department of 2386–2396 (2018). Energy, Office of Science, Basic Energy Sciences, Materials Sci- [21] J. Rogal, W. Lechner, J. Juraszek, B. Ensing and P.G. ences and Engineering Division, and by the National Science Bolhuis, J. Chem. Phys. 133 (17), 174109 (2010). 18 H. SIDKY ET AL. [22] A. Laio and M. Parrinello, Proc. Natl. Acad. Sci. 99 (20), [51] C.G. Li, J. Guo, G. Chen, X.F. Nie, and Z. Yang, 2006, in 12562–12566 (2002). 2006 International Conference on Machine Learning and [23] A. Laio and F.L. Gervasio, Rep. Prog. Phys. 71 (12), Cybernetics, pp. 3201–3206. 126601 (2008). [52] J. Wang, Geometric Structure of High-Dimensional Data [24] O. Valsson, P. Tiwary and M. Parrinello, Annu. Rev. Phys. and Dimensionality Reduction (Springer, Berlin, 2011). Chem. 67, 159–184 (2016). [53] D.L. Donoho and C. Grimes, Proc. Natl. Acad. Sci. 100 [25] S. Singh, M. Chopra and J.J. de Pablo, Annu. Rev. Chem. (10), 5591–5596 (2003). Biomol. Eng. 3 (1), 369–394 (2012). [54] M. Belkin and P. Niyogi, 2002, in Advances in Neural [26] A. Ma and A.R. Dinner, J. Phys. Chem. B 109 (14), Information Processing Systems, pp. 585–591. 6769–6779 (2005). [55] A.L. Ferguson, A.Z. Panagiotopoulos, P.G. Debenedetti [27] S.B. Kim, C.J. Dsilva, I.G. Kevrekidis and P.G. and I.G. Kevrekidis, J. Chem. Phys. 134 (13), 04B606 Debenedetti, J. Chem. Phys. 142 (8), 02B613_1 (2015). (2011). [28] F. Noé, A. Tkatchenko, K.R. Müller and C. Clementi, [56] R.R. Coifman and S. Lafon, Appl. Comput. Harmon. arXiv preprint arXiv:1911.02792 (2019). Anal. 21 (1), 5–30 (2006). [29] A.L. Ferguson, J. Phys. 30 (4), 043002 (2018). [57] M.A. Rohrdanz, W. Zheng, M. Maggioni and C. [30] N.E. Jackson, M.A. Webb and J.J. de Pablo, Curr. Opin. Clementi, J. Chem. Phys. 134 (12), 03B624 (2011). Chem. Eng. 23, 106–114 (2019). [58] M. Ceriotti, G.A. Tribello and M. Parrinello, Proc. Natl. [31] K.T. Butler, D.W. Davies, H. Cartwright, O. Isayev and A. Acad. Sci. 108 (32), 13023–13028 (2011). Walsh, Nature 559 (7715), 547–555 (2018). [59] M. Ceriotti, G.A. Tribello and M. Parrinello, J. Chem. [32] M.A. Rohrdanz, W. Zheng and C. Clementi, Annu. Rev. Theory Comput. 9 (3), 1521–1532 (2013). Phys. Chem. 64, 295–316 (2013). [60] L.v.d. Maaten and G. Hinton, J. Mach. Learn. Res. 9 [33] P. Tiwary and B.J. Berne, Proc. Natl. Acad. Sci. 113 (11), (Nov), 2579–2605 (2008). 2839–2844 (2016). [61] G.a. Tribello, M. Ceriotti and M. Parrinello, Proc. Natl. [34] M.M. Sultan and V.S. Pande, J. Chem. Phys. 149 (9), Acad. Sci. 107 (41), 17509–17514 (2010). 094106 (2018). [62] E. Chiavazzo, R. Covino, R.R. Coifman, C.W. Gear, A.S. [35] Z. Shamsi, K.J. Cheng and D. Shukla, arXiv preprint Georgiou, G. Hummer and I.G. Kevrekidis, Proc. Natl. arXiv:1710.00495 (2017). Acad. Sci. 114 (28), E5494–E5503 (2017). [36] W. Chen and A.L. Ferguson, J. Comput. Chem. 39 (25), [63] W. Chen, A.R. Tan and A.L. Ferguson, J. Chem. Phys. 2079–2102 (2018). 149 (7), 072312 (2018). [37] G.A. Tribello, M. Ceriotti and M. Parrinello, Proc. Natl. [64] L. Donati, C. Hartmann and B.G. Keller, J. Chem. Phys. Acad. Sci. 109 (14), 5196–5201 (2012). 146 (24), 244112 (2017). [38] B. Hashemian, D. Millán and M. Arroyo, J. Chem. Phys. [65] L. Donati and B.G. Keller, J. Chem. Phys. 149 (7), 072335 139 (21), 12B601_1 (2013). (2018). [39] A.L. Ferguson, J. Comput. Chem. 38 (18), 1583–1605 [66] J. Quer, L. Donati, B.G. Keller and M. Weber, SIAM J. Sci. (2017). Comput. 40 (2), A653–A670 (2018). [40] A.L. Ferguson, A.Z. Panagiotopoulos, P.G. Debenedetti [67] H. Wu, F. Paul, C. Wehmeyer and F. Noé, Proc. Natl. and I.G. Kevrekidis, Proc. Natl. Acad. Sci.107 (31), Acad. Sci. 113 (23), E3221–E3230 (2016). 13597–13602 (2010). [68] J.D. Chodera, W.C. Swope, F. Noé, J.H. Prinz, M.R. Shirts [41] A.L. Ferguson, A.Z. Panagiotopoulos, I.G. Kevrekidis and V.S. Pande, J. Chem. Phys. 134 (24), 06B612 (2011). and P.G. Debenedetti, Chem. Phys. Lett. 509 (1), 1–11 [69] J.H. Prinz, J.D. Chodera, V.S. Pande, W.C. Swope, J.C. (2011). Smith and F. Noé, J. Chem. Phys. 134 (24), 06B613 [42] W. Zheng, M.A. Rohrdanz and C. Clementi, J. Phys. (2011). Chem. B 117 (42), 12769–12776 (2013). [70] S. Klus, F. Nüske, P. Koltai, H. Wu, I. Kevrekidis, C. [43] J. Preto and C. Clementi, Phys. Chem. Chem. Phys. 16 Schütte and F. Noé, J. Nonlinear Sci. 28 (3), 985–1010 (36), 19181–19191 (2014). (2018). [44] T. Ichiye and M. Karplus, Proteins 11 (3), 205–217 [71] S. Klus, P. Koltai and C. Schütte, arXiv preprint (1991). arXiv:1512.05997 (2015). [45] A.E. García, Phys. Rev. Lett. 68 (17), 2696 (1992). [72] M.O. Williams, C.W. Rowley and I.G. Kevrekidis, arXiv [46] S.T. Roweis and L.K. Saul, Science 290 (5500), 2323–2326 preprint arXiv:1411.2260 (2014). (2000). [73] K.K. Chen, J.H. Tu and C.W. Rowley, J. Nonlinear Sci. 22 [47] Z. Zhang and J. Wang, 2006, in Proceedings of the 19th (6), 887–915 (2012). International Conference on Neural Information Process- [74] C.W. Rowley, I. Mezić, S. Bagheri, P. Schlatter, and D.S. ing Systems, pp. 1593–1600. Henningson, 2010, in Seventh IUTAM Symposium on [48] P. Das, M. Moll, H. Stamati, L.E. Kavraki and C. Laminar-Turbulent Transition, pp. 43–50. Clementi, Proc. Natl. Acad. Sci. 103 (26), 9885–9890 [75] M.S. Hemati, C.W. Rowley, E.A. Deem and L.N. (2006). Cattafesta, Theor. Comp. Fluid Dyn. 31 (4), 349–368 [49] J.B. Tenenbaum, V. De Silva and J.C. Langford, Science (2017). 290 (5500), 2319–2323 (2000). [76] M.O. Williams, M.S. Hemati, S.T. Dawson, I.G. [50] K.Q. Weinberger and L.K. Saul, Int. J. Comput. Vis. 70 Kevrekidis and C.W. Rowley, IFAC-PapersOnLine 49 (1), 77–90 (2006). (18), 704–709 (2016). MOLECULAR PHYSICS 19 [77] M.O. Williams, I.G. Kevrekidis and C.W. Rowley, J. Non- [108] W. Chen, H. Sidky and A.L. Ferguson, J. Chem. Phys. linear Sci. 25 (6), 1307–1346 (2015). 151, 064123 (2019). [78] H. Wu, F. Nüske, F. Paul, S. Klus, P. Koltai and F. Noé, [109] C.X. Hernández, H.K. Wayment-Steele, M.M. Sultan, J. Chem. Phys. 146 (15), 154104 (2017). B.E. Husic and V.S. Pande, Phys. Rev. E 97 (6), 062412 [79] H. Wu and F. Noé, arXiv preprint arXiv:1707.04659 (2018). (2017). [110] M.M. Sultan, H.K. Wayment-Steele and V.S. Pande, [80] A. Mardt, L. Pasquali, H. Wu and F. Noé, Nat. Commun. J. Chem. Theory Comput. 14 (4), 1887–1894 (2018). 9 (1), 5 (2018). [111] K.L. Priddy and P.E. Keller, Artificial Neural Net- [81] S. Hanke, S. Peitz, O. Wallscheid, S. Klus, J. Böcker and works: An Introduction, 68vols. (SPIE Press, Bellingham, M. Dellnitz, arXiv preprint arXiv:1804.00854 (2018). 2005). [82] S. Peitz and S. Klus, Automatica 106, 184–191 (2019). [112] M.H. Hassoun, Fundamentals of Artificial Neural Net- [83] F. Noé and F. Nuske, Multiscale. Model. Simul. 11 (2), works (MIT Press, Cambridge, MA, 1995). 635–655 (2013). [113] T. Chen and H. Chen, IEEE Trans. Neural Netw. 6 (4), [84] F. Nüske, B.G. Keller, G. Pérez-Hernández, A.S.J.S. Mey 911–917 (1995). and F. Noé, J. Chem. Theory Comput. 10 (4), 1739–1752 [114] R. Hecht-Nielsen, Neural Networks for Perception (Else- (2014). vier, Amsterdam, 1992). pp. 65–93. [85] G. Pérez-Hernández, F. Paul, T. Giorgino, G. De Fab- [115] L. Bottou, Neural Networks: Tricks of the Trade (Springer, ritiis and F. Noé, J. Chem. Phys. 139 (1), 07B604_1 Berlin, 2012). pp. 421–436. (2013). [116] D.P. Kingma and J. Ba, arXiv preprint arXiv:1412.6980 [86] C.R. Schwantes and V.S. Pande, J. Chem. Theory Com- (2014). put. 9 (4), 2000–2009 (2013). [117] A. Krizhevsky, I. Sutskever, and G.E. Hinton, 2012, [87] C.R. Schwantes and V.S. Pande, J. Chem. Theory Com- in Advances in Neural Information Processing Systems, put. 11 (2), 600–608 (2015). pp. 1097–1105. [88] M.P. Harrigan and V.S. Pande, bioRxiv p. 123752 (2017). [118] Y. LeCun, P. Haffner, L. Bottou and Y. Bengio, Shape, [89] J.H. Tu, C.W. Rowley, D.M. Luchtenburg, S.L. Brunton Contour and Grouping in Computer Vision (Springer, and J.N. Kutz, arXiv preprint arXiv:1312.0041 (2013). Berlin, 1999). pp. 319–345. [90] M.S. Hemati, M.O. Williams and C.W. Rowley, Phys. [119] I. Goodfellow, Y. Bengio and A. Courville, Deep Learning Fluids 26 (11), 111701 (2014). (MIT Press, Cambridge, MA, 2016). [91] J.L. Proctor, S.L. Brunton and J.N. Kutz, SIAM J. Appl. [120] M.A. Kramer, AIChE J. 37 (2), 233–243 (1991). Dyn. Syst. 15 (1), 142–161 (2016). [121] D.P. Kingma and M. Welling, arXiv preprint arXiv:1312. [92] J.N. Kutz, S.L. Brunton, B.W. Brunton and J.L. Proctor, 6114 (2014). Dynamic Mode Decomposition: Data-Driven Modeling of [122] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Complex Systems (SIAM, Philadelphia, 2016). Warde-Farley, S. Ozair, A. Courville and Y. Bengio, in [93] J.N. Kutz, X. Fu and S.L. Brunton, SIAM J. Appl. Dyn. Advances in Neural Information Processing Systems 27, Syst. 15 (2), 713–735 (2016). edited by Z. Ghahramani, M. Welling, C. Cortes, N. [94] J.N. Kutz, X. Fu, S.L. Brunton, and N.B. Erichson, 2015, D. Lawrence and K. Q. Weinberger (Curran Associates, in 2015 IEEE International Conference on Computer Inc., Red Hook, 2014), pp. 2672–2680. Vision Workshop (ICCVW), pp. 921–929. [123] H. Sidky and J.K. Whitmer, J. Chem. Phys. 148 (10), [95] S. Klus, P. Gelß, S. Peitz and C. Schütte, Nonlinearity 31 104111 (2018). (7), 3359 (2018). [124] R. Galvelis and Y. Sugita, J. Chem. Theory Comput.13 [96] Q. Li, F. Dietrich, E.M. Bollt and I.G. Kevrekidis, Chaos (6), 2489–2500 (2017). 27 (10), 103111 (2017). [125] E. Schneider, L. Dai, R.Q. Topper, C. Drechsel-Grau [97] M. Korda and I. Mezić, J. Nonlinear Sci. 28 (2), 687–710 and M.E. Tuckerman, Phys. Rev. Lett. 119 (15), 150601 (2018). (2017). [98] H. Hotelling, Biometrika 28, 321–377 (1936). [126] A.Z. Guo, E. Sevgen, H. Sidky, J.K. Whitmer, J.A. Hubbell [99] G. Froyland, Nonlinearity 12 (1), 79 (1999). and J.J. de Pablo, J. Chem. Phys. 148 (13), 134108 (2018). [100] J. Ding and A. Zhou, Physica D 92 (1–2), 61–68 (1996). [127] L. Bonati, Y.Y. Zhang and M. Parrinello, arXiv preprint [101] S.M. Ulam, A Collection of Mathematical Problems, arXiv:1904.01305 (2019). 8 vols. (Interscience Publishers, Geneva, 1960). [128] J. Behler and M. Parrinello, Phys. Rev. Lett. 98, 146401 [102] G. Froyland, G.A. Gottwald and A. Hammerlindl, SIAM (2007). J. Appl. Dyn. Syst. 13 (4), 1816–1846 (2014). [129] R.Z. Khaliullin, H. Eshet, T.D. Kühne, J. Behler and M. [103] G. Andrew, R. Arora, J. Bilmes, and K. Livescu, Parrinello, Phys. Rev. B 81, 100103 (2010). 2013, in International Conference on Machine Learning, [130] Z. Li, J.R. Kermode and A. De Vita, Phys. Rev. Lett. 114, pp. 1247–1255. 096405 (2015). [104] F. Noé, arXiv preprint arXiv:1812.07669 (2018). [131] J. Wang, S. Olsson, C. Wehmeyer, A. Pérez, N.E. Char- [105] W. Chen, H. Sidky and A.L. Ferguson, J. Chem. Phys. 150 ron, G. de Fabritiis, F. Noé and C. Clementi, ACS Cent. (21), 214114 (2019). Sci. 5 (5), 755–767 (2019). [106] H. Sidky, W. Chen and A.L. Ferguson, J. Phys. Chem. B [132] F. Noé, S. Olsson, J. Köhler and H. Wu, Science 365 123 (123), 7999–8009 (2019). (6457), eaaw1147 (2019). [107] C. Wehmeyer and F. Noé, J. Chem. Phys. 148 (24), [133] D. Bhowmik, S. Gao, M.T. Young and A. Ramanathan, 241703 (2018). BMC Bioinformatics 19 (S18), 429 (2018). 20 H. SIDKY ET AL. [134] H. Wu, A. Mardt, L. Pasquali, and F. Noé, 2018, [159] L. Molgedey and H.G. Schuster, Phys. Rev. Lett. 72 (23), in Advances in Neural Information Processing Systems, 3634 (1994). pp. 3975–3984. [160] T. Blaschke, P. Berkes and L. Wiskott, Neural. Comput. [135] E.N. Feinberg, D. Sur, Z. Wu, B.E. Husic, H. Mai, Y. Li, 18 (10), 2495–2508 (2006). S. Sun, J. Yang, B. Ramsundar and V.S. Pande, ACS Cent. [161] F. Noé and C. Clementi, J. Chem. Theory Comput.11 Sci. 4 (11), 1520–1530 (2018). (10), 5002–5011 (2015). [136] K.T. Schütt, H.E. Sauceda, P.J. Kindermans, A. [162] F. Noé, R. Banisch and C. Clementi, J. Chem. Theory Tkatchenko and K.R. Müller, J. Chem. Phys. 148 (24), Comput. 12 (11), 5620–5630 (2016). 241722 (2018). [163] G. Pérez-Hernández and F. Noé, J. Chem. Theory Com- [137] C.R. Qi, H. Su, K. Mo, and L.J. Guibas, 2017, in Pro- put. 12 (12), 6118–6129 (2016). ceedings of the IEEE Conference on Computer Vision and [164] K. Pearson, Lond. Edinb. Dubl. Phil. Mag. J. Sci. 2 (11), Pattern Recognition, pp. 652–660. 559–572 (1901). [138] R.S. DeFever, C. Targonski, S.W. Hall, M.C. Smith and S. [165] H.J. Woo and B. Roux, Proc. Natl. Acad. Sci. 102 (19), Sarupria, Chem. Sci. 10 (32), 7503–7515 (2019). 6825–6830 (2005). [139] R.R. Coifman, S. Lafon, A.B. Lee, M. Maggioni, B. [166] M.M. Sultan and V.S. Pande, J. Chem. Theory Comput. Nadler, F. Warner and S.W. Zucker, Proc. Natl. Acad. Sci. 13 (6), 2440–2447 (2017). 102 (21), 7426–7431 (2005). [167] J. McCarty and M. Parrinello, J. Chem. Phys. 147 (20), [140] A.W. Long and A.L. Ferguson, J. Phys. Chem. B 118 (15), 204109 (2017). 4228–4244 (2014). [168] C.R. Laing, T.A. Frewen and I.G. Kevrekidis, Nonlinear- [141] R.R. Coifman, Y. Shkolnisky, F.J. Sigworth and A. Singer, ity 20 (9), 2127 (2007). IEEE Trans. Image Process. 17 (10), 1891–1899 (2008). [169] A.W. Long and A.L. Ferguson, Mol. Syst. Des. Eng. 3 (1), [142] A.W. Long and A.L. Ferguson, Appl. Comput. Harmon. 49–65 (2018). Anal. 47 (1), 190–211 (2019). [170] Y. Ma and A.L. Ferguson, Soft. Matter 15 (43), [143] J. Wang and A.L. Ferguson, Macromolecules 51 (2), 8808–8826 (2019). 598–616 (2018). [171] G. Mishne, U. Shaham, A. Cloninger and I. Cohen, Appl. [144] E.J. Nyström, Über die Praktische Auflösung von Lin- Comput. Harmon. Anal. 1, 1–27 (2017). earen Integralgleichungen mit Anwendungen auf Randw- [172] E. Darve, D. Rodríguez-Gómez and A. Pohorille, ertaufgaben der Potentialtheorie (Akademische Buch- J. Chem. Phys. 128 (14), 144120 (2008). handlung, Freiberg, 1929). [173] B. Hashemian, D. Millán and M. Arroyo, J. Chem. Phys. [145] N. Rabin and R.R. Coifman, 2012, in Proceedings of the 145 (17), 174109 (2016). 2012 SIAM International Conference on Data Mining, [174] J.M.L. Ribeiro, P. Bravo, Y. Wang and P. Tiwary, J. Chem. pp. 189–199. Phys. 149 (7), 072301 (2018). [146] E. Chiavazzo, C. Gear, C. Dsilva, N. Rabin and I. [175] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Kevrekidis, Processes 2 (1), 112–140 (2014). Botvinick, S. Mohamed and A. Lerchner, ICLR 2 (5), 6 [147] B. Peters and B.L. Trout, J. Chem. Phys. 125 (5), 054108 (2017). (2006). [176] C.P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Wat- [148] B. Peters, G.T. Beckham and B.L. Trout, J. Chem. Phys. ters, G. Desjardins and A. Lerchner, arXiv preprint 127 (3), 034109 (2007). arXiv:1804.03599 (2018). [149] T. Berry, J.R. Cressman, Z. Greguric-Ferencek and T. [177] J.M.L. Ribeiro and P. Tiwary, J. Chem. Theory Comput. Sauer, SIAM J. Appl. Dyn. Syst. 12 (2), 618–649 (2013). 15 (1), 708–719 (2018). [150] R.A. Mansbach and A.L. Ferguson, J. Chem. Phys. 142 [178] M.M. Sultan and V.S. Pande, arXiv preprint arXiv:1802. (10), 03B607_1 (2015). 10510 (2018). [151] G.M. Torrie and J.P. Valleau, J. Comput. Phys. 23 (2), [179] D. Mendels, G. Piccini and M. Parrinello, J. Phys. Chem. 187–199 (1977). Lett. 9 (11), 2776–2781 (2018). [152] V.S. Pande, K. Beauchamp and G.R. Bowman, Methods [180] G. Piccini, D. Mendels and M. Parrinello, J. Chem. The- 52 (1), 99–105 (2010). ory Comput. 14 (10), 5040–5044 (2018). [153] G.R. Bowman, V.S. Pande and F. Noé, An Introduction [181] D. Mendels, G. Piccini, Z.F. Brotzakis, Y.I. Yang and M. to Markov State Models and Their Application to Long Parrinello, J. Chem. Phys. 149 (19), 194113 (2018). Timescale Molecular Simulation, 797vols. (Springer Sci- [182] D.J. Griffiths and D.F. Schroeter, Introduction to Quan- ence & Business Media, Berlin, 2013). tum Mechanics (Cambridge University Press, Cam- [154] J.D. Chodera and F. Noé, Curr. Opin. Struct. Biol. 25, bridge, 2018). 135–144 (2014). [183] A. Szabo and N.S. Ostlund, Modern Quantum Chem- [155] J.H. Prinz, H. Wu, M. Sarich, B. Keller, M. Senne, M. istry: Introduction to Advanced Electronic Structure The- Held, J.D. Chodera, C. Schütte and F. Noé, J. Chem. Phys. ory (Courier Corporation, Chelmsford, MA, 2012). 134 (17), 174105 (2011). [184] M.K. Scherer, B.E. Husic, M. Hoffmann, F. Paul, H. Wu [156] A.S. Mey, H. Wu and F. Noé, Phys. Rev. X 4 (4), 041018 and F. Noé, arXiv preprint arXiv:1811.11714 (2018). (2014). [185] F. Rosenblatt, Psychol. Rev. 65 (6), 386 (1958). [157] F. Nüske, H. Wu, J.H. Prinz, C. Wehmeyer, C. Clementi [186] I. McDonald and K. Singer, J. Chem. Phys. 47 (11), and F. Noé, J. Chem. Phys. 146 (9), 094104 (2017). 4766–4772 (1967). [158] S. Röblitz and M. Weber, Adv. Data Anal. Classif.7 (2), [187] Y.I. Yang, Q. Shao, J. Zhang, L. Yang and Y.Q. Gao, 147–179 (2013). J. Chem. Phys. 151 (7), 070902 (2019). MOLECULAR PHYSICS 21 [188] T. Huber, A.E. Torda and W.F. van Gunsteren, J. Comput. [199] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.R. Aided Mol. Des. 8 (6), 695–708 (1994). Müller and W. Samek, PloS One 10 (7), e0130140 [189] H. Grubmüller, Phys. Rev. E 52 (3), 2893 (1995). (2015). [190] A. Barducci, M. Bonomi and M. Parrinello, Wiley Inter- [200] S. Lapuschkin, S. Wäldchen, A. Binder, G. Montavon, discip. Rev. Comput. Mol. Sci. 1 (5), 826–843 (2011). W. Samek and K.R. Müller, Nat. Commun. 10 (1), 60 [191] K.A. Dill, S.B. Ozkan, M.S. Shell and T.R. Weikl, Annu. (2019). Rev. Biophys.37, 289–316 (2008). [201] T. Beucler, M. Pritchard, S. Rasp, P. Gentine, J. Ott and P. [192] K.A. Dill and J.L. MacCallum, Science 338 (6110), Baldi, arXiv preprint arXiv:1909.00912 (2019). 1042–1046 (2012). [202] M. Weiler, M. Geiger, M. Welling, W. Boomsma, and T. [193] G.A. Khoury, J. Smadbeck, C.A. Kieslich and C.A. Cohen, 2018, in Advances in Neural Information Process- Floudas, Trends Biotechnol. 32 (2), 99–109 (2014). ing Systems, pp. 10381–10392. [194] N. Chennamsetty, V. Voynov, V. Kayser, B. Helk and [203] B. Anderson, T.S. Hy and R. Kondor, arXiv preprint B.L. Trout, Proc. Natl. Acad. Sci. 106 (29), 11937–11942 arXiv:1906.04015 (2019). (2009). [204] J.W. Pitera and J.D. Chodera, J. Chem. Theory Comput. [195] L.J. Weiser and E.E. Santiso, AIMS Mat. Sci. 4 (5), 8 (10), 3445–3451 (2012). 1029–1051 (2017). [205] A. Krylov, T.L. Windus, T. Barnes, E. Marin-Rimoldi, [196] B.E. Husic and F. Noé, J. Chem. Phys. 151 (5), 054103 J.A. Nash, B. Pritchard, D.G. Smith, D. Altarawy, P. Saxe, (2019). C. Clementi, T.D. Crawford, R.J. Harrison, S. Jha, V.S. [197] A.L. Ferguson, ACS Cent. Sci. 4 (8), 938–941 (2018). Pande and T. Head-Gordon, J. Chem. Phys. 149 (18), [198] W. Samek, Explainable AI: Interpreting, Explaining and 180901 (2018). Visualizing Deep Learning (Springer Nature, London, [206] N. Wilkins-Diehr and T.D. Crawford, Comput. Sci. Eng. 2019). 20 (5), 26–38 (2018).