MRP 2020: The Second Shared Task on Cross-Framework and Cross-Lingual Meaning Representation Parsing Stephan Oepen♣ , Omri Abend♠ , Lasha Abzianidze♥ , Johan Bos♦ , Jan Hajič◦ , Daniel Hershcovich? , Bin Li• , Tim O’Gorman , Nianwen Xue∗ , and Daniel Zeman◦ ♣ University of Oslo, Department of Informatics ♠ The Hebrew University of Jerusalem, School of Computer Science and Engineering ♥ Utrecht University, UiL OTS ♦ University of Groningen, Center for Language and Cognition ◦ Charles University, Prague, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics ? University of Copenhagen, Department of Computer Science • Nanjing Normal University, School of Chinese Language and Literature University of Massachusetts at Amherst, College of Information and Computer Sciences ∗ Brandeis University, Department of Computer Science
[email protected]Abstract resentation in graph form in a uniform training and evaluation setup. The 2020 Shared Task at the Conference for Key differences in the 2020 edition of the task Computational Language Learning (CoNLL) include the addition of a graph-based encoding was devoted to Meaning Representation Pars- ing (MRP) across frameworks and languages. of Discourse Representation Structures (dubbed Extending a similar setup from the previous DRG); a generalization of Prague Tectogrammati- year, five distinct approaches to the represen- cal Graphs (to include more information from the tation of sentence meaning in the form of di- original annotations); and a separate cross-lingual rected graphs were represented in the English track, introducing one extra language (beyond En- training and evaluation data for the task, pack- glish) for four of the frameworks involved.1 aged in a uniform graph abstraction and serial- Participants were invited to develop parsing ization; for four of these representation frame- works, additional training and evaluation data systems that support five distinct semantic graph was provided for one additional language per frameworks in four languages (see §3 below)— framework. The task received submissions all encoding core predicate–argument structure, from eight teams, of which two do not par- among other things—in the same implementation. ticipate in the official ranking because they ar- Ideally, these parsers predict sentence-level mean- rived after the closing deadline or made use of ing representations in all frameworks in parallel. additional training data. All technical informa- Architectures utilizing complementary knowledge tion regarding the task, including system sub- missions, official results, and links to support- sources (e.g. via parameter sharing) were encour- ing resources and software are available from aged, though not required. Learning from multiple the task web site at: flavors of meaning representation in tandem has hardly been explored (with notable exceptions, e.g. http://mrp.nlpl.eu the parsers of Peng et al., 2017; Hershcovich et al., 1 Background and Motivation 2018; Stanovsky and Dagan, 2018; or Lindemann et al., 2019). The 2020 Conference on Computational Language The task design aims to reduce framework- Learning (CoNLL) hosts a shared task (or ‘system specific ‘balkanization’ in the field of meaning bake-off’) on Cross-Framework Meaning Repre- representation parsing. Its contributions include sentation Parsing (MRP 2020), which is a revised 1 and extended re-run of a similar CoNLL shared task To reduce the threshold to participation, two of the target frameworks represented in MRP 2019 are not in focus this in the preceding year. The goal of these tasks is to year, viz. the purely bi-lexical DELPH-IN MRS Bi-Lexical De- advance data-driven parsing into graph-structured pendencies and Prague Semantic Dependencies (PSD). These representations of sentence meaning. For the first graphs largely overlap with the corresponding (but richer) frameworks in 2020, EDS and PTG, respectively, and the time, the MRP task series combines formally and original bi-lexical semantic dependency graphs remain inde- linguistically different approaches to meaning rep- pendently available (Oepen et al., 2015). 1 Proceedings of the CoNLL 2020 Shared Task: Cross-Framework Meaning Representation Parsing, pages 1–22 Online, Nov. 19-20, 2020. c 2020 Association for Computational Linguistics (a) a unifying formal model over different seman- tute ordered graphs. A natural way to visualize a tic graph banks (§2), (b) uniform representations bi-lexical dependency graph is to draw its edges and scoring (§4 and §6), (c) contrastive evaluation as semicircles in the halfplane above the sentence. across frameworks (§5), and (d) increased cross- An ordered graph is called noncrossing if in such a fertilization of parsing approaches (§7). drawing, the semicircles intersect only at their end- points (this property is a natural generalization of 2 Definitions: Graphs and Flavors projectivity as it is known from dependency trees). Reflecting different traditions and communities, A natural generalization of the noncrossing prop- there is wide variation in how individual meaning erty, where one is allowed to also use the halfplane representation frameworks think (and talk) about below the sentence for drawing edges is a prop- semantic graphs, down to the level of visual conven- erty called pagenumber two. Kuhlmann and Oepen tions used in rendering graph structures. Increased (2016) provide additional definitions and a quanti- terminological uniformity and guidance in how to tative summary of various formal graph properties navigate this rich and diverse landscape are among across frameworks. the desirable side-effects of the MRP task series. Hierarchy of Formal Flavors In the context of The following paragraphs provide semi-formal def- the MRP shared task series, we have previously de- initions of core graph-theoretic concepts that can fined different flavors of semantic graphs based on be meaningfully applied across the range of frame- the nature of the relationship they assume between works represented in the shared task. the linguistic surface signal (typically a written Basic Terminology Semantic graphs (across dif- sentence, i.e. a string) and the nodes of the graph ferent frameworks) can be viewed as directed (Oepen et al., 2019). We refer to this relation as graphs or digraphs. A semantic digraph is a anchoring (of nodes onto sub-strings); other com- triple (T, N, E) where N is a set of nodes and monly used terms include alignment, correspon- E ⊆ N × N is a set of edges. The in- and out- dence, or lexicalization. degree of a node count the number of edges arriving Flavor (0) is characterized by the strongest form at or leaving from the node, respectively. In con- of anchoring, obtained in bi-lexical dependency trast to the unique root node in trees, graphs can graphs, where graph nodes injectively correspond have multiple (structural) roots, which we define as to surface lexical units (i.e. tokens or ‘words’). In nodes with in-degree zero. The majority of seman- such graphs, each node is directly linked to one tic graphs are structurally multi-rooted. Thus, we specific token (conversely, there may be semanti- distinguish one or several nodes in each graph as cally empty tokens), and the nodes inherit the linear top nodes, T ⊆ N ; the top(s) correspond(s) to the order of their corresponding tokens. most central semantic entities in the graph, usually Flavor (1) includes a more general form of an- the main predication(s). chored semantic graphs, characterized by relaxing In a tree, every node except the root has in- the correspondence between nodes and tokens, al- degree one. In semantic graphs, nodes can have lowing arbitrary parts of the sentence (e.g. sub- in-degree two or higher (indicating shared argu- token or multi-token sequences) as node anchors, ments), which constitutes a reentrancy in the graph. as well as unanchored nodes, or multiple nodes In contrast to trees, general digraphs may contain anchored to overlapping sub-strings. These graphs cycles, i.e. a directed path leading from a node to afford greater flexibility in the representation of itself. Another central property of trees is that they meaning contributed by, for example, (derivational) are connected, meaning that there exists an undi- affixes or phrasal constructions and facilitate lexi- rected path between any pair of nodes. In contrast, cal decomposition (e.g. of causatives or compara- semantic graphs need not generally be connected. tives). Finally, in some semantic graph frameworks Finally, Flavor (2) semantic graphs do not con- there is a (total) linear order on the nodes, typi- sider the correspondence between nodes and the cally (though not necessarily) induced by the sur- surface string as part of the representation of mean- face order of corresponding tokens. Such graphs ing (thus backgrounding notions of derivation and are conventionally called bi-lexical dependencies compositionality). Such semantic graphs are sim- (probably deriving from a notion of lexicalization ply unanchored. articulated by Eisner, 1997) and formally consti- While different flavors refer to formally defined 2 _almost_a_1 〈23:29〉 ARG1 _impossible_a_for comp 〈30:40〉 〈2:9〉 TENSE pres ARG1 ARG1 _similar_a_to _a_q _apply_v_to udef_q _other_a_1 _such+as_p udef_q 〈2:9〉 〈0:1〉 〈44:49〉 〈53:100〉 〈53:58〉 〈66:73〉 〈74:100〉 ARG1 BV ARG2 ARG3 BV ARG1 ARG1 ARG2 BV _technique_n_1 _crop_n_1 implicit_conj udef_q udef_q 〈10:19〉 〈59:65〉 〈82:100〉 〈74:81〉 〈82:100〉 NUM sg NUM pl NUM pl BV L-INDEX R-INDEX BV _and_c _cotton_n_1 udef_q udef_q 〈91:94〉 〈74:81〉 〈82:90〉 〈95:100〉 NUM pl BV L-INDEX R-INDEX BV _soybean_n_unknown _rice_n_1 〈82:90〉 〈95:100〉 NUM pl Figure 1: Semantic dependency graphs for the running example A similar technique is almost impossible to apply to other crops, such as cotton, soybeans and rice: Elementary Dependency Structures (EDS). Node properties are indicated as two-column records below the node labels. sub-classes of semantic graphs, we reserve the 3 Meaning Representation Frameworks term framework for specific linguistic approaches to graph-based meaning representation (typically The shared task combines five distinct frameworks encoded in a particular graph flavor, of course). for graph-based meaning representation, each with However, the coarse classification into three dis- its specific formal and linguistic assumptions. This tinct flavors does not fully account for the variabil- section reviews the frameworks and presents En- ity of anchoring relations observed across frame- glish example graphs for sentence #20209013 from works. For example, graphs can be partially an- the venerable Wall Street Journal (WSJ) Corpus chored, meaning that only a subset of nodes are from the Penn Treebank (PTB; Marcus et al., explicitly linked to the surface string; the anchor- 1993): ing relations that are present, can in turn stand in (1) A similar technique is almost impossible to one-to-one correspondence to surface tokens, or apply to other crops, such as cotton, soybeans allow overlapping and sub-token or phrasal rela- and rice. tionships. At the same time, a framework may impose a total ordering of nodes independent (or The example exhibits some interesting linguis- possibly only partly dependent) on anchoring. We tic complexity, including what is called a tough will interpret Flavors (0) and (2) strictly, as fully adjective (impossible), a scopal adverb (almost), a lexically anchored and wholly unanchored, respec- tripartite coordinate structure, and apposition. The tively, leading to the categorization of mixed forms example graphs in Figures 1 through 4 are pre- of anchoring as Flavor (1), and allow for the pres- where unanchored nodes for unexpressed material beyond the ence of ordered graphs, in principle at least, at all surface string can be postulated (Schuster and Manning, 2016). levels of the hierarchy.2 Whether or not these nodes occupy a well-defined position in the otherwise total order of basic UD nodes remains an open 2 Albeit in the realm of syntactic structure, the popular Uni- question, but either way the presence of unanchored nodes versal Dependencies (UD; Nivre et al., 2020) initiative is cur- will take enhanced UD graphs beyond the bi-lexical Flavor (0) rently exploring the introduction of ‘enhanced’ dependencies, graphs in our terminology. 3 sented in order of (arguably) increasing ‘abstrac- two-place such+as p relation, as well as the tion’ from the surface string, i.e. ranging from fully implicit conj(unction) relation (which reflects re- anchored Flavor (1) to unanchored Flavor (2). cursive decomposition of the coordinate structure into binary predications) do not correspond to indi- Elementary Dependency Structures The EDS vidual surface tokens (but are anchored on larger graphs (Oepen and Lønning, 2006) originally spans, overlapping with anchors from other nodes). derive from the underspecified logical forms Conversely, the two nodes associated with similar computed by the English Resource Grammar indicate lexical decomposition as a comparative (Flickinger et al., 2017; Copestake et al., 2005). predicate, where the second argument of the comp These logical forms are not in and of themselves relation (the ‘point of reference’) remains unex- semantic graphs (in the sense of §2 above) and pressed in Example (1). are often refered to as English Resource Semantics (ERS; Bender et al., 2015).3 Elementary Depen- Prague Tectogrammatical Graphs These dency Structures (EDS; Oepen and Lønning, 2006) graphs present a conversion from the multi-layered encode English Resource Semantics in a variable- (and somewhat richer) annotations in the tradition free semantic dependency graph—not limited to of Prague Functional Generative Description bi-lexical dependencies—where graph nodes corre- (FGD; Sgall et al., 1986), as adopted (among spond to logical predications and edges to labeled others) in the Prague Czech–English Dependency argument positions. The EDS conversion from Treebank (PCEDT; Hajič et al., 2012) and Prague underspecified logical forms to directed graphs dis- Dependency Treebank (PDT; Böhmová et al., cards partial information on semantic scope from 2003). For more details on how the graphs are the full ERS, which makes these graphs abstractly— obtained from the original annotations, see Zeman if not linguistically—similar to Abstract Meaning and Hajič (2020). Representation (see below). The PTG structures essentially recast core pred- Nodes in EDS are in principle independent of icate–argument structure in the form of mostly surface lexical units, but for each node there is an anchored dependency graphs, albeit introducing explicit, many-to-many anchoring onto sub-strings ‘empty’ (or generated, in FGD terminology) nodes, of the underlying sentence. Thus, EDS instanti- for which there is no corresponding surface token. ates Flavor (1) in our hierarchy of different formal Thus, these partially anchored representations in- types of semantic graphs and, more specfically, stantiate Flavor (1) in our hierarchy of different are fully anchored but unordered. Avoiding a one- formal types of semantic graphs, where anchoring to-one correspondence between graph nodes and relations can be discontinuous: For example, the surface lexical units enables EDS to adequately rep- technique node in Figure 2 is anchored to both the resent, among other things, lexical decomposition noun and its indefinite determiner a. PTG struc- (e.g. of comparatives), sub-lexical or construction tures assume a total order of nodes, which provides semantics (e.g. corresponding to morphological the foundation for an underlying theory of topic– derivation or syntactic compounding, respectively), focus articulation, as proposed by Hajičová et al. and covert (e.g. elided) meaning contributions. All (1998). nodes in the example EDS in Figure 1 make explicit The PTG structure for our running example has their anchoring onto sub-strings of the underlying many of the same dependency edges as the EDS input, for example span h2 : 9i for similar. graph (albeit using a different labeling scheme and In the EDS analysis for the running ex- inverse directionality in a few cases), but it ana- ample, nodes representing covert quantifiers lyzes the predicative copula as semantically con- (e.g. on bare nominals, labeled udef q4 ), the tentful and does not treat almost as ‘scoping’ over 3 The underlying grammar is rooted in the general linguistic the entire graph. In the example graph, there are theory of Head-Driven Phrase Structure Grammar (HPSG; two generated nodes to represent the unexpressed Pollard and Sag, 1994). BEN(efactive) of the impossible relation as well 4 In the EDS example in Figure 1, all nodes correspond- ing to instances of bare ‘nominal’ meanings are bound by a as the unexpressed ACT(or) argument of the three- covert quantificational predicate, including the group-forming place apply relation, respectively; these nodes are implicit conj and and c nodes that represent the nested, binary- related by an edge indicating grammatical coref- branching coordinate structure. This practice of uniform quan- tifier introduction in ERS is acknowledged as “particularly erence. In this graph, the indefinite determiner, exuberant” by Steedman (2011, p. 21). infinitival to, and the vacuous preposition marking 4 PRED be 〈20:22〉 sentmod enunc sempos v frame en-v#ev-w218f2 ACT PAT apply possible 〈41:43〉 〈44:49〉 〈30:40〉 sempos v sempos adj.denot frame en-v#ev-w119f2 APPS PAT BEN EXT such_as technique almost ADDR #Benef 〈66:70〉 〈71:73〉 〈0:1〉 〈10:19〉 ACT 〈23:29〉 effective sempos x sempos x sempos n.denot sempos adv.denot.grad.neg ADDR ADDR CONJ ADDR ADDR RSTR coref.gram effective effective member member effective and crop similar #Gen 〈91:94〉 〈50:52〉 〈59:64〉 〈2:9〉 sempos x sempos x sempos n.denot sempos adj.denot ADDR ADDR ADDR RSTR member member member rice soybean cotton other 〈95:99〉 〈82:90〉 〈74:80〉 〈53:58〉 sempos n.denot sempos n.denot sempos n.denot sempos adj.denot Figure 2: Semantic dependency graphs for the running example A similar technique is almost impossible to apply to other crops, such as cotton, soybeans and rice: Prague Tectogrammatical Graphs (PTG). In addition to node properties, visualized similarly to the EDS in Figure 1, boolean edge attributes are abbreviated below edge labels, for true values. the deep object of apply can be argued to not have Universal Conceptual Cognitive Annotation a semantic contribution of their own. Universal Cognitive Conceptual Annotation The ADDR argument relation to the apply pred- (UCCA; Abend and Rappoport, 2013) is based icate has been recursively propagated to both el- on cognitive linguistic and typological theo- ements of the apposition and to all members of ries, primarily Basic Linguistic Theory (Dixon, the coordinate structure. Accordingly, edge labels 2010/2012). The shared task targets the UCCA in PTG are not always functional, in the sense of foundational layer, which focuses on argument allowing multiple outgoing edges from one node structure phenomena (where predicates may be with the same label. verbal, nominal, adjectival, or otherwise). This In FGD, role labels (called functors) ACT(or), coarse-grained level of semantics has been shown PAT(ient), ADDR(essee), ORIG(in), and EFF(ect) to be preserved well across translations (Sulem indicate ‘participant’ positions in an underlying va- et al., 2015). It has also been successfully used lency frame and, thus, correspond more closely to for improving text simplification (Sulem et al., the numbered argument positions in other frame- 2018c), as well as to the evaluation of a number works than their names might suggest.5 The PTG of text-to-text generation tasks (Birch et al., 2016; annotations are grounded in a machine-readable Sulem et al., 2018a; Choshen and Abend, 2018). valency lexicon (Urešová et al., 2016), and the The basic unit of annotation is the scene, denot- frame values on verbal nodes in Figure 2 indi- ing a situation mentioned in the sentence, typically cate specific verbal senses in the lexicon. involving a predicate, participants, and potentially modifiers. Linguistically, UCCA adopts a notion of semantic constituency that transcends pure depen- 5 Accordingly, multiple instances of the same core partic- dency graphs, in the sense of introducing separate, ipant role—as ADDR:member in Figure 2—will only occur with propagation of dependencies into paratactic construc- unlabeled nodes, called units. One or more labels tions. are assigned to each edge. Formally, UCCA has a 5 A F DF P A U is to apply . F E E C R E C U E A C almost impossible to other crops , S A R C U C N C similar technique such as cotton , soybeans and rice Figure 3: Universal Conceptual Cognitive Annotation (UCCA), foundational layer, for the running example A similar technique is almost impossible to apply to other crops, such as cotton, soybeans and rice. The dashed edge whose target is the node anchored to technique abbreviates a boolean remote edge attribute. Type (1) flavor, where leaf (or terminal) nodes of ‘aligns’ graph nodes with (possibly discontinuous) the graph are anchored to possibly discontinuous sets of tokens in the underlying input, this anchor- sequences of surface sub-strings, while interior (or ing is not part of the meaning representation proper. ‘phrasal’) graph nodes are formally unanchored. At the same time, AMR frequently invokes lexi- The UCCA graph for the running example (see cal decomposition and normalization towards ver- Figure 3) includes a single scene, whose main re- bal senses, such that AMR graphs often appear to lation is the Process (P) evoked by apply. It also ‘abstract’ furthest from the surface signal. Since contains a secondary relation labeled Adverbial the first general release of an AMR graph bank in (D), almost impossible, which is broken down into 2014, the framework has provided a popular tar- its Center (C) and Elaborator (E); as well as two get for data-driven meaning representation parsing complex arguments, labeled as Participants (A). Un- and has been the subject of two consecutive tasks like the other frameworks in the task, the UCCA at SemEval 2016 and 2017 (May, 2016; May and foundational layer integrates all surface tokens into Priyadarshi, 2017). the graph, possibly as the targets of semantically The AMR example graph in Figure 4 has a topo- bleached Function (F) and Punctuation (U) edges. UCCA graphs need not be rooted trees: Argument sharing across units will give rise to reentrant nodes possible-01 much like in the other frameworks. For example, polarity - technique in Figure 3 is both a Participant in the ARG1 mod (domain) scene evoked by similar and a Center in the parent unit. UCCA in principle also supports implicit (un- apply-02 almost expressed) units which do not correspond to any ARG1 ARG2 tokens, but these are currently excluded from pars- technique crop ing evaluation and, thus, suppressed in the UCCA (ARG1)-of mod (domain) (ARG1)-of graphs distributed in the context of the shared task. resemble-01 other exemplify-01 Abstract Meaning Representation The shared ARG0 task includes Abstract Meaning Representation and (AMR; Banarescu et al., 2013), which in the MRP op1 op2 op3 op4 hierarchy of different formal types of semantic graphs (see §2 above) is simply unanchored, i.e. cotton soybean rice et-cetera represents Flavor (2). The AMR framework is inde- pendent of particular approaches to derivation and Figure 4: Abstract Meaning Representation (AMR) for compositionality and, accordingly, does not make the running example A similar technique is almost im- explicit how elements of the graph correspond to possible to apply to other crops, such as cotton, soy- the surface utterance. Although most AMR pars- beans and rice. Edge labels in parentheses indicate nor- ing research presupposes a pre-processing step that malized (i.e. un-inverted) roles. 6 ATTRIBUTION PRESUPPOSITION in in in in in in in in technique.n.01 impossible.a.01 in time.n.08 apply.v.01 crop.n.01 entity.n.01 in in in in Attribute Theme Topic Degree Theme Time EQU Goal Instance NEQ Sub Sub Sub similar.a.01 proposition.n.01 almost.r.01 "now" rice.n.01 soybean.n.03 crop.n.01 cotton.n.01 Figure 5: Discourse Representation Graph (DRG) for the running example A similar technique is almost impossible to apply to other crops, such as cotton, soybeans and rice. Different node shapes are not formally part of the graph but serve as a visual aid to distinguish different types of the underlying DRS elements. logy broadly comparable to EDS, with some no- coordinate structure in Figure 4. Conversely, like table differences. Similar to the UCCA example in the other frameworks (except UCCA), some sur- graph (and unlike EDS), the AMR representation face tokens are analyzed as semantically vacuous. of the coordinate structure is flat. Although most For example, parallel to the PTG graph in Figure 2, lemmas are linked to derivationally related forms there is no meaning contribution annotated for the in the sense lexicon, this is not universal, as seen determiner a (let alone for covert determiners in by the nodes corresponding to similar and such as, bare nominals, as made explicit in EDS). which are labeled as resemble-01 and exemplify-01, respectively. These sense distinctions (primarily Discourse Representation Graphs Finally, Dis- for verbal predicates) are grounded in the inventory course Representation Graphs (DRG) provide a of predicates from the PropBank lexicon (Kings- graph encoding of Discourse Representation Struc- bury and Palmer, 2002; Hovy et al., 2006). ture (DRS), the meaning representations at the core of Discourse Representation Theory (DRT; Kamp Role labels in AMR encode semantic argument and Reyle, 1993; Van der Sandt, 1992; Asher, positions, with the particular roles defined accord- 1993). DRSs can model many challenging se- ing to each PropBank sense, though the counting in mantic phenomena including quantifiers, negation, AMR is zero-based such that the ARG1 and ARG2 scope, pronoun resolution, presupposition accom- roles in Figure 4 often correspond to ARG2 and modation, and discourse structure. Moreover, they ARG3, respectively, in the EDS of Figure 1. Prop- are directly translatable into first-order logic for- Bank distinguishes such numbered arguments from mulas to account for logical inference. non-core roles labeled from a general semantic in- DRG used in the shared task represents a type ventory, such as frequency, duration, or domain. of graph encoding of DRS that makes the graphs Figure 4 also shows the use of inverted edges structurally as close as possible to the structures in AMR, for example ARG1-of and mod. These found in other frameworks; Abzianidze et al. (2020) serve to allow annotators (and in principle also pars- provide more details on the design choices in the ing systems) to view the graph as a tree-like struc- DRG encoding. The source DRS annotations are ture (with occasional reentrancies) but are formally taken from data release 3.0.0 of the Parallel Mean- merely considered notational variants. Therefore, ing Bank (PMB; Abzianidze et al., 2017; Bos et al., the MRP rendering of the AMR example graph 2017).6 Although the annotations in the PMB are also provides an unambiguous indication of the compositionally derived from lexical semantics, underlying, normalized graph: Edges with a label anchoring information is not explicit in its DRSs; component shown in parentheses are to be reversed thus, (like AMR) the DRG framework formally in normalization, e.g. representing an actual ARG0 instantiates Flavor (2) of meaning representations. edge from resemble-01 to technique or a domain The DRG of the running example is given in Fig- edge from other to crop. ure 5. The concepts (vissualized as oval shapes) are Given the non-compositionality of AMR anno- represented by WordNet 3.0 senses and semantic tation, AMR allows the introduction of semantic roles (in diamond shapes) by the adapted version concepts which have no explicit lexicalization in 6 the text, for example the et-cetera element in the https://pmb.let.rug.nl/data.php 7 EDS PTG UCCA AMR DRG Flavor 1 1 1 2 2 TRAIN Text Type newspaper newspaper mixed mixed mixed Sentences 37,192 42,024 6,872 57,885 6,606 Tokens 861,831 1,026,033 171,838 1,049,083 44,692 VALIDATE Text Type mixed mixed mixed mixed mixed Sentences 3,302 1,664 1,585 3,560 885 Tokens 65,564 40,770 25,982 61,722 5,541 Text Type mixed newspaper mixed mixed mixed TEST Sentences 4,040 2,507 600 2,457 898 Tokens 68,280 59,191 18,633 49,760 5,991 Table 1: Quantitative summary of English gold-standard training, validation, and evaluation data for the five frame- works in the cross-framework track; token counts reflect the morpho-syntactic companion parses, see § 4. of VerbNet roles. Nodes with quoted labels rep- 4 Task Setup resent entities which semantically behave as con- The following paragraphs summarize the ‘logistics’ stants. Such a node is used for the indexical “now”, of the MRP 2020 shared task. Except for the addi- modelling the time of speech, which is part of the tion of the new cross-lingual track, the overall task semantics of the present-tense copula is. setup mirrored that of the 2019 predecessor; please Explicit encoding of the scope is one of the main see Oepen et al. (2019) for additional background. differences between DRG and the other frame- works. Scopes can be triggered by discourse seg- Cross-Framework Track The English training, ments, negation, universal quantification, clause validation, and evaluation data are summarized in embedding (e.g. to apply . . . ), and presuppositions Table 1. For EDS, PTG, UCCA, and AMR the (e.g. other crops). The scopes are represented as provenance of these gold-standard annotations is unlabeled (square-shaped) nodes in DRG (UCCA the same as in the MRP 2019 setup (Oepen et al., also has unlabeled nodes, albeit for a different rea- 2019).8 The DRG target structures have been con- son). The node for the first discourse segment is verted using the procedure sketched in §3 above. treated as a root, which is connected to the scope Unlike in the 2019 edition of the task, designated of the embedded clause by the ATTRIBUTION dis- validation segments have been provided for all five course relation. The latter scope presupposes the frameworks in the cross-framework track; this data scope containing a crop which is different (with could be used during system development, e.g. for NEQ inequality) from the group of crops consist- parameter tuning, but not for training the final sys- ing of (with the Sub semantic role) rice, soybeans, tem submission. For EDS, UCCA, and AMR, the and cotton. Each concept, represented by a Word- 2020 validation data corresponds to the 2019 evalu- Net synset, has explicitly assigned its scope via in ation segments, thus allowing some comparability edges.7 across the two editions of the MRP shared task. As a common point of reference, the training Compared to the other frameworks, DRG struc- data includes a sample of 89 WSJ sentences an- tures are larger in size due to the number of se- notated in all five frameworks (twenty for DRG); mantic relations, explicit nodes for scope, scope for all frameworks but DRG, the evaluation data membership edges, role reification, and informa- further includes parallel annotations over the same tion about the time (which usually introduces at random selection of 100 sentences from the novel least four additional nodes). The Little Prince (by Antoine de Saint-Exupéry) as used in MRP 2019, dubbed LPPS. These parallel 7 Since in principle the scope of a semantic role cannot be subsets of the gold-standard data are available for uniquely determined by the scopes of its arguments, semantic public download from the task site (see §9 below). roles are reified as nodes and can have ingoing in edges. But 8 whenever the scopes of a role and its arguments coincide, There are slightly more EDS and PTG (compared to PSD the scope membership edge for the role is omitted and hence in 2019) graphs this year, because the two underlying re- recoverable. This decision decreases the number of edges in sources are no longer intersected; for UCCA, the 2020 release DRG. includes additional, recent gold-standard annotations. 8 EDS PTG UCCA AMR−1 DRG (02) Average Tokens per Graph 22.17 24.42 25.01 18.12 6.77 (03) Average Nodes per Token 1.26 0.74 1.33 0.64 2.09 (04) Distinct Edge Labels 10 72 15 101 16 (05) Percentage of top nodes 0.99 1.27 1.66 3.77 3.40 PROPORTIONS (06) Percentage of node labels 29.02 21.61 – 43.91 39.81 (07) Percentage of node properties 12.54 26.22 – 7.63 – (08) Percentage of node anchors 29.02 19.63 38.80 – – (09) Percentage of (labeled) edges 28.43 26.10 56.88 44.69 56.79 (10) Percentage of edge attributes – 5.17 2.66 – – (11) %g Rooted Trees 0.09 22.63 28.19 22.05 0.35 (12) %g Treewidth One 68.60 22.67 34.17 49.91 0.35 (13) Average Treewidth 1.317 2.067 1.691 1.561 2.131 TREENESS (14) Maximal Treewidth 3 7 4 5 5 (15) Average Edge Density 1.015 1.177 1.055 1.092 1.265 (16) %n Reentrant 32.77 16.23 4.90 19.89 25.92 (17) %g Cyclic 0.27 33.97 0.00 0.38 0.27 (18) %g Not Connected 1.90 0.00 0.00 0.00 0.00 (19) %g Multi-Rooted 99.93 0.00 0.00 71.64 32.32 Table 2: Contrastive graph statistics for the MRP 2020 English training data using a subset of the properties defined by Kuhlmann and Oepen (2016). Here, %g and %n indicate percentages of all graphs and nodes, respectively, in each framework; AMR−1 refers to the normalized form of the graphs, with inverted edges reversed, as discussed in § 3. The second block of statistics indicates the proportional distribution of different formal types of information in the graphs, according to the categorization used in the MRP cross-framework evaluation metric (see § 5). Table 2 provides a quantitative side-by-side com- separate cross-lingual track, which transcends the parison of the training data, using some of the MRP 2019 task setup. graph-theoretic properties discussed by Kuhlmann and Oepen (2016); see §2 for semi-formal def- initions. The table indicates clear differences Additional Resources For reasons of compara- among the frameworks. The underlying input bility and fairness, the shared task constrained strings for AMR (where text selection is more var- which additional data or pre-trained models (e.g. ied), for example, are shorter, and much shorter corpora, word embeddings, language models, lex- in turn for DRG. EDS, UCCA, and DRG have ica, or other annotations) can be legitimately many more nodes per token, on average, than the used besides the resources distributed by the task other frameworks—reflecting lexical decomposi- organizers—such that all participants should in tion, ‘phrasal’ grouping, and role reification, re- principle have access to the same range of data. spectively, as evident in Figures 1, 3, and 5. In However, to keep such constraints to the minimum some respects, the PTG and UCCA graphs are required, a ‘white-list’ of legitimate resources was more tree-like than graphs in the other frameworks, compiled from nominations by participants (with a for example in their proportions of actual rooted cut-off date eight weeks before the end of the eval- trees, the frequencies of reentrant nodes, and the lack of multi-rooted structures. At the same time, PTG exhibits comparatively high average and max- PTG UCCA AMR DRG imal treewidth and is the only framework with a non-trivial percentage of cyclic graphs. Language Czech German Chinese German Flavor 1 1 1 2 Cross-Lingual Track For four of the frame- Text Type newspaper mixed mixed mixed TRAIN Sentences 43,955 4,125 18,365 1,575 works (excluding EDS), gold-standard training and Tokens 740,466 95,634 428,054 9,088 evaluation data has been compiled in other lan- Text Type newspaper mixed mixed mixed guages than English: Mandarin Chinese for AMR, TEST Sentences 5,476 444 1,713 403 Czech for PTG, and German for UCCA and DRG. Tokens 92,643 10,585 39,228 2,384 For UCCA and in particular DRG, however, avail- able data is comparatively limited, as summarized Table 3: Quantitative summary of gold-standard data in Table 3. These target representations constitute a for the four frameworks in the cross-lingual track. 9 uation period).9 Thus, the task design reflects what EDS PTG UCCA AMR DRG is at times called a closed track, where participants Top Nodes 3 3 3 3 3 are constrained in which additional data and pre- Node Labels 3 3 7 3 3 trained models can be used in system development. Node Properties 3 3 7 3 7 Node Anchors 3 3 3 7 7 Labeled Edges 3 3 3 3 3 Companion Syntactic Parses At a technical Edge Attributes 7 3 3 7 7 level, training (and evaluation) data were dis- tributed in two formats, (a) as sequences of ‘raw’ Table 4: Different tuple types per framework. sentence strings and (b) in pre-tokenized, part- of-speech–tagged, lemmatized, and syntactically parsed form. For the latter, premium-quality on-line CodaLab infrastructure. Teams were al- morpho-syntactic dependency analyses were pro- lowed to make repeated submissions, but only the vided to participants, called the MRP 2020 compan- most recent successful upload to CodaLab within ion parses. These parses were obtained using a pre- the evaluation period was considered for the offi- release of the ‘future’ UDPipe architecture (Straka, cial, primary ranking of submissions. Task partici- 2018; Straka and Straková, 2020), trained on avail- pants were encouraged to process all inputs using able gold-standard UD 2.x treebanks, for English the same general parsing system, but—owing to augmented with conversions from PTB-style anno- inevitable fuzziness about what constitutes ‘one’ tations in the WSJ and OntoNotes corpora (Hovy parser—this constraint was not formally enforced. et al., 2006), using the UD-style CoreNLP 4.0 to- kenizer (Manning et al., 2014) and jack-knifing 5 Evaluation where appropriate (to avoid overlap with the texts Following the previous edition of the shared task, underlying the MRP semantic graphs). the official MRP metric for the task is the micro- average F1 score across frameworks over all tuple Rules of Participation While the various mean- types that encode ‘atoms’ of information in MRP ing representation frameworks and graph banks graphs. The cross-framework metric uniformly represented in the shared task inevitably present evaluates graphs of different flavors, regardless of considerable linguistic variation, all MRP 2020 a specific framework exhibiting (a) labeled or un- data was repackaged in a uniform and normalized labeled nodes or edges, (b) nodes with or without abstract representation with a common serializa- anchors, and (c) nodes and edges with optional tion, the same JSON Lines format as used in the properties and attributes, respectively (see Table 4). previous year (Oepen et al., 2019). Because some of the semantic graph banks involved in the shared The MRP metric generalizes earlier framework- task had originally been released by the Linguis- specific metrics (Dridan and Oepen, 2011; Cai tic Data Consortium (LDC), the training data was and Knight, 2013; Hershcovich et al., 2019a) in made available to task participants by the LDC terms of decomposing each graph into sets of under no-cost evaluation licenses. All task data (in- typed tuples, as indicated in Figure 6. To quantify cluding system submissions and evaluation results) graph similarity in terms of tuple overlap, a corre- is being prepared for general release through the spondence relation between the nodes of the gold- LDC, while subsets that are copyright-free will also standard and system graphs must be determined. become available for direct, open-source download. Adapting a search procedure for the NP-hard max- imum common edge subgraph (MCES) isomor- The shared task was first announced in March phism problem, the MRP scorer will search for the 2020, the initial release of the cross-framework node-to-node correspondence that maximizes the training data became available in late April, and intersection of tuples between two graphs, where the evaluation period ran between July 27 and Au- node identifiers (m and n in Figure 6) act like vari- gust 10, 2020; during this period, teams obtained ables that can be equated across the gold-standard the unannotated input strings for the evaluation and system graphs.10 This means that during eval- data and had available a little more than two weeks uation all information in the MRP graphs is con- to prepare and submit parser outputs. Submission of semantic graphs for evaluation was through the 10 Conceptually, the search expands both graphs into larger structures with ‘lightly labeled’ nodes and edges, e.g. treat- 9 See http://svn.nlpl.eu/mrp/2020/public/ ing node properties much like ‘pseudo-edges’ with globally resources.txt for the list of legitimate extra resources. unique constant-valued target nodes. 10 Cross-Framework Cross-Lingual Teams AMR DRG EDS PTG UCCA AMR DRG PTG UCCA Reference Hitachi 3 3 3 3 3 3 3 3 3 Ozaki et al. (2020) ÚFAL 3 3 3 3 3 3 3 3 3 Samuel and Straka (2020) HIT-SCIR 3 3 3 3 3 3 3 3 3 Dou et al. (2020) HUJI-KU 3 3 3 3 3 3 3 3 3 Arviv et al. (2020) ISCAS 3 3 3 3 3 7 7 7 7 TJU-BLCU 3 3 3 3 3 3 3 3 7 JBNU 3 7 7 7 7 7 7 7 7 Na and Min (2020) ÚFAL 3 3 3 3 3 3 3 3 3 Samuel and Straka (2020) ERG 7 7 3 7 7 7 7 7 7 Oepen and Flickinger (2019) Table 5: Overview of participating teams and the tracks they participated in. Columns correspond to tracks and frameworks, and rows correspond to teams. The top block represents ‘official’ submissions, which participated in the competition. The middle block represents ‘unofficial’ submissions, which were submitted after the closing deadline. The bottom row represents the ERG baseline. sidered with equal weight, i.e. tops, node and edge the greedy hill-climbing search of e.g. Smatch; labels, properties and attributes, and anchors. Cai and Knight, 2013). MRP scoring is robust MRP scoring is carried out using the open- with respect to equivalent variations of values, e.g. source mtool software—the Swiss Army Knife case and string vs. number type distinctions for of Meaning Representation11 —which implements all literals. Comparison of anchor values ignores a refinement of the MCES algorithm by McGre- whitespace character positions, internal segmen- gor (1982). Based on pre-computed per-node re- tation of adjacent anchors, and basic punctuation wards and upper bounds on adjacent edge corre- marks in the left or right periphery of a normalized spondences, candidate node-to-node mappings are anchor. Assuming the string Oh no! as a hypotheti- initialized and scheduled in decreasing order of cal parser input, the following anchorings will all expected similarity. For increased efficiency (in be considered equivalent: {h0 : 6i}, {h0 : 2i, h3 : 6i}, principle tractability, in fact), mtool will return {h0 : 1i, h1 : 6i}, and {h0 : 5i}. the best available solution when it exhausts its pre- set search space limits. This anytime behavior of 6 Submissions and Results the scores provides a distinction between exact vs. approximate solutions (which contrasts with Six teams submitted parser outputs to the shared task within the official evaluation period. In addi- 11 https://github.com/cfmrp/mtool tion, we received two submissions after the sub- mission deadline, which we mark as ‘unofficial’. We further include results from an additional ‘ref- tops: erence’ system by one of the task co-organizers, hmi namely EDS outputs from the grammar-based ERG ln labels: parser (Oepen and Flickinger, 2019). p vp hm, ln i Table 5 presents an overview of the participating properties: systems and the tracks and frameworks they sub- hm, p, vp i mitted results for. All official systems submitted le edges: results for the cross-framework track (across all hm, n, le i frameworks), and additionally five of them submit- a va attributes: ted results to the cross-lingual track as well (where hm, n, le , a, va i TJU-BLCU did not submit UCCA parser outputs anchors: in the cross-lingual track). We note that the shared 〈i:j〉 hn, i, . . . , ji task explicitly allowed partial submissions, in order to lower the bar for participation (which is no doubt Figure 6: Representing an abstractMRP graph as a set substantial). Two of the teams—ISCAS and TJU- of typed tuples, with m and n as node identifiers for the BLCU—declined the invitation to submit a system top and bottom node, respectively. description paper to the shared task proceedings. 11 Cross-Framework Cross-Lingual Team All EDS PTG UCCA AMR DRG All PTG UCCA AMR DRG 1 1 2 1 1 – – – – – – Hitachi 1 1 1 2 1 2 1 2 3 1 1 1 2 1 1 2 – – – – – – ÚFAL 1 2 2 1 1 1 1 1 1 2 2 3 3 3 3 3 – – – – – – HIT-SCIR 3 3 3 2 3 3 3 3 2 3 3 4 5 4 4 5 – – – – – – HUJI-KU 4 5 4 4 5 5 4 4 4 4 4 5 4 6 6 4 – – – – – – ISCAS 5 4 6 6 4 4 – – – – – 6 6 5 5 6 – – – – – – TJU-BLCU 6 6 5 5 6 6 5 5 – 5 5 Tops Labels Properties Anchors Edges Attributes All Team P R F P R F P R F P R F P R F P R F P R F .93 .93 .93 .65 .68 .66 .63 .62 .62 .71 .70 .70 .82 .80 .81 .39 .32 .34 .85 .85 .85 Hitachi .95 .95 .95 .72 .72 .72 .54 .54 .54 .57 .55 .56 .83 .80 .82 .24 .23 .24 .88 .85 .86 .93 .93 .93 .68 .68 .68 .61 .60 .60 .69 .71 .70 .80 .79 .80 .42 .33 .36 .85 .85 .85 ÚFAL .94 .94 .94 .74 .73 .74 .55 .54 .54 .56 .57 .56 .80 .80 .80 .23 .24 .24 .87 .86 .86 .94 .94 .94 .63 .64 .64 .45 .41 .43 .71 .71 .71 .77 .76 .77 .37 .30 .33 .80 .80 .80 HIT-SCIR .94 .94 .94 .70 .69 .69 .44 .37 .40 .57 .56 .57 .77 .75 .76 .22 .22 .22 .82 .80 .81 .87 .84 .85 .36 .36 .36 .29 .18 .20 .66 .67 .67 .67 .62 .64 .15 .07 .10 .73 .63 .67 HUJI-KU .88 .83 .85 .29 .29 .29 .40 .24 .28 .51 .51 .51 .65 .62 .64 .07 .08 .07 .73 .58 .64 .70 .70 .70 .50 .49 .48 .22 .26 .24 .35 .41 .37 .52 .35 .39 – – – .53 .43 .43 ISCAS .75 .74 .74 .56 .55 .55 .22 .22 .21 .29 .31 .29 .57 .40 .44 – – – .58 .46 .48 .83 .82 .83 .41 .29 .34 – – – .45 .30 .35 .53 .30 .37 – – – .57 .30 .39 TJU-BLCU .75 .74 .75 .54 .29 .38 – – – .33 .14 .19 .44 .18 .24 – – – .55 .22 .30 .93 .93 .93 .68 .68 .68 .61 .60 .60 .71 .71 .71 .80 .80 .80 .43 .34 .37 .85 .85 .85 ÚFAL .94 .94 .94 .74 .73 .74 .55 .54 .54 .57 .57 .57 .80 .80 .80 .23 .24 .24 .87 .86 .87 Tops Labels Properties Anchors Edges Attributes All Team P R F P R F P R F P R F P R F P R F P R F Hitachi .96 .96 .96 .65 .65 .65 .44 .42 .43 .7 .68 .69 .8 .77 .78 .27 .27 .26 .86 .84 .85 ÚFAL .95 .95 .95 .66 .66 .66 .43 .43 .43 .65 .72 .68 .78 .79 .79 .3 .33 .31 .84 .86 .85 HIT-SCIR .95 .95 .95 .53 .52 .53 .21 .18 .20 .47 .47 .47 .66 .65 .66 .23 .24 .23 .72 .67 .69 HUJI-KU .9 .84 .87 .15 .15 .15 .31 .32 .32 .42 .42 .42 .59 .58 .59 .08 .08 .08 .69 .54 .60 TJU-BLCU .56 .55 .56 .41 .21 .27 – – – .23 .12 .15 .28 .13 .18 – – – .35 .15 .20 ÚFAL .95 .95 .95 .66 .66 .66 .43 .43 .43 .71 .72 .72 .79 .79 .79 .3 .33 .31 .86 .86 .86 Table 6: Official rankings (top) for both tracks, and MRP scores for the cross-framework (middle) and cross-lingual (bottom) tracks. Each cross-framework submission is evaluated in two settings, where the top scores present results for the LPPS sub-corpus, and the bottom ones for the full English evaluation set. The rankings are presented both for the overall average scores (All), and separately per framework. Evaluation results are broken down by ‘atomic’ component pieces. For each component we report precision (P), recall (R), and F1 score (F). Entries in the two MRP tables are split into the same blocks as in Table 5: official (top) vs. unofficial (bottom) submissions, omitting the two highly partial unofficial submissions by JBNU and ERG. 12 EDS PTG UCCA AMR DRG P R F P R F P R F P R F P R F 0.97 0.97 0.97 0.80 0.84 0.82 0.86 0.80 0.83 0.78 0.79 0.79 – – – Hitachi 0.94 0.93 0.94 0.89 0.89 0.89 0.78 0.72 0.75 0.83 0.80 0.82 0.94 0.92 0.93 0.96 0.95 0.95 0.81 0.84 0.83 0.84 0.82 0.83 0.77 0.79 0.78 – – – ÚFAL 0.93 0.92 0.93 0.88 0.89 0.88 0.75 0.78 0.76 0.81 0.79 0.80 0.95 0.93 0.94 0.90 0.89 0.89 0.78 0.78 0.78 0.84 0.80 0.82 0.68 0.71 0.70 – – – HIT-SCIR 0.87 0.88 0.87 0.85 0.84 0.84 0.75 0.74 0.75 0.74 0.66 0.70 0.90 0.89 0.89 0.83 0.76 0.79 0.71 0.49 0.58 0.80 0.76 0.78 0.56 0.5 0.53 – – – HUJI-KU 0.83 0.76 0.80 0.69 0.44 0.54 0.73 0.73 0.73 0.57 0.49 0.52 0.84 0.5 0.63 0.86 0.90 0.88 0.12 0.25 0.16 0.45 0.08 0.13 0.68 0.47 0.56 – – – ISCAS 0.85 0.87 0.86 0.14 0.26 0.18 0.42 0.03 0.06 0.74 0.53 0.61 0.78 0.63 0.69 0.83 0.51 0.64 0.41 0.24 0.30 0.52 0.13 0.21 0.50 0.34 0.4 – – – TJU-BLCU 0.84 0.35 0.49 0.38 0.15 0.21 0.50 0.06 0.10 0.54 0.21 0.30 0.49 0.34 0.40 – – – – – – – – – 0.74 0.73 0.74 – – – JBNU – – – – – – – – – 0.71 0.62 0.66 – – – 0.96 0.95 0.95 0.83 0.84 0.84 0.84 0.81 0.83 0.77 0.79 0.78 – – – ÚFAL 0.93 0.92 0.93 0.89 0.89 0.89 0.75 0.78 0.76 0.81 0.79 0.80 0.95 0.93 0.94 0.95 0.96 0.96 – – – – – – – – – – – – ERG 0.94 0.91 0.93 – – – – – – – – – – – – Table 7: Per-framework results for the cross-framework track, using the same groupings as in Table 6. Table 6 presents the official rankings for the offi- missions share the first place for both tracks, and cial submissions (top), including an overall score together rank first or second for almost all the in- for each track and per-framework rankings. Rank- dividual frameworks (save for UCCA parsing in ings are given over the LPPS dataset, a sample the cross-lingual track, where Hitachi ranks third). from the Little Prince annotated by all frameworks HIT-SCIR further ranks second for UCCA parsing save for DRG, and over the entire test set. Results in both tracks. Interestingly, rankings in the per- are consequently more readily comparable for the framework track are similar across frameworks, LPPS sub-corpus, but should be more robust on the which may indicate some similarity in the parsing entire test corpus, due to its larger size (see §4). problem exhibited by different linguistic schemes, That said, LPPS and overall test results are very despite differences in structure and content. similar, both in terms of ranking and in terms of Per-framework scores using the official MRP bottom line scores. metric are given in Table 7 for the cross-framework The main task results are summarized in Ta- track and Table 8 for the cross-lingual track. Exam- ble 6 for both the cross-framwork (middle) and ining these results, we note that cross-framework cross-lingual (bottom) tracks. Results are broken and cross-lingual scores are quite similar, an en- down into component pieces. Edge attributes are couraging sign of cross-linguistic applicability. An- only present in PTG and UCCA. While they are other trend to note is that precision and recall are still predicted with fairly low results, this consti- surprisingly close to each other for many systems, tutes a notable improvement over the findings of often identical. MRP 2019 (the best score on the official track on UCCA edge attributes was 0.12 F1 then, as op- 7 Overview of Approaches posed to 0.36 now). Anchors are predicted with Compared with systems from MRP 2019, there has substantially lower scores compared to MRP 2019, been a fairly clear shift in approaches for partic- probably since we did not include in MRP 2020 ipating systems this year, resulting in significant the bi-lexical Flavor (0) frameworks. Edges and improvements in performance. The improvements tops are slightly more accurate, while labels and for some of the frameworks are fairly substantial. properties slightly less, but these are not directly For example, the Hitachi system, one of the two comparable since the frameworks and data are dif- winning systems, achieves a score of 0.82 F1 in ferent. See §8 for an overall discussion of the state AMR parsing, in comparison to 0.73 F1 achieved of the art, considering MRP 2019 and MRP 2020. by the top AMR parser in MRP 2019. This reflects Results show that the Hitachi and ÚFAL sub- an improvement of over eight points, reflecting a 13 PTG UCCA AMR DRG P R F P R F P R F P R F Hitachi .89 .86 .87 .79 .79 .79 .82 .79 .8 .93 .94 .93 ÚFAL .91 .91 .91 .79 .83 .81 .75 .81 .78 .90 .89 .90 HIT-SCIR .82 .75 .78 .78 .82 .80 .60 .42 .49 .68 .69 .68 HUJI-KU .65 .53 .58 .74 .76 .75 .55 .38 .45 .82 .50 .62 TJU-BLCU .51 .14 .22 – – – .46 .17 .25 .42 .28 .34 ÚFAL .93 .92 .92 .79 .83 .81 .81 .8 .81 .9 .89 .9 Table 8: Per-framework results for the cross-lingual track. number of innovations from the participants this very large, rather than predicting the node labels di- year, as well as contemporaneous developments rectly, the PERIN system reduces the search space outside the shared task (see §8). by predicting ‘relative rules’ that can be used to Broadly speaking, top performers at MRP 2020 map surface token strings to node labels in meaning have all adopted a system architecture that is based representation graphs, an idea that is similar to the on an encoder–decoder framework in which the use of Factored Concept Labels in Wang and Xue input sentence is encoded into contextualized token (2017). Another innovation of the PERIN system embeddings that are used as input to the decoder. is that it is trained with a permutation-invariant loss The system vary in the decoding strategies. function that returns the same value independently The Hitachi system adopts a transformer-based of how the nodes in the graph are ordered. This encoder–decoder architecture. The system uses captures the unordered nature of nodes in (most of the standard transformer encoder in which self- the MRP 2020) meaning representation graphs and attention and position embeddings are used to com- prevents situations in which the model is penalized pute the contextualized token embeddings. In its for generating the correct nodes in an order that is decoder, this system has a number of innovations, different from that in the training data. however. First of all, the system rewrites the mean- The HIT-SCIR and JBNU systems adopt the it- ing representation graphs into a reversible Plain erative inference framework first proposed by Cai Graph Notation (PGN), and enhances PGN with a and Lam (2020) for Flavor (2) meaning represen- number of pseudo-nodes that indicate the end of tation graphs that do not enforce strict correspon- node prediction, the end of label prediction, etc. dences between tokens in the input sentence and These correspond well with parsing actions com- the concepts in meaning representation graphs. The monly found in transition-based systems. In this iterative inference framework is also based on an sense, the systems combines the strengths of graph- encoder–decoder architecture. The encoder takes based parsing on the encoder side resulting from the sentence as input and computes contextualized self attention with efficiency of transition-based token embeddings that are used as text memory parsing on the decoder side. Another innovation by a decoder that iteratively predicts the next node is the use of a ‘hierarchical’ decoding process in given the text memory and a predicted parent node which the model first predicts a mode, and then pre- in the partially constructed graph memory at the dicts the next action conditioned on the mode. For previous time step, and then identifies the parent example, if the mode is G(raph), the decoder pre- node for the newly predicted node from the par- dicts a meta node, and if the mode is S(urface), the tially constructed graph. While the HIT-SCIR sys- decoder predicts the node label of a specific con- tem essentially uses the Cai and Lam (2020) archi- cept. This allows a fair competition among actions tecture with little modification, the JBNU system that are similar in nature. attempts to extend the work of Cai and Lam (2020) The PERIN system computes contextualized to- by using a shared state to make both predictions ken embeddings with XLM-R (Conneau et al., but did not observe substantial improvements. 2019) on the encoder side, and then on the de- Transition-based systems, which had achieved coder side, uses separate attention heads to predict strong performance in the 2019 shared task, are the node labels, identify anchors for nodes, and also represented in the competition this year. The predict edges between nodes, as well as edge la- HIT-SCIR team uses a transition-based system to bels. Because the label set for nodes is typically parse Flavor (1) meaning representations where 14 there is a stricter correspondence between tokens in Prince) is identical between the two years for EDS, the input sentence and concepts in the meaning rep- UCCA, and AMR. This allows a comparison on resentation graph. The HIT-SCIR transition-based nearly equal grounds: as Table 9 shows, in terms system is essentially the overall top performing sys- of LPPS F1 , the state-of-the-art has substantially tem they developed for MRP 2019. It uses Stack improved for EDS and AMR parsing, but stayed LSTM to compute transition states in the parsing the same for UCCA. However, as mentioned in §6, process, and the parsing actions are tailored to spe- remote edge detection for UCCA improved sub- cific meaning representation frameworks. In the stantially, though it carries only a small weight in training process, the system fine-tunes BERT con- terms of overall scores due to the scarcity of remote textualized encodings. edges. The HUJI-KU system also extends an entry in For EDS, the strongest results were obtained the 2019 MRP shared task (originally called TUPA) in the MRP 2019 official competition by SUDA– to parse additional frameworks and handle mean- Alibaba (Zhang et al., 2019c). However, in the ing representation parsing in a multilingual set- post-evaluation stage, they were outperformed by ting. TUPA is a transition-based system that sup- the Peking system (Chen et al., 2019). Both used ports general DAG parsing. TUPA applies separate factorization-based parsing with pre-trained contex- constraints tailored to each meaning representa- tualized language model embeddings (which has tion framework. When parsing cross-framework consistently proved to be very effective for other meaning representations for English, the system frameworks too). These parsers even approached is trained with a BERT-large-cased pretrained en- the performance of the carefully designed grammar- coder, and when parsing cross-lingual meaning rep- based ERG parser (Oepen and Flickinger, 2019). resentations, it is trained with multilingual BERT. English PTG has not been comprehensively ad- 8 On the State of the Art dressed by parsers prior to MRP 2020, but a bi- lexical framework called PSD is a subset of PTG. MRP 2019 (Oepen et al., 2019) yielded parsers It was included in the SDP shared tasks (Oepen for five frameworks in a uniform format, of et al., 2014, 2015) as well as in MRP 2019, and has which EDS, UCCA, and AMR are represented in been addressed by numerous parsers since (Kurita MRP 2020 again. Submissions included transition-, and Søgaard, 2019; Kurtz et al., 2019; Jia et al., factorization-, and composition-based systems, and 2020, among others). Wang et al. (2019) estab- gold-standard target structures in 2019 were solely lished the state of the art in supervised PSD us- for English. Comparability is limited by the fact ing a second-order factorization-based parser, and that two of the 2020 frameworks (PTG and DRG) Fernández-González and Gómez-Rodrı́guez (2020) are new, training and (in particular) evaluation sets matched it using a stack-pointer parser. for the others have been updated since MRP 2019, Czech PTG, in its original form as published and additional validation sets was introduced. How- in the Prague Dependency Treebank (Hajič et al., ever, the LPPS evaluation sub-corpus (Le Petite 2018), has been used in several version of the TectoMT machine translation system (Rosa et al., EDS UCCA AMR 2016); however, parsing results have not been pub- P R F P R F P R F lished separately. A (lossy) conversion has been included in the CoNLL 2009 Shared Task on Se- 2019 .92 .93 .93 .84 .82 .83 .74 .72 .73 2020 .97 .97 .97 .86 .80 .83 .78 .79 .79 mantic Role Labeling (Hajič et al., 2009), but the differences in task design are and conversion make Table 9: Per-framework cross-task comparison of top empirical comparison impossible. MRP metric scores on LPPS between the 2019 and UCCA parsing has been dominated by transition- 2020 editions of the MRP task, on the three frameworks based methods (Hershcovich et al., 2017, 2018; represented in both year, for English. The top systems in MRP 2019 for EDS, UCCA, and AMR were Peking Che et al., 2019). However, both English and Ger- (Chen et al., 2019), HIT-SCIR (Che et al., 2019), and man UCCA parsing featured in a SemEval shared Saarland (Donatelli et al., 2019), respectively; in MRP task (Hershcovich et al., 2019b), where the best 2020 the Hitachi system (Ozaki et al., 2020) was at the system, a composition-based parser (Jiang et al., top for all three frameworks, sharing the UCCA first 2019), treated the task as constituency tree parsing rank with ÚFAL (Samuel and Straka, 2020). with the recovery of remote edges as a postprocess- 15 ing task. Fancellu et al. (2019), among which the best re- Prior to MRP 2019, Lyu and Titov (2018) parsed sults (F1 = 0.85) were achieved by the word-level AMR using a joint probabilistic model with la- sequence-to-sequence model with Tranformer (Liu tent alignments, avoiding cascading errors due to et al., 2019). Note that the DRS shared task used F1 alignment inaccuracies and outperforming previ- calculated based on the DRS clausal forms, which ous approaches. Lyu et al. (2020) recently im- is not comparable to MRP F1 over DRGs. proved the latent alignment parser using stochas- Similarly to English DRG, German DRG has not tic softmax. Lindemann et al. (2019) trained a been used for semantic parsing prior to the shared composition-based parser on five frameworks in- task due to the new DRG format. Moreover, seman- cluding AMR and EDS, using the Apply–Modify tic parsing with German DRG is novel in the sense algebra, on which the third-ranked Saarland sub- that its DRS counterpart is also new. In German mission to MRP 2019 was based (Donatelli et al., DRG, concepts are grounded in English WordNet 2019). They employed multi-task training with 3.0 (Fellbaum, 2012) senses assuming that synsets all tackled semantic frameworks and UD, estab- are language-neutral. The mismatch between Ger- lishing the state of the art on all graph banks but man tokens and English lemmas of senses must be AMR 2017. Since then, a new state-of-the-art has expected to add additional complexity to German been established for English AMR, using sequence- DRG parsing. to-sequence transduction (Zhang et al., 2019a,b) Direct comparison to non-MRP results is impos- and iterative inference with graph encoding (Cai sible: we are using a new version of AMRbank. and Lam, 2019, 2020). Xu et al. (2020a) improved Gold-standard tokenization is not provided for any sequence-to-sequence parsing for AMR by using of the frameworks. We use the MRP scorer. How- pre-trained encoders, reaching similar performance ever, general trends appear consistent with recent to Cai and Lam (2020). Astudillo et al. (2020) in- developments. Pretrained embeddings and cross- troduced a stack-transformer to enhance transition- lingual transfer help; but multi-task learning less so. based AMR parsing (Ballesteros and Al-Onaizan, There is yet progress to be made in sharing infor- 2017), and Lee et al. (2020) improved it further, mation between parsers for different frameworks using a trained parser for mining oracle actions and making better use of their overlap. and combining it with AMR-to-text generation to outperform the state of the-art. 9 Reflections and Outlook Wang et al. (2018) parsed Chinese AMR with The MRP series of shared tasks has contributed to a transition-based system. For cross-lingual AMR general availability of accurate data-driven parsers parsing, Blloshmi et al. (2020) trained an AMR for a broad range of different frameworks, with parser similar to the approach of Zhang et al. performance levels ranging between 0.76 MRP F1 (2019b), using cross-lingual transfer learning, out- (English UCCA) and 0.94 F1 (English EDS). Pars- performing the transition-based cross-lingual AMR ing accuracies in the cross-lingual track present parser of Damonte and Cohen (2018) on German, comparable levels of performance, despite limited Spanish, Italian, and Chinese. training data in the case of UCCA and DRG. Fur- DRG is a novel graph representation format for thermore, the evaluation sets for most of the frame- DRS that was specially designed for MRP 2020 to works comprise different text types and subject make it structurally as close as possible to other matters—offering some hope of robustness to do- frameworks (Abzianidze et al., 2020). However, main variation. We expect that these parsers will en- several semantic parsers exist for DRS, which em- able follow-up experimentation on the utility of ex- ploy different encodings. Liu et al. (2018) used plicit meaning representation in downstream tasks a DRG format that dominantly labels edges com- like, for example, relation extraction, argumenta- pared to nodes. van Noord et al. (2018) process tion mining, summarization, or text generation. DRSs in a clausal form, sets of triples and quadru- Maybe equally importantly, the MRP task design ples. The latter format is more common among capitalizes on uniformity of representations and DRS parsers, as it was officially used by the shared evaluation, enabling resource creators and parser task on DRS parsing (Abzianidze et al., 2019). developers to more closely (inter)relate representa- The shared task gave rise to several DRS parsers: tions and parsing approaches across a diverse range Evang (2019); Liu et al. (2019); van Noord (2019); of semantic graph frameworks. This facilitates 16 both quantitative contrastive studies (e.g. the ‘post- safe-guarding against ‘neural meltdown’ (e.g. dis- mortem’ analysis by Buljan et al. (2020), which carding something as foundational as negation or observes that top-performing MRP 2019 parsers inadvertently altering a date expression in summa- have complementary strengths and weaknesses) rization or translation). In a similar vain, meaning but also more linguistic, qualitative comparison. representations are being successfully applied in General availability of parallel gold-standard anno- evaluation, e.g. to quantify system output vs. gold tations over the same text samples—drawing from standard similarity beyond surface n-grams (Sulem the WSJ and LPPS corpora—enables side-by-side et al., 2018b; Xu et al., 2020b, inter alios). comparison of linguistic design choices in the dif- All technical information regarding the ferent frameworks. This is an area of investigation MRP 2019 shared task, including system sub- that we hope will see increased interest in the af- missions, detailed official results, and links to termath of the MRP task series, to go well beyond supporting resources and software are available the impressionistic observations from §3 and ide- from the task web site at: ally lead to contrastive refinement across linguistic http://mrp.nlpl.eu schools and traditions. Despite uniformity in packaging and evaluation, Acknowledgments cumulative overall complexity and inherent diver- Several colleagues have assisted in designing sity of the frameworks deemed participation in the the task and preparing its data and software re- shared task a formidable challenge. Of the six- sources. We thank Dotan Dvir (Hebrew University teen teams who participated in MRP 2019, only of Jerusalem) for leading the annotation efforts on four teams (predominantly strong performers from UCCA. Dan Flickinger (Stanford University) cre- before) decided to submit parser outputs in 2020. ated fresh gold-standard annotations of some 1,000 The two ‘newcomer’ teams, by comparison, only WSJ strings, which form part of the EDS evalua- made partial submissions in the cross-lingual track tion graphs in 2020. Sebastian Schuster (Stanford and ended up not competing for top ranks over- University) advised on how to convert the gold- all. Similar trends of ‘competitive self-selection’ standard syntactic annotations from the venerable and declining participant groups for consecutive PTB and OntoNotes treebanks to Universal Depen- instances have been observed with earlier CoNLL dencies, version 2.x, using ‘modern’ tokenization. shared task and similar benchmarking series. On Anna Nedoluzhko and Jiřı́ Mı́rovský (Charles Uni- the upside, with the possible exception of English versity in Prague) enhanced the PTG annotation AMR (where there has been much contemporane- of LPPS data with previously missing items, most ous progress recently), the MRP 2020 empirical notably coreference. Milan Straka (Charles Uni- results present a strong state-of-the-art benchmark versity in Prague) made available an enhanced ver- for meaning representation parsing. sion of his UDPipe parser and assisted in training On the more foundational question of the rele- Czech, English, and German morpho-syntacic pars- vance of explicit, discrete representations of sen- ing models (for the MRP companion trees). Jayeol tence meaning, the past several years of break- Chun (Brandeis University) provided invaluable as- through neural advances have been comparatively sistance in conversion of the Chinese AMR annota- insensitive to syntactico-semantic structure. In our tions, preparation of the Chinese morpho-syntactic view, these developments have at least in part been companion trees, and provisioning of companion reflective of the stark lack of general techniques for alignments for the English AMR graphs. the encoding of hierarchical structure in end-to-end We are grateful to the Nordic Language Process- neural architectures. Increased adoption of Graph ing Laboratory (NLPL) and Uninett Sigma2, which Convolutional Networks (Kipf and Welling, 2017) provided technical infrastructure for the MRP 2020 and other hierarchical modeling techniques sug- task. Also, we warmly acknowledge the assistance gest new opportunities for the exploration of both of the Linguistic Data Consortium (LDC) in dis- structurally informed end-to-end archictures or e.g. tributing the training data for the task to partici- multi-task learning setups. Beyond such ultimately pants at no cost to anyone. performance-driven research, explicit encoding of The work on UCCA and the HUJI-KU sub- syntactico-semantic structure in our view further mission was partially supported by the Israel Sci- bears promise in terms of model interpretability and ence Foundation (grant No. 929/17). The work 17 on PTG has been partially supported by the Min- Miguel Ballesteros and Yaser Al-Onaizan. 2017. AMR istry of Education, Youth and Sports of the Czech parsing using stack-LSTMs. In Proceedings of the 2017 Conference on Empirical Methods in Natu- Republic (project LINDAT/CLARIAH-CZ, grant ral Language Processing, pages 1269–1275, Copen- No. LM2018101) and partially by the Grant hagen, Denmark. Association for Computational Agency of the Czech Republic (project LUSyD, Linguistics. grant No. GX20-16819X). The work on DRG was Laura Banarescu, Claire Bonial, Shu Cai, Madalina supported by the NWO-VICI grant (288-89-003) Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin and the European Union Horizon 2020 research Knight, Philipp Koehn, Martha Palmer, and Nathan and innovation programme (under grant agreement Schneider. 2013. Abstract Meaning Representation for sembanking. In Proceedings of the 7th Linguis- No. 742204). The work on Chinese AMR data is tic Annotation Workshop and Interoperability with partially supported by project 18BYY127 under Discourse, pages 178 – 186, Sofia, Bulgaria. the National Social Science Foundation of China Emily M. Bender, Dan Flickinger, Stephan Oepen, and project 61772278 under the National Science Woodley Packard, and Ann Copestake. 2015. Lay- Foundation of China. ers of interpretation. On grammar and composition- ality. In Proceedings of the 11th International Con- ference on Computational Semantics, pages 239 – 249, London, UK. References Omri Abend and Ari Rappoport. 2013. UCCA. A Alexandra Birch, Omri Abend, Ondřej Bojar, and semantics-based grammatical annotation scheme. In Barry Haddow. 2016. HUME. Human UCCA-based Proceedings of the 10th International Conference evaluation of machine translation. In Proceedings on Computational Semantics, pages 1 – 12, Potsdam, of the 2016 Conference on Empirical Methods in Germany. Natural Language Processing, pages 1264 – 1274, Austin, TX, USA. Lasha Abzianidze, Johannes Bjerva, Kilian Evang, Hessel Haagsma, Rik van Noord, Pierre Ludmann, Rexhina Blloshmi, Rocco Tripodi, and Roberto Navigli. Duc-Duy Nguyen, and Johan Bos. 2017. The Par- 2020. XL-AMR: Enabling cross-lingual AMR pars- allel Meaning Bank: Towards a multilingual corpus ing with transfer learning techniques. In Proceed- of translations annotated with compositional mean- ings of the 2020 Conference on Empirical Methods ing representations. In Proceedings of the 15th Con- in Natural Language Processing. ference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Pa- Alena Böhmová, Jan Hajič, Eva Hajičová, and Barbora pers, pages 242–247, Valencia, Spain. Association Hladká. 2003. The Prague Dependency Treebank: for Computational Linguistics. A three-level annotation scenario. In Anne Abeillé, editor, Treebanks. Building and Using Parsed Cor- pora, pages 103 – 127. Kluwer Academic Publishers, Lasha Abzianidze, Johan Bos, and Stephan Oepen. Dordrecht, The Netherlands. 2020. DRS at MRP 2020: Dressing up Discourse Representation Structures as graphs. In Proceedings Johan Bos, Valerio Basile, Kilian Evang, Noortje Ven- of the CoNLL 2020 Shared Task: Cross-Framework huizen, and Johannes Bjerva. 2017. The Gronin- Meaning Representation Parsing, pages 23 – 32, On- gen Meaning Bank. In Nancy Ide and James Puste- line. jovsky, editors, Handbook of Linguistic Annotation. Springer Netherlands. Lasha Abzianidze, Rik van Noord, Hessel Haagsma, and Johan Bos. 2019. The first shared task on dis- Maja Buljan, Joakim Nivre, Stephan Oepen, and Lilja course representation structure parsing. In Proceed- Øvrelid. 2020. A tale of three parsers: Towards di- ings of the IWCS Shared Task on Semantic Pars- agnostic evaluation for meaning representation pars- ing, Gothenburg, Sweden. Association for Compu- ing. In Proceedings of the 12th Language Resources tational Linguistics. and Evaluation Conference, pages 1902–1909, Mar- seille, France. European Language Resources Asso- Ofir Arviv, Ruixiang Cui, and Daniel Hershcovich. ciation. 2020. HUJI-KU at MRP 2020: Two transition- based neural parsers. In Proceedings of the CoNLL Deng Cai and Wai Lam. 2019. Core semantic first: A 2020 Shared Task: Cross-Framework Meaning Rep- top-down approach for AMR parsing. In Proceed- resentation Parsing, pages 73 – 82, Online. ings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Inter- Nicholas Asher. 1993. Reference to Abstract Objects national Joint Conference on Natural Language Pro- in Discourse. Kluwer Academic Publishers. cessing (EMNLP-IJCNLP), pages 3799–3809, Hong Kong, China. Association for Computational Lin- Ramon Fernandez Astudillo, Miguel Ballesteros, guistics. Tahira Naseem, Austin Blodgett, and Radu Flo- rian. 2020. Transition-based parsing with stack- Deng Cai and Wai Lam. 2020. AMR parsing via graph- transformers. In Findings of EMNLP. sequence iterative inference. In Proceedings of the 18 58th Annual Meeting of the Association for Compu- Longxu Dou, Yunlong Feng, Yuqiu Ji, Wanxi- tational Linguistics, pages 1290–1301, Online. As- ang Che, and Ting Liu. 2020. HIT-SCIR at sociation for Computational Linguistics. MRP 2020: Transition-based parser and iterative in- ference parser. In Proceedings of the CoNLL 2020 Shu Cai and Kevin Knight. 2013. Smatch. An evalua- Shared Task: Cross-Framework Meaning Represen- tion metric for semantic feature structures. In Pro- tation Parsing, pages 65 – 72, Online. ceedings of the 51th Meeting of the Association for Computational Linguistics, pages 748 – 752, Sofia, Rebecca Dridan and Stephan Oepen. 2011. Parser eval- Bulgaria. uation using elementary dependency matching. In Proceedings of the 12th International Conference on Wanxiang Che, Longxu Dou, Yang Xu, Yuxuan Wang, Parsing Technologies, pages 225 – 230, Dublin, Ire- Yijia Liu, and Ting Liu. 2019. HIT-SCIR at land. MRP 2019: A unified pipeline for meaning rep- resentation parsing via efficient training and effec- Jason Eisner. 1997. Bilexical grammars and a cubic- tive encoding. In Proceedings of the Shared Task time probabilistic parser. In Proceedings of the Fifth on Cross-Framework Meaning Representation Pars- International Workshop on Parsing Technologies, ing at the 2019 Conference on Computational Natu- pages 54–65, Boston/Cambridge, Massachusetts, ral Language Learning, pages 76 – 85, Hong Kong, USA. Association for Computational Linguistics. China. Kilian Evang. 2019. Transition-based DRS parsing Yufei Chen, Yajie Ye, and Weiwei Sun. 2019. Peking using stack-LSTMs. In Proceedings of the IWCS at MRP 2019: Factorization- and composition- Shared Task on Semantic Parsing, Gothenburg, Swe- based parsing for Elementary Dependency Struc- den. Association for Computational Linguistics. tures. In Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing Federico Fancellu, Sorcha Gilroy, Adam Lopez, and at the 2019 Conference on Computational Natural Mirella Lapata. 2019. Semantic graph parsing Language Learning, pages 166 – 176, Hong Kong, with recurrent neural network DAG grammars. In China. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the Leshem Choshen and Omri Abend. 2018. Reference- 9th International Joint Conference on Natural Lan- less measure of faithfulness for grammatical error guage Processing (EMNLP-IJCNLP), pages 2769– correction. In Proceedings of the 2015 Conference 2778, Hong Kong, China. Association for Computa- of the North American Chapter of the Association tional Linguistics. for Computational Linguistics, New Orleans, LA, USA. Christiane Fellbaum. 2012. Wordnet. The Encyclope- dia of Applied Linguistics. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Daniel Fernández-González and Carlos Gómez- Guzmán, Edouard Grave, Myle Ott, Luke Zettle- Rodrı́guez. 2020. Transition-based semantic moyer, and Veselin Stoyanov. 2019. Unsupervised dependency parsing with pointer networks. In cross-lingual representation learning at scale. arXiv Proceedings of the 58th Annual Meeting of the preprint arXiv:1911.02116. Association for Computational Linguistics, pages 7035–7046, Online. Association for Computational Ann Copestake, Dan Flickinger, Carl Pollard, and Linguistics. Ivan A. Sag. 2005. Minimal Recursion Semantics. An introduction. Research on Language and Com- Dan Flickinger, Stephan Oepen, and Emily M. Bender. putation, 3(4):281 – 332. 2017. Sustainable development and refinement of complex linguistic annotations at scale. In Nacy Ide Marco Damonte and Shay B. Cohen. 2018. Cross- and James Pustejovsky, editors, Handbook of Lin- lingual Abstract Meaning Representation parsing. guistic Annotation, pages 353 – 377. Springer, Dor- In Proceedings of the 2018 Conference of the North drecht, The Netherlands. American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, Jan Hajič, Eduard Bejček, Alevtina Bémová, Eva Volume 1 (Long Papers), pages 1146–1155, New Buráňová, Eva Hajičová, Jiřı́ Havelka, Petr Orleans, Louisiana. Association for Computational Homola, Jiřı́ Kárnı́k, Václava Kettnerová, Na- Linguistics. talia Klyueva, Veronika Kolářová, Lucie Kučová, Markéta Lopatková, Marie Mikulová, Jiřı́ Mı́rovský, Robert M. W. Dixon. 2010/2012. Basic Linguistic The- Anna Nedoluzhko, Petr Pajas, Jarmila Panevová, Lu- ory. Oxford University Press. cie Poláková, Magdaléna Rysová, Petr Sgall, Jo- Lucia Donatelli, Meaghan Fowlie, Jonas Groschwitz, hanka Spoustová, Pavel Straňák, Pavlı́na Synková, Alexander Koller, Matthias Lindemann, Mario Magda Ševčı́ková, Jan Štěpánek, Zdeňka Urešová, Mina, and Pia Weißenhorn. 2019. Saarland at Barbora Vidová Hladká, Daniel Zeman, Šárka MRP 2019: Compositional parsing across all graph- Zikánová, and Zdeněk Žabokrtský. 2018. Prague banks. In Proceedings of the Shared Task on Cross- dependency treebank 3.5. LINDAT/CLARIAH-CZ Framework Meaning Representation Parsing at the digital library at the Institute of Formal and Ap- 2019 Conference on Computational Natural Lan- plied Linguistics (ÚFAL), Faculty of Mathematics guage Learning, pages 66 – 75, Hong Kong, China. and Physics, Charles University. 19 Jan Hajič, Massimiliano Ciaramita, Richard Johans- Annual Meeting of the Association for Computa- son, Daisuke Kawahara, Maria Antònia Martı́, Lluı́s tional Linguistics, pages 6795–6805, Online. Asso- Màrquez, Adam Meyers, Joakim Nivre, Sebastian ciation for Computational Linguistics. Padó, Jan Štěpánek, Pavel Straňák, Mihai Surdeanu, Nianwen Xue, and Yi Zhang. 2009. The CoNLL- Wei Jiang, Zhenghua Li, Yu Zhang, and Min Zhang. 2009 shared task: Syntactic and semantic depen- 2019. HLT@SUDA at SemEval-2019 task 1: dencies in multiple languages. In Proceedings of UCCA graph parsing as constituent tree parsing. the Thirteenth Conference on Computational Nat- In Proceedings of the 13th International Workshop ural Language Learning (CoNLL 2009): Shared on Semantic Evaluation, pages 11–15, Minneapo- Task, pages 1 – 18, Boulder, Colorado. Association lis, Minnesota, USA. Association for Computational for Computational Linguistics. Linguistics. Hans Kamp and Uwe Reyle. 1993. From Discourse Jan Hajič, Eva Hajičová, Jarmila Panevová, Petr Sgall, to Logic; An Introduction to Modeltheoretic Seman- Ondřej Bojar, Silvie Cinková, Eva Fučı́ková, Marie tics of Natural Language, Formal Logic and DRT. Mikulová, Petr Pajas, Jan Popelka, Jiřı́ Semecký, Kluwer, Dordrecht. Jana Šindlerová, Jan Štěpánek, Josef Toman, Zdeňka Urešová, and Zdeněk Žabokrtský. 2012. Announc- Paul Kingsbury and Martha Palmer. 2002. From Tree- ing Prague Czech-English Dependency Treebank Bank to PropBank. In Proceedings of the 3rd In- 2.0. In Proceedings of the 8th International Confer- ternational Conference on Language Resources and ence on Language Resources and Evaluation, pages Evaluation, pages 1989 – 1993, Las Palmas, Spain. 3153 – 3160, Istanbul, Turkey. Thomas N. Kipf and Max Welling. 2017. Semi- Eva Hajičová, Barbara Partee, and Petr Sgall. 1998. supervised classification with graph convolutional Topic–Focus Articulation, Tripartite Structures, and networks. In 5th International Conference on Learn- Semantic Content. Kluwer, Dordrecht, The Nether- ing Representations, Toulon, France. lands. Marco Kuhlmann and Stephan Oepen. 2016. Towards a catalogue of linguistic graph banks. Computa- Daniel Hershcovich, Omri Abend, and Ari Rappoport. tional Linguistics, 42(4):819 – 827. 2017. A transition-based directed acyclic graph parser for UCCA. In Proceedings of the 55th An- Shuhei Kurita and Anders Søgaard. 2019. Multi-task nual Meeting of the Association for Computational semantic dependency parsing with policy gradient Linguistics (Volume 1: Long Papers), pages 1127– for learning easy-first strategies. In Proceedings of 1138, Vancouver, Canada. Association for Computa- the 57th Annual Meeting of the Association for Com- tional Linguistics. putational Linguistics, pages 2420–2430, Florence, Italy. Association for Computational Linguistics. Daniel Hershcovich, Omri Abend, and Ari Rappoport. 2018. Multitask parsing across semantic representa- Robin Kurtz, Daniel Roxbo, and Marco Kuhlmann. tions. In Proceedings of the 56th Meeting of the As- 2019. Improving semantic dependency parsing with sociation for Computational Linguistics, pages 373 – syntactic features. In Proceedings of the First NLPL 385, Melbourne, Australia. Workshop on Deep Learning for Natural Language Processing, pages 12–21, Turku, Finland. Linköping Daniel Hershcovich, Zohar Aizenbud, Leshem University Electronic Press. Choshen, Elior Sulem, Ari Rappoport, and Omri Abend. 2019a. SemEval-2019 task 1. Cross-lingual Young-Suk Lee, Ramon Fernandez Astudillo, Tahira semantic parsing with UCCA. In Proceedings Naseem, Revanth Gangi Reddy, Radu Florian, and of the 13th International Workshop on Semantic Salim Roukos. 2020. Pushing the limits of AMR Evaluation, pages 1 – 10, Minneapolis, MN, USA. parsing with self-learning. In Findings of EMNLP. Matthias Lindemann, Jonas Groschwitz, and Alexan- Daniel Hershcovich, Zohar Aizenbud, Leshem der Koller. 2019. Compositional semantic parsing Choshen, Elior Sulem, Ari Rappoport, and Omri across graphbanks. In Proceedings of the 57th An- Abend. 2019b. SemEval-2019 task 1: Cross-lingual nual Meeting of the Association for Computational semantic parsing with UCCA. In Proceedings Linguistics, pages 4576–4585, Florence, Italy. Asso- of the 13th International Workshop on Semantic ciation for Computational Linguistics. Evaluation, pages 1–10, Minneapolis, Minnesota, USA. Association for Computational Linguistics. Jiangming Liu, Shay B. Cohen, and Mirella Lapata. 2018. Discourse representation structure parsing. In Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Proceedings of the 56th Annual Meeting of the As- Ramshaw, and Ralph Weischedel. 2006. OntoNotes. sociation for Computational Linguistics (Volume 1: The 90% solution. In Proceedings of Human Lan- Long Papers), pages 429–439, Melbourne, Australia. guage Technologies: The 2006 Annual Conference Association for Computational Linguistics. of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Jiangming Liu, Shay B. Cohen, and Mirella Lapata. Short Papers, pages 57 – 60, New York City, USA. 2019. Discourse representation structure parsing with recurrent neural networks and the transformer Zixia Jia, Youmi Ma, Jiong Cai, and Kewei Tu. 2020. model. In Proceedings of the IWCS Shared Task Semi-supervised semantic dependency parsing us- on Semantic Parsing, Gothenburg, Sweden. Associ- ing CRF autoencoders. In Proceedings of the 58th ation for Computational Linguistics. 20 Chunchuan Lyu, Shay B. Cohen, and Ivan Titov. 2020. Stephan Oepen, Omri Abend, Jan Hajič, Daniel Her- A differentiable relaxation of graph segmentation shcovich, Marco Kuhlmann, Tim O’Gorman, Nian- and alignment for AMR parsing. wen Xue, Jayeol Chun, Milan Straka, and Zdeňka Urešová. 2019. MRP 2019: Cross-framework Chunchuan Lyu and Ivan Titov. 2018. AMR parsing as Meaning Representation Parsing. In Proceedings of graph prediction with latent alignment. In Proceed- the Shared Task on Cross-Framework Meaning Rep- ings of the 56th Annual Meeting of the Association resentation Parsing at the 2019 Conference on Com- for Computational Linguistics (Volume 1: Long Pa- putational Natural Language Learning, pages 1 – 27, pers), pages 397–407, Melbourne, Australia. Asso- Hong Kong, China. ciation for Computational Linguistics. Stephan Oepen and Dan Flickinger. 2019. The ERG Christopher Manning, Mihai Surdeanu, John Bauer, at MRP 2019: Radically compositional semantic Jenny Finkel, Steven Bethard, and David McClosky. dependencies. In Proceedings of the Shared Task 2014. The Stanford CoreNLP natural language pro- on Cross-Framework Meaning Representation Pars- cessing toolkit. In Proceedings of 52nd Annual ing at the 2019 Conference on Computational Natu- Meeting of the Association for Computational Lin- ral Language Learning, pages 40 – 44, Hong Kong, guistics: System Demonstrations, pages 55–60, Bal- China. timore, Maryland. Association for Computational Linguistics. Stephan Oepen, Marco Kuhlmann, Yusuke Miyao, Daniel Zeman, Silvie Cinková, Dan Flickinger, Jan Mitchell Marcus, Beatrice Santorini, and Mary Ann Hajič, and Zdeňka Urešová. 2015. SemEval 2015 Marcinkiewicz. 1993. Building a large annotated Task 18. Broad-coverage semantic dependency pars- corpus of English. The Penn Treebank. Computa- ing. In Proceedings of the 9th International Work- tional Linguistics, 19:313 – 330. shop on Semantic Evaluation, pages 915 – 926, Den- ver, CO, USA. Jonathan May. 2016. SemEval-2016 Task 8. Mean- ing representation parsing. In Proceedings of the Stephan Oepen, Marco Kuhlmann, Yusuke Miyao, 10th International Workshop on Semantic Evalua- Daniel Zeman, Dan Flickinger, Jan Hajič, Angelina tion, pages 1063 – 1073, San Diego, CA, USA. Ivanova, and Yi Zhang. 2014. SemEval 2014 Task 8. Broad-coverage semantic dependency parsing. In Jonathan May and Jay Priyadarshi. 2017. SemEval- Proceedings of the 8th International Workshop on 2017 Task 9. Abstract Meaning Representation pars- Semantic Evaluation, pages 63 – 72, Dublin, Ireland. ing and generation. In Proceedings of the 11th Inter- national Workshop on Semantic Evaluation, pages Stephan Oepen and Jan Tore Lønning. 2006. 536 – 545. Discriminant-based MRS banking. In Proceedings of the 5th International Conference on Language James J. McGregor. 1982. Backtrack search algorithms Resources and Evaluation, pages 1250 – 1255, and the maximal common subgraph problem. Soft- Genoa, Italy. ware: Practice and Experience, 12(1):23 – 34. Hiroaki Ozaki, Gaku Morio, Yuta Koreeda, Terufumi Seung-Hoon Na and Jinwoo Min. 2020. JBNU at Morishita, and Toshinori Miyoshi. 2020. Hitachi at MRP 2020: AMR parsing using a joint state model MRP 2020: Text-to-graph-notation transducer. In for graph-sequence iterative inference. In Pro- Proceedings of the CoNLL 2020 Shared Task: Cross- ceedings of the CoNLL 2020 Shared Task: Cross- Framework Meaning Representation Parsing, pages Framework Meaning Representation Parsing, pages 40 – 52, Online. 83 – 87, Online. Hao Peng, Sam Thomson, and Noah A. Smith. 2017. Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin- Deep multitask learning for semantic dependency ter, Jan Hajič, Christopher D. Manning, Sampo parsing. In Proceedings of the 55th Meeting of the Pyysalo, Sebastian Schuster, Francis Tyers, and Association for Computational Linguistics, pages Daniel Zeman. 2020. Universal Dependencies v2: 2037 – 2048, Vancouver, Canada. An evergrowing multilingual treebank collection. In Proceedings of the 12th Language Resources Carl Pollard and Ivan A. Sag. 1994. Head-Driven and Evaluation Conference, pages 4034–4043, Mar- Phrase Structure Grammar. Studies in Contempo- seille, France. European Language Resources Asso- rary Linguistics. The University of Chicago Press, ciation. Chicago, USA. Rik van Noord. 2019. Neural Boxer at the IWCS Rudolf Rosa, Martin Popel, Ondřej Bojar, David shared task on DRS parsing. In Proceedings of the Mareček, and Ondřej Dušek. 2016. Moses & treex IWCS Shared Task on Semantic Parsing, Gothen- hybrid MT systems bestiary. In Proceedings of the burg, Sweden. Association for Computational Lin- 2nd Deep Machine Translation Workshop, pages 1 – guistics. 10, Lisbon, Portugal. ÚFAL MFF UK. Rik van Noord, Lasha Abzianidze, Antonio Toral, and David Samuel and Milan Straka. 2020. ÚFAL at Johan Bos. 2018. Exploring neural methods for MRP 2020: Permutation-invariant semantic pars- parsing discourse representation structures. Trans- ing in PERIN. In Proceedings of the CoNLL 2020 actions of the Association for Computational Lin- Shared Task: Cross-Framework Meaning Represen- guistics, 6:619–633. tation Parsing, pages 53 – 64, Online. 21 Rob A. Van der Sandt. 1992. Presupposition projec- Chuan Wang, Bin Li, and Nianwen Xue. 2018. tion as anaphora resolution. Journal of Semantics, Transition-based Chinese AMR parsing. In Proceed- 9(4):333 – 377. ings of the 2018 Conference of the North Ameri- can Chapter of the Association for Computational Sebastian Schuster and Christopher D. Manning. 2016. Linguistics: Human Language Technologies, Vol- Enhanced English Universal Dependencies: An im- ume 2 (Short Papers), pages 247–252, New Orleans, proved representation for natural language under- Louisiana. Association for Computational Linguis- standing tasks. In Proceedings of the Tenth Inter- tics. national Conference on Language Resources and Evaluation (LREC’16), pages 2371–2378, Portorož, Chuan Wang and Nianwen Xue. 2017. Getting the Slovenia. European Language Resources Associa- most out of AMR parsing. In Proceedings of the tion (ELRA). 2017 Conference on Empirical Methods in Natural Language Processing, pages 1257 – 1268, Copen- Petr Sgall, Eva Hajičová, and Jarmila Panevová. 1986. hagen, Denmark. The Meaning of the Sentence and Its Semantic and Pragmatic Aspects. D. Reidel Publishing Company, Xinyu Wang, Jingxian Huang, and Kewei Tu. 2019. Dordrecht, The Netherlands. Second-order semantic dependency parsing with Gabriel Stanovsky and Ido Dagan. 2018. Semantics as end-to-end neural networks. In Proceedings of the a foreign language. In Proceedings of the 2018 Con- 57th Annual Meeting of the Association for Com- ference on Empirical Methods in Natural Language putational Linguistics, pages 4609–4618, Florence, Processing, pages 2412 – 2421, Brussels, Belgium. Italy. Association for Computational Linguistics. Mark Steedman. 2011. Taking Scope. MIT Press, Cam- Dongqin Xu, Junhui Li, Muhua Zhu, Min Zhang, and bridge, MA, USA. Guodong Zhou. 2020a. Improving AMR parsing with sequence-to-sequence pre-training. In Proceed- Milan Straka. 2018. UDPipe 2.0 prototype at CoNLL ings of the 2020 Conference on Empirical Methods 2018 UD shared task. In Proceedings of the CoNLL in Natural Language Processing. 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 197 – 207. Jin Xu, Yinuo Guo, and Junfeng Hu. 2020b. Incor- porate semantic structures into machine translation Milan Straka and Jana Straková. 2020. UDPipe at evaluation via UCCA. In Proceedings of the Interna- EvaLatin 2020: Contextualized embeddings and tional Conference on Machine Translation, Online. treebank embeddings. In Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Daniel Zeman and Jan Hajič. 2020. FGD at MRP 2020: Historical and Ancient Languages, pages 124–129, Prague Tectogrammatical Graphs. In Proceedings Marseille, France. European Language Resources of the CoNLL 2020 Shared Task: Cross-Framework Association (ELRA). Meaning Representation Parsing, pages 33 – 39, On- line. Elior Sulem, Omri Abend, and Ari Rappoport. 2015. Conceptual annotations preserve structure across Sheng Zhang, Xutai Ma, Kevin Duh, and Benjamin translations. A French–English case study. In Pro- Van Durme. 2019a. AMR parsing as sequence-to- ceedings of the 1st Workshop on Semantics-Driven graph transduction. In Proceedings of the 57th An- Statistical Machine Translation, pages 11 – 22. nual Meeting of the Association for Computational Elior Sulem, Omri Abend, and Ari Rappoport. 2018a. Linguistics, pages 80–94, Florence, Italy. Associa- Semantic structural annotation for text simplifica- tion for Computational Linguistics. tion. In Proceedings of the 2015 Conference of the Sheng Zhang, Xutai Ma, Kevin Duh, and Benjamin North American Chapter of the Association for Com- Van Durme. 2019b. Broad-coverage semantic pars- putational Linguistics, New Orleans, LA, USA. ing as transduction. In Proceedings of the 2019 Con- Elior Sulem, Omri Abend, and Ari Rappoport. 2018b. ference on Empirical Methods in Natural Language Semantic structural evaluation for text simplifica- Processing and the 9th International Joint Confer- tion. In Proceedings of the 2018 Conference of the ence on Natural Language Processing (EMNLP- North American Chapter of the Association for Com- IJCNLP), pages 3786–3798, Hong Kong, China. As- putational Linguistics: Human Language Technolo- sociation for Computational Linguistics. gies, Volume 1 (Long Papers), pages 685–696, New Orleans, Louisiana. Association for Computational Yue Zhang, Wei Jiang, Qingrong Xia, Junjie Cao, Linguistics. Rui Wang, Zhenghua Li, and Min Zhang. 2019c. SUDA–Alibaba at MRP 2019: Graph-based models Elior Sulem, Omri Abend, and Ari Rappoport. 2018c. with BERT. In Proceedings of the Shared Task on Simple and effective text simplification using seman- Cross-Framework Meaning Representation Parsing tic and neural methods. In Proceedings of the 56th at the 2019 Conference on Computational Natural Meeting of the Association for Computational Lin- Language Learning, pages 149 – 157, Hong Kong, guistics, Melbourne, Australia. China. Zdeňka Urešová, Eva Fučı́ková, and Jana Šindlerová. 2016. CzEngVallex. A bilingual Czech–English va- lency lexicon. The Prague Bulletin of Mathematical Linguistics, 105:17 – 50. 22