A Latent-Variable Model for Intrinsic Probing Karolina Stańczak * 1 Lucas Torroba Hennigen * 2 Adina Williams 3 Ryan Cotterell 4 Isabelle Augenstein 1 Abstract 1 Uralic fin ■ ■ ■ ■ 0.9 The success of pre-trained contextualized repre- sentations has prompted researchers to analyze rus 0.8 ■ ■ ■ ■ ■ arXiv:2201.08214v1 [cs.CL] 20 Jan 2022 them for the presence of linguistic information. In- IE (Slavic) 0.7 deed, it is natural to assume that these pre-trained pol representations do encode some level of linguistic ■ ■ ■ ■ ■ 0.6 knowledge as they have brought about large em- 0.5 IE (Romance) por pirical improvements on a wide variety of NLP ■ ■ ■ ■ ■ 0.4 tasks, which suggests they are learning true lin- IE (Germanic) eng guistic generalization. In this work, we focus on ■ ■ ■ ■ 0.3 intrinsic probing, an analysis technique where the 0.2 Afro-Asiatic ara goal is not only to identify whether a represen- ■ ■ ■ ■ ■ 0.1 tation encodes a linguistic attribute, but also to ara eng por pol rus fin pinpoint where this attribute is encoded. We pro- pose a novel latent-variable formulation for con- Figure 1: The percentage overlap between the top-30 most structing intrinsic probes and derive a tractable informative number dimensions in BERT for the probed variational approximation to the log-likelihood. languages. Statistically significant overlap, after Holm– Our results show that our model is versatile and Bonferroni family-wise error correction (Holm, 1979), with yields tighter mutual information estimates than α = 0.05, is marked with an orange square. two intrinsic probes previously proposed in the literature. Finally, we find empirical evidence that pre-trained representations develop a cross- lingually entangled notion of morphosyntax. in representations and, perhaps, how that information is structured. Probing has grown to be a fruitful area of research, with researchers probing for morphological (Tang 1. Introduction et al., 2020; Ács et al., 2021), syntactic (Voita & Titov, 2020; Hall Maudslay et al., 2020; Ács et al., 2021), and There have been considerable improvements to the quality semantic (Vulić et al., 2020; Tang et al., 2020) information. of pre-trained contextualized representations in recent years (e.g., Peters et al., 2018; Devlin et al., 2019; Raffel In this paper, we focus on one type of probing known as et al., 2020). These advances have sparked an interest in intrinsic probing (Dalvi et al., 2019; Torroba Hennigen understanding what linguistic information may be lurking et al., 2020), a subset of which specifically aims to ascer- within the representations themselves (Poliak et al., 2018; tain how information is structured within a representation. Zhang & Bowman, 2018; Rogers et al., 2020, inter alia). This means that we are not solely interested in determin- One philosophy that has been proposed to extract this in- ing whether a network encodes the tense of a verb, but formation is called probing, the task of training an external also in pinpointing exactly which neurons in the network classifier to predict the linguistic property of interest directly are responsible for encoding the property. Unfortunately, from the representations. The hope of probing is that it the naïve formulation of intrinsic probing requires one to sheds light onto how much linguistic knowledge is present analyze all possible combinations of neurons, which is intractable even for the smallest representations used in * Equal contribution 1 University of Copenhagen 2 Massachusetts modern-day NLP. For example, analyzing all combinations Institute of Technology 3 Facebook AI Research 4 ETH Zürich. of 768-dimensional BERT word representations would re- Correspondence to: Karolina Stańczak <
[email protected]>, Lucas Torroba Hennigen <
[email protected]>. quire us to train 2768 different probes, one for each combi- nation of neurons, which far exceeds the estimated number of atoms in the observable universe. A Latent-Variable Model for Intrinsic Probing To obviate this difficulty, we introduce a novel latent- our approach can easily generalize to other settings, e.g., variable probe for discriminative intrinsic probing. The the layers in a transformer or filters of a convolutional core idea of this approach is that instead of training a neural network. Identifying individual neurons responsible different probe for each combination of neurons, we for encoding linguistic features of interest has previously introduce a subset-valued latent variable. We approximately been shown to increase model transparency (Bau et al., marginalize over the latent subsets using variational 2019). In fact, knowledge about which neurons encode inference. Training the probe in this manner results in a set certain properties has also been employed to mitigate of parameters which work well across all possible subsets. potential biases (Vig et al., 2020), for controllable text We propose two variational families to model the posterior generation (Bau et al., 2019), and to analyze the linguistic over the latent subset-valued random variables, both based capabilities of language models (Lakretz et al., 2019). on common sampling designs: Poisson sampling, which To formally describe our intrinsic probing framework, we selects each neuron based on independent Bernoulli trials, first introduce some notation. We define Π to be the set and conditional Poisson sampling, which first samples of values that some property of interest can take, e.g., a fixed number of neurons from a uniform distribution Π = {S INGULAR, P LURAL} for the morphosyntactic num- and then a subset of neurons of that size (Lohr, 2019). Conditional Poisson sampling offers the modeler more ber attribute. Let D = {(π (n) , h(n) )}N n=1 be a dataset of control over the distribution over subset sizes; they may label–representation pairs: π (n) ∈ Π is a linguistic prop- pick the parametric distribution themselves. erty and h(n) ∈ Rd is a representation. Additionally, let D be the set of all neurons in a representation; in our We compare both variants to the two main intrinsic prob- setup, it is an integer range. In the case of BERT, we have ing approaches we are aware of in the literature (§5). To D = {1, . . . , 768}. Given a subset of dimensions C ⊆ D, do so, we train probes for 29 morphosyntactic properties we write hC for the subvector of h which contains only the across 6 languages (English, Portuguese, Polish, Russian, dimensions present in C. Arabic, and Finnish) from the Universal Dependencies (UD; (n) Nivre et al. 2017) treebanks. We show that, in general, both Let pθ (π (n) | hC ) be a probe—a classifier trained to pre- (n) variants of our method yield tighter estimates of the mu- dict π (n) from a subvector hC . In intrinsic probing, our tual information, though the model based on conditional goal is to find the size k subset of neurons C ⊆ D which Poisson sampling yields slightly better performance. This are most informative about the property of interest. This suggests that they are better at quantifying the informational may be written as the following combinatorial optimization content encoded in m-BERT representations (Devlin et al., problem (Torroba Hennigen et al., 2020): 2019). We make two typological findings when applying N our probe. We show that there is a difference in how infor- X (n) mation is structured depending on the language with certain C ? = argmax log pθ π (n) | hC (1) C⊆D, n=1 language–attribute pairs requiring more dimensions to en- |C|=k code relevant information. We also analyze whether neural representations are able to learn cross-lingual abstractions To exhaustively solve Eq. (1), we would have to train a from multilingual corpora. We confirm this statement and probe pθ (π | hC ) for every one of the exponentially many observe a strong overlap in the most informative dimensions, subsets C ⊆ D of size k. Thus, exactly solving eq. (1) especially for number (Fig. 1). In an additional experiment, is infeasible, and we are forced to rely on an approximate we show that our method supports training deeper probes solution, e.g., greedily selecting the dimension that maxi- (§5.4), though the advantages of non-linear probes over their mizes the objective. However, greedy selection alone is not linear counterparts are modest. enough to make solving eq. (1) manageable; because we must retrain pθ (π | hC ) for every subset C ⊆ D considered during the greedy selection procedure, i.e., we would end 2. Intrinsic Probing up training O(k |D|) classifiers. As an example, consider The success behind pre-trained contextual representations what would happen if one used a greedy selection scheme to such as BERT (Devlin et al., 2019) suggests that they may find the 50 most informative dimensions for a property on offer a continuous analogue of the discrete structures in 768-dimensional BERT representations. To select the first language, such as morphosyntactic attributes number, case, dimension, one would need to train 768 probes. To select or tense. Intrinsic probing aims to recognize the parts of a the second dimension, one would train an additional 767, network (assuming they exist) which encode such structures. and so forth. After 50 dimensions, one would have trained In this paper, we will operate exclusively at the level of 37893 probes. To address this problem, our paper intro- the neuron—in the case of BERT, this is one component duces a latent-variable probe, which identifies a θ that can of the 768-dimensional vector the model outputs. However, be used for any combination of neurons under consideration allowing a greedy selection procedure to work in practice. A Latent-Variable Model for Intrinsic Probing 3. A Latent-Variable Probe 3.1. Parameter Estimation The technical contribution of this work is a novel latent- As mentioned above, exact computation of the log- variable model for intrinsic probing. Our method starts with likelihood is intractable due to the sum over all possible a generic probabilistic probe pθ (π | C, h) which predicts subsets of D. Thus, we optimize the variational bound pre- a linguistic attribute π given a subset C of the hidden di- sented in eq. (3). We optimize the bound through stochastic mensions; C is then used to subset h into hC . To avoid gradient descent with respect to the model parameters θ and training a unique probe pθ (π | C, h) for every possible the variational parameters φ, a technique known as stochas- subset C ⊆ D, we propose to integrate a prior over subsets tic variational inference (Hoffman et al., 2013). However, p(C) into the model and then to marginalize out all possible one final trick is necessary, since the variational bound still subsets of neurons: includes a sum over all subsets in the first term: h i ∇θ Eqφ log pθ (π (n) , C | h(n) ) X pθ (π | h) = pθ (π | C, h) p(C) (2) (4) C⊆D h i = Eqφ ∇θ log pθ (π (n) , C | h(n) ) Due to this marginalization, our likelihood is not dependent M h i on any specific subset of neurons C. Throughout this paper X ≈ ∇θ log pθ (π (n) , C (m) | h(n) ) we opted for a non-informative, uniform prior p(C), but m=1 other distributions are also possible. where we take M Monte Carlo samples to approximate the Our goal is to estimate the parameters θ. We achieve sum. In the case of the gradient with respect to φ, we also this PN by maximizing the log-likelihood of the training data have to apply the REINFORCE trick (Williams, 1992): (π (n) , C | h(n) ) with respect to the P n=1 log C⊆D p θ parameters θ. Unfortunately, directly computing this in- h i ∇φ Eqφ log pθ (π (n) , C | h(n) ) (5) volves a sum over all possible subsets of D—a sum with h i an exponential number of summands. Thus, we resort to = Eqφ log pθ (π (n) , C | h(n) )∇φ log qφ (C) a variational approximation. Let qφ (C) be a distribution M h over subsets, parameterized by parameters φ; we will use X i qφ (C) to approximate the true posterior distribution. Then, ≈ log pθ (π (n) , C (m) | h(n) )∇φ log qφ (C) m=1 the log-likelihood is lower-bounded as follows: N where we again take M Monte Carlo samples. This pro- X log X pθ (π (n) (n) ,C | h ) (3) cedure leads to an unbiased estimate of the gradient of the n=1 C⊆D variational approximation. N h i 3.2. Choice of Variational Family qφ (C). X ≥ Eqφ log pθ (π (n) , C | h(n) ) +H(q) n=1 We consider two choices of variational family qφ (C), both based on sampling designs (Lohr, 2019). Each defines a which follows from Jensen’s inequality, where H(qφ ) is the parameterized distribution over all subsets of D. entropy of qφ .1 Our likelihood is general, and can take the form of any ob- Poisson Sampling. Poisson sampling is one of the sim- jective function. This means that we can use this approach to plest sampling designs. In our setting, each neuron d is given train intrinsic probes with any type of architecture amenable a unique non-negative weight wd = exp(φd ). This gives us to gradient-based optimization, e.g., neural networks. How- the following parameterized distribution over subsets: ever, in this paper, we use a linear classifier unless stated otherwise. Further, note that eq. (3) is valid for any choice Y wd Y 1 qφ (C) = (6) of qφ . We explore two variational families for qφ , each 1 + wd 1 + wd d∈C d6∈C based on a common sampling technique. The first (herein P OISSON) applies Poisson sampling (Hájek, 1964), which The formulation in Eq. (6) shows that taking a sample corre- assumes each neuron to be subjected to an independent sponds to |D| independent coin flips—one for each neuron— wd Bernoulli trial. The second one (C ONDITIONAL P OISSON; where the probability of heads is 1+w . The entropy of a d Aires, 1999) corresponds to conditional Poisson sampling, Poisson sampling may be computed in O(|D|) time: which can be defined as conditioning a Poisson sample by a fixed sample size. |D| X wd H(qφ ) = log Z − log wd (7) 1 See App. A for the full derivation. 1 + wd d=1 A Latent-Variable Model for Intrinsic Probing P|D| where log Z = d=1 log(1 + wd ). The gradient of eq. (7) feeding hC as input to our probes, we set any dimensions may be computed automatically through backpropagation. that are not present in C to zero. We select M = 5 as the Poisson sampling automatically modules the size of the number of Monte Carlo samples since we found this to work sampled set C ∼ qφ (·) and we have the expected size adequately in small-scale experiments. We compare the per- P|D| wd formance of the probes on 29 different language–attribute E[|C|] = d=1 1+w d . pairs (listed in App. B). Conditional Poisson Sampling. We also consider a vari- Since the performance of a probe on a specific subset of ational family that factors as follows: dimensions is related to both the subset itself (e.g., whether CP size it is informative or not) and the number of dimensions being qφ (C) = qφ (C | |C| = k) qφ (k) (8) | {z } evaluated (e.g., if a probe is trained to expect 768 dimensions Conditional Poisson as input, it might work best when few or no dimensions are size filled with zeros), we sample 100 subsets of dimensions In this paper, we take qφ (k) = Uniform(D), but a more with 5 different possible sizes (we considered 10, 50, 100, complex distribution, e.g., a Categorical, could be learned. CP 250, 500 dim.) and compare every model’s performance on We define qφ (C | |C| = k) as a conditional Poisson sam- each of those subset sizes. Further details about training and pling design. Similarly to Poisson sampling, conditional hyperparameter settings are provided in App. C. Poisson sampling starts with a unique positive weight as- sociated with every neuron wd = exp(φd ). However, an additional cardinality constraint is introduced. This leads to 4.1. Baselines the following distribution We compare our latent-variable probe against two other Q recently proposed intrinsic probing methods as baselines. wd qφCP (C) = 1{|C| = k} d∈C CP (9) Z • Torroba Hennigen et al. (2020): Our first baseline is A more elaborate dynamic program which runs in O(k |D|) a generative probe that models the joint distribution of may be used to compute Z CP efficiently (Aires, 1999). We representations and their properties p(h, π) = p(h | may further compute the entropy H(qφ ) and its the gradient π) p(π), where the representation distribution p(h | π) in O |D|2 time using the expectation semiring (Eisner, is assumed to be Gaussian. Torroba Hennigen et al. CP 2002; Li & Eisner, 2009). Sampling from qφ can be done (2020) report that a major limitation of this probe is efficiently using quantities computed when running the dy- that if certain dimensions of the representations are not namic program used to compute Z CP (Kulesza, 2012).2 distributed according to a Gaussian distribution, then probe performance will suffer. 4. Experimental Setup • Dalvi et al. (2019): Our second baseline is a linear Our setup is virtually identical to the morphosyntactic prob- classifier, where dimensions not under consideration ing setup of Torroba Hennigen et al. (2020). This con- are zeroed out during evaluation (Dalvi et al., 2019; sists of first automatically mapping treebanks from UD Durrani et al., 2020).5 Their approach is a special case v2.1 (Nivre et al., 2017) to the UniMorph (McCarthy et al., of our proposed latent-variable model, where qφ is 2018) schema.3 Then, we compute multilingual BERT (m- fixed, so that on every training iteration the entire set BERT) representations4 for every sentence in the UD tree- of dimensions is sampled. banks. After computing the m-BERT representations for the entire sentence, we extract representations for individual Additionally, we compare our methods to a naïve approach, words in the sentence and pair them with the UniMorph a probe that is re-trained for every set of dimensions under morphosyntactic annotations. We estimate our probes’ pa- consideration selecting the dimension that maximizes the rameters using the UD training set and conduct greedy selec- objective (herein U PPER B OUND).6 Due to computational tion to approximate the objective in Eq. (1) on the validation 5 set; finally, we report the results on the test set, i.e., we test We note that they do not conduct intrinsic probing via dimen- sion selection: Instead, they use the absolute magnitude of the whether the set of neurons we found on the development weights as a proxy for dimension importance. In this paper, we set generalizes to held-out data. Additionally, we discard adopt the approach of (Torroba Hennigen et al., 2020) and use the values that occur fewer than 20 times across splits. When performance-based objective in eq. (1). 6 2 The U PPER B OUND yields the tightest estimate on the mutual In practice, we use the semiring implementations by Rush information, however as mentioned in §2, this is unfeasible since it (2020). requires retraining for every different combination of neurons. For 3 We use the code available at: https://github.com/ comparison, in English number, on an Nvidia RTX 2070 GPU, our unimorph/ud-compatibility. P OISSON, G AUSSIAN and L INEAR experiments take a few minutes 4 We use the implementation by Wolf et al. (2020). or even seconds to run, compared to U PPER B OUND which takes A Latent-Variable Model for Intrinsic Probing cost, we limit our comparisons with U PPER B OUND to their performance (MI and accuracy), but also with respect 6 randomly chosen morphosyntactic attributes,7 each in a to the size of the subset of dimensions being evaluated, i.e., different language. the size of set C. We acknowledge that there is a disparity between the quanti- 4.2. Metrics tative evaluation we employ, in which probes are compared We compare our proposed method to the baselines above based on their MI estimates, and qualitative nature of intrin- under two metrics: accuracy and mutual information (MI). sic probing, which aims to identify the substructures of a We report mutual information, which has recently been model that encode a property of interest. However, it is non- proposed as an evaluation metric for probes (Pimentel et al., trivial to evaluate fundamentally qualitative procedures in 2020). Here, mutual information (MI) is a function between a large-scale, systematic, and unbiased manner. Therefore, a Π-valued random variable P and a R|C| -valued random we rely on the quantitative evaluation metrics presented in variable HC over masked representations: §4.2, while also qualitatively inspecting the implications of our probes. MI(P ; HC ) = H(P ) − H(P | HC ) (10) where H(P ) is the inherent entropy of the property being 5. Results probed and is constant with respect to HC ; H(P | HC ) is the entropy over the property given the representations In this section, we present the results of our empirical in- HC . Exact computation of the mutual information is in- vestigation. First, we address our main research question: tractable; however, we can lower-bound the MI by approx- Does our latent-variable probe presented in §3 outperform imating H(P | HC ) using our probe’s average negative previously proposed intrinsic probing methods (§5.1)? Sec- PN ond, we analyze the structure of the most informative m- log-likelihood: − N1 n=1 log pθ (π (n) | C, h(n) ) on held- BERT neurons for the different morphosyntactic attributes out data. See Brown et al. (1992) for a derivation. we probe for (§5.2). Finally, we investigate whether knowl- We normalize the mutual information (NMI) by dividing the edge about morphosyntax encoded in neural representations MI by the entropy which turns it into a percentage and is, is shared across languages (§5.3). In §5.4, we show that arguably, more interpretable. We refer the reader to Gates our latent-variable probe is flexible enough to support deep et al. (2019) for a discussion of the normalization of MI. neural probes. We also report accuracy which is a standard measure for evaluating probes as it is for evaluating classifiers in general. 5.1. How Do Our Methods Perform? However, accuracy can be a misleading measure especially The main question we ask is how the performance of our on imbalanced datasets since it considers solely correct models compares to existing intrinsic probing approaches. predictions. To investigate this research question, we compare the per- formance of the P OISSON and C ONDITIONAL P OISSON 4.3. What Makes a Good Probe? probes to L INEAR (Dalvi et al., 2019) and G AUSSIAN (Tor- roba Hennigen et al., 2020). Refer to §4.3 for a discussion Since we report a lower bound on the mutual information of the limitations of our method. (§4), we deem the best probe to be the one that yields the tightest mutual information estimate, or, in other words, In general, C ONDITIONAL P OISSON tends to outperform the one that achieves the highest mutual information esti- P OISSON at lower dimensions, however, P OISSON tends to mate; this is equivalent to having the best cross-entropy on catch up as more dimensions are added. Our results suggest held-out data, which is the standard evaluation metric for that both variants of our latent-variable model from §3 are language modeling. effective and generally outperform the L INEAR baseline as shown in Tab. 1. The G AUSSIAN baseline tends to perform However, in the context of intrinsic probing, the topic of similarly to C ONDITIONAL P OISSON when we consider primary interest is what the probe reveals about the structure subsets of 10 dimensions, and it outperforms P OISSON sub- of the representations. For instance, does the probe reveal stantially. However, for subsets of size 50 or more, both that the information encoded in the embeddings is focalized C ONDITIONAL P OISSON and P OISSON are preferable. We or dispersed across many neurons? Several prior works believe that the robust performance of G AUSSIAN in the low- (e.g., Lakretz et al., 2019) focus on the single neuron setting, dimensional regimen can be attributed to its ability to model which is a special, very focal case. To engage with this non-linear decision boundaries (Murphy, 2012, Chapter 4). prior work, we compare probes not only with respect to The trends above are corroborated by a comparison of the multiple hours. 7 English–Number, Portuguese–Gender and Noun Class, Polish– mean NMI (Tab. 2, top) achieved by each of these probes Tense, Russian–Voice, Arabic–Case, Finnish–Tense for different subset sizes. However, in terms of accuracy A Latent-Variable Model for Intrinsic Probing 1 Number of dimensions 10 50 100 250 500 G AUSSIAN NMI C. P OISSON 0.50 0.58 0.70 0.99 1.00 0.5 P OISSON 0.21 0.49 0.66 0.98 1.00 L INEAR C. P OISSON 0.99 1.00 1.00 1.00 0.98 0 P OISSON 0.95 0.99 1.00 1.00 0.97 20 40 60 80 100 Number of sampled dimensions Table 1: Proportion of experiments where C ONDITIONAL P OISSON (C. P OISSON) and P OISSON beat the benchmark 1 models L INEAR and G AUSSIAN in terms of NMI. For each of the subset sizes, we sampled 100 different subsets of BERT dimensions at random. Accuracy 0.5 (see Tab. 3 in App. D), while both C ONDITIONAL P OISSON and P OISSON generally outperform L INEAR, G AUSSIAN tends to achieve higher accuracy than our methods. Notwith- 0 standing, G AUSSIAN’s performance (in terms of NMI) is not 20 40 60 80 100 stable and can yield low or even negative mutual information Number of sampled dimensions estimates across all subsets of dimensions. Adding a new dimension can never decrease the mutual information, so the Figure 2: Comparison of the P OISSON, C ONDITIONAL observable decreases occur because the generative model de- P OISSON, L INEAR (Dalvi et al., 2019) and G AUSSIAN (Tor- teriorates upon adding another dimension, which validates roba Hennigen et al., 2020) probes. We use the greedy Torroba Hennigen et al.’s claim that some dimensions are selection approach in Eq. (1) to select the most informative not adequately modeled by the Gaussian assumption. While dimensions, and average across all language–attribute pairs these results suggest that G AUSSIAN may be preferable if we probe for. performing a comparison based on accuracy, the instabil- ity of G AUSSIAN when considering NMI suggests that this edge in terms of accuracy comes at a hefty cost in terms of performance to G AUSSIAN at low dimensionalities for both calibration (Guo et al., 2017).8 NMI and accuracy, though the latter tends to yield a slightly Further, we compare the P OISSON and C ONDITIONAL higher (and thus a tighter) bound on the mutual information. P OISSON probes to the U PPER B OUND baseline. This is However, as more dimensions are taken into consideration, expected to be the highest performing since it is re-trained our models vastly outperform G AUSSIAN. P OISSON for every subset under consideration and indeed, this as- and C ONDITIONAL P OISSON perform comparably at sumption is confirmed by the results in Tab. 2 (bottom). The high dimensions, but C ONDITIONAL P OISSON performs difference between our probes’ performance and the U PPER slightly better for 1–20 dimensions. P OISSON outperforms B OUND baseline’s performance can be seen as the cost of L INEAR at high dimensions, and C ONDITIONAL P OISSON sharing parameters across all subsets of dimensions, and an outperforms L INEAR for all dimensions considered. These effective intrinsic probe should minimize this. effects are less pronounced for accuracy, which we believe to be due to accuracy’s insensitivity to a probe’s confidence We also conduct a direct comparison of L INEAR, G AUS - in its prediction. SIAN , P OISSON and C ONDITIONAL P OISSON when used to identify the most informative subsets of dimensions. 5.2. Information Distribution The average MI reported by each model across all 29 mor- phosyntactic language–attribute pairs is presented in Fig. 2. We compare performance of the C ONDITIONAL P OISSON On average, C ONDITIONAL P OISSON offers comparable probe for each attribute for all available languages in order 8 to better understand the relatively high NMI variance across While accuracy only cares about whether predictions are cor- rect, NMI penalizes miscalibrated predictions. This is the case results (see Tab. 2). In Fig. 3 we plot the average NMI because it is proportional to the negative log likelihood (Guo et al., for gender and observe that languages with two genders 2017). present (Arabic and Portuguese) achieve higher performance A Latent-Variable Model for Intrinsic Probing Probe 10 50 100 250 500 768 C OND . P OISSON 0.04 ± 0.03 0.18 ± 0.10 0.31 ± 0.14 0.54 ± 0.17 0.69 ± 0.15 0.71 ± 0.15 P OISSON −0.18 ± 0.28 0.03 ± 0.24 0.22 ± 0.21 0.53 ± 0.17 0.69 ± 0.16 0.71 ± 0.19 L INEAR −0.28 ± 0.35 −0.18 ± 0.36 −0.06 ± 0.35 0.24 ± 0.33 0.59 ± 0.21 0.78 ± 0.14 G AUSSIAN −0.15 ± 0.43 −1.20 ± 2.82 −3.97 ± 8.62 −61.70 ± 186.15 −413.80 ± 1175.31 −1067.08 ± 2420.08 C OND . P OISSON 0.04 ± 0.03 0.21 ± 0.11 0.35 ± 0.16 0.58 ± 0.2 0.77 ± 0.19 0.74 ± 0.16 P OISSON −0.10 ± 0.10 0.11 ± 0.13 0.28 ± 0.17 0.57 ± 0.20 0.73 ± 0.20 0.76 ± 0.18 U PPER B OUND 0.10 ± 0.06 0.36 ± 0.16 0.52 ± 0.19 0.70 ± 0.20 0.79 ± 0.17 0.81 ± 0.13 Table 2: Mean and standard deviation of NMI for the P OISSON, C ONDITIONAL P OISSON, L INEAR (Dalvi et al., 2019) and G AUSSIAN (Torroba Hennigen et al., 2020) probes for all language–attribute pairs (top) and mean NMI and standard deviation for the C ONDITIONAL P OISSON, P OISSON and U PPER B OUND for 6 selected language–attribute pairs (bottom). For each subset size considered, we take our averages over 100 randomly sampled subsets of BERT dimensions. right). These results might indicate that BERT may be lever- Normalized Mutual Information ara pol por rus aging data from other languages to develop a cross-lingually 0.8 entangled notion of morpho-syntax (Torroba Hennigen et al., 0.6 2020), and that this effect that may be particularly strong between typologically similar languages.9 0.4 5.4. How Do Deeper Probes Perform? 0.2 Multiple papers have promoted the use of linear probes (Ten- 0 10 50 100 ney et al., 2018; Liu et al., 2019), in part because they are os- Number of sampled dimensions tensibly less likely to memorize patterns in the data (Zhang & Bowman, 2018; Hewitt & Liang, 2019), though this is Figure 3: Comparison of the average NMI for gender di- subject to debate (Voita & Titov, 2020; Pimentel et al., 2020). mensions in BERT for each of the available languages. We Here we verify our claim from §3 that our probe can be ap- use the greedy selection approach in Eq. (1) to select the plied to any kind of discriminative probe architecture as our most informative dimensions, and average across all lan- objective function can be optimized using gradient descent. guage–attribute pairs we probe for. We follow the setup of Hewitt & Liang (2019), and test MLP- 1 and MLP-2 C ONDITIONAL P OISSON probes alongside a linear C ONDITIONAL P OISSON probe. The MLP-1 and than languages with three genders (Russian and Polish) MLP-2 probes are multilayer perceptrons (MLP) with one which is an intuitive result due to increased task complexity. and two hidden layer(s), respectively, and Rectified Linear Further, we see that the slopes for both Russian and Polish Unit (ReLU; Nair & Hinton, 2010) activation function. are flatter, especially at lower dimensions. This implies that In Fig. 4, we can see that our method not only works well the information for Russian and Polish is more dispersed for deeper probes, but also outperforms the linear probe and more dimensions are needed to capture the typological in terms of NMI. However, at higher dimensionalities, the information. advantage of a deeper probe diminishes. We also find that the difference in performance between MLP-1 and MLP-2 5.3. Cross-Lingual Overlap is negligible. We compare the most informative m-BERT dimensions re- covered by our probe across languages and find that, in 6. Related Work many cases, the same set of neurons may express the same morphosyntactic phenomena across languages. For exam- A growing interest in interpretability has led to a flurry of ple, we find that Russian, Polish, Portuguese, English and work in trying to assess exactly what pre-trained represen- Arabic all have statistically significant overlap in the top-30 tations know about language. To this end, diverse methods most informative neurons for number (Fig. 1). Similarly, we have been employed, such as the construction of specific observe presence of statistically significant overlap for gen- challenge sets that seek to evaluate how well representations der (Fig. 5, left). This effect is particularly strong between 9 In concurrent work, Antverg & Belinkov (2021) find evidence Russian and Polish, where we additionally find statistically supporting a similar phenomenon. significant overlap between top-30 neurons for case (Fig. 5, A Latent-Variable Model for Intrinsic Probing Number (eng) Gender and Noun Class (por) Tense (pol) 1 0.5 Normalized Mutual Information 0 10 50 100 250 500 10 50 100 250 500 10 50 100 250 500 Voice (rus) Case (ara) Case (fin) 1 0.5 0 10 50 100 250 500 10 50 100 250 500 10 50 100 250 500 Number of sampled dimensions Figure 4: Comparison of a linear C ONDITIONAL P OISSON probe to non-linear MLP-1 and MLP-2 C ONDITIONAL P OISSON probes for selected language-attribute pairs. For each of the subset sizes shown on the x-axis, we sampled 100 different subsets of BERT dimensions at random. model particular phenomena (Linzen et al., 2016; Gulordava ing to neurons based on how their activations correlate with et al., 2018; Goldberg, 2019; Goodwin et al., 2020), and particular classifications in images, and are able to control visualization methods (Kádár et al., 2017; Rethmeier et al., these manually with interpretable results. Aiming to answer 2020). Work on probing comprises a major share of this questions on interpretability in computer vision and natu- endeavor (Belinkov & Glass, 2019; Belinkov, 2021). This ral language inference, Mu & Andreas (2020) develop a has taken the form of both focused studies on particular method to create compositional explanations of individual linguistic phenomena (e.g., subject-verb number agreement, neurons and investigate abstractions encoded in them. Vig Giulianelli et al., 2018) to broad assessments of contextual et al. (2020) analyze how information related to gender and representations in a wide array of tasks (Şahin et al., 2020; societal biases is encoded in individual neurons and how it Tenney et al., 2018; Conneau et al., 2018; Liu et al., 2019; is being propagated through different model components. Ravichander et al., 2021, inter alia). Efforts have ranged widely, but most of these focus on 7. Conclusion extrinsic rather than intrinsic probing. Most work on the In this paper, we introduce a new method for training dis- latter has focused primarily on ascribing roles to individ- criminative intrinsic probes that can perform well across ual neurons through methods such as visualization (Karpa- any subset of dimensions. To do so, we train a probing clas- thy et al., 2015; Li et al., 2016a) and ablation (Li et al., sifier with a subset-valued latent variable and demonstrate 2016b). For example, recently Lakretz et al. (2019) con- how the latent subsets can be marginalized using variational duct an in-depth study of how long–short-term memory inference. We propose two variational families, based on networks (LSTMs; Hochreiter & Schmidhuber, 1997) cap- common sampling designs, to model the posterior over sub- ture subject–verb number agreement, and identify two units sets: Poisson sampling and conditional Poisson sampling. largely responsible for this phenomenon. We demonstrate that both variants outperform our baselines More recently, there has been a growing interest in extend- in terms of mutual information, and that using a conditional ing intrinsic probing to collections of neurons. Bau et al. Poisson variational family generally gives optimal perfor- (2019) utilize unsupervised methods to identify important mance. Further, we investigate information distribution for neurons, and then attempt to control a neural network’s out- each attribute for all available languages. Finally, we find puts by selectively modifying them. Bau et al. (2020) pursue empirical evidence for overlap in the specific neurons used a similar goal in a computer vision setting, but ascribe mean- to encode morphosyntactic properties across languages. A Latent-Variable Model for Intrinsic Probing References Linguistics. doi: 10.18653/v1/P18-1198. URL https: //aclanthology.org/P18-1198. Ács, J., Kádár, Á., and Kornai, A. Subword pooling makes a difference. In Proceedings of the 16th Conference of Dalvi, F., Durrani, N., Sajjad, H., Belinkov, Y., Bau, A., the European Chapter of the Association for Computa- and Glass, J. What is one grain of sand in the desert? tional Linguistics: Main Volume, pp. 2284–2295, Online, Analyzing individual neurons in deep NLP models. Pro- April 2021. Association for Computational Linguistics. ceedings of the AAAI Conference on Artificial Intelli- URL https://www.aclweb.org/anthology/2021. gence, 33:6309–6317, July 2019. ISSN 2374-3468, 2159- eacl-main.194. 5399. doi: 10.1609/aaai.v33i01.33016309. URL https: //doi.org/10.1609/aaai.v33i01.33016309. Aires, N. Algorithms to find exact inclusion probabilities for conditional Poisson sampling and Pareto πs sampling Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: designs. Methodology And Computing In Applied Pre-training of deep bidirectional transformers for lan- Probability, 1(4):457–469, Dec 1999. ISSN 1573- guage understanding. In Proceedings of the 2019 Confer- 7713. doi: 10.1023/A:1010091628740. URL https: ence of the North American Chapter of the Association for //EconPapers.repec.org/RePEc:spr:metcap:v: Computational Linguistics: Human Language Technolo- 1:y:1999:i:4:d:10.1023_a:1010091628740. gies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, June 2019. Association for Antverg, O. and Belinkov, Y. On the pitfalls of analyzing in- Computational Linguistics. doi: 10.18653/v1/N19-1423. dividual neurons in language models. arXiv:2110.07483 URL https://aclanthology.org/N19-1423. [cs.CL], 2021. Durrani, N., Sajjad, H., Dalvi, F., and Belinkov, Y. Analyz- Bau, A., Belinkov, Y., Sajjad, H., Durrani, N., Dalvi, F., and ing individual neurons in pre-trained language models. In Glass, J. Identifying and controlling important neurons in Proceedings of the 2020 Conference on Empirical Meth- neural machine translation. In International Conference ods in Natural Language Processing (EMNLP), pp. 4865– on Learning Representations, 2019. URL https:// 4880, Online, November 2020. Association for Computa- arxiv.org/abs/1811.01157. tional Linguistics. doi: 10.18653/v1/2020.emnlp-main. Bau, D., Zhu, J.-Y., Strobelt, H., Lapedriza, A., Zhou, B., 395. URL https://www.aclweb.org/anthology/ and Torralba, A. Understanding the role of individual 2020.emnlp-main.395. units in a deep neural network. Proceedings of the Na- Eisner, J. Parameter estimation for probabilistic finite-state tional Academy of Sciences, September 2020. ISSN 0027- transducers. In Proceedings of the 40th Annual Meet- 8424, 1091-6490. doi: 10.1073/pnas.1907375117. URL ing of the Association for Computational Linguistics, pp. https://www.pnas.org/content/117/48/30071. 1–8, Philadelphia, Pennsylvania, USA, July 2002. As- Belinkov, Y. Probing classifiers: Promises, shortcomings, sociation for Computational Linguistics. doi: 10.3115/ and alternatives. arXiv preprint arXiv:2102.12452, 2021. 1073083.1073085. URL https://www.aclweb.org/ URL https://arxiv.org/abs/2102.12452. anthology/P02-1001. Belinkov, Y. and Glass, J. Analysis methods in neu- Gates, A. J., Wood, I. B., Hetrick, W. P., and Ahn, ral language processing: A survey. Transactions of Y.-Y. Element-centric clustering comparison uni- the Association for Computational Linguistics, 7:49– fies overlaps and hierarchy. Scientific Reports, 9(1): 72, March 2019. doi: 10.1162/tacl_a_00254. URL 8574, June 2019. ISSN 2045-2322. doi: 10.1038/ https://doi.org/10.1162/tacl_a_00254. s41598-019-44892-y. URL https://doi.org/10. 1038/s41598-019-44892-y. Brown, P. F., Della Pietra, S. A., Della Pietra, V. J., Lai, J. C., and Mercer, R. L. An estimate of an upper bound Giulianelli, M., Harding, J., Mohnert, F., Hupkes, D., and for the entropy of English. Computational Linguistics, Zuidema, W. Under the hood: Using diagnostic clas- 18(1):31–40, 1992. URL https://www.aclweb.org/ sifiers to investigate and improve how language mod- anthology/J92-1002.pdf. els track agreement information. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing Conneau, A., Kruszewski, G., Lample, G., Barrault, L., and Interpreting Neural Networks for NLP, pp. 240– and Baroni, M. What you can cram into a single 248, Brussels, Belgium, 2018. Association for Compu- $&!#* vector: Probing sentence embeddings for lin- tational Linguistics. doi: 10.18653/v1/W18-5426. URL guistic properties. In Proceedings of the 56th Annual https://aclanthology.org/W18-5426. Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pp. 2126–2136, Mel- Goldberg, Y. Assessing BERT’s syntactic abilities. bourne, Australia, 2018. Association for Computational arXiv:1901.05287 [cs], January 2019. A Latent-Variable Model for Intrinsic Probing Goodwin, E., Sinha, K., and O’Donnell, T. J. Probing Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, linguistic systematicity. In Proceedings of the 58th An- J. Stochastic variational inference. Journal of Ma- nual Meeting of the Association for Computational Lin- chine Learning Research, 14(4):1303–1347, 2013. ISSN guistics, pp. 1958–1969, Online, July 2020. Association 1533-7928. URL https://www.jmlr.org/papers/ for Computational Linguistics. doi: 10.18653/v1/2020. volume14/hoffman13a/hoffman13a.pdf. acl-main.177. URL https://aclanthology.org/ 2020.acl-main.177. Holm, S. A simple sequentially rejective multiple test proce- dure. Scandinavian Journal of Statistics, 6(2):65–70, Gulordava, K., Bojanowski, P., Grave, E., Linzen, T., and 1979. ISSN 0303-6898. URL http://www.jstor. Baroni, M. Colorless green recurrent networks dream org/stable/4615733. hierarchically. In Proceedings of the 2018 Conference Kádár, Á., Chrupała, G., and Alishahi, A. Representation of of the North American Chapter of the Association for linguistic form and function in recurrent neural networks. Computational Linguistics: Human Language Technolo- Computational Linguistics, 43(4):761–780, December gies, Volume 1 (Long Papers), pp. 1195–1205, New Or- 2017. doi: 10.1162/COLI_a_00300. URL https:// leans, Louisiana, June 2018. Association for Computa- aclanthology.org/J17-4003. tional Linguistics. doi: 10.18653/v1/N18-1108. URL https://aclanthology.org/N18-1108. Karpathy, A., Johnson, J., and Fei-Fei, L. Visualizing and understanding recurrent networks. In 4th International Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On Conference on Learning Representations, ICLR 2016, calibration of modern neural networks. In Proceedings of San Juan, Puerto Rico, May 2-4, 2016, Workshop Pro- the 34th International Conference on Machine Learning ceedings, November 2015. URL https://arxiv.org/ - Volume 70, ICML’17, pp. 1321–1330. JMLR.org, Aug pdf/1506.02078. 2017. URL http://proceedings.mlr.press/v70/ guo17a/guo17a.pdf. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In 3rd International Conference on Learn- Hájek, J. Asymptotic theory of rejective sampling with ing Representations, San Diego, CA, May 2015. URL varying probabilities from a finite population. The Annals https://arxiv.org/abs/1412.6980. of Mathematical Statistics, 35(4):1491–1523, 1964. ISSN 0003-4851. URL https://www.jstor.org/stable/ Kulesza, A. Determinantal point processes for machine 2238287. learning. Foundations and Trends in Machine Learning, 5(2-3):123–286, 2012. ISSN 1935-8245. doi: 10.1561/ Hall Maudslay, R., Valvoda, J., Pimentel, T., Williams, 2200000044. URL http://dx.doi.org/10.1561/ A., and Cotterell, R. A tale of a probe and a parser. 2200000044. In Proceedings of the 58th Annual Meeting of the Lakretz, Y., Kruszewski, G., Desbordes, T., Hupkes, D., Association for Computational Linguistics, pp. 7389– Dehaene, S., and Baroni, M. The emergence of number 7395, Online, July 2020. Association for Computa- and syntax units in LSTM language models. In Pro- tional Linguistics. doi: 10.18653/v1/2020.acl-main.659. ceedings of the 2019 Conference of the North American URL https://www.aclweb.org/anthology/2020. Chapter of the Association for Computational Linguis- acl-main.659. tics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 11–20, Minneapolis, Minnesota, June Hewitt, J. and Liang, P. Designing and interpreting probes 2019. Association for Computational Linguistics. doi: with control tasks. In Proceedings of the 2019 Confer- 10.18653/v1/N19-1002. URL http://aclweb.org/ ence on Empirical Methods in Natural Language Pro- anthology/N19-1002. cessing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. Li, J., Chen, X., Hovy, E., and Jurafsky, D. Visualiz- 2733–2743, Hong Kong, China, November 2019. As- ing and understanding neural models in NLP. In Pro- sociation for Computational Linguistics. doi: 10.18653/ ceedings of the 2016 Conference of the North Ameri- v1/D19-1275. URL https://nlp.stanford.edu/ can Chapter of the Association for Computational Lin- pubs/hewitt2019control.pdf. guistics: Human Language Technologies, pp. 681–691, San Diego, California, 2016a. Association for Compu- Hochreiter, S. and Schmidhuber, J. Long Short-Term Mem- tational Linguistics. doi: 10.18653/v1/N16-1082. URL ory. Neural Computation, 9(8):1735–1780, November https://aclanthology.org/N16-1082. 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735. URL https://doi.org/10.1162/neco.1997.9.8. Li, J., Monroe, W., and Jurafsky, D. Understanding neu- 1735. ral networks through representation erasure. CoRR, A Latent-Variable Model for Intrinsic Probing abs/1612.08220, 2016b. URL http://arxiv.org/ Nair, V. and Hinton, G. E. Rectified linear units improve re- abs/1612.08220. stricted Boltzmann machines. In Proceedings of the 27th International Conference on International Conference Li, Z. and Eisner, J. First- and second-order expecta- on Machine Learning, pp. 807–814, Madison, WI, USA, tion semirings with applications to minimum-risk train- June 2010. ISBN 978-1-60558-907-7. URL https: ing on translation forests. In Proceedings of the 2009 //dl.acm.org/doi/10.5555/3104322.3104425. Conference on Empirical Methods in Natural Language Processing, pp. 40–51, Singapore, August 2009. Asso- ciation for Computational Linguistics. URL https: Nivre, J., Agić, Ž., Ahrenberg, L., Antonsen, L., Aranzabe, //www.aclweb.org/anthology/D09-1005. M. J., Asahara, M., Ateyah, L., Attia, M., Atutxa, A., Augustinus, L., Badmaeva, E., Ballesteros, M., Banerjee, Linzen, T., Dupoux, E., and Goldberg, Y. Assessing the E., Bank, S., Barbu Mititelu, V., Bauer, J., Bengoetxea, ability of LSTMs to learn syntax-sensitive dependencies. K., Bhat, R. A., Bick, E., Bobicev, V., Börstell, C., Bosco, Transactions of the Association for Computational Lin- C., Bouma, G., Bowman, S., Burchardt, A., Candito, M., guistics, 4:521–535, 2016. doi: 10.1162/tacl_a_00115. Caron, G., Cebiroğlu Eryiğit, G., Celano, G. G. A., Cetin, URL https://aclanthology.org/Q16-1037. S., Chalub, F., Choi, J., Cinková, S., Çöltekin, Ç., Con- nor, M., Davidson, E., de Marneffe, M.-C., de Paiva, V., Liu, N. F., Gardner, M., Belinkov, Y., Peters, M. E., Diaz de Ilarraza, A., Dirix, P., Dobrovoljc, K., Dozat, and Smith, N. A. Linguistic knowledge and trans- T., Droganova, K., Dwivedi, P., Eli, M., Elkahky, A., ferability of contextual representations. In Proceed- Erjavec, T., Farkas, R., Fernandez Alcalde, H., Foster, ings of the 2019 Conference of the North American J., Freitas, C., Gajdošová, K., Galbraith, D., Garcia, M., Chapter of the Association for Computational Linguis- Gärdenfors, M., Gerdes, K., Ginter, F., Goenaga, I., Go- tics: Human Language Technologies, Volume 1 (Long jenola, K., Gökırmak, M., Goldberg, Y., Gómez Guino- and Short Papers), pp. 1073–1094, Minneapolis, Min- vart, X., Gonzáles Saavedra, B., Grioni, M., Grūzı̄tis, nesota, June 2019. Association for Computational Lin- N., Guillaume, B., Habash, N., Hajič, J., Hajič jr., J., guistics. doi: 10.18653/v1/N19-1112. URL https: Hà Mỹ, L., Harris, K., Haug, D., Hladká, B., Hlaváčová, //aclanthology.org/N19-1112. J., Hociung, F., Hohle, P., Ion, R., Irimia, E., Jelínek, T., Johannsen, A., Jørgensen, F., Kaşıkara, H., Kanayama, Lohr, S. L. Sampling: Design and Analysis. CRC Press, 2 H., Kanerva, J., Kayadelen, T., Kettnerová, V., Kirch- edition, 2019. URL https://www.routledge.com/ ner, J., Kotsyba, N., Krek, S., Laippala, V., Lambertino, Sampling-Design-and-Analysis/Lohr/p/book/ L., Lando, T., Lee, J., Lê Hồng, P., Lenci, A., Lertpra- 9780367273415. dit, S., Leung, H., Li, C. Y., Li, J., Li, K., Ljubešić, N., Loginova, O., Lyashevskaya, O., Lynn, T., Macketanz, McCarthy, A. D., Silfverberg, M., Cotterell, R., Hulden, V., Makazhanov, A., Mandl, M., Manning, C., Mărăn- M., and Yarowsky, D. Marrying universal dependencies duc, C., Mareček, D., Marheinecke, K., Martínez Alonso, and universal morphology. In Proceedings of the Sec- H., Martins, A., Mašek, J., Matsumoto, Y., McDonald, ond Workshop on Universal Dependencies (UDW 2018), R., Mendonça, G., Miekka, N., Missilä, A., Mititelu, C., pp. 91–101, Brussels, Belgium, November 2018. Asso- Miyao, Y., Montemagni, S., More, A., Moreno Romero, ciation for Computational Linguistics. doi: 10.18653/ L., Mori, S., Moskalevskyi, B., Muischnek, K., Müürisep, v1/W18-6011. URL https://aclanthology.org/ K., Nainwani, P., Nedoluzhko, A., Nešpore-Bērzkalne, W18-6011. G., Nguyễn Thi., L., Nguyễn Thi. Minh, H., Nikolaev, Mu, J. and Andreas, J. Compositional explanations of V., Nurmi, H., Ojala, S., Osenova, P., Östling, R., Øvre- neurons. In Larochelle, H., Ranzato, M., Hadsell, R., lid, L., Pascual, E., Passarotti, M., Perez, C.-A., Per- Balcan, M. F., and Lin, H. (eds.), Advances in Neural rier, G., Petrov, S., Piitulainen, J., Pitler, E., Plank, B., Information Processing Systems, volume 33, pp. 17153– Popel, M., Pretkalnin, a, L., Prokopidis, P., Puolakainen, 17163. Curran Associates, Inc., 2020. URL https: T., Pyysalo, S., Rademaker, A., Ramasamy, L., Rama, //proceedings.neurips.cc/paper/2020/file/ T., Ravishankar, V., Real, L., Reddy, S., Rehm, G., Ri- c74956ffb38ba48ed6ce977af6727275-Paper. naldi, L., Rituma, L., Romanenko, M., Rosa, R., Rovati, pdf. D., Sagot, B., Saleh, S., Samardžić, T., Sanguinetti, M., Saulı̄te, B., Schuster, S., Seddah, D., Seeker, W., Ser- Murphy, K. P. Machine Learning: A Probabilistic Per- aji, M., Shen, M., Shimada, A., Sichinava, D., Silveira, spective. Adaptive Computation and Machine Learning N., Simi, M., Simionescu, R., Simkó, K., Šimková, M., Series. MIT Press, Cambridge, MA, 2012. ISBN 978- Simov, K., Smith, A., Stella, A., Straka, M., Strnadová, 0-262-01802-9. URL https://mitpress.mit.edu/ J., Suhr, A., Sulubacak, U., Szántó, Z., Taji, D., Tanaka, books/machine-learning-1. T., Trosterud, T., Trukhina, A., Tsarfaty, R., Tyers, F., A Latent-Variable Model for Intrinsic Probing Uematsu, S., Urešová, Z., Uria, L., Uszkoreit, H., Vajjala, Ravichander, A., Belinkov, Y., and Hovy, E. Probing the S., van Niekerk, D., van Noord, G., Varga, V., Villemonte probing paradigm: Does probing accuracy entail task de la Clergerie, E., Vincze, V., Wallin, L., Washington, relevance? In Proceedings of the 16th Conference of J. N., Wirén, M., Wong, T.-s., Yu, Z., Žabokrtský, Z., the European Chapter of the Association for Computa- Zeldes, A., Zeman, D., and Zhu, H. Universal depen- tional Linguistics: Main Volume, pp. 3363–3377, Online, dencies 2.1, 2017. URL http://hdl.handle.net/ April 2021. Association for Computational Linguistics. 11234/1-2515. LINDAT/CLARIAH-CZ digital library URL https://www.aclweb.org/anthology/2021. at the Institute of Formal and Applied Linguistics (ÚFAL), eacl-main.295. Faculty of Mathematics and Physics, Charles University. Rethmeier, N., Saxena, V. K., and Augenstein, I. TX-Ray: Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Quantifying and explaining model-knowledge transfer in Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, (un-)supervised NLP. In Adams, R. P. and Gogate, V. L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Rai- (eds.), Proceedings of the Thirty-Sixth Conference on Un- son, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, certainty in Artificial Intelligence, pp. 197. AUAI Press, L., Bai, J., and Chintala, S. PyTorch: An imperative style, 2020. URL http://dblp.uni-trier.de/db/conf/ high-performance deep learning library. In Advances uai/uai2020.html#RethmeierSA20. in Neural Information Processing Systems 32, pp. Rogers, A., Kovaleva, O., and Rumshisky, A. A primer 8024–8035. Curran Associates, Inc., 2019. URL https: in BERTology: What we know about how BERT works. //proceedings.neurips.cc/paper/2019/file/ Transactions of the Association for Computational Lin- bdbca288fee7f92f2bfa9f7012727740-Paper. guistics, 8:842–866, 2020. doi: 10.1162/tacl_a_00349. pdf. URL https://www.aclweb.org/anthology/2020. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, tacl-1.54. C., Lee, K., and Zettlemoyer, L. Deep contextualized Rush, A. Torch-Struct: Deep structured prediction li- word representations. In Proceedings of the 2018 Confer- brary. In Proceedings of the 58th Annual Meeting of ence of the North American Chapter of the Association the Association for Computational Linguistics: System for Computational Linguistics: Human Language Tech- Demonstrations, pp. 335–342, Online, July 2020. As- nologies, Volume 1 (Long Papers), pp. 2227–2237, New sociation for Computational Linguistics. URL https: Orleans, Louisiana, June 2018. Association for Compu- //aclanthology.org/2020.acl-demos.38. tational Linguistics. doi: 10.18653/v1/N18-1202. URL https://aclanthology.org/N18-1202. Şahin, G. G., Vania, C., Kuznetsov, I., and Gurevych, I. LINSPECTOR: Multilingual probing tasks for word rep- Pimentel, T., Valvoda, J., Hall Maudslay, R., Zmigrod, R., resentations. Computational Linguistics, 46(2):335–385, Williams, A., and Cotterell, R. Information-theoretic 2020. doi: 10.1162/coli\\_a\\_00376. URL https: probing for linguistic structure. In Proceedings of the //doi.org/10.1162/coli_a_00376. 58th Annual Meeting of the Association for Computa- tional Linguistics, pp. 4609–4622, Online, July 2020. As- Tang, G., Sennrich, R., and Nivre, J. Understanding pure sociation for Computational Linguistics. doi: 10.18653/ character-based neural machine translation: The case v1/2020.acl-main.420. URL https://aclanthology. of translating Finnish into English. In Proceedings of org/2020.acl-main.420. the 28th International Conference on Computational Lin- guistics, pp. 4251–4262, Barcelona, Spain (Online), De- Poliak, A., Haldar, A., Rudinger, R., Hu, J. E., Pavlick, E., cember 2020. International Committee on Computational White, A. S., and Van Durme, B. Collecting diverse natu- Linguistics. doi: 10.18653/v1/2020.coling-main.375. ral language inference problems for sentence representa- URL https://www.aclweb.org/anthology/2020. tion evaluation. In Proceedings of the 2018 Conference on coling-main.375. Empirical Methods in Natural Language Processing, pp. 67–81, Brussels, Belgium, October 2018. Association for Tenney, I., Xia, P., Chen, B., Wang, A., Poliak, A., Mc- Computational Linguistics. doi: 10.18653/v1/D18-1007. Coy, R. T., Kim, N., Durme, B. V., Bowman, S. R., URL https://aclanthology.org/D18-1007. Das, D., and Pavlick, E. What do you learn from con- text? Probing for sentence structure in contextualized Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., word representations. In International Conference on Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring Learning Representations, September 2018. URL https: the limits of transfer learning with a unified text-to-text //openreview.net/forum?id=SJzSgnRcKX. transformer. Journal of Machine Learning Research, 21 (140):1–67, 2020. URL http://jmlr.org/papers/ Torroba Hennigen, L., Williams, A., and Cotterell, R. In- v21/20-074.html. trinsic probing through dimension selection. In Proceed- A Latent-Variable Model for Intrinsic Probing ings of the 2020 Conference on Empirical Methods in Royal Statistical Society. Series B (Statistical Method- Natural Language Processing (EMNLP), pp. 197–216, ology), 67(2):301–320, 2005. ISSN 1369-7412. Online, November 2020. Association for Computational URL https://www.jstor.org/stable/3647580? Linguistics. doi: 10.18653/v1/2020.emnlp-main.15. seq=1#metadata_info_tab_contents. URL https://www.aclweb.org/anthology/2020. emnlp-main.15. Vig, J., Gehrmann, S., Belinkov, Y., Qian, S., Nevo, D., Singer, Y., and Shieber, S. Investigating gender bias in language models using causal mediation anal- ysis. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 12388– 12401. Curran Associates, Inc., 2020. URL https: //proceedings.neurips.cc/paper/2020/file/ 92650b2e92217715fe312e6fa7b90d82-Paper. pdf. Voita, E. and Titov, I. Information-theoretic probing with minimum description length. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 183–196, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.14. URL https://www. aclweb.org/anthology/2020.emnlp-main.14. Vulić, I., Ponti, E. M., Litschko, R., Glavaš, G., and Ko- rhonen, A. Probing pretrained language models for lex- ical semantics. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7222–7240, Online, November 2020. As- sociation for Computational Linguistics. doi: 10.18653/ v1/2020.emnlp-main.586. URL https://www.aclweb. org/anthology/2020.emnlp-main.586. Williams, R. J. Simple statistical gradient-following al- gorithms for connectionist reinforcement learning. Ma- chine Learning, 8:229–256, 1992. URL https://link. springer.com/article/10.1007/BF00992696. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., and Brew, J. HuggingFace’s Transformers: State-of-the- art natural language processing. arXiv:1910.03771 [cs], February 2020. Zhang, K. and Bowman, S. Language modeling teaches you more than translation does: Lessons learned through auxiliary syntactic task analysis. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and In- terpreting Neural Networks for NLP, pp. 359–361, Brus- sels, Belgium, November 2018. Association for Compu- tational Linguistics. doi: 10.18653/v1/W18-5448. URL https://www.aclweb.org/anthology/W18-5448. Zou, H. and Hastie, T. Regularization and vari- able selection via the Elastic Net. Journal of the A Latent-Variable Model for Intrinsic Probing A. Variational Lower Bound these hyperparameters to confirm that the probe is not very sensitive to the tuning of these values (unless they are ex- The derivation of the variational lower bound is shown be- treme) which aligns with the claim presented in Dalvi et al. low: (2019). For G AUSSIAN, we take the MAP estimate, with N X X a weak data-dependent prior (Murphy, 2012, Chapter 4). log pθ (π (n) , C | h(n) ) (11) In addition, we found that a slight improvement in the per- n=1 C⊆D formance of P OISSON and C ONDITIONAL P OISSON was N obtained by scaling the entropy term in eq. (3) by a factor X X pθ (π (n) , C | h(n) ) = log qφ (C) of 0.01. n=1 qφ (C) C⊆D D. Supplementary Results N " # X pθ (π (n) , C | h(n) ) = log Eqφ n=1 qφ (C) Tab. 3 compares the accuracy of our two models, P OISSON N " # and C ONDITIONAL P OISSON, to the L INEAR, G AUSSIAN X pθ (π (n) , C | h(n) ) ≥ Eqφ log (12) and U PPER B OUND baselines. The table reflects the trend n=1 qφ (C) observed in Tab. 2: P OISSON and C ONDITIONAL P OIS - N X h i SON generally outperform the L INEAR baseline. However, = Eqφ log pθ (π (n) , C | h(n) ) + H(q) G AUSSIAN achieves highers accuracy with exception of a n=1 high dimension regimen. B. List of Probed Morphosyntactic Attributes The 29 language–attribute pairs we probe for in this work are listed below: • Arabic: Aspect, Case, Definiteness, Gender, Mood, Number, Voice • English: Number, Tense • Finnish: Case, Number, Person, Tense, Voice • Polish: Animacy, Case, Gender, Number, Tense • Portuguese: Gender, Number, Tense • Russian: Animacy, Aspect, Case, Gender, Number, Tense, Voice C. Training and Hyperparameter Tuning We train our probes for a maximum of 2000 epochs using the Adam optimizer (Kingma & Ba, 2015). We add early stopping with a patience of 50 as a regularization technique. Early stopping is conducted by holding out 10% of the train- ing data; our development set is reserved for the greedy selection of subsets of neurons. Our implementation is built with PyTorch (Paszke et al., 2019). To execute a fair compar- ison with Dalvi et al. (2019), we train all probes other than the Gaussian probe using ElasticNet regularization (Zou & Hastie, 2005), which consists of combining both L1 and L2 regularization, where the regularizers are weighted by tunable regularization coefficients λ1 and λ2 , respectively. We follow the experimental set-up proposed by Dalvi et al. (2019), where we set λ1 , λ2 = 10−5 for all probes. In a preliminary experiment, we performed a grid search over A Latent-Variable Model for Intrinsic Probing Probe 10 50 100 250 500 768 C OND . P OISSON 0.66 ± 0.15 0.73 ± 0.13 0.78 ± 0.11 0.86 ± 0.08 0.92 ± 0.06 0.93 ± 0.05 P OISSON 0.62 ± 0.15 0.70 ± 0.13 0.77 ± 0.12 0.86 ± 0.08 0.92 ± 0.06 0.94 ± 0.04 L INEAR 0.51 ± 0.15 0.59 ± 0.15 0.65 ± 0.14 0.77 ± 0.12 0.88 ± 0.08 0.95 ± 0.04 G AUSSIAN 0.69 ± 0.14 0.80 ± 0.11 0.84 ± 0.09 0.88 ± 0.08 0.88 ± 0.08 0.87 ± 0.1 C OND . P OISSON 0.55 ± 0.1 0.65 ± 0.13 0.72 ± 0.12 0.83 ± 0.10 0.90 ± 0.08 0.93 ± 0.06 P OISSON 0.51 ± 0.13 0.63 ± 0.14 0.72 ± 0.12 0.83 ± 0.10 0.90 ± 0.08 0.93 ± 0.07 U PPER B OUND 0.58 ± 0.12 0.75 ± 0.12 0.80 ± 0.10 0.89 ± 0.08 0.93 ± 0.06 0.94 ± 0.05 Table 3: Mean and standard deviation of accuracy for the P OISSON, C ONDITIONAL P OISSON, L INEAR (Dalvi et al., 2019) and G AUSSIAN (Torroba Hennigen et al., 2020) probes for all language–attribute pairs (above) and for the C ONDITIONAL P OISSON, P OISSON and U PPER B OUND for 6 selected language–attribute pairs (below) for each of the subset sizes. We sampled 100 different subsets of BERT dimensions at random. 1 1 rus 0.9 Uralic fin 0.9 IE (Slavic) ■ ■ 0.8 0.8 rus pol 0.7 IE (Slavic) ■ 0.7 ■ 0.6 pol 0.6 0.5 ■ 0.5 IE (Romance) por 0.4 0.4 ■ ■ IE (Romance) por 0.3 0.3 0.2 Afro-Asiatic ara 0.2 Afro-Asiatic ara ■ ■ ■ 0.1 0.1 ara por pol rus ara por pol rus fin Figure 5: The percentage overlap between the top-30 most informative gender (left) and case (right) dimensions in BERT for the probed languages. Statistically significant overlap, after Holm–Bonferroni family-wise error correction (Holm, 1979), with α = 0.05, is marked with an orange square.