WORKING PAPER 1 Autocorrelation in Sparse Distributed Memory Marcelo S. Brogliato, Member, IEEE, and Alexandre Linhares, Member, IEEE, Abstract—Sparse Distributed Memory (SDM) is a neuroscien- These are some of the questions touched by SDM. It is tific and psychologically plausible model of human memory. In a theory of long-term memory that constructs the process of this paper we present an anomaly between its previously theo- recall from a very microscopic level of neuron-firings. retized behavior and its actual behavior, and demonstrate that this anomaly is due to the autocorrelation involved in the model. Our findings have systemwide implications for those willing to implement and apply SDM in computational intelligence settings. Index Terms—Sparse Distributed Memory, Autocorrelation, Theoretical Neuroscience, Memory, Systemwide properties. I. I NTRODUCTION PARSE DISTRIBUTED MEMORY (SDM) [9] (see also S [3; 4; 12; 11; 14; 16; 17; 15; 13; 27; 28]) is a math- ematical model of long-term memory that has a number of (a) Q3 (b) Q7 neuroscientific and psychologically plausible dynamics. This model is used in all sort of applications due to its incredible ability to closely reflect the human capacity to remember past experiences from the subtlest of clues. Applications range from call admission control [19; 18], to behavior-based robotics [25; 22; 8], to noise filtering [23], etc. A. Desiderata for a theory of memory SDM reads like a desiderata of a theory for human long- term memory. To understand the breadth of topics that SDM encompasses, consider the following questions: 1) Why are most concepts orthogonal, unrelated to each other? 2) Why is there Miller’s magic number, i.e., we can’t hold too many things in mind at once? 3) Why do we at times instantly recall an experience; other times we can’t recall anything at all; and still other times we get into this strange tip-of-the-tongue situation... it is clearly ‘there’... but where? (and ‘what’ is ‘there’?1 .) 4) How does this recall process work? What is remember- ing? (c) Q10 5) Why do neurons die and we still remember most every- Fig. 1. Kanerva models the space of possible incoming neural signals as an thing? n-dimensional hypercube, or Qn . Here we have Qn , for n ∈ {3, 7, 10}. 6) What do neurons actually do? What is their primary Each node corresponds to a bitstring in {0, 1}n , and two nodes are linked function? iff the bitstrings differ by a single dimension. A number of observations can be made here. First, the number of nodes grows as O(2n ); which makes the Dr. Brogliato is with sdm.ai and with Behavioral and Decision Sci- space rapidly intractable. Another interesting observation is that most of the ences, EBAPE, Fundação Getulio Vargas, Rio de Janeiro, Brazil, e-mail: space lies ‘at the core’ of the hypercube, at a distance of around n/2 from
[email protected]. any given vantage point. Dr. Linhares is with sdm.ai and with Behavioral and Decision Sciences, EBAPE, Fundação Getulio Vargas, Rio de Janeiro, Brazil, e-mail: lin-
[email protected]. 1 Psychologists have documented interesting properties about this state: it B. Review of the model happens, for instance, around every 1-week cycle; and it happens mostly over IN SDM, the data — and address space on which it is proper names, subjects have access to the first letter and to the number of syllables better than chance, etc. Also interesting is their clever way stored — are represented by large sequences of bits, called bit- of triggerring the state: through definitions of rarely-used words. For some strings. The Hamming distance provides comparisons between psychological studies about the tip-of-tongue, see (author?) [24, 1, 2]. bitstrings and is used as a metric for the system. The Hamming 0000–0000/00$00.00distance is defined for two bitstrings of equal length as the c 2012 IEEE WORKING PAPER 2 number of positions in which bits differ. For example, 00110b and 01100b are bitstrings of length 5 and their Hamming distance is 2. The space studied by Kanerva is also called the hypercube graph, or Qn , as in Figure 1. For a fixed n ∈ Z, the graph ⌘ G = (V, E), in which v ∈ V iff there is a bijective function b : V → {0, 1}n and (vi , vj ) ∈ E iff H(b(vi ), b(vj )) = 1, where H is the Hamming distance. That is, n-sized bitstrings correspond to nodes, and edges exist between nodes if f they flip a single bit. Though Kanerva has derived many combinatorial properties of the space, additional results have been found by the graph-theoretical community: it is a bipartite graph with chromatic number 2, and it is planar only if Fig. 2. Activated addresses inside accessradius r around the center address. n ≤ 3. The proofs, though elementary, illuminate some of its properties: vertices connect bitstrings with an even number of 1’s with bitstrings with an odd number of 1’s, therefore (i) SDM memory has robustness against loss of addresses (e.g., it is bipartite and (ii) it may be colored with two colors. As death of a neuron). for planarity, the case n = 3 is the largest planar one. A good In traditional memory, each datum is stored in an address survey is provided by [7] — and some interesting results may and every lookup of a specific datum requires a search through be found in [5; 29; 20; 26]. the memory. In spite of computer scientists having developed One has to be careful when thinking intuitively about beautiful algorithms to perform fast searches, almost all of distance in SDM because the Hamming distance does not have them do a precise search. That is, if you have an imprecise the same properties of, say, our 3-dimensional space. clue of what you need, these algorithms will simply fail. In SDM, the data space is the same as the address space, A major difference is that in the 3-d world, there exist which amounts to a vectorial, binary space, that is, a {0, 1}n very far points, while in Qn the largest possible distance is space. This way, the addresses where the data will be written n. Another one is that the number of directions in Qn is are the same as the data themselves. For example, the datum enormous: at every point there are n directions to choose from. η = 00101b ∈ {0, 1}5 will be written to the address ξ = η = Though both follow the triangle inequality (d(A, C) ≤ 00101b . If one chooses a radius of 1, the SDM will activate d(A, B) + d(B, C)), which in 3-d Euclidean distance may be all addresses one bit away or less from the center address. loosely interpreted as “if A is close to B, and B is close to So, the datum 00101b will be written to the addresses 00101b , C, then A is also close to C” — d(A, B) ≤ r and d(B, C) ≤ 10101b , 01101b , 00001b , 00111b , and 00100b . r ⇒ d(A, C) ≤ 2r —, but in SDM, although the inequality is In this case, when one needs to retrieve the data, one also valid, two bitstrings would be close when, for instance, could have an imprecise cue at most one bit away from η, r = 430, so 2r = 860 would cover all other bitstrings. Hence, since all addresses one bit away have η stored in themselves. it makes no sense to say that A is also close to C. Extending this train of thought for larger dimensions and There are numerous, beautiful, counter-intuitive notions radius, exponential numbers of addresses are activated and one involved in this space. This difference in intuition may trick can see why SDM is a distributed memory. even experienced researchers when analyzing some situations. When reading a cue ηx that is x bits away from η, the cue Unlike traditional memory used by computers, SDM per- shares many addresses with η. The number of shared addresses forms read and write operations in a multitude of addresses, decreases as the cue’s distance to η increases, in other words, also called neurons. That is, the data is not written, or it is not as x increases. This is shown in Figure 3. The target datum η read in a single address spot, but in many addresses. These was written in all shared addresses, thus they will bias the read are called activated addresses, or activated neurons. output in the direction of η. If the cue is sufficiently near the The activation of addresses takes place according to their target datum η, the read output will be closer to η than ηx was. distances from the datum. Suppose one is writing datum η Repeating the read operation increasingly gets results closer at address ξ, then all addresses inside a circle with center ξ to η, until it is precisely the same. So, it may be necessary and radius r are activated. So, η will be stored in all these to perform more than one read operation to converge to the activated addresses, which are around address ξ, such as in target data η. Figure 2. An address ξ 0 is inside the circle if its Hamming The addresses of the {0, 1}n space grow exponentially with distance to the center ξ is less than or equal to the radius r, the number of dimensions n, i.e., N = 2n . For n = 100 we i.e. distance(ξ, ξ 0 ) ≤ r. have N ≈ 1030 , which is incredibly large when related to Every write or read in SDM memory activates a number computer memory. Furthermore, [9] suggests n between 100 of addresses with close distance. The data is written in these and 10,000. To solve the feasibility problem of implementing activated addresses or read from them. These issues will be this memory, Kanerva made a random sample of {0, 1}n , in his addressed in due detail further on, but a major difference from work, having N 0 elements. All these addresses in the sample a traditional computer memory is that the data are always are called hard locations. Other elements of {0, 1}n , not in stored and retrieved in a multitude of addresses. This way N 0 , are called virtual neurons. This is represented in Figure 4. WORKING PAPER 3 converging. Convergence is simple to handle, just read again and again, until it converges to the target η. If this distance is greater than x we are diverging. Finally, if this distance equals x we are in a tip-of-the-tongue process. A tip-of-the-tongue ⌘ ⌘x psychologically happens when you know that you know, but you can’t say what exactly it is. In SDM mathematical model, a tip-of-the-tongue process takes infinite time to converge. (author?) [9] called this x distance, where the read’s output averages x, the critical distance. Intuitively, it is the distance from which smaller distances converge and greater distances diverge. In Figure 5, the circle has radius equal to the critical distance and every ηx inside the circle should converge. The Fig. 3. Shared addresses between thetarget datum η and the cue ηx . figure also shows convergence in four readings. All properties of reading and write operations presented before remain valid but limited to hard locations. Kanerva suggests taking a sample of about one million hard locations. critical distance Using this sample of binary space, our data space does not ⌘ exist completely. That is, the binary space has 2n addresses, but the memory is far away from having these addresses avail- ⌘x,3 able. In fact, only a fraction of this vectorial space is actually ⌘x,2 instantiated. Following Kanerva’s suggestion of one million ⌘x,1 ⌘x hard locations, for n = 100, only 100 · 106 /2100 = 7 · 10−23 percent of the whole space exists, and for n = 1, 000 only 100 · 106 /21000 = 7 · 10−294 percent. Kanerva also suggests the selection of a radius that will Fig. 5. In this example, four iterative readings were activate, on average, one-thousandth of the sample, which is required to converge from ηx to η. 1,000 hard locations for a sample of one million addresses. In order to achieve his suggestion, a 1,000-dimension memory The {0, 1}n space has N = 2n locations from which we uses an access radius r = 451, and a 256-dimensional memory, instantiate N 0 samples. Each location in our sample is called r = 103. We think that a 256-dimensional memory may be a hard location. On these hard locations, we do operations important because it presents conformity to Miller’s magic of read and write. One of the insights of SDM is exactly number [21]. the way we read and write: using data as addresses in a distributed fashion. Each datum η is written in every activated hard location inside the access radius centered on the address, N = {0, 1}n virtual neurons that equals datum, ξ = η. Kanerva suggested using an access radius r having about one-thousandth of N 0 . As an imprecise cue ηx shares hard locations with the target bitstring η, it is possible to retrieve η correctly. (Actually, probably more than N0 one read is necessary to retrieve exactly η.). Moreover, if some ⇠1 neurons are lost, only a fraction of the datum is lost and it is ⇠3 ⇠2 possible that the memory can still retrieve the right datum. A random bitstring is generated with equal probability of hard-locations 0’s and 1’s for each bit. One can readily see that the average distance between two random bitstrings follows the pbinomial distribution with mean n/2 and standard deviation n/4. For Fig. 4. Hard-locations randomly sampled from binary space. a large n, most of the space lies close to the mean and has fewer shared hard locations. As two bitstrings with distance Since a cue ηx near the target bitstring η shares many hard far from n/2 are very improbable, (author?) [9] defined that locations with η, SDM can retrieve data from imprecise cues. two bitstrings are orthogonal when their distance is n/2. Despite this feature, it is very important to know how impre- The write operation needs to store, for each dimension cise this cue could be while still giving accurate results. What bit which happened more (0’s or 1’s). This way, each hard is the maximum distance from our cue to the original data location has n counters, one for each dimension. The counter that still retrieves the right answer? An interesting approach is incremented for each bit 1 and decremented for each bit is to perform a read operation with a cue ηx , that is x bits 0. Thus, if the counter is positive, there have been more 1’s away from the target η. Then measure the distance from the than 0’s, if the counter is negative, there have been more 0’s read output and η. If this distance is smaller than x we are than 1’s, and if the counter is zero, there have been an equal WORKING PAPER 4 η 0 1 1 0 1 0 0 ξbefore 6 -3 12 -1 0 2 concepts 4 together. (author?) [21] approached this concept via red⇓ -1 red⇓ +1 red⇓ +1 red⇓ -1 red⇓ +1 red⇓ -1 “chunking red⇓ -1 through averaging”. ξafter 5 -2 13 -2 1 1 Due3 to the distribution of hard locations between two TABLE I W RITE OPERATION EXAMPLE IN A 7- DIMENSIONAL MEMORY OF DATA η random bitstrings, the vast majority of concepts is orthogo- BEING WRITTEN TO ξ, ONE OF THE ACTIVATED ADDRESSES . nal to all others. Consider a non-scientific survey during a cognitive science seminar, where students asked to mention ideas unrelated to the course brought up terms like birthdays, boots, dinosaurs, fever, executive order, x-rays, and so on. Not number of 1’s and 0’s. Table I shows an example of a write only are the items unrelated to cognitive science, the topic of operation being performed in a 7-dimensional memory. the seminar, but they are also unrelated to each other. The read is performed polling each activated hard location For any two memory items, one can readily find a stream and statistically choosing the most written bit for each dimen- of thought relating two such items (“Darwin gave dinosaurs sion. It consists of adding all n counters from the activated the boot”; “she ran a fever on her birthday”; “isn’t it time for hard locations and, for each bit, choosing bit 1 if the counter is the Supreme Court to x-ray that executive order?”, ... and so positive, choosing bit 0 if the counter is negative, and randomly forth). Robert French presents an intriguing example in which choose bit 0 or 1 if the counter is zero. one suddenly creates a representation linking the otherwise unrelated concepts of “coffee cups” and “old elephants” [6]. II. N EURONS AS POINTERS This mapping from concepts to bitstrings brings us two One interesting view is that neurons in SDM work like central questions: (i) Suppose we have a bitstring that is pointers. As we write bitstrings in memory, the hard locations’ linking two major concepts. How do we know which concepts counters are updated and some bits are flipped. Thus, the are linked together? (ii) From a concept bitstring, how can we activated hard locations do not necessarily point individually to list all concepts that are somehow linked to it? This second the bitstring that activated it, but together they point correctly. question is called the problem of spreading activation. In other words, the read operation depends on many hard lo- cations to be successful. This effect is represented in Figure 6: IV. R EAD OPERATION where all hard locations inside the circle are activated and they, individually, do not point to η. But, like vectors, adding them In his work, Kanerva proposed and analyzed a read algo- up points to η. If another datum ν is written into the memory rithm called here Kanerva’s read. His read takes all activated near η, the shared hard locations will have information from hard locations counters and sum them. The resulting bitstring both of them and would not point to either. All hard locations has bit 1 where the result is positive, bit 0 where the result outside of the circle are also pointing somewhere (possibly is negative, and a random bit where the result is zero. In a other data points). This is not shown, however, in order to word, each bit is chosen according to all written bitstrings in keep the picture clean and easily understandable. all hard locations, being equal to the bit more appeared. Table ?? shows an example of Kanerva’s read result bitstring. 1.0 P (xi = yi |d(x, y) ≤ r) ⌘ 0.9 0.8 0.7 0.6 Fig. 6. Hard-locations pointing, approximately, to the target bitstring. 0.5 0 100 200 300 400 500 600 700 800 900 III. C ONCEPTS Fig. 7. The threshold size r of hard-locations bring about the autocorrelation Although Kanerva does not mention concepts directly in of Theorem 1. Consider, for instance, the extremes {0, n}. Let us start his book [9], the author’s interpretation is that each bitstring with n (in our case, n = 1000 dimensions): given the information that may be mapped to a concept. Thus, unrelated concepts are d(x, y) ≤ n, the probability of a bitmatch in dimension i is 1/2; as ∀x, y ∈ {0, 1}N , d(x, y) ≤ n. At the other extreme, consider that we have orthogonal and concepts could be linked through a bitstring the information that d(x, y) ≤ 0: in this case x = y and the probability near both of them. For example, “beauty” and “woman” have of a bitmatch, in any dimension, is 1. The autocorrelation hence drops distance n/2, but a bitstring that means “beautiful woman” monotonically until convergence at 1/2 as the distance grows. Numerically, our results converge to precisely 1/2 only after d(x, y) ≥ 607 for n = 1000; could have distance n/4 to both of them. As a bitstring d(x, y) ≥ 5332 for n = 10000, and d(x, y) ≥ 83 for n = 100. This will with distance n/4 is very improbable, it is linking those be seen in Lemma 2. WORKING PAPER 5 In this other paragraph we explain what the equator distance Proof. The left-hand expression P (xi = yi |d(x, y) ≤ r) is. computes the probability of a bitmatch in i, given that we Then, this introductory section breaks and moves on to the know that x and y are in the access radius defined by r, i.e., main findings of this paper. d(x, y) ≤ r. Applying the law of total probability to the left-hand ex- V. A DEVIATION FROM THE EQUATOR DISTANCE ? pression we obtain (author?) [10] Kanerva writes2 : r X You have done an incredibly thorough analysis of P (xi = yi |d(x, y) = k ≤ r)P (d(x, y) = k|d(x, y) ≤ r) SDM. I like the puzzle in your message and believe k=0 that your simulations are correct and to be learned (1) from. So what to make of the difference compared to my Figure 7.3 (and your Figure ??)? I think the We also know that difference comes from my not having accounted fully for the effect of the other 9,999 vectors that n−k are stored in the memory. You say in it P (xi = yi |d(x, y) = k) = (2) n n k “Our results show that the theoretical prediction P (d(x, y) = k|d(x, y) ≤ r) = Pr n (3) is not accurate. There are interaction effects from j=0 j one or more of the attractors created by the 10,000 writes, and these attractors seem to raise the Hence, distance beyond 500 bits (Figure ??).” Pr n−k n I think that is correct. It also brings to mind a k=0 n k P (xi = yi |d(x, y) ≤ r) = P r n (4) comment Louis Jaeckel made when we worked at j=0 j NASA Ames. He pointed out that autoassociative storage (each vector is stored with itself as the Finally, the combinatorial identity address) introduces autocorrelation that my formula for Figure 7.2 did not take into account. When we n−k n (n − k) (n − 1)! n−1 n! read from memory, each stored vector exerts a pull = = = toward itself, which also means that each bit of a n k n (n − k)!k! k!(n − 1 − k)! k retrieved vector is slightly biased toward the same bit (5) of the read address, regardless of the read address. closes the theorem. We never worked out the math. This is an important observation. A hard location is activated because it shares many dimensions with the items read from or written onto it. Imagine the ‘counter’s eye view’: each Theorem 1 is valid for both “x written at x” (autoassociative individual counter ‘likes’ to write on its own corresponding memory) and “random written at x” (heteroassociative mem- bit-address value more than it likes the opposite; as each hard- ory). When n = 1, 000 and r = 451, P (xi = yi |d(x, y) ≤ location has a say in its own area — and nowhere else. r) = p = 0.552905. Each bit of a hard location does indeed Let x and y be random bitstrings and n be the number of have a small pull bias. What is meant by this is that each dimensions in the memory; let xi and yi be the i-th bit of x and particular dimension has a small preference toward positive y, respectively; and d(x, y) be the Hamming distance. Whilst values if its address bit is set to 1, and negative values if set the probability of a shared bit-value between same dimension- to 0—an intuition developed in Figure 8. bits in two random addresses is 1/2, an address only activates Lemma 2. Let r be the access radius given that f percent of hard-locations close to it. Let us call these shared bitvalues a the hard locations are activated. Then, limn→∞ r/n = 1/2. bitmatch in dimension i. So, what is the probability of bitmatches given that we know Proof. As the bits of the hard locations’ addresses are ran- the access radius r between the address and a hard-location? domly chosen, the distance between two hard locations follow a Binomial distribution with n samples and probability 0.5, B(n, 0.5). For n sufficiently large, the Binomial distribu- Theorem 1. Each dimension i has a small pull Pbias, which tion can be approximated by a Normal distribution, i.e., r n−1 can be measured by P (xi = yi |d(x, y) ≤ r) = Pk=0 r k n . B(n, 0.5) → N (µ = n/2, σ 2 = n/4). k=0 k Let Φ(x) be the cdf of the standard normal distribution. 2 Email thread ‘SDM: A puzzling issue and an invitation’, started March Let z = r−n/2 √ n/2 . Thus, P (d(x, y) ≤ r) = Φ(z). As f = 16th 2018, in which we discussed the aforementioned discrepancy. To think that some centuries ago, all scientific publishing was the exchange of such P (d(x, y) ≤ r), then, f = Φ(z). letters. Calculating the inverse, z = Φ−1 (f ). Then, WORKING PAPER 6 Theorem 4. The autocorrelation vanishes when n → ∞, i.e., limn→∞ P (xi = yi |d(x, y) ≤ r) = 1/2. 1.0 P (xi = yi |d(x, y) ≤ r) Proof. From Lemma 2, we know that r = n/2 for n 0.9 sufficiently large. Thus, replacing r = n/2 in Lemma 3, Φ(z1 ) 1 P (xi = yi |d(x, y) ≤ r) = 2Φ(z 2) , where z1 = √n−1 and 0.8 z2 = 0. As n → ∞, z1 → 0, and P (xi = yi |d(x, y) ≤ r) = Φ(0) 2Φ(0) = 1/2. 0.7 Another way to prove is to divide into two cases: 0.6 Suppose that n is an even integer, then, 0.5 r n/2 n 0 100 200 300 400 500 600 700 800 900 X n X n 1X n 2n = = = = 2n−1 k k 2 k 2 k=0 k=0 k=0 Fig. 8. The threshold size r of hard-locations bring about the autocorrelation And, also, of Theorem 1. Consider, for instance, the extremes {0, n}. Let us start with n (in our case, n = 1000 dimensions): given the information that d(x, y) ≤ n, the probability of a bitmatch in dimension i is 1/2; as r n/2 n−1 n−1 ∀x, y ∈ {0, 1}N , d(x, y) ≤ n. At the other extreme, consider that we have X X the information that d(x, y) ≤ 0: in this case x = y and the probability = of a bitmatch, in any dimension, is 1. The autocorrelation hence drops k k k=0 k=0 monotonically until convergence at 1/2 as the distance grows. Numerically, "n−1 # 1 X n−1 n−1 our results converge to precisely 1/2 only after d(x, y) ≥ 607 for n = 1000; d(x, y) ≥ 5332 for n = 10000, and d(x, y) ≥ 83 for n = 100. This will = − 2 k n/2 be seen in Lemma 2. k=0 n−1 1 n−1 = 2 − 2 n/2 1 n−1 = 2n−2 − z = Φ−1 (f ) (6) 2 n/2 r − n/2 Finally, √ = Φ−1 (f ) (7) n/2 √ n n Pr n−1 r = + Φ−1 (f ) (8) k=0 k 2 2 P (xi = yi |d(x, y) ≤ r) = P r n (10) r 1 1 k=0 k = + Φ−1 (f ) √ (9) 2n−2 − 12 n−1 n 2 2 n n/2 = (11) Therefore, n → ∞ ⇒ r/n → 1/2. 2n−1 n−2 1 n−1 2 = n−1 − n (12) Lemma 3. Let Φ(x) be the cdf of the standard normal distri- 2 2 n/2 Φ(z1 ) bution. Then, n → ∞ ⇒ P (xi = yi |d(x, y) ≤ r) = 12 Φ(z 1 n−1 2) , 1 2r−n+1 2r−n = − n (13) where z1 = √n−1 and z2 = √n . 2 2 n/2 Proof. From the approximation of the Binomial distribution n Stirling’s n n for n approximation yields that, √ sufficiently large, 2 ∼ √ 2 . Thus, 21n n/2 B(a, 0.5) by the Normal distribution N (µ = a/2, σ 2 = a/4), n/2 πn/2 ∼ √πn , which yields we conclude that, for a sufficiently large, the cdf of the 1 n limn→∞ 2n n/2 = 0. Finally, as n/2 ≤ n/2 n−1 n , by the Binomial is approximately equal to the cdf of the Normal 1 n−1 squeeze theorem, limn→∞ 2n n/2 = 0, which closes the distribution. Thus, proof for n even. b When n is an odd integer, the steps of the proof are similar. b − a/2 2b − a 1 X a = Φ √ = Φ √ Therefore, the proof is complete. 2a k a/2 a k=0 Thus, In Figure 9, with n = 10, 000 and r = 4, 845, we can notice that the autocorrelation has reduced significantly as predicted b 2b − a X a a by Theorem 4. In fact, in this case, using Lemma 3, P (xi = =2 Φ √ k a yi |d(x, y) ≤ r) = 0.516876. k=0 So far we have looked only at a single pair of bitstrings, The result comes directly from applying the equation above the probability of a single bitmatch between bitstrings within in P (xi = yi |d(x, y) ≤ r). the access radius distance. Now let us consider the number of activated hard locations exhibiting this bitmatch. WORKING PAPER 7 Fig. 10. Given an address x and a dimension i, how many hard locations with bitmatches in i are activated by reading at x? The histogram was obtained through numerical simulation. The red curve is the theoretical normal distribution found in Theorem 5. Fig. 9. The same setup as in Figure ??, but for n = 10, 000. It shows that the interference has almost gone away when n is sufficiently large. Let us analyze the ith counter of a hard location. Let s be the number of bitstrings written into memory (in our case, s = 10, 000) and addri be the ith bit of the hard location’s Let h be the number of activated hard locations. As the address. probability of activating a specific hard location is a constant Let θ be the average number of bitstrings written in each h ∼ Binomial(H, p1 ). Thus, E[h] = P µh = Hp1 and V[h] = hard location. As there are s bitstrings written into the mem- r σh2 = Hp1 (1 − p1 ), where p1 = 2−n k=0 nk . ory, and the probability of activating a specific hard location Let Z be the number of activated hard locations Ph with the is constant, θ ∼ Binomial(s, p1 ). Thus, E[θ] = µθ = sp1 and same bit as the reading address. Then, Z = i=1 Xi , where V[θ] = σθ2 = sp1 (1 − p1 ). Xi ∼ Bernoulli(p), where p = P (xi = yi |d(x, y) ≤ r). Let us introduce another random variable. Let Yi be the Theorem 5. Given a reading address x and a dimension i, number of bitmatches in the ith dimension of aP hard location’s θ the number of activated hard-locations with bitmatches at i address after s written bitstrings. Then, Yi = k=1 Xk . follows a normal distribution with E[Z] = µZ = pµh and Theorem 6. Given the number of written bitstrings s, E[Yi ] = 2 V[Z] = σZ = p(1 − p)µh + p2 σh2 . µY = pµθ and V[Yi ] = σY2 = p(1 − p)µθ + p2 σθ2 . Proof. By the central limit theorem, Z is normally distributed. Proof. Applying the law of total expectation, E[Y ] = Applying the law of total averages and the law of total E[E[Y |θ]] = E[pθ] = pE[θ] = pµθ . variance, E[Z] = E[E[Z|h]] = E[ph] = pE[h] = ph, and Applying the law of total variance, V[Z] = E[V[Z|h]] + V[E[Z|h]] = E[hp(1 − p)] + V[ph] = V[Y ] = E[V[Y |θ]] + V[E[Y |θ]] = E[θp(1 − p)] + V[pθ] = p(1 − p)E[h] + p2 V[h] = hp(1 − p) + p2 Hp1 (1 − p1 ). p(1 − p)E[θ] + p2 V[θ] = p(1 − p)µθ + p2 σθ2 . Applying the law of total variance, V[Z] = E[V[Z|h]] + V[E[Z|h]] = E[hp(1−p)]+V[ph] = p(1−p)E[h]+p2 V[h] = During a write operation, the counters are incremented for p(1 − p)µh + p2 σh2 . every bit 1 and decremented for every bit 0. So, after s writes, As, in our case, P (973 < h < 1170) = 0.997, by the there will be θ bitstrings written in each hard location with Yi central limit theorem, Z may be approximated by a normal bitmatches and θ − Yi non-bitmatches. Thus, [cnti |addri = distribution. 1] = (Yi ) − (θ − Yi ) = 2Yi − θ and [cnti |addri = 0] = θ − 2Yi . See Figure 10 for a comparison between the theoretical Theorem 7. E[cnti |addri = 1] = µcnt = (2p − 1)µθ and model and a simulation. 2 V[cnti |addri = 1] = σcnt = 4p(1 − p)µθ + (2p − 1)2 σθ2 . VI. C OUNTER BIAS Proof. E[cnti |addri = 1] = E[2Yi − θ] = E[2Yi ] − E[θ] = The previous theorems show that there is bias in the 2E[Yi ] − µθ = 2pµθ − µθ = (2p − 1)µθ . counters. In this section we show that there is a small counter Applying the law of total variance, V[cnti |addri = 1] = bias, as a function of the bitmatch in dimension i. If the Hard V[2Yi − θ] = E[V[2Yi − θ|θ]] + V[E[2Yi − θ|θ]]. location’s address bit is set to 1, the mean will be above zero. Let us solve each part independently. Thus, Pθ And the mean will be below zero if the address bit is set to V[2Yi − θ|θ] = V[2Yi |θ] = 4V[Yi |θ] = 4V[ k=1 Xk ] = 0. 4θp(1 − p). WORKING PAPER 8 E[V[2Yi −θ|θ]] = E[4θp(1−p)] = 4p(1−p)E[θ] = 4p(1− p)µθ . Finally, E[2Yi − θ|θ] = 2E[Yi |θ] − E[θ|θ] = 2pθ − θ = (2p − 1)θ. V[E[2Yi − θ|θ]] = V[(2p − 1)θ] = (2p − 1)2 V[θ] = (2p − 2 2 1) σθ . Theorem 8. E[cnti |addri = 0] = −µcnt and V[cnti |addri = 2 1] = σcnt . Proof. Notice that [cnti |addri = 0] = −[cnti |addri = 1]. Thus, E[cnti |addri = 0] = −E[cnti |addri = 1] and V[cnti |addri = 0] = V[cnti |addri = 1]. In summary, what we learn is that 2 [cnti |addri = 1] ∼ N (µcnt , σcnt ) (14) (a) addri = 1 2 [cnti |addri = 0] ∼ N (−µcnt , σcnt ) (15) In our case, p = 0.5529, s = 10, 000, and H = 1, 000, 000, so [cnti |addri = 1] ∼ N (µ = 1.1341, σ 2 = 10.7184). For “random at x”, p = 0.5, so µ = 0 and σ 2 = 10.7185. The slight bias above or below 0 can be seen in Figure 11. Finally, P (cnti ≥ 0|addri = 1) = P (cnti ≤ 0|addri = 0) = 1 − N .cdf(0) (16) For “random written at x”, p = 0.5 implies µcnt = 0, which implies P (cnti ≥ 0|addri = 1) = P (cnti ≤ 0|addri = 0) = 0.5, independently of the parameters because they will only affect the variance and the normal distribution is symmetrical around the average. However, for “x written at x”, p = 0.5529 and the probabil- ities depend on s. For s = 10, 000, they are equal to 0.6354. (b) addri = 0 For s = 20, 000, they are equal to 0.6867. For s = 30, 000, Fig. 11. The value of the counters after s = 10, 000 writes shows the autocorrelation in the counters in autoassociative memories (“x at x”). The they are equal to 0.7232. The more random bitstrings are histogram was obtained through simulation. The red curve is the theoretical written into the memory, the more the hard locations point normal distribution found in equations (14) and (15). to themselves. Let D be the number of counters aligned with addri . The standard deviation was calculated using the fact that [D|θ] ∼ Binomial(1000, q = P (cnti > 0|addri = 1, θ)). h Applying the law of total variance, V[D] = E[V[D|θ]] + X V[E[D|θ]] = E[1000q(1 − q)] + V[1000q] = 1000E[q − acci = cntk (17) k=1 q 2 ] + 10002 V[q] =P1000E[q](1 − E[q]) + 1000(1000 − 1)V[q], where E[q] P = θ P (cnti > 0|addri = 1, θ)P (θ) and Let η be the reading address and ηi the ith bit of it. Then, E[q 2 ] = θ [P (cnti > 0|addri = 1, θ)]2 P (θ). let’s split the h activated hard locations into two groups: (i) Doing the math, E[q] = 0.402922 and E[q 2 ] = 0.634433. the ones with the same bit as ηi with Z hard locations, and (ii) Thus, V[q] = E[q 2 ] − (E[q])2 = 0.0004166. Hence, V[D] = the ones with the opposite bit as ηi with h − Z hard locations. 648.2041. See Figure 12 and notice that I still have to figure out why the mean is correct, but the standard deviation is not. Z X h−Z X VII. R EAD BIAS [acci |ηi ] = [cntk |addrk = ηi ] + [cntk |addrk 6= ηi ] k=1 k=1 Now that we know the distribution of cnti |addri , we may go (18) to the read operation. During the read operation, on average, h hard locations are activated and their counters are summed Each sum is a sum of normally distributed random variables, up. So, for the ith bit, so WORKING PAPER 9 (a) “random at x” (a) Equation 19 (addrk = 1) (b) “x at x” (b) Equation 20 (addrk = 0) Fig. 12. Autocorrelation in the counters in autoassociative memories (“x Fig. 13. The histogram was obtained through simulation. The red curve is written at x”). The histogram was obtained through simulation. The red curve the theoretical normal distribution. is the theoretical distribution. [acci |ηi = 1] ∼ N (µ = (2p − 1)2 µθ µh , σ 2 = σcnt 2 µh + 2µ2cnt σh2 ) (21) [acci |ηi = 0] ∼ N (µ = −(2p − 1)2 µθ µh , σ 2 = σcnt 2 µh + 2µ2cnt σh2 ) Z X (22) [cntk |addrk = η1 ] ∼ N (µ = µcnt µZ , σ 2 = σcnt 2 µZ + µ2cnt σZ 2 ) k=1 In our case, [acci |ηi = 1] ∼ N (µ = 128.62, σ 2 = (19) 12865.69), and [acci |ηi = 0] ∼ N (µ = −128.62, σ 2 = h−Z X 12865.69). See Figure 14 — we can notice that the variance 6 η1 ] ∼ N (µ = −µcnt (1 − p)µh , σ 2 = σcnt [cntk |addrk = 2 (1 − issue p)µh + µ2cntFigure from 2 σh−Z ) 13 has propagated to these images. k=1 Finally, (20) P (wrong) = P (acci < 0|ηi = 1) · P (ηi = 1) + P (acci > 0|ηi = 0) · P (η PZ (23) In our case, k=1 [cnt |addrk = 1] ∼ N (µ = PkZ Nηi =1 .cdf(0) 1 − Nηi =0 .cdf(0) 672.12, σ 2 = 7113.87), and k=1 [cntk |addrk = 1] ∼ N (µ = = + (24) 2 2 −543.49, σ 2 = 5752.54). See Figure 13 — we can notice that Nηi =1 .cdf(0) Nηi =1 .cdf(0) the average is correct but the variance is too small. = + (25) 2 2 Hence, = Nηi =1 .cdf(0) (26) WORKING PAPER 10 (a) Equation 21 (ηk = 1) Fig. 15. The histogram was obtained through simulation. The red curve is the theoretical normal distribution. (b) Equation 22 (ηk = 0) Fig. 14. The histogram was obtained through simulation. The red curve is the theoretical normal distribution. Using the empirical variance of σ 2 = 27838.3029, we calculate P (wrong) = 0.220377. In order to check this probability, I have run a simulation reading from 1,000 random bitstrings (which have never been Fig. 16. New distance after a single read operation in a bitstring ηd , which is d bits away from η. The new distance was calculated between ηd and written into memory) and calculate the distance from the result read(ηd ). Notice that when d ≥ 520, the intersection between η and ηd is of a single read. As the P (wrong) = 0.22037, I expected to zero, which means there is only random bitstrings written into the activated get an average distance of 220.37 with a standard deviation of hard locations. The distance 220 equals 1000 · 0.220 which is the probability find in Figure 15. 13.10. See Figure 15 for the comparison between the simulated and the theoretical outcomes. Figure 16 shows the new distance between ηd and read(ηd ), Our conclusion is that this work is of special relevance as where ηd is d bits away from η. As for d ≥ 520 there is no it brings about some systemwide properties that models need intersection between η and ηd , our models applies and explains to take into account. The first decision choice involved in the horizontal line around distance 220. building a system with SDM regards the size of n, the number of dimensions of the memory. Here, complementing the work VIII. C ONCLUSION of XXXXXXXX, we show that as n grows, autocorrelation tends to 0. But recall that as n grows, the size of the critical A. Effects on system-wide properties distance also tends to 0. It is therefore crucial to select a value The skeptical reader may ask: ‘Beyond theory; what are the for n in which the model will retain its convergence properties, consequences of this work? How does it apply, if at all, to with the understanding of the possible effects brought about building systems that incorporate SDM?’ by autocorrelation. Systemwide properties such as these affect WORKING PAPER 11 the entire behavior of the model and, as such, should always Morasso, editors, ICANN ’94, pages 226–229. Springer be taken into account by theoreticians and modelers alike. London, London, 1994. [14] Pentti Kanerva. Binary spatter-coding of ordered K- tuples. In Gerhard Goos, Juris Hartmanis, Jan Leeuwen, B. Future Research Christoph Malsburg, Werner Seelen, Jan C. Vorbrüggen, BPs, System1 vs System2, HRRs, etc. and Bernhard Sendhoff, editors, Artificial Neural Net- works — ICANN 96, volume 1112, pages 869–873. ACKNOWLEGMENT Springer Berlin Heidelberg, Berlin, Heidelberg, 1996. The authors would like to thank Pentti Kanerva, Eric Paul [15] Pentti Kanerva. Fully Distributed Representation. In In Nichols, José Ricardo de Almeida Torreão, Moacyr Alvim Proceedings RWC Symposium, pages 358–365, 1997. Horta Barbosa da Silva, Flavio Codeço Coelho, Horacio [16] Pentti Kanerva. Hyperdimensional computing: An intro- Hideki Yanasse, and Paulo Murilo Castro de Oliveira for their duction to computing in distributed representation with careful, comprehensive, reviews of the first author’s theses. high-dimensional random vectors. Cognitive Computa- tion, 1(2):139–159, 2009. [17] James D. Keeler. Comparison between Kanerva’s SDM R EFERENCES and Hopfield-type neural networks. Cognitive Science, [1] Alan S Brown. A Review of the Tip-of-the-Tongue 12(3):299–329, 1988. Experience. Psychological Bulletin, 109(2):204–223, [18] Hee-Yong Kwon. ATM call admission control using 1991. sparse distributed memory. In Neural Networks, 1997., [2] Roger Brown and David McNeill. The Tip of the Tongue International Conference on, volume 2, pages 1321– phenomenon. Journal of Verbal Learning and Verbal 1325. IEEE, 1997. Behavior, 5:325–337, 1966. [19] Hee-Yong Kwon, Dong-Keyu Kim, Seung-Jun Song, Je- [3] Peter J. Denning. Sparse Distributed Memory. American U. CHoi, In-Heang Lee, and Hee-Yeung Hwang. ATM Scientist, 77:333–335, July 1989. call admission control using sparse distributed mem- [4] M J Flynn, P Kanerva, and N Bhadkamkar. Sparse ory. II. In Neural Networks Proceedings, 1998. IEEE Distributed Memory: Principles and Operation. Technical World Congress on Computational Intelligence. The 1998 Report CSL-TR-89-400, NASA Ames Research Center, IEEE International Joint Conference on, volume 3, pages 1989. 1799–1803. IEEE, 1998. [5] S Foldes. A characterization of hypercubes. Discrete [20] Jean-Marie Laborde and Surya Prakash Rao Hebbare. Mathematics, 17(1):155–159, 1977. Another characterization of hypercubes. Discrete Math- [6] R. M. French. When coffee cups are like old ele- ematics, 39(2):161–166, January 1982. phants, or why representation modules dont make sense. [21] Alexandre Linhares, Daniel M. Chada, and Christian N. In A. Riegler and M. Peschl, editors, Proceedings of Aranha. The emergence of Miller’s Magic Number on the 1997 International Conference on New Trends in a Sparse Distributed Memory. PLOS One, 6(1):e15592, Cognitive Science, pages 158–163. Austrian Society for Jan 2011. Cognitive Science, 1997. [22] Mateus Mendes, Manuel Crisóstomo, and A Paulo Coim- [7] Frank Harary, John P Hayes, and Horng-Jyh Wu. A bra. Robot navigation using a sparse distributed memory. survey of the theory of hypercube graphs. Computers In Robotics and automation, 2008. ICRA 2008. IEEE & Mathematics with Applications, 15(4):277–289, 1988. international conference on, pages 53–58. IEEE, 2008. [8] S. Jockel, F. Lindner, and Jianwei Zhang. Sparse dis- [23] Hongying Meng, Kofi Appiah, Andrew Hunter, Shigang tributed memory for experience-based robot manipula- Yue, Mervyn Hobden, Nigel Priestley, Peter Hobden, tion. In Proceedings of the 2008 IEEE International and Cy Pettit. A modified sparse distributed memory Conference on Robotics and Biomimetics, pages 1298– model for extracting clean patterns from noisy inputs. In 1303. IEEE, February 2009. Proceedings of International Joint Conference on Neural [9] P. Kanerva. Sparse Distributed Memory. MIT Press, Networks, pages 2084–2089. IEEE, June 2009. 1988. [24] Antje Meyer and Kathryn Bock. The tip-of-the-tongue [10] P. Kanerva. The spatter code for encoding concepts at phenomenon: Blocking or partial activation? Memory & many levels. In M. Marinaro and P.G. Morasso, editors, Cognition, 20(6):715–726, 1992. ICANN ’94, Proceedings of International Conference on [25] Rajesh Rao and Olac Fuentes. Hierarchical learning Artificial Neural Networks, volume 1, pages 226–229, of navigational behaviors in an autonomous robot using London, 1994. Springer-Verlag. a predictive sparse distributed memory. Autonomous [11] Pentti Kanerva. Parallel Structures in Human and Com- Robots, pages 297–316, 1998. puter Memory. Cognitiva, 85, June 1985. [26] Frank Ruskey. Combinatorial Generation. University of [12] Pentti Kanerva. Sparse Distributed Memory and Related Victoria, working version (1j-csc 425/520) edition, 2003. Models. In Associative Neural Memories: Theory and [27] Magnus Sahlgren and Pentti Kanerva. Permutations as a Implementation, page 41. Oxford University Press, 1993. Means to Encode Order in Word Space. page 6. [13] Pentti Kanerva. The Spatter Code for Encoding Concepts [28] Yoshinori Uesaka, Pentti Kanerva, and Hideki Asoh, edi- at Many Levels. In Maria Marinaro and Pietro G. tors. Foundations of real-world intelligence. Number no. WORKING PAPER 12 125 in CSLI lecture notes. CSLI Publications, Stanford, Calif, 2001. [29] A. Wagner and D. Corneil. Embedding Trees in a Hy- percube is NP-Complete. SIAM Journal on Computing, 19(3):570–590, June 1990. Replace this box by an Your Name All about you and the what your image with a width of interests are. 1 in and a height of 1.25 in! Coauthor Same again for the co-author, but without photo