Attention, Perception, & Psychophysics 2010, 72 (6), 1614-1625 doi:10.3758/APP.72.6.1614 Alignment to visual speech information RACHEL M. MILLER, KAUYUMARI R SANCHEZ, AND LAWRENCE R D. ROSENBLUM University of California, Riverside, California Speech alignment is the tendency for interlocutors to unconsciously imitate one another’s speaking style. Alignment also occurs when a talker is asked to shadow recorded words (e.g., Shockley, Sabadini, & Fowler, 2004). In two experiments, we examined whether alignment could be induced with visual (lipread) speech andd with auditory speech. In Experiment 1, we asked subjects to lipread and shadow out loud a model silently ut- tering words. The results indicate that shadowed utterances sounded more similar to the model’s utterances than did subjects’ nonshadowed read utterances. This suggests that speech alignment can be based on visual speech. In Experiment 2, we tested whether raters could perceive alignment across modalities. Raters were asked to judge the relative similarity between a model’s visual (silent video) utterance and subjects’ audio utterances. The subjects’ shadowed utterances were again judged as more similar to the model’s than were read utterances, suggesting that raters are sensitive to cross-modal similarity between aligned words. Starting from infancy, humans show an amazing abil- In order to assess imitation of a model’s speech, the ity to imitate one another (see Meltzoff & Moore, 1997, baseline (control) and shadowed words of each subjectt for a review). As adults, we unconsciously imitate facial are typically compared with the model’s words in an AXB expressions, body posture, and mannerisms of a conversa- perceptual matching task (Goldinger, 1998). The presence tional partner in a social context (e.g., Chartrand & Bargh, of imitation is indicated when raters choose the shadowed d 1999; Shockley, Santana, & Fowler, 2003). Chartrand and words as sounding more similar to the model’s words Bargh suggest that imitation is often passive and can occur than do the baseline words. The results of Goldinger’s ex- without volition. They propose the chameleon effect, an periment indicated that immediate shadowing produced unconscious tendency toward mimicking facial expres- greater perceived imitation than delayed shadowing; thatt sions, body posture, and mannerisms of another person. over the two conditions, low-frequency words were con- Although this imitation is typically unintentional, it can sidered better imitations than high-frequency words; andd be influenced by multiple factors, including the social re- that the strength of perceived imitation for raters increased lationship between the conversational partners. with the number of repetitions that the shadower heard off Imitation also occurs in speech communication. Dur- the model’s utterances. According to Goldinger, this evi- ing conversational interaction, interlocutors subtly align dence suggests that episodic traces of words that we hearr to each other’s speech rate, intonation, and vocal intensity are present and accessible in lexical memory. Alignmentt (Giles, Coupland, & Coupland, 1991; Natale, 1975). This during shadowing emerges as a byproduct of how words alignment is considered to have important linguistic and are accessed from memory. Alignment also shows thatt social functions, allowing interlocutors to be more ef- f perceivers are sensitive to a talker’s articulatory style andd fectively and efficiently understood (Giles et al., 1991; unconsciously incorporate that style into their own speech and see Chartrand & Bargh, 1999). But even outside of productions. In this sense, speech alignment phenomena a social setting, talkers will imitate aspects of the speech are consistent with other results showing that perceivers of a recorded model producing individual words (Gold- are sensitive to talker-specific phonetic information and inger, 1998; Goldinger & Azuma, 2004; Namy, Nygaard, use this talker information to facilitate later speech per- & Sauerteig, 2002; Pardo, 2006; Shockley, Sabadini, & ception (for a review, see Nygaard, 2005). Fowler 2004). Goldinger implemented a shadowing para- In order to evaluate possible acoustic dimensions imi- digm in which talkers uttered isolated words immediately tated during shadowing, Shockley et al. (2004) digitally after a recorded model. In the shadowing paradigm, talkers extended voice onset time (VOT) durations in the initial are asked to say the words they hear out loud quickly, but consonants of a model’s words before presentation to the clearly. Talkers are never instructed to imitate what they shadowers. The results of an AXB rating task showedd hear. In the typical shadowing experiment, subjects first that the shadowers tacitly imitated the lengthened VOTs read a series of words off a computer monitor. These read at better-than-chance levels. An acoustical analysis also words act as baseline stimuli for later perceptual ratings of revealed that the VOTs of the subjects’ shadowed tokens alignment. The subjects then perform the shadowing task. were significantly longer than the VOTs of their baseline L. D. Rosenblum,

[email protected]

© 2010 The Psychonomic Society, Inc. 1614 VISUAL SPEECH ALIGNMENT 1615 tokens. The fact that the talkers show evidence of align- these stimuli, talkers can be recognized in both match- ment to VOT in a shadowing task is important in providing ing and identification contexts (Rosenblum, Niehus, & evidence that phonetically relevant dimensions of speech Smith, 2007; Rosenblum et al., 2002). Point-light research are imitated (see also Pardo, 2004). has also shown that a talker’s isolated speech movements In summary, speech alignment occurs on both phonetic can be matched to their voice at better-than-chance levels and extraphonetic levels, with conversational interaction (e.g., Lachs & Pisoni, 2004c; Rosenblum, Smith, Nichols, or without. As with other types of chameleon effects, Hale, & Lee, 2006). These findings suggest that observ- speech alignment, although typically unconscious, can ers are sensitive to the articulatory style of a talker as it is be influenced by outside social factors (e.g., Namy et al., reflected in both auditory and visual modalities. 2002; Pardo, 2006). One missing piece of alignment re- In summary, research suggests that the visual speech search is the determination of whether auditory speech is signal provides not only phonetic information, but also in- the only type of speech information that can induce align- formation about the talker-specific articulatory style—or ment. The next section discusses whether visuall speech idiolect—of t a talker. If talker-specific articulatory infor- information may also have this ability. mation is conveyed in visual speech, visual speech stimuli could have the potential to induce the type of speech align- Visual Speech Information for ment shown for auditory speech. Indeed, Gentilucci and Talker-Specific Characteristics Bernardis (2007) recently reported initial evidence that It is well known that visual speech information plays a visual speech information might have the potential to in- vital role in face-to-face communication (see Rosenblum, duce speech alignment. These researchers asked women 2005, for a review). When the auditory signal is degraded to lipread and shadow two male and two female talkers either by hearing loss or by a noisy environment, indi- silently uttering />?>/ bisyllables. Kinematic and acoustic viduals are aided by seeing the articulating face of a talker analyses of the women’s utterances showed that their lip (e.g., Grant & Seitz, 2000; Sumby & Pollack, 1954). Even movements were larger and their voice spectra were lower when the auditory signal is clear, visual speech informa- when shadowing the male than when shadowing the fe- tion can help perceivers recover a complicated message or male talkers. Gentilucci and Bernardis suggested that this understand messages spoken with a heavy foreign accent would be expected from (what they claim) are the known (e.g., Arnold & Hill, 2001; Reisberg, McLean, & Gold- differences in articulatory movements between the gen- field, 1987). Visual speech information also facilitates ders (with male talkers having larger excursions). These language acquisition in infants (e.g., Mills, 1987). In fact, results suggest that the visual information for the male blind infants show a delay in acquiring certain phonetic talker’s utterances induced the female subjects to produce distinctions that are acoustically similar but visually dis- shadowed utterances that were more male-like in their tinct (e.g., /J/ vs. /K/). Visual speech information can fa- movements: The women aligned to talker gender. cilitate second language perception and learning as well The research by Gentilucci and Bernardis (2007) pro- (Davis & Kim, 2001; Navarra & Soto-Faraco, 2007). vided initial evidence that perceivers can align to some as- Notably, visual speech also influences the perception pects of visible speech utterances, but a number of impor- of heard syllables when discrepant auditory and visual tant questions about visual speech alignment remain. For syllables are presented synchronously (i.e., the McGurk example, although Gentilucci and Bernardis used a single effect; McGurk & MacDonald, 1976). The automatic and />?>/ stimulus to induce alignment, auditory alignment ubiquitous nature of audiovisual speech perception has led researchers have typically used word lists (e.g., Gold- some theorists to argue that the primary mode of speech inger, 1998; Namy et al., 2002; Shockley et al., 2004). It perception is multimodal, typically relying on both audi- has proven important to test words in auditory alignment tory and visual input. Spoken communication may in fact research in order to examine the role of lexical access have evolved to take advantage of visuofacial, as well as (e.g., word frequency, neighborhood density) in speech auditory, sensitivities (e.g., Rosenblum, 2005). This per- alignment (e.g., Goldinger, 1998; Goldinger & Azuma, spective is consistent with neurophysiological findings 2004; Shockley et al., 2004). This makes it essential to suggesting that visual speech information modulates audi- determine whether alignment to visual speech can occur tory cortex activity as if the brain is responding to heard with words, as well as with the bisyllables />?>/ tested by speech (e.g., Calvert et al., 1997; MacSweeney et al., Gentilucci and Bernardis. 2000; MacSweeney et al., 2002). Gentilucci and Bernardis (2007) tested only female Given the importance of visual speech information, subjects in their experiment, whereas in most of the audi- the question arises whether it can induce the unconscious tory alignment research both male and female shadowers imitation, or alignment, that has been shown for auditory have been tested. In fact, there is some evidence that male speech. For visual speech to do so, it must convey informa- and female subjects do align differently. For example, tion about a talker’s speaking style to the perceiver. In fact, Namy et al. (2002) found that female shadowers tended there is evidence that visual speech can provide talker- to align more than male shadowers (but see Pardo, 2006). specific characteristics. For example, perceivers can rec- Namy et al. attributed the finding to gender differences in ognize talkers from simply seeing their isolated speech perceptual sensitivity. They speculated that women may movements. Speech movements can be isolated by using be more sensitive to talker-specific information than men a point-light technique, in which only moving dots placed and that this information influences their own productions. on the face are seen against a dark background. From If this is true, the visual alignment reported by Gentilucci 1616 MILLER, SANCHEZ, AND ROSENBLUM and Bernardis may have been a result of the fact that only shadowed utterances sounded more like the model’s utter- female shadowers were tested. The putatively less sensi- ances than did their baseline utterances. tive male observers may not align to visual speech. This makes it critical to test visual speech alignment with both EXPERIMENT 1 male and female subjects. Method A third question arising from the research of Gentilucci Participants and Bernardis (2007) is whether visual alignment will occur Two graduate students (1 male, 1 female) acted as models in the with shadowers not asked to repeat the utterances that they experiment and produced the original word list to be shadowed perceive. A majority of auditory alignment researchers (e.g., Shockley et al., 2004). These models had no noticeable ac- have intentionally instructed subjects to say the perceived cents or speech impediments. Sixteen undergraduates (8 male, 8 utterances out loud, thereby avoiding any suggestion that female) acted as subjects who were asked to shadow the models’ the subjects should imitate. However, the subjects in the words. Thirty-two undergraduates acted as raters in an AXB match- ing task. All of the models, subjects, and raters were native speakers Gentilucci and Bernardis study were instructed to repeat of American English with normal hearing and normal or corrected the utterances that they saw, possibly biasing them toward vision. The graduate student models were paid for their participa- imitation. Although this may be a more minor concern, tion. The undergraduate subjects and raters participated in order to testing visual alignment with subjects who are instructed to partially fulfill a course requirement. simply say the words out loud could provide a more rigor- ous test of inadvertent (unconscious) alignment and would Materials and Apparatus A list of 74 bisyllabic, low-frequency English words were used be more consistent with the existing alignment research. as stimuli (see the Appendix). These words were derived from the A final question is whether visual speech alignment list used by Shockley et al. (2004). The words had frequencies of occurs in a perceptually relevant way. Gentilucci and Ber- less than 75 occurrences per million (Kukera & Francis, 1967), and nardis’s (2007) evaluation of alignment involved measur- r they all began with the voiceless stop consonants (/M/, /Q/, or /H/). ing movement kinematics and voice spectra. In contrast, This allowed us to ensure that our subjects were shadowing to a auditory speech researchers most often evaluate alignment degree comparable to those of Shockley et al. (2004). In addition, low-frequency words were selected because it has been shown that using the aforementioned naive rater matching task. By they generally induce greater alignment in shadowers (e.g., Gold- having naive perceivers judge the relative similarity be- inger, 1998). In that this experiment constituted a first attempt to tween utterances, researchers use this method to determine induce alignment with visual speech, it was thought that using low- whether shadowed speech alignment occurs in a perceptu- frequency words would provide the best chance of doing so. How- ally relevant manner (e.g., Goldinger, 1998; Goldinger & ever, it must be acknowledged that using low-frequency words does Azuma, 2004; Namy et al., 2002; Pardo, 2006; Shockley limit the scope of the study. et al., 2004) . Recall that one proposed function of speech All of the stimuli were presented using PsyScope software. Text (baseline) and visual speech stimuli were presented on a 20-in. video alignment is to facilitate communicative and social interac- monitor positioned 3 ft in front of the subjects. Auditory stimuli were tions. If this assertion is true, speech should align in a way presented through Sony MDR-V6 headphones. A Sony DSR-11 that is perceptible. Although the use of rater matching tasks camcorder was used to videotape the models. The models and sub- has established this to be true for auditory speech align- jects responded verbally into a Shure SM57 microphone and were ment, it has not yet been determined for visual speech. audio recorded at 44 kHz (16 bits) using Amadeus II software. To address whether visual speech alignment can occur Procedure in a way comparable to that in which auditory speech The experiment took place in three phases. For all three phases, alignment does—in a way that is lexical, gender-relevant, the individuals sat in a sound-attenuating chamber. unconscious, and perceptible—visual speech was tested Phase 1. In Phase 1, two models (1 male, 1 female) were video- using the alignment methods of the auditory speech re- taped producing the 74 bisyllabic words. The word list was presented search. In Experiment 1, we borrowed the shadowing to the models as text on a video monitor. The words were randomly methodology and AXB rating measure used by Goldinger presented at a rate of one word per second. The models were asked to speak the words quickly but clearly into the microphone. These (1998) and others. We tested both male and female sub- utterances were filmed using the camcorder, and these recordings jects on an auditory and visual speech alignment task. were edited on a computer to produce tokens for later presentation The auditory task was borrowed directly from the method to the subjects. The audiovisual recordings were digitized and edited of Goldinger and involved shadowing of a word list ad- using FinalCut Pro software into 74 audio and 74 silent video tokens. opted from Shockley et al. (2004). The visual speech task The silent video showed the entire head and a portion of the models’ adapted these methods for lipreading. On each visual shoulders. speech trial, subjects were asked to say out loud a word Phase 2. Phase 2 of the experiment consisted of the 16 subjects (8 male, 8 female) participating in three tasks: baseline word pro- that they had just lipread from a model. The model was duction (text reading), audio shadowing, and silent video shadowing of the same gender as the subjects (Shockley et al., 2004). (lipreading). Each task was presented in its own block (e.g., Gold- In order to make the lipreading task easier, each trial first inger, 1998; Shockley et al., 2004), and all of the subjects performed included a presentation of two text words, one of which the baseline word production first. The order of the remaining two was the same as the word that they were to lipread. The tasks was counterbalanced across subjects. subjects’ utterances were recorded and presented to raters For the baseline word task, the subjects were audio recorded pro- ducing the original word list, which they read from a video monitor. along with the model’s auditory words and baseline (read) The words were presented individually at 1-sec intervals. The sub- words spoken by the subjects before the shadowing task. If jects were asked to say the words that they saw quickly but clearly visual speech can induce the type of alignment induced by into the microphone. These utterances were later edited on a com- auditory speech, the raters should find that the subjects’ puter to create 74 baseline tokens for the ratings in Phase 3. VISUAL SPEECH ALIGNMENT 1617 For the audio shadowing task, the subjects were audio recorded task ranged from 86% to 96% for the 16 shadowing subjects, with shadowing 1 of the model’s 74 audio words, which they heard over a mean of 90% correct. Because only correctly lipread utterances headphones. The male subjects shadowed the male model, and the could be used in the matching task performed by the raters, each female subjects shadowed the female model (e.g., Shockley et al., subject’s incorrectly lipread words were not including in the AXB 2004). The shadowing task required the subjects to say each word sets for that subject. Furthermore, to ensure that comparisons across that they heard quickly but clearly into the microphone (e.g., Shock- shadowing of audio and visual presentations were fair, the words in- ley et al., 2004). The subjects were never asked to imitate, or even correctly lipread by a subject were also removed from that subject’s repeat, the model. All shadowed utterances were recorded and later audio shadowed lists. Thus, if the word cabbage had been incor- edited to create 74 audio shadowed tokens for comparison purposes rectly lipread by a subject, cabbage would also be removed from that in Phase 3. subject’s audio shadowed list, baseline list, and model’s list so that For the silent video shadowing task, the subjects were again audio cabbage would not be part of the AXB stimuli for that subject. This recorded shadowing a model’s 74 words. However, in this condition, accounts for the differential number of triads judged by the raters. the subjects were asked to lipread the words from the model. Be- The triads based on the auditorily and visually derived utterances cause of the difficulty that some individuals have with lipreading, a of a shadower were completely randomized together for presentation low-uncertainty forced choice task was used. The subjects were first to the raters. The raters listened to the triads through headphones presented with two text words—a target and distractor—shown side and were asked to choose which of the words—the first or third— by side on the video monitor (e.g., cabbage, camel ). These words sounded more similar to the second. The raters were instructed to were presented for 2 sec. Immediately afterward, the subjects would press the key labeled “1” on the keyboard if the first word sounded see the face of the model silently saying 1 of the words (e.g., cab- more similar to the second or to press the key labeled “3” on the bage). The subjects’ task was to produce out loud into the micro- keyboard if the third word sounded more similar to the second. phone quickly but clearly the word that they had lipread. Again, they were never asked to imitate the model. Results and Discussion Each distractor word was chosen to be similar in initial segments to its paired target word (e.g., cabbage, camel ). Pilot tests showed Means were calculated for the number of shadowed ut- that this forced the subjects to pay attention to the articulated target terances chosen as sounding more like those of the model words but allowed the subjects to correctly lipread the target words for each rater and each subject. These individual means a majority of the time. for male and female subjects, for both the audio and video All shadowed utterances were audio recorded and later edited to shadow responses (averaged across words), are presented create video shadowed tokens for comparison purposes in Phase 3. Phase 3. In Phase 3, naive raters were asked to judge the similar- graphically in Figure 1. The overall mean proportion of ity between the models’ words and the subjects’ shadowed words the subjects’ shadowed tokens considered better imita- relative to that between the models’ words and the subjects’ baseline tions of the models’ tokens (than were the baseline read words. For these purposes, we used an AXB matching task, which is tokens) was .573 (SE .017) for audio shadowing and commonly used in speech alignment experiments (Goldinger, 1998; .564 (SE .015) for visual (lipread) shadowing. These Namy et al., 2002; Pardo, 2006; Shockley et al., 2004). Rating meth- proportions were compared with chance (.50) using ods were chosen over acoustical analysis for a number of reasons. First, rating methods provide a perceptually valid way of establish- t tests, which revealed that the subjects’ shadowed tokens ing similarities across stimuli and, thus, alignment across utterances were judged to be better imitations of the models’ tokens (Goldinger, 1998; Namy et al., 2002; Pardo, 2006; Shockley et al., than were the baseline tokens for both the audio shadowed 2004). In addition, the method avoids the difficulty in determining to words [t(31) 4.892, p .0001, Cohen’s d effect size which of the many possible acoustical dimensions subjects are align- .87] and the visually shadowed words [t(31) 3.704, p ing (Goldinger, 1998). Finally, the method has been used to evaluate .0008, Cohen’s d .66] (Thalheimer & Cook, 2002). A alignment in a majority of the studies in which the phenomenon was investigated (e.g., Goldinger, 1998; Namy et al., 2002; Pardo, 2006; paired samples t test revealed that there was no significant Shockley et al., 2004).1 difference in rater matching between the audio and visual Thirty-two naive raters (23 female) judged whether a subject’s shadowing tasks [t(31) 0.604, p .5500]. shadowed token was more similar to the model’s token than was the The effects of gender (between subjects; male vs. fe- subject’s baseline token. Two raters were assigned to judge the words male) and modality (within subjects; audio vs. video) produced by a given shadowing subject (16 shadowing subjects were evaluated (on the basis of values averaged across 2 raters each 32 raters) (e.g., Shockley et al., 2004). words and raters) using a factorial ANOVA. The results A separate AXB triad was created for each word that the subjects produced. Each triad included presentations of the same word (e.g., indicate a marginal main effect of gender [F(1,30)F cabbage) produced once by the model and twice by the subject. The 3.524, p .07], with the female subjects aligning more model’s spoken utterance always appeared as the middle (X) token. than the male subjects. Still, t tests conducted for the male The subjects’ shadowed tokens appeared either in the A (first) or B and female subjects revealed that the utterances for both (third) position and the subjects’ baseline (the read token) appeared gender groups were matched to their respective models at in the remaining A or B position. Each subject’s word, from each better-than-chance levels for both the audio and the video shadowing block (audio and visual) actually appeared in two triads: once when presented in the A position and once when presented in shadowing conditions ( p .05). No main effect of modal- the B position. This means that, in principle, raters would judge a ity was found [F(1,30) F 0.357, p .55], and there was total of 296 separate triads: 74 words 2 shadowing modalities no significant gender modality interaction [F(1,30) F (audio, video) 2 triad orderings (once with the shadow word in 0.328, p .57]. Finally, a paired t test of the effect of the A position, once with it in the B position). presentation block was conducted and revealed that on the However, the number of triads derived from each shadowing sub- basis of the AXB ratings, more alignment occurred during ject’s responses actually ranged between 256 and 284 (M 268). The reason for this was as follows. Although the subjects were gen- the second block than during the first (M ( .586, SE erally quite accurate at lipreading the model in the two-alternative .017, and M .551, SE .015, respectively) [t(31) forced choice task, all of the subjects lipread words wrong a few 2.47, p .019]. This is not surprising, because the same times during the session. The percentage correct on the lipreading 74 words were used in the two blocks. Past research has 1618 MILLER, SANCHEZ, AND ROSENBLUM .8 Proportion Alignment Judgments .7 .6 .5 Audio .4 Video .3 .2 .1 0 F1 F2 F3 F4 F5 F6 F7 F8 Female Shadowers .8 .7 Proportion Alignment Judgments .6 .5 Audio .4 Video .3 .2 .1 0 M1 M2 M3 M4 M5 M6 M7 M8 Male Shadowers Figure 1. Mean proportion of model’s words sounding more similar to subjects’ shadowed words than did baseline words for audio and visual shadowing conditions for female subjects (top panel) and male subjects (bottom panel). shown that the degree of alignment to a talker increases and visual shadowed conditions. This suggests that the with the number of repetitions of a word spoken by that two modalities provided a comparable amount of informa- talker (Goldinger, 1998). tion to drive speech alignment. These results reveal that on the basis of the auditory Although the results portrayed in Figure 1 suggest that judgments of naive raters, the subjects did align to the some subjects aligned more than others, the range of these words that they both heard and saw the models say. In fact, values is similar to those of other alignment studies (e.g., the values were statistically equivalent for the auditory Goldinger, 1998; Namy et al., 2002; Pardo, 2006; Shockley VISUAL SPEECH ALIGNMENT 1619 et al., 2004). Also, the effect sizes for both the audio and although gender may play an intricate role in alignment, the visual conditions were in the high-medium-to-large it could also be that other factors distinguishing our two range (Thalheimer & Cook, 2002). Thus, although the re- models (e.g., speech clarity, attractiveness, expression) sults show that the alignment for both the audio and the vi- drove these marginal effects. sual conditions was often subtle, it was statistically sound and comparable to that of other alignment research. EXPERIMENT 2 Although recent evidence has shown that visual speech provides indexical information (Kaufmann & Schwein- As was stated above, there is evidence in the literature berger, 2005; Schweinberger & Soukup, 1998; Sheffert that perceivers are sensitive to the articulatory-style in- & Fowler, 1995; Yakel, Rosenblum, & Fortier, 2000; and formation of a talker as it is reflected in both the auditory see also Sheffert & Olson, 2004), it was unknown whether and the visual modalities (see Nygaard, 2005, and Rosen- this visually specified information could unconsciously blum, 2005, for reviews). In fact, this information allows alter speech production responses. Prior research (e.g., perceivers to match heard speech to lipread speech on the Goldinger, 1998; Goldinger & Azuma, 2004; Namy et al., basis of talker identity (e.g., Kamachi, Hill, Lander, & 2002; Pardo, 2006; Shockley et al., 2004) has shown that Vatikiotis-Bateson, 2003; Lachs & Pisoni, 2004a, 2004b, auditory speech has this potency in both conversational 2004c; Rosenblum et al., 2006). This suggests that speak- and shadowing contexts. In showing that lipread shadowed ing style can be perceived across modalities. Furthermore, words are rated as auditorily similar to those of the model, the results of Experiment 1 show that raters are sensitive the present results provide evidence that visually specified to the similarity in models’ and shadowers’ utterances indexical talker information can modulate speech produc- whether the shadowing is based on audio or visual infor- tion responses. mation of the model. This suggests that speaking style These results go beyond those reported by Gentilucci can, to some degree, be perceived across talkers. and Bernardis (2007) by showing that visual speech align- If speaking style can be perceived across modalities ment can occur with spoken words. Furthermore, the pres- and across talkers, an interesting prediction arises. Rat- ent results show that both female and male subjects align ers should be able to match aligned utterances across a to visible speech and do so even when they are simply model and shadower when each utterance is presented in instructed to say out loud, rather than to repeat, the utter- a different modality. Put differently, if shadowers are tak- ances that they perceive. Finally, Gentilucci and Bernardis ing on some of the articulatory style of the models, and evaluated alignment using acoustic and kinematic mea- articulatory style can be perceived across modalities, then surements of shadowed responses; the present results add observers should also be able to match a shadower’s voice evidence that visually induced alignment is robust enough to the visible articulating face of the model that had been to be perceived by naive raters in a matching task. shadowed. The results also revealed a marginal main effect of gen- In Experiment 2, we tested this prediction using the der. Research findings on the impact of gender on speech audio and video recordings obtained in Experiment 1. Rat- shadowing have been inconsistent (Namy et al., 2002; ers were asked to make cross-modal AXB matches. The Pardo, 2006). Using a shadowing paradigm, Namy et al. raters were presented AXB trials on which a shadower’s compared the alignment of male and female subjects shad- utterances (the A and B positions) were presented audito- owing models of the same or of a different gender. The re- rily, whereas the model’s words (X) were presented visu- searchers found that female shadowers tended to align more ally without sound. Thus, the raters were asked to match than male shadowers, although the shadowers, in general, the similarity of utterances across two talkers (a model and tended to align more to the male models. This difference in a shadower) and two modalities (auditory and visual). alignment was attributed to gender differences in percep- In this sense, the raters of Experiment 2 were actually tual sensitivity. In other words, women may attend better the subjects whose perceptual sensitivity was tested. If the to talker-specific properties than men. This interpretation information for talker alignment can be conveyed across is consistent with the results of Experiment 1. modalities, these subjects should be able to match a mod- However, Pardo (2006) found results that suggested el’s silent video token to the shadower’s audio token (vs. that male talkers aligned more than female talkers. This baseline) at better-than-chance levels. difference may stem in part from differences in the ex- In addition, we incorporated the shadowed responses perimental design. Rather than using shadowing to assess derived from both the audio and visual shadowing condi- alignment, Pardo (2006) opted to use an interactive map tions of Experiment 1. In this sense, in Experiment 2, we task to induce alignment in the context of live conversa- tested a modified replication of Experiment 1 by exam- tion. Pardo (2006) attributes her observed gender effects ining whether matches (in this case cross-modal) can be in alignment to attentional differences with the task, rather made between a model’s and a shadower’s utterances when than to differences in perceptual sensitivity, as such. that shadow is based on lipread or auditory information. Future research can examine why women aligned mar- ginally more than men in the present experiment. Because Method subject (shadower) gender was matched to model gender Participants in the present experiment (following Shockley et al., The graduate student models and undergraduate shadowers were 2004), the degree to which the subjects’ versus the mod- the same as those used in Experiment 1. Thirty-two new undergradu- els’ gender played a role in these effects is uncertain. Also, ates (23 female) acted as subjects in a modified AXB matching task. 1620 MILLER, SANCHEZ, AND ROSENBLUM These undergraduate subjects participated in order to partially fulfill matched the female shadowers’ utterances to those of the a course requirement. None had participated in Experiment 1. model more often than they did the male shadowers’ utter- ances when these shadowed utterances were based on the Materials and Apparatus video stimuli. It is unclear why this interaction occurred, All materials and apparati were the same as those in Experiment 1. However, in this experiment, the models’ silent video utterances re- but the fact that the subjects in this experiment found that corded in Phase 1 of Experiment 1 were used for comparison with the female shadowers’ shadowed utterances more often the shadowers’ baseline and shadowed tokens recorded (auditorily) matched those of the model (when shadowing a video in Phase 2. All three types of shadowers’ utterances (from Experi- utterance) is consistent with the marginal gender effects ment 1) were used in Experiment 2: baseline productions, shadows reported in Experiment 1. of the model’s audio tokens, and shadows of the model’s video to- The results portrayed in Figure 2 suggest that, again, kens. Shadowed modality (audio vs. video) from Experiment 1 was therefore considered a factor in Experiment 2. some subjects aligned more than others. Still, the range of these values is comparable to those of other alignment Procedure studies (e.g., Goldinger, 1998; Namy et al., 2002; Pardo, The subjects judged whether a shadower’s shadowed tokens were 2006; Shockley et al., 2004). The effect sizes for both the more similar to the model’s silent video tokens than were the shad- audio and the visual conditions were in the medium range ower’s baseline tokens. A separate AXB sequence was created for (Thalheimer & Cook, 2002). each shadower from Experiment 1 and was presented to 2 subjects (of Experiment 2). For each triad, a model’s silent video token always The results of Experiment 2 show that when subjects appeared in the X position. The shadower’s shadowed and baseline are asked to match shadowed utterances to a model’s ut- audio tokens appeared in the A and B positions, which were counter- terances, they can do so across modalities at better-than- balanced to create two orders for each triad. As was stated above, the chance levels. On each trial, the subjects in Experiment 2 shadowed tokens were taken from Phase 2 of Experiment 1, in which were presented with an audio utterance from a shadower, the subjects were asked to shadow both the audio (heard) and video a silent video utterance from a model, and then another (lipread) tokens of the model. This means that in principle, the full audio utterance from the shadower. One of the shadower’s matching sequence would consist of 296 tokens: 74 words 2 shad- owing modalities (audio and video shadows from Experiment 1) utterances was produced when the shadower simply read 2 AXB orderings. However, again, the total number of triads differed the word (baseline), whereas the other was produced when between sequences because of incorrect lipread responses for each the shadower shadowed the model. The subjects in Ex- shadower in Experiment 1 (see above). periment 2 were able to determine, at better-than-chance The AXB triads were presented auditorily over Sony MDR-V6 levels, which of the shadower’s utterances were produced headphones. The video tokens were presented on a 20-in. monitor when shadowing the model. In this sense, the subjects 3 ft in front of the subjects. These tokens did not include sound. The subjects were asked to choose which of the utterances—the first or were able to detect speech alignment both across talkers the third—was more similar to the video utterance presented as the and across modalities. This suggests that the indexical second. The subjects were instructed to press the key labeled “1” on characteristics that are passed from one talker to another the keyboard if the first word was more similar to the second or to are perceptible across auditory and visual information. press the key labeled “3” on the keyboard if the third word was more The implications of this finding will be addressed in the similar to the second. General Discussion section. The results of Experiment 2 also showed that these Results and Discussion matches could be made at better-than-chance levels when The individual means for the male and female subjects, the shadowers of Experiment 1 shadowed either the visual for both the audio and the video shadowed responses, are or the auditory speech of the model. This finding is con- presented graphically in Figure 2. The overall mean pro- sistent with the results of Experiment 1 in again showing portions of shadowers’ shadowed tokens considered bet- that speech alignment can be induced by either visual or ter imitations of the models’ video tokens (than were the auditory speech information. baseline read tokens) were .538 (SE .013) for audio shadows and .559 (SE .016) for visual (lipread) shad- GENERAL DISCUSSION ows. A comparison to chance (.50) revealed that the sub- jects judged the shadowers’ shadowed tokens to be better The experiments reveal that shadowers align to a mod- imitations of the models’ video tokens for audio shadowed el’s spoken words whether those words are presented au- words [t(31) 3.008, p .01, Cohen’s d .535] and ditorily or visually. Although the results suggest that this for visually shadowed words [t(31) 3.658, p .001, alignment can be subtle, it seems comparable to that ob- Cohen’s d .648]. Again, a paired samples t test revealed served in previous alignment research (e.g., Namy et al., that there was no significant difference in these judgments 2002; Pardo, 2006; Shockley et al., 2004). The present between words shadowed auditorily in Experiment 1 and research also shows that this alignment between shadow- those shadowed visually [t(31) 1.383, p .177]. ers and models is perceivable across auditory tokens (Ex- An ANOVA on the factors of shadower gender and periment 1, Phase 3), as well as across auditory and visual shadowed modality (on the basis of values averaged across tokens (Experiment 2). words and raters) did not reveal a main effect of gender The finding of auditory alignment is consistent with past [F(1,30) F 2.485, p .05] or modality [F(1,30) F 2.229, research showing alignment to auditory speech both during p .05]. However, there was a significant gender mo- live conversation (Pardo, 2006) and when shadowing iso- dality interaction [F(1,30) F 6.143, p .019]. Pairwise lated tokens (Goldinger, 1998; Namy et al., 2002; Shockley comparisons revealed that the subjects of Experiment 2 et al., 2004). Indeed, our auditory results closely replicate VISUAL SPEECH ALIGNMENT 1621 .8 Proportion Alignment Judgments .7 .6 .5 Audio .4 Video .3 .2 .1 0 F1 F2 F3 F4 F5 F6 F7 F8 Female Shadowers .8 .7 Proportion Alignment Judgments .6 .5 Audio .4 Video .3 .2 .1 0 M1 M2 M3 M4 M5 M6 M7 M8 Male Shadowers Figure 2. Mean proportion of model’s visible words rated as more similar to subjects’ shadowed words than were baseline words for audio and visual shadowing conditions for female subjects (top panel) and male subjects (bottom panel). the findings of Shockley et al. (2004, Experiment 1), on alignment occurs for multiple wordd stimuli, instead of for which the method of the present study was partly based. bisyllabic, nonsense stimuli, and occurs to the degree that With regard to visual speech, our findings that shad- it is perceivable by naive raters, not simply by measures owers align to visually presented stimuli are consistent of lip kinematics and acoustics. In this sense, the present with the initial report of Gentilucci and Bernardis (2007). findings show that alignment to visual speech can work in As was noted above, however, our findings reach further a methodological context comparable to that used for most than those researchers’ work in showing that visual speech auditory alignment demonstrations. 1622 MILLER, SANCHEZ, AND ROSENBLUM The results of Experiment 2 show that in a variation of media. It could be that detection of amodal idiolectical the AXB task, matchers are perceptually sensitive to shad- properties also provided the basis for the cross-modal owers’ aligned speech, despite this speech’s being pre- matching in our Experiment 2. sented to them in different modalities. The results comple- If shadowers align to properties of a talker’s amodal ment those of Experiment 1, wherein raters judged both idiolect, it still must be determined which of these prop- auditory and visual speech alignment to occur when the erties are most salient. These properties could range in tokens were compared auditorily. The ability to perceive complexity from simple articulatory rate to more nuanced alignment across modalities suggests that the indexical coarticulatory style. If the imitated dimensions are simi- information carried across aligning talkers is available lar to those found salient for cross-modal matching, it is cross-modally. The conceptual implications of our find- unlikely that the shadowers are imitating simple duration ings will be discussed in the following sections. (e.g., Lachs & Pisoni, 2004b). Future research can be designed to determine which amodal and/or modality- Informational Basis of Alignment specific properties shadowers imitate. Finding evidence for visual, as well as auditory, speech alignment poses interesting questions about the informa- Indexical Influences on Word Perception tional dimensions to which talkers align. Although the AXB The present findings may also have implications for matching method allows for confirmation of the percep- theories of word recognition. As was stated above, Gold- tual salience of the aligned information between model and inger (1998) interpreted his auditory alignment findings shadower, the method cannot easily determine the informa- as supporting an episodic lexicon. Goldinger’s theory tion to which talkers align. Still, speculation is warranted. proposes that episodic traces of heard words are present Previous results in the auditory alignment literature have and accessible in lexical memory. Alignment is thought to suggested that shadowers imitate models’ produced acous- emerge as a byproduct of responding to a particular talker tic dimensions, including intonational contour, acoustic whose indexical information contributes most recently to vowel space, and VOT (e.g., Gentilucci & Bernardis, 2007; these episodes. Goldinger, 1998; Pardo, 2004; Shockley et al., 2004). In finding evidence for alignment in shadowed re- With regard to visual speech alignment, it is unclear sponses to visible speech, the present results suggest a which talker-specific visible attributes shadowers might broadening of the form of episodic traces. Assuming that imitate. Gentilucci and Bernardis (2007) measured lip ki- alignment phenomena reflect the nature of the episodic nematics as their female subjects shadowed visible />?>/ lexicon, the results indicate that episodic traces retain not bisyllables produced by male and female models. These only auditory, but also visible indexical dimensions. In researchers reported that the subjects produced greater lip fact, broadening the traces to include visual speech dimen- excursions when shadowing male than when shadowing sions could allow Goldinger’s (1998) theory to explain the female model syllables, thereby aligning toward the ex- talker-facilitation effects observed for visual speech per- tent of lip excursions produced by the models themselves ception. These effects show that visual familiarity with a (which were also measured). Although our own subjects speaking face can facilitate visible vowel identification may have also aligned to the model’s lip excursions to (Kaufmann & Schweinberger, 2005; Schweinberger & some degree, it is unclear whether this dimension, con- Soukup, 1998), word lipreading (Lander & Davies, 2008), sidered by Gentilucci and Bernardis to distinguish talker sentence lipreading from single- versus multiple-talker gender, would be sufficient to induce the talker alignment lists (Yakel et al., 2000), and—most germane to Gold- observed in Experiment 1. It could very well be that other inger’s theory—memory for lipread words (Sheffert & aspects of a model’s visible articulatory movements also Fowler, 1995; see also Sheffert & Olson, 2004). induced the alignment observed in Experiment 1. However, following from the discussion in the preceding Note that, unsurprisingly, the imitated dimensions for section, it could be that Goldinger’s (1998) theory would auditory alignment have been considered acoustic in na- be best served by considering the traces as composed of ture and, for visual alignment, optic in nature. However, amodal idiolectic information. As was stated above, this the results of Experiment 2 suggest an alternative formula- could account for the results of the cross-modal and cross- tion. In that subjects could match aligned utterances across talker alignment results observed in Experiment 2. More- modalities, the imitated information likely includes some over, considering the retained talker information as amodal dimensions that are instantiated in both the visible and the could explain results recently reported in our laboratory audible streams. In supporting cross-modal matches, this (Rosenblum, Miller, & Sanchez, 2007). We observed ev- information might best be construed as amodal or modal- idence that familiarity with a talker gained through one ity neutral. In fact, the notion of amodal talker-specific in- modality can carry over to facilitate speech in another formation has been used to explain the cross-modal talker modality. In this experiment, subjects were asked to lip- matching findings described earlier (Kamachi et al., 2003; read sentences from a single talker for 1 h. Afterward, they Lachs & Pisoni, 2004a, 2004b, 2004c; Rosenblum et al., were asked to listen to auditory speech-in-noise sentences 2006). The authors of those reports suggest that cross- produced by a talker who was either the same as or differ- modal matching could be based on the extraction of com- ent from the talker that they had just lipread. The subjects mon idiolectic information available across modalities. who listened to the same talker that they had just lipread Idiolect is, after all, an amodal articulatory property that were better able to recover the speech-in-noise sentences can potentially structure both the acoustic and the visual than were the subjects who lipread one talker and heard a VISUAL SPEECH ALIGNMENT 1623 different talker. We interpret this finding as evidence that As was argued above, this finding, along with findings amodal idiolectic information is extracted from the visual reported on cross-modal talker recognition, call for a con- speech signal of a talker and is then used to facilitate audi- sideration of the relevant indexical information as amodal. tory speech recovery from that talker. If episodic traces To the degree that this amodal information takes a gestural also contain amodal idiolectic information, rather than form, these results are also consistent with the common simply auditory details, it could explain these cross-modal currency proposal. talker facilitation results, as well as the results presented in Before we conclude, an important caveat must be ac- the present report. In fact, Goldinger himself entertains the knowledged. The auditory alignment literature has re- possibility that the episodes might take a gestural, rather vealed that although these phenomena can appear rapid, than simply auditory, form (Goldinger, 1998). unconscious, and inadvertent, it would be wrong to consider alignment as a reflexive, direct, or automatic Perceptual Regulation of phenomenon (see Pardo & Remez, 2006, for a review). Speech Production Responses Intervening variables such as interlocutor gender and role, Speech alignment effects have also been interpreted as as well as the lexical frequency and presentation repeti- demonstrating perceptual regulation of produced speech tions of word stimuli, strongly affect auditory alignment based on the input from an interlocutor (Fowler, 2004; phenomena (Goldinger, 1998; Pardo, 2004, 2006; Pardo Pardo, 2006; Pardo & Remez, 2006). As was stated & Remez, 2006). It is likely that these same factors will above, alignment research has shown that auditory speech bear on visual speech alignment, as was hinted by the information can influence a talker’s rate, accent, and into- marginal gender effects found in Experiment 1, as well as national contour (e.g., Giles et al., 1991; Gregory, 1990; the intersubject variability observed in both experiments Natale, 1975; Sancier & Fowler, 1997), as well as pho- (see Figures 1 and 2). In fact, it is easy to imagine that netic dimensions (Shockley et al., 2004). These alignment the visual information for an interlocutor could bear even phenomena show that the perceptual effects on the self- more strongly on the social aspects of alignment. Future regulation of produced speech can be impressively fast. research can be designed to test for this possibility, as well In fact, there is evidence that speech production responses as to further examine the claims of the amodal and com- to perceived speech can be disproportionately faster than mon currency theses. speech responses to nonspeech stimuli (Fowler, Brown, AUTHOR NOTE Sabadini, & Weihing, 2003; Kozhevnikov & Chistovich, 1965; Porter & Castellanos, 1980; Porter & Lubker, 1980). This research was supported by NIDCD Grant 1R01DC008957-01. Similarly, the reaction time differences between simple The authors thank three anonymous reviewers and editor Lynne Nygaard single and choice response types are especially small for for helpful comments on an earlier version of this article. Correspon- dence concerning this article should be addressed to L. D. Rosenblum, shadowed speech (Fowler et al., 2003; Porter & Castel- Department of Psychology, University of California, 900 University lanos, 1980; Porter & Lubker, 1980). Ave., Riverside, CA 92521 (e-mail:

[email protected]

). These reaction time findings have been interpreted as evidence for an exceptionally close connection between R REFERENCES the speech production and perception functions and that a Arnold, P., & Hill, F. (2001). Bisensory augmentation: A speechread- common currency is shared between the functions (Fowler, ing advantage when speech is clearly audible and intact. British Jour- r 2004; Fowler et al., 2003; Sancier & Fowler, 1997; Shock- nal of Psychology, 92, 339-355. ley et al., 2004). Fowler and her colleagues argued that this Calvert, G. A., Bullmore, E., Brammer, M. J., Campbell, R., Iversen, S. D., Woodruff, P., et al. (1997). Silent lipreading acti- common currency takes the form of articulatory gestures vates the auditory cortex. Science, 276, 593-596. that are both perceived and produced, and they cited align- Chartrand, T. L., & Bargh, J. A. (1999). The chameleon effect: The ment phenomena as supporting this claim (e.g., Fowler, perception-behavior link and social interaction. Journal of Personality 2004). These researchers further suggested that the com- & Social Psychology, 76, 893-910. Davis, C., & Kim, J. (2001). Repeating and remembering foreign lan- mon currency thesis is consistent with neurophysiological guage words: Implications for language teaching systems. Artificial evidence for mirror neuron-type functions in human speech Intelligence Review, 16, 37-47. perception (e.g., Fadiga, Craighero, Buccino, & Rizzolatti, Fadiga, L., Craighero, L., Buccino, G., & Rizzolatti, G. (2002). 2002; Sundara, Namasivayam, & Chen, 2001). Speech listening specifically modulates the excitability of tongue The present visual speech alignment results seem con- muscles: A TMS study. European Journal of Neuroscience, 15, 399- 402. sistent with the common currency thesis (see also Kerzel Fowler, C. A. (2004). Speech as a supermodal or amodal phenomenon. & Bekkering, 2000). For the currency to be truly common In G. A. Calvert, C. Spence, & B. E. Stein (Eds.), The handbook of between production and perception, the primitives for multisensory processingg (pp. 189-201). Cambridge, MA: MIT Press. perception would need to be gestural and not auditory in Fowler, C. A., Brown, J. M., Sabadini, L., & Weihing, J. (2003). Rapid access to speech gestures in perception: Evidence from choice nature. In showing that these gestures, visually conveyed, and simple response time tasks. Journal of Memory & Language, 49, can modulate a production response, the present results 396-413. demonstrate that the primitives need not be auditory. In Gentilucci, M., & Bernardis, P. (2007). Imitation during phoneme fact, there is evidence that, as for auditory speech, visual production. Neuropsychologia, 45, 608-615. speech stimuli can induce mirror-type activity in articula- Giles, H., Coupland, N., & Coupland, J. (1991). Accommodation theory: Communication, context, and consequences. In H. Giles, tory musculature (Sundara et al., 2001). N. Coupland, & J. Coupland (Eds.), Contexts of accommodation: De- Furthermore, the results of Experiment 2 show evidence velopments in applied sociolinguistics (pp. 1-68). Cambridge: Cam- for the presence of cross-modal alignment information. bridge University Press. 1624 MILLER, SANCHEZ, AND ROSENBLUM Goldinger, S. D. (1998). Echoes of echoes? An episodic theory of lexi- M. Traxler & M. A. Gernsbacher (Eds.), The handbook of psycholin- cal access. Psychological Review, 105, 251-279. guistics (2nd ed., pp. 201-248). New York: Academic Press. Goldinger, S. D., & Azuma, T. (2004). Episodic memory reflected in Porter, R. J., Jr., & Castellanos, F. X. (1980). Speech production printed word naming. Psychonomic Bulletin, 11, 716-722. measures of speech perception: Rapid shadowing of VCV syllables. Grant, K. W., & Seitz, P. F. (2000). The use of visible speech cues Journal of the Acoustical Society of America, 67, 1349-1356. for improving auditory detection of spoken sentences. Journal of the Porter, R. J., Jr., & Lubker, J. F. (1980). Rapid reproduction of vowel– Acoustical Society of America, 108, 1197-1208. vowel sequences: Evidence for a fast and direct acoustic–motoric Gregory, S. W. (1990). Analysis of fundamental frequency reveals co- linkage. Journal of Speech & Hearing Research, 23, 593-602. variation in interview partners’ speech. Journal of Nonverbal Behav- Reisberg, D., McLean, J., & Goldfield, A. (1987). Easy to hear but ior, 14, 237-251. hard to understand: A lip-reading advantage with intact auditory stim- Kamachi, M., Hill, H., Lander, K., & Vatikiotis-Bateson, E. (2003). uli. In B. Dodd & R. Campbell (Eds.), Hearing by eye: The psychology “Putting the face to the voice”: Matching identity across modality. of lip-reading (pp. 97-114). Hillsdale, NJ: Erlbaum. Current Biology, 13, 1709-1714. Rosenblum, L. D. (2005). The primacy of multimodal speech percep- Kaufmann, J. M., & Schweinberger, S. R. (2005). Speaker varia- tion. In D. Pisoni & R. Remez (Eds.), Handbook of speech perception tions influence speechreading speed for dynamic faces. Perception, (pp. 51-78). Malden, MA: Blackwell. 34, 595-610. Rosenblum, L. D., Miller, R. M., & Sanchez, K. (2007). Lip-read me Kerzel, D., & Bekkering, H. (2000). Motor activation from visible now, hear me better later: Cross-modal transfer of talker-familiarity speech: Evidence from stimulus–response compatibility. Journal of effects. Psychological Science, 18, 392-396. Experimental Psychology: Human Perception & Performance, 26, Rosenblum, L. D., Niehus, R. P., & Smith, N. M. (2007). Look who’s 634-647. talking: Recognizing friends from visible articulation. Perception, 36, Kozhevnikov, V., & Chistovich, L. (1965). Speech: Articulation and 157-159. perception (JPRS Publication 50, 543). Washington, DC: Joint Publi- Rosenblum, L. D., Smith, N. M., Nichols, S. M., Hale, S., & Lee, J. cations Research Service. (2006). Hearing a face: Cross-modal speaker matching using isolated Kuiera, H., & Francis, W. (1967). Computational analysis of present- visible speech. Perception & Psychophysics, 68, 84-93. day American English. Providence, RI: Brown University Press. Rosenblum, L. D., Yakel, D. A., Baseer, N., Panchal, A., Nordarse, Lachs, L., & Pisoni, D. B. (2004a). Crossmodal source identification in B. C., & Niehus, R. P. (2002). Visual speech information for face speech perception. Ecological Psychology, 16, 159-187. recognition. Perception & Psychophysics, 64, 220-229. Lachs, L., & Pisoni, D. B. (2004b). Cross-modal source information Sancier, M. L., & Fowler, C. A. (1997). Gestural drift in a bilingual and spoken word recognition. Journal of Experimental Psychology: speaker of Brazilian Portuguese and English. Journal of Phonetics, Human Perception & Performance, 30, 378-396. 25, 421-436. Lachs, L., & Pisoni, D. B. (2004c). Specification of cross-modal source Schweinberger, S. R., & Soukup, G. R. (1998). Asymmetric rela- information in isolated kinematic displays of speech. Journal of the tionships among perceptions of facial identity, emotion, and facial Acoustical Society of America, 116, 507-518. speech. Journal of Experimental Psychology: Human Perception & Lander, K., & Davies, R. (2008). Does face familiarity influence Performance, 24, 1748-1765. speechreadability? Quarterly Journal of Experimental Psychology, Sheffert, S. M., & Fowler, C. A. (1995). The effects of voice and vis- 61, 961-967. ible speaker change on memory for spoken words. Journal of Memory MacSweeney, M., Amaro, E., Calvert, G. A., Campbell, R., David, & Language, 34, 665-685. A. S., McGuire, P., et al. (2000). Silent speechreading in the ab- Sheffert, S. M., & Olson, E. (2004). Audiovisual speech facilitates sence of scanner noise: An event-related f MRI study. NeuroReport, voice learning. Perception & Psychophysics, 66, 352-362. 11, 1729-1733. Shockley, K., Sabadini, L., & Fowler, C. A. (2004). Imitation in MacSweeney, M., Calvert, G. A., Campbell, R., McGuire, P. K., shadowing words. Perception & Psychophysics, 66, 422-429. David, A. S., Williams, S. C. R., et al. (2002). Speechreading cir- Shockley, K., Santana, M. V., & Fowler, C. A. (2003). Mutual inter- cuits in people born deaf. Neuropsychologia, 40, 801-807. personal postural constraints are involved in cooperative conversation. McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Journal of Experimental Psychology: Human Perception & Perfor- Nature, 264, 746-748. mance, 29, 326-332. Meltzoff, A. N., & Moore, M. K. (1997). Explaining facial imitation: Sumby, W. H., & Pollack, I. (1954). Visual contribution to speech in- A theoretical model. Early Development & Parenting, 6, 179-192. telligibility in noise. Journal of the Acoustical Society of America, Mills, A. E. (1987). The development of phonology in the blind child. 26, 212-215. In B. Dodd & R. Campbell (Eds.), Hearing by eye: The psychology of Sundara, M., Namasivayam, A. K., & Chen, R. (2001). Observation– lip-readingg (pp. 145-162). Hillsdale, NJ: Erlbaum. execution matching system for speech: A magnetic stimulation study. Nakamura, M., Iwano, K., & Furui, S. (2008). Differences between NeuroReport, 12, 1341-1344. acoustic characteristics of spontaneous and read speech and their of Thalheimer, W., & Cook, S. (2002, August). How to calculate ef- f effects on recognition performance. Computer Speech & Language, fect sizes from published research articles: A simplified methodol- 22, 171-184. ogy. Retrieved November 31, 2002, from http://work-learning.com/ Namy, L. L., Nygaard, L. C., & Sauerteig, D. (2002). Gender dif- f effect_sizes.htm. ferences in vocal accommodation: The role of perception. Journal of Yakel, D. A., Rosenblum, L. D., & Fortier, M. A. (2000). Effects of Language & Social Psychology, 21, 422-432. talker variability on speechreading. Perception & Psychophysics, 62, Natale, M. (1975). Convergence of mean vocal intensity in dyadic 1405-1412. communication as a function of social desirability. Journal of Person- ality & Social Psychology, 32, 790-804. NOTE Navarra, J., & Soto-Faraco, S. (2007). Hearing lips in a second lan- guage: Visual articulatory information enables the perception of L2 1. In theory, there are some possible shortcomings of the AXB rating sounds. Psychological Research, 71, 4-12. measure used here and in the extant speech alignment research. For ex- Nygaard, L. C. (2005). The integration of linguistic and non-linguistic ample, in each AXB triad, two of the utterances are based on readd speech properties of speech. In D. Pisoni & R. Remez (Eds.), Handbook of (the model’s token and subject’s baseline), whereas the third is based on speech perception (pp. 390-414). Malden, MA: Blackwell. shadowedd speech (the subject’s shadowed token). It is known that there are Pardo, J. S. (2004). Acoustic–phonetic convergence among interacting audible differences between read and shadowed speech (Nakamura, Iwano, talkers. Journal of the Acoustical Society of America, 115, 2608. & Furui, 2008), and these differences could influence the raters’ matches. Pardo, J. S. (2006). On phonetic convergence during conversational Note, however, if the raters made matches on the basis of the similarity of interaction. Journal of the Acoustical Society of America, 119, 2382- the two read tokens, they would more often match the model’s tokens to the 2393. subjects’ baseline tokens, since both are read. This is an outcome opposite Pardo, J. S., & Remez, R. E. (2006). The perception of speech. In to that hypothesized and typically observed in the alignment literature. VISUAL SPEECH ALIGNMENT 1625 APPENDIX English Bisyllable Words Frequency Frequency Frequency /H/ (per million) /M/ (per million) /Q/ (per million) cabbage 4 package 20 tailor 2 cable 7 panther 1 tamper 1 camel 1 pardon 8 target 45 campus 33 parrot 1 taxi 16 canyon 12 partner 32 teaspoon 4 capture 17 passion 28 temper 12 carpet 13 patience 22 temple 38 cartridge 6 payment 53 tender 11 castle 7 pedal 4 tennis 15 cocoa 2 pencil 34 terrace 9 combat 27 penny 25 ticket 16 comet 2 perfect 58 tidy 1 compass 13 pester 1 tiger 7 concert 39 pigeon 3 timber 19 contact 63 pillow 8 timing 11 contest 26 pizza 3 token 10 copper 13 poison 10 tonic 1 cottage 19 poker 6 topic 9 courage 32 poodle 2 towel 6 culture 58 poster 4 tuba 1 curtain 13 posture 13 tulip 4 cushion 8 punish 3 tumble 3 custom 14 puppy 2 tunnel 10 kennel 3 puzzle 10 turkey 9 kitten 5 turtle 8 M 17.5 14.6 10.7 Note—Adapted from “Imitation in Shadowing Words,” by K. Shockley, L. Sabadini, and C. A. Fowler, 2004, Perception & Psychophysics, 66, p. 428. Copyright 2004 by the Psychonomic Society, Inc. Word frequencies based on Kukera and Francis (1967). (Manuscript received October 28, 2008; revision accepted for publication April 4, 2010.)