Session: Audio Interaction CHI 2014, One of a CHInd, Toronto, ON, Canada The BoomRoom: Mid-air Direct Interaction with Virtual Sound Sources ¨ J¨org Muller 1,2 Matthias Geier3 Christina Dicke2 Sascha Spors3 1 Alexander von Humboldt Institute for Internet and Society, Berlin, Germany 2 Quality and Usability Lab, Telekom Innovation Laboratories, TU Berlin, Germany 3 Institut f¨ur Nachrichtentechnik, Universit¨at Rostock, Germany Figure 1. The BoomRoom allows to “touch”, grab and manipulate sounds in mid-air. Further, real objects can seem to emit sound (a), even when being moved (b). Sounds can be picked up (c) and placed in mid-air (d). We use real world objects to augment the auditory feedback. For example, by using a bowl as filter object (e). Finally, sounds can be dropped into objects to be found more quickly (f). Sounds can be heard anywhere in the room, and appear to originate from the location of the virtual sound source regardless of the listener’s position. ABSTRACT INTRODUCTION In this paper we present a system that allows to “touch”, grab As a hobby DJ, Marc has a BoomRoom installed in his living and manipulate sounds in mid-air. Further, arbitrary objects room. He invites his friend Laura, who is an amateur mu- can seem to emit sound. We use spatial sound reproduction sician, for a jam session. Laura brings a few loops she has for sound rendering and computer vision for tracking. Us- recorded with different instruments and uploads them into the ing our approach, sounds can be heard from anywhere in the system. Each instrument gets captured in a bottle on the ta- room and always appear to originate from the same (pos- ble (Figure 1 a), so that Marc can pick up bottles and listen sibly moving) position, regardless of the listener’s position. to them (b). He finds a sound that he likes and takes it out We demonstrate that direct “touch” interaction with sound of the bottle (c). Meanwhile, Laura has taken a sound she is an interesting alternative to indirect interaction mediated particularly likes out of another bottle and hands it to Marc. through controllers or visual interfaces. We show that sound Marc drops his sound in mid-air for later use and picks up localization is surprisingly accurate (11.5 cm), even in the the sound from Laura (d). He likes the sound, but explains to presence of distractors. We propose to leverage the ventrilo- Laura that with a little bit of effect it could be even cooler. He quist effect to further increase localization accuracy. Finally, walks over to his effect bowl, holds the sound over the bowl we demonstrate how affordances of real objects can create and stretches it with the other hand to distort it (e). They take synergies of auditory and visual feedback. As an application a few of the other sounds and choose different variants, vol- of the system, we built a spatial music mixing room. umes, filters, etc. They place some sounds in mid-air, while they drop others into bottles (f) to create an interesting and engaging soundscape. They will continue to play with this ACM Classification Keywords soundscape at the party they are giving later that night. H.5.m. Information Interfaces and Presentation (e.g. HCI): Miscellaneous In this paper we present the BoomRoom. The BoomRoom allows for direct manipulation of virtual sound sources hov- ering in mid-air. It also enables ordinary objects or body Author Keywords parts to appear to emit sounds. To accomplish this, Boom- Mid-air; Spatial Sound Reproduction; Gestural Interaction. Room uses a combination of spatial sound reproduction, in our case Wave Field Synthesis (WFS), and optical tracking. Loudspeakers and cameras can be at a distance from where the actual interaction takes place. We envision loudspeakers Permission to make digital or hard copies of part or all of this work for personal or and cameras to be embedded into the walls and ceilings of classroom use is granted without fee provided that copies are not made or distributed arbitrary rooms. Further, we envision users to be completely for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. uninstrumented, using the system as they are. For all other uses, contact the Owner/Author. Copyright is held by the owner/author(s). CHI 2014, Apr 26 - May 01 2014, Toronto, ON, Canada ACM 978-1-4503-2473- 1/14/04. http://dx.doi.org/10.1145/2556288.2557000 247 Session: Audio Interaction CHI 2014, One of a CHInd, Toronto, ON, Canada In our prototype we approximated this vision while simplify- impression that sounds are emerging from the users’ hands or ing our installation and increasing robustness. We created from physical objects. Fourth, spatial audio presents a limited a small room (3 m diameter) where a circular array of 56 bandwidth for feedback of gestural interaction. Therefore, af- loudspeakers is hidden behind curtains. Further, we used a fordances of real objects should be used to provide additional marker-based optical tracking system to simplify the com- visual feedback for more complex interactions. puter vision part of gesture recognition, user tracking and ob- ject recognition and tracking. SCENARIOS In contrast to previous indirect interaction with audio, in this We believe that the ability to “touch” sound sources in mid-air paper we propose to merge the location of the sound and of and to make objects “speak” opens many new opportunities the interaction, enabling auditory direct manipulation of vir- for human-computer interaction (HCI). As a simple exam- tual sound sources hovering in mid-air. ple, the marble answering machine [11] could be taken to a new dimension. An ordinary bowl with marbles could be pro- We determine constraints and design issues of our system in grammed to serve as an answering machine, making an occa- three steps. First, we conducted two laboratory studies to de- sional clicking sound by which the number of new messages termine important parameters of our system. We find the ab- is audible if a user is nearby. When a marble is taken out of solute pointing accuracy of our system to be 11.5 cm. Using the bowl, the marble itself could play the recorded message, distractors different from the stimulus, the accuracy degrades while being carried through the room. If the user wants to only insignificantly, but with distractors similar to the stim- delete the message, she could simply pull it out of the mar- ulus, accuracy degrades to up to 48 cm. To our knowledge ble and drop it into the bin. She could even speak a reply this is the first direct pointing accuracy evaluation of a WFS into the marble that would be returned to the caller. If she system. Second, to showcase the capabilities and limitations wants to keep the message, she could simply drop the mar- of BoomRoom, we implemented a spatial music mixing ap- ble into another bowl. More generally, there would be no plication. Third, we provide learnings from theoretical con- need for any devices in the room, like alarm clocks, bells, siderations, our own experience, and a user-centered design egg boilers, phones or computers, to have loudspeakers for process with invited novices and musicians. themselves. Extending the vision of Audio Aura [16], unread This paper makes both a technical and a scientific contribu- emails could be a flock of birds that sit or fly somewhere in tion. On the technical side, we present the first system that the room, with new mails flying in through the door and mak- allows to “touch”, grab and manipulate sounds in mid-air. ing a pass around the user. Urgent mails could occasionally Further, arbitrary objects can seem to emit sound, even when fly over the user. By the chirp of the birds different senders moving. This is also the first WFS system that allows users could be recognized. If the user wants to read one of the mes- to walk through a landscape of multiple, possibly moving, sages, she could walk over to the bird, “touch” it, and the sounds in mid-air while always coping with the users’ cur- message would be read to her. By grabbing and manipulat- rent head position. ing the chirp, she could reply to or forward emails. As an- other example, smart rooms could finally become accessible On the scientific side, we present the first experiment using for the blind. If a person comes into the room and wants to get WFS that investigates how accurately users can “touch” a an overview of the present objects, she could simply call the source. To our knowledge, previous experiments have only announce function and all objects would quickly announce investigated how exactly users can point into the direction of themselves (keys, table, chair, etc.). Similarly, dropped ob- a source (i.e. give the angle). Finally, we present the first jects on the floor or spilled liquids could make an appropriate investigation of accuracy of WFS reproduction with distrac- sound to be detected by the user. Blind users could also sim- tors. These basic studies are necessary for a wide variety of ply attach their own sounds to objects by putting them into interactions with sounds hovering in mid-air. the objects, or leave messages for others in mid-air. We took four main learnings from this project. First, we learned that direct “touch” interaction with sound is an in- SPATIAL AUDIO teresting alternative to indirect interaction mediated by con- The capabilities of the human auditory system to analyze trollers or visual interfaces. It avoids a modality switch be- acoustic scenes rely on the acoustic scattering of the outer tween auditory and visual modality. Further, it is very easy to ear including the upper torso, head and pinnae [2]. These learn by observation, and users describe it as natural and fun. acoustic properties are represented by so-called head-related Second, sound localization is surprisingly accurate (11.5 cm), transfer functions (HRTFs), which are dependent on distance even in the presence of distractors. However, the simultane- and angle of incidence of a sound source. They are individual ous presentation of very similar sounds should be avoided. for each person. Humans can perceive sound coming from Third, the localization cues of the visual and proprioceptive any direction; however, the localization accuracy depends on senses are stronger than the auditory cue. The ventriloquist the spatial origin of the sound in relation to the position of effect describes that if there is a discrepancy between auditory the listener. The angular resolution is about 1–5 degrees of and visual localization cues, the perception is biased towards azimuth in front of the listener and up to 20 degrees for more the visual cue [12]. For sources close to the loudspeakers, the peripheral and rear positions depending on the characteristics ventriloquist effect can create an unwanted bias towards the of the source and the presence of distractors [2, 24]. Locali- loudspeakers, however, the same effect helps to improve the zation in the median plane is much less accurate than in the horizontal plane. 248 Session: Audio Interaction CHI 2014, One of a CHInd, Toronto, ON, Canada 2 of loudspeakers is used, which constitutes a spatial sampling process. Typical loudspeaker spacings of 10 to 30 centimeters 1.5 result in spatial sampling artifacts emerging for frequencies above 1 kHz. The results of psychoacoustic studies [20] and 1 the considerable number of realized systems [6] prove that WFS allows for accurate spatial perception of sound scenes. 0.5 The perceptual properties of focused sources have been in- y/m 0 vestigated e.g. in [23, 22]. One problem that had to be tackled in the presented work −0.5 is the limited listening area for focused sources. A listener −1 who is located in the converging part of the sound field does not perceive the intended spatial impression [23]. While the −1.5 sound itself is reproduced without major artifacts, it is located towards the loudspeakers. The propagation direction of a fo- −2 cused source can be controlled by sensible selection of active −2 −1 0 1 2 secondary sources [19]. Choosing the propagation direction x/m towards the actual position of the listener, resolves the issue Figure 2. Synthesis of a focused source with WFS using a circular loud- of the limited listening area. This holds also for multiple fo- speaker array of 56 loudspeakers. The array has a diameter of 3 m, the cused sources and for multiple listeners as long as no listener focused source emits a monochromatic sound field of 1 kHz and is lo- cated at (0, 0.75) m as indicated by the white ring. The sound impression is located in the converging part of the sound field. is correct for any user in the lower area delimited by the dashed line. RELATED WORK We review the state of the art regarding interaction with spa- Sound field synthesis (SFS) techniques aim at physically syn- tial audio on headphones and using loudspeakers. thesizing a desired sound field within a defined listening area using surrounding loudspeakers. Well known representatives Headphones are Wave Field Synthesis [1] and higher-order Ambisonics The vast majority of spatial audio work in HCI uses head- [5]. By synthesizing a sound field, the individual HRTFs of phones. As an example, Brewster et al. [3] present two spa- the listeners are preserved even when moving throughout the tial audio interaction techniques to be used with headphones listening area. This is especially of interest in the presented while walking. One is nodding into the direction of the sound work, since distance perception for nearby sources is related source, while the second consists of gestural commands on to individual spectral changes in the HRTFs [4]. a belt-mounted PDA. Audio Aura [16] augmented an of- fice with non-spatial audio on headphones, such as sonifying In the following, we will focus on the foundations and prac- emails or reminders. Strengths of using headphones are 1) tical limitations of WFS since this technique is used in the mobility, 2) isolation from ambient noise, and 3) ability to BoomRoom. The physical foundations of WFS are related render different sounds to different users. The major draw- to the Kirchhoff–Helmholtz integral from which WFS can be back is the necessity to wear headphones in the first place, derived as stationary-phase approximation [25]. WFS allows which may be cumbersome and influence hearing, thereby for the synthesis of recorded or prescribed sound fields. The separating the user from real-world sounds and other people. latter case is used commonly in model-based rendering where spatio-temporal models of virtual sources are used to derive Loudspeakers the driving signals for the loudspeakers. Typical models are The majority of research on WFS and related techniques con- plane waves, point sources and focused sources. The former cerns non-interactive spatial audio rendering. Typical appli- two constitute an acoustic source at an infinite respectively fi- cations are television and cinema, where they could provide nite distance outside of the listening area. The latter one is the the next generation of surround-sound which is not dependent special case of an acoustic source located within the listening on a sweet spot. In this section we discuss the few examples area. The BoomRoom uses only focused sources. Figure 2 of gestural interactive WFS systems we could find. illustrates the synthesis of a monochromatic focused source with a circular loudspeaker array in the same configuration Grainstick [13] is a gesture-controlled musical instrument us- as used in the BoomRoom. A sound field is synthesized that ing WFS. Two users stand in front of a linear loudspeaker converges towards a focus point and diverges after the focus array with optically tracked Wiimote controllers. The rela- point at (0, 0.75) m as a point source located at the focus. tive height of the two controllers controls virtual grains ren- Hence, from a physical point of view the synthesized sound dered as focused sources which move from one direction to field is only correct for y <= 0.75 m within the loudspeaker the other in front of the WFS system. array. Every listener located anywhere inside this area gets the impression of a sound emerging from the focus point. The application of WFS in the context of visual Augmented Reality is discussed in [14]. The user is wearing video see- The implementation of WFS faces some practical limitations. through AR glasses while standing inside a WFS system. The The theoretical background assumes a spatially continuous user can use a large paddle with a visual marker attached to distribution of secondary sources. In practice, a finite number the end to position a sound source. Seemingly, the sound 249 Session: Audio Interaction CHI 2014, One of a CHInd, Toronto, ON, Canada source is permanently attached to the end of the paddle. In We also 3) remove the physical prop, yielding the first system the same paper, also an exocentric perspective is presented, where users can walk with their head through a focused sound where users look at the room as World-in-Miniature and can source. We discuss the consequences of these extensions in use a miniature paddle to position the sound source. the paper. Springer et al. [21] present a system that combines WFS with In summary, while a few interactive WFS applications have a multi-viewer stereo display. Users stand in front of a large been implemented, interaction is always indirect. In partic- two-viewer stereo projection wearing a combination of shut- ular, we are not aware of any system where the users could ter and polarization glasses. Behind the screen is a WFS sys- “touch” the sound sources. This is partially due to the fact tem comprising three linear array segments. In one applica- that in order to correctly render focused sources when the tion, users hold an optically tracked controller to operate a user is moving around, the approach of Melchior et al. is billard cue on the screen. They can hit balls which bounce off necessary. Further, in order to create sources anywhere in cushions and other balls and thereby emit sounds. In another a room, a closed loudspeaker array is needed that encircles application, users hold a hand-held trackball. On the screen, the entire room. Naturally, because “touching” sources has a forest with a brook flowing from left to right is shown and not been possible before, we present the first evaluations of audible. Users control a 3D cursor with the trackball, and “touch” accuracy in WFS. Finally, we present the first system when they press a button, a stone is dropped from the cursor that allows more interactions than mere indirect translation of into the brook. sources. Fohl et al. [7] present a gesture control interface for WFS. Users can point into the direction of a source and raise their THE IMPLEMENTATION OF BOOMROOM hand to select it after a certain dwell time. When the hand is The BoomRoom was realized in a room with a size of 4.3 m moved towards or away from the source, the source moves × 4.5 m. The room is equipped with absorber panels and cur- away from or towards the user. When the hand is moved tains which reduce the reverberation time T60 to 0.5 seconds. sideways, the source rotates around the user. When the hand In the room, a ring of 56 loudspeakers (Elac 301) is suspended is dropped, the source is released. It is not stated in the from the ceiling. The ring has a diameter of 3 m which yields paper whether focused sources are used, but since the head a loudspeaker spacing of about 17 cm. The height of the ring is not tracked, apparently users cannot walk around focused of loudspeakers can be varied; for the BoomRoom it was po- sources. sitioned at ear level of a standing person. The main difference between these systems and our work is The loudspeaker driving signals are generated in real-time by that users cannot “touch” the sources, but interact with them a computer running the Debian GNU/Linux operating sys- indirectly, by a) lifting two controllers [13] b) a paddle [14], tem. The model-based spatial audio reproduction was re- c) a controller for pointing or a trackball [21] or d) the point- alized with the open-source software SoundScape Renderer ing direction of their hand [7]. Also different in our system is (SSR) [8]. The SSR provides, among several other repro- that users can walk freely around sources hovering in mid-air. duction methods, a very efficient real-time implementation of The sources can even be moved around the user’s own head, WFS. The WFS algorithm is implemented using a delay line while a correct listening experience is maintained. This is ap- and a weighting factor for each source–loudspeaker pair and parently not possible in these previous systems. The ability a static filter per source [6]. This so-called pre-equalization to “touch” and move sources in mid-air is difficult to achieve filter must be used to compensate the inherent low-pass char- without being able to walk around them. Finally, these sys- acteristic of a loudspeaker array. It depends only on the layout tems provide mostly translation of sources, while we enable of the loudspeaker array and is applied to each source signal richer interaction, e.g., manipulation. before applying time delays and amplitude weighting factors. With a loudspeaker setup limited to the horizontal plane, the One critical extension of WFS that enables systems allow- amplitude of the sound field cannot be synthesized correctly ing to “touch” sounds is presented by Melchior et al. [15]. for the whole listening area. Therefore, a certain point inside They do not present an interactive system, but rather a tech- the listening area is chosen as a reference point for the cal- nique to select loudspeakers for focused sources based on culation of the amplitude. This reference point is typically the tracked listener position. This is a critical feature to en- located in the center of the loudspeaker array. For the Boom- able users to walk around focused sources while continuously Room, the SSR was extended with a feature called reference maintaining a correct listening impression. In the experiment, offset. This extension is publicly available in the latest release a physical (inactive) loudspeaker was placed in the center of of the SSR.1 The reference offset is bound to the tracked po- a WFS system and users walked around it. They were asked sition of the listener’s head, therefore the amplitude is always whether they had the impression that the sound was coming correctly calculated for the actual position of the listener. As from the loudspeaker (while it was actually rendered by the mentioned above, for a given source only a subset of loud- WFS system). We use the same approach to enable users speakers is used. This selection is also controlled by the ref- to walk around sound sources in the BoomRoom. We ex- erence offset and updated in real-time. tend the approach by 1) applying it to multiple sources simul- taneously, so users can walk through an auditory landscape The interactive playback and looping of audio files and their and by 2) enabling dynamically moving sources, so users can routing to virtual sound sources in the SSR was realized with hold a source in their hand and move it around their head. 1 http://spatialaudio.net/ssr/ 250 Session: Audio Interaction CHI 2014, One of a CHInd, Toronto, ON, Canada the visual programming language Pure Data (Pd). For the Absolute Error in Space second experiment and the music mixing application (see 1.5 below), several audio tracks have to be played back syn- chronously. This is done by loading a multi-channel audio 1.0 file in Pd. At the end of the file, the playback is seamlessly Y dimension of room (m) started from the beginning. In addition to the instrumental 0.5 loops, further audio files can be loaded for providing audio feedback in the music mixing application. These files can be 0.0 started on demand and their position in the virtual room can be specified separately. -0.5 For the music mixing application two sound effects were im- -1.0 plemented in Pd. One effect is a band-pass filter with reso- nance, where the cutoff frequency and the sharpness can be -1.5 remote-controlled. The other effect is a simple distortion ef- fect realized by wave-shaping the signal with a tangens hy- -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 perbolicus curve. The amount of distortion can be remote- X dimension of room (m) controlled. Figure 3. Error in absolute space. Arrows point from click to source. For optical tracking an OptiTrack system with 16 cameras Red arrows indicate clicking towards the loudspeakers, while blue ar- suspended from the ceiling is used. We use pinch gloves to ro- rows indicate clicking towards the center of the room. As can be seen bustly detect pinches. Cap, gloves, and objects are equipped from the plot, there is a systematic bias for sources to be perceived as being closer to the loudspeakers (black circle, green circle denotes area with reflective markers. The main system logic was imple- where sources appeared). mented in Processing. The communication between applica- tions was realized using TCP/IP sockets. For routing audio signals to the soundcard and between applications the JACK plane) was continuously recorded over time. After each trial, Audio Connection Kit was used. the following stimulus was selected randomly. Users com- pleted 3 stimuli × 15 repetitions = 45 trials. After all trials, EXPERIMENTAL EVALUATION 1: ACCURACY OF LO- a semi-structured interview regarding the user’s strategy was CALIZATION performed. The purpose of this study was to determine the accuracy with We recruited 17 participants (6 male) to participate in the which users can locate virtual sound sources within our appa- study. Participants were not associated with our laboratory ratus. This information is used to determine the radius within and had no experience with spatial listening experiments or which a sound can be selected. WFS before. They were aged between 25 and 68 (mean = 36). No participants reported any hearing impairments. Procedure We used the WFS apparatus described in the previous sec- tion. The head of the user was tracked with an optical marker. Results The selector was a Logitech Presenter with an optical marker The median accuracy (horizontal distance from click location attached. The participants selected using a button on the pre- to sound source) was 11.5 cm (mean = 14 cm). We did senter. not find significant differences between stimuli (repeated- measures ANOVA, F(2,715) = 1.39, p < .25). We did Participants began a trial standing within the circle of loud- find significant differences between participants, however speakers by clicking the button. They heard a sound placed at (ANOVA, F(16,748) = 21.9, p < .01). Mean error of the a random position with at least 30 cm distance from the loud- most accurate participant was only 6.4 cm, while the least speakers. Participants were able to move freely around the accurate participant had a mean error of 23.7 cm. Median room within the loudspeaker array. Their task was to place selection time was 8 s. the selector directly at the location where the sound appeared to be coming from and click the button. When they clicked When the stimulus appeared, participants needed about one the button, the sound was placed in a new random location, second to localize it. Then they walked quickly towards it, beginning the next trial. Users trained for 8 trials at the be- reaching the vicinity of 30 cm after 3 s. They finally per- ginning of the experiment. The independent variable was the formed a fine search, where they improved their accuracy stimulus. only slightly to 20 cm after 6 s. As stimuli, pulsed noise, speech, and guitar tones were used. The actual locations of the sound sources and the clicks are All stimuli are available on the project website.2 As depen- plotted in Figure 3. It can be seen that there is a systematic dent variables, the selection time from stimulus presentation error for sources which are perceived to be closer to the cur- to button press was measured and the distance between the tain than they really are (t-test t(764) = 13.7, p < .01). The selector and the sound source (projected to the horizontal error from a user perspective is plotted in Figure 4. In this ex- periment, a slight tendency for overshooting (two-sided t-test 2 http://joergmueller.info/boomroom/ t(764) = -5.7, p < .01) and bias to the right (t(764) = 5.9, p 251 Session: Audio Interaction CHI 2014, One of a CHInd, Toronto, ON, Canada Error relative to source DISCUSSION Localization of sound sources in mid-air is surprisingly accu- Error on axis orthogonal to head-source axis (m) rate even for novice users (11.5 cm), and there are no signif- + 0.4 + icant differences between our stimuli. Determining the loca- + + + + + + + + tion of a sound from a distance or while walking around or ++ ++ + + ++ + +++ ++ + ++ + + + + + even through sounds are different techniques, and we cannot 0.2 + ++ + + ++ ++ +++ + +++ ++ + + + + + ++ ++ ++ ++++ ++ +++ + + ++++ ++ + +++ ++ + + +++++++++++++ + + + + + + + + + ++ + + + +++ + + + + distinguish their accuracy in this experiment. The two sys- + + ++ + +++++++++++ ++++++ ++++++++ +++ + + + +++ ++++++++++++++++++++++++++++++++++++++ +++ ++ ++++++++++++++++++ +++ ++ + + + ++ + tematic misperceptions of a higher variance on the axis be- + + + ++++++ +++++++++++++++++++++++++ + + + + + + ++ + + ++++++++ ++ + + + ++++++++++++++++++++++++++ ++ + ++++++++ ++ + + + + 0.0 + + + + + ++++++ + + + +++++++++++++++++ + +++++++ +++++++++ + +++++++++++++++ + + +++++ ++ + + + + + ++++ + ++++++++++++++++++++++++++++++++++++++++ ++ + + + + + ++++ ++ + +++ + ++ +++ + + + ++ + +++ +++++++ + +++++++ + +++ + ++ + + + + + + + + tween head and source than the orthogonal axis, and the bias + ++ ++++ + + +++++ ++++++ ++ +++ + +++ ++ + + ++ + + +++ + + +++ ++ + + ++ + + + + towards the loudspeakers, can be explained by psychoacous- + + ++ +++ + + + tics. Humans are much better in determining angles than dis- -0.2 + + + + + + +++ + + + + + + tances. In our case, this effect is much less pronounced than + + + + in experiments where the listener is stationary [2]. Users em- -0.4 + ployed active hearing, they translated and rotated their head. + + Even with this strategy, the effect is present. The perception -0.4 -0.2 0.0 0.2 0.4 of sources close to the curtains as coming from the curtains Error on axis between head and source (m) can be explained with visual dominance and the ventriloquist effect. If there is a plausible visual sound source (e.g., a cur- Figure 4. Error on an axis between source and head. Each trial is trans- tain) close to an audible sound source, the visual perception lated so that the source is at the center (red circle) and rotated so that the head is on the horizontal axis (mean head position blue diamond). may dominate the auditory perception, and the sound may be Clicks are shown as grey + signs, the mean click location as black + sign. perceived as coming from the curtain. It can be seen that there is greater variance along the axis between head and source (distance estimation). Further, in our experiment partici- pants tend to slightly overshoot the target and have a slight bias to their EXPERIMENTAL EVALUATION 2: LOCALIZATION IN THE right. PRESENCE OF DISTRACTORS The purpose of this study is to investigate the impact of vary- ing numbers of distractors (both similar and dissimilar to the stimulus) on the accuracy of target acquisition. < .01) can be observed (all participants were using their right hand). Note also that the variance along the axis between head and source (SD = 14.6 cm) is significantly larger than Procedure the variance on the orthogonal axis (SD = 8.9 cm, p < .01, The same apparatus as in the previous experiment was used Bartlett test). It can also be seen that participants tend to have and the experiment was conducted immediately after the pre- their head close to the source (median = 21.7 cm). vious one with the same participants. Due to the application scenario, a prototypical implementation of a music mixer, In interviews, participants described their strategy as first lis- musical instruments were chosen as stimuli for this study. In- tening to the stimulus, rotating their head a bit, and then walk- dividual tracks from REM’s song “It happened today”, which ing towards it. For the final approach, strategies differed. are freely available under a Creative Commons license (CC Some participants simply streched out their hand in the di- BY-NC-SA 3.0), served as source material. An acoustic gui- rection of the source when they believed to be close. Oth- tar was chosen as stimulus. As dissimilar distractors, differ- ers described it as walking around the source. Others rotated ent instruments (percussion, synth, mallets, bass, etc.) were their head to see on which ear the sound was louder. During used, and as similar distractors, different sounds from electric the experiment a few users developed the strategy of walking guitar, mandolin, and banjo. All sound files are available for through the source and determining when it jumped to the download on the project website.2 other side of their head. Many users tried to find the source with their head first. Then they either moved their head away In the beginning of the experiment, participants could listen and moved their hand to their previous head position, or they to the stimulus and all distractors separately. A trial began simply moved the hand very close to the head. Regarding the by listening to the stimulus in isolation. When the subject perceived location of the sound, many participants initially pressed a button, the stimulus changed location and 1, 3, 5 or perceived the sound coming from the curtain. After a few 7 concurrent distractors (either similar or dissimilar) added at trials, however, they reported to perceive the sound to origi- random locations (at least 20 cm distance from the curtain) nate in mid-air. Some participants also reported to perceive became audible. The task was to place the selector directly sources close to the curtains as coming from the curtains, and at the location where the stimulus appeared to be coming sources further towards the center of the room as originating from and click the button. After the click, all distractors were in mid-air. Most participants expressed astonishment about muted, so that the subjects could estimate their own accuracy. their first ever experience of walking through a sound source. 2 distractor categories × 4 different numbers of distractors Some described the experience as a strange feeling in their × 3 repetitions = 24 trials were performed. After all trials, head, as if the sound had entered their head. Others described a semi-structured interview regarding the user’s strategy was it as the sound evading them, resulting in the perception of a performed. As dependent variables, accuracy and time were moving sound source. measured. 252 Session: Audio Interaction CHI 2014, One of a CHInd, Toronto, ON, Canada Accuracy with Distractors do not see an issue presenting large numbers of dissimilar 2.5 Distractors sounds concurrently, the concurrent presentation of similar Similar Dissimilar sounds should be avoided if possible. 2.0 APPLICATIONS 1.5 In order to explore the capabilities of the BoomRoom, we im- Error (m) plemented four different applications. We like to think about the BoomRoom to provide capabilities for consumption and 1.0 creation. Regarding consumption, we implemented an appli- cation to augment the music listening experience. Instead of 0.5 simply listening to a prefabricated stereo mix, the instruments and voices are distributed throughout the living room. E.g., 0.0 1 3 5 7 violins may be situated close to the sofa, while flutes may Number of Distractors hover above the table. Users can rearrange the spatial layout of the musical scenario at will. We have also implemented a lightsaber game where users can hold a small controller with Figure 5. Error with similar and dissimilar distractors, for 1, 3, 5 and 7 distractors, respectively. For dissimilar distractors (white), the error is two buttons. They can switch the lightsaber on and off, which almost unaffected by the number of distractors. In contrast, with similar then emits a lightsaber sound as if the saber would be about a distractors (yellow) the system quickly becomes unusable when adding meter long. Invisible enemies (which one can hear breathing) distractors. attack the players from all sides. The players have to defend themselves using the lightsaber. Third, we implemented an Results immersive audio drama experience where the voices occur The average error for similar and dissimilar distractors by from around the user. Here too, users can rearrange the scene number of distractors is given in Figure 5. With a two-way at will. ANOVA, we found significant main effects of kind of distrac- In order to explore the creative capabilities of the Boom- tors (F(1,78.7) = 63.08, p < .01) and number of distractors Room, we implemented a spatial music mixing application. (F(3,78.7) = 7.89, p < .01) on the performance time. We also We were inspired by Ishii et al.’s examples of tangible com- found a significant interaction of kind of distractors and num- puting [11], in particular the musicBottles [10]. With the ber of distractors (F(3,78.7) = 3.4, p < .05). A Tukey’s pair- musicBottles, sounds are confined within bottles placed on wise comparison revealed the significant differences between a dedicated table. Many users were seen to lift the bottles to similar and dissimilar distractors as well as between 1-5, 1-7 their ears to hear whether the sound was literally coming from and 3-7 distractors (p < .05). Bartlett’s test found the variance the bottle. However, this did not work, since the sound was in error to be higher for similar than for dissimilar distractors coming from loudspeakers below the table. (χ2 (1) = 201.5, p < .01). With dissimilar distractors, median errors for 1, 3, 5 and 7 distractors were 11 cm, 12 cm, 14 cm We decided to take these concepts a step further. As with the and 16 cm, respectively. Median selection time was around musicBottles, musical instruments reside within bottles. In 10 s and independent of the number of distractors. With simi- contrast to the musicBottles, the sound itself can be grasped lar distractors, median errors for 1, 3, 5 and 7 distractors were and positioned somewhere else in the room. Sounds can also 13 cm, 23 cm, 28 cm, and 48 cm, respectively. Further, me- be dropped into bottles. In addition to bottles, there are a dian selection times rose sharply from 7.9 s for 1 distractor to number of bowls in the room. The bowls can be programmed 16.4 s for 7 distractors. Notably, with similar distractors, par- with arbitrary sound effects, and when a sound is held above ticipants reported difficulties when multiple distractors were a bowl, it can be altered. The sounds are explained to be very close to each other or close to the stimulus. A two- elastic, so they can be held in place in one hand and stretched way ANOVA found significant differences between partici- with the other hand in horizontal and/or vertical direction to pants (F(16,374) = 4.37, p < .01) and an interaction between be altered. We have implemented bowls for changing volume, participant and kind of distractors (F(16,375) = 2.15, p < selecting different variants of the instruments, applying filters .01). With similar distractors, the highest performing partici- and effects, making sounds play solo and muting them. For pant had a median error of 8.4 cm, while the least performing example, the volume of any sound can be changed simply by participant had a median error of 131.0 cm. With dissimilar placing it above the volume bowl, holding it with one hand distractors, this span was only 6.3 cm vs. 30.3 cm. and stretching it vertically with the other hand. Discussion EXPERIENCES WITH MID-AIR AUDITORY DIRECT MANIP- While dissimilar distractors have little effect on performance, ULATION similar distractors make the system quickly unusable, both We invited a dozen users to explore interaction with the mu- in terms of speed and accuracy. Participants did not re- sic mixing application. Users came from different back- port problems distinguishing the stimulus from distractors grounds (from no musical experience over audio experts to when presented separately. However, especially some par- professional musicians). In this section, we provide learn- ticipants were prone to confusing the stimulus with distrac- ings from theoretical considerations, from our own experi- tors when presented simultaneously. Concluding, while we ences and from this user-centered design process. 253 Session: Audio Interaction CHI 2014, One of a CHInd, Toronto, ON, Canada be beneficial if the user could select a region for items to an- nouce themselves via a torch metaphor. A torch could high- light sounds either in a cone (direction) or circle (position). We use overlapping announcement of sounds within 200 ms. Third, we assigned sounds to objects or body parts (the hands), such that they can simply be found via the visual or Figure 6. Design space of primitive interactions with sounds hovering in mid-air. proprioceptive senses. The user needs to remember which source is attached to which object, but then these sounds can be found efficiently using the visual or proprioceptive senses We present our results in form of a design space of primitive regardless whether they are currently audible. interactions with sounds hovering in mid-air. We identified five primitive interactions: Finding, selecting, grabbing, ma- Selecting nipulating and dropping sounds. We argue that the combina- Selecting involves the determination of one sound for further tion of these interaction primitives enables direct manipula- interaction. In our case, selecting is performed by positioning tion of sound sources in mid-air, inspired by the concept of one hand in an area around the sound. When the selection direct manipulation for graphical user interfaces. Direct ma- area is entered or left, a feedback sound is played. Impor- nipulation interfaces [18] are charactized by 1) Continuous tant choices are the size of the selection area and feedback representation of the object of interest; 2) Physical actions or sounds. There is a general tradeoff between ease of selection labeled button presses instead of complex syntax; 3) Rapid and inadvertent selections, either when multiple sounds are incremental reversible operations whose impact on the ob- close or e.g., while walking through the room. We observed ject of interest is immediately visible. They are argued to be that users often walk in the direction of the source and then beneficial in particular because of the closeness of the user’s sweep their hand in front of them until they hear the feedback mental model and the physical requirements of the system, that the source is selected. We currently use a horizontal ra- and because of the user’s feeling of interacting with the ob- dius of 20 cm for source size, which is well above the average jects themselves rather than via a tool [9]. Direct manipula- localization accuracy of 11.5 cm and works well. In cases tion is also argued to provide benefits like ease of learning where users were unable to select a source, their head was by imitation, the immediate feedback whether the actions are often very close to the source, making it difficult to select. furthering one’s goals, and the ability to simply change the Equally important is the vertical size of the selection area. It direction of one’s activity otherwise [18]. We discuss each of can be very annoying to get a large number of entered/left the interaction primitives in turn. feedbacks when one is walking around in the room. There- fore, it should be taken care that sounds are not selected when Finding hands are not raised. We observed that when a sound is not at- Finding involves determining the location of a specific sound, tached to an object, users tend to lift their hands to the height possibly within a complex scene. We observed that users ro- of the loudspeaker array for selection. The vertical localiza- tate their head a lot and walk around in the room. In order tion accuracy is enough for people to experience the vertical to support these strategies, especially moving around sources, position of the sounds at the height of the loudspeakers. For sounds hovering in mid-air, we currently define the vertical the extension of the method of Melchior [15], which was used selection area as starting 10 cm below the loudspeaker array. in our implementation, is necessary. One particularly interesting aspect of finding sounds relates For sounds that are attached to objects such as bottles, we to the first principle of direct manipulation, continuous rep- leverage the ventriloquist effect for selection. The effect resentation of the object of interest. While objects in visual works well for angular deviations between auditory and interfaces are usually visible continuously, audio is often not visual cues of up to 20◦ –30◦ even if there is no apparent causal relationship between visual and auditory events [12]. continuous. In silent moments, it may be difficult or impossi- It should be noted that in the literature only perceived differ- ble to either find a source or hear the effects when it is manip- ences in azimuthal angles are investigated. We are not aware ulated. We propose three solutions to this problem: (1) Mod- ifying the source signal, (2) adding an announce function and of any experiments which investigate localization with regard (3) attaching the source to visual objects or body parts. to different elevation angles and different distances. One can assume, that the bias towards the visual cue is even greater First, in our music mixing application, we simply removed in elevation and distance because auditory localization on its most moments of silence in the source signal. The central own is much less accurate in these cases [2]. downside of this approach is the modification of the audio In our experience, users have the perception that the sound is scene itself, which may be inappropriate in the case of music or speech. coming directly from the opening of the bottles, and that this experience is quite robust against vertical angular deviations Second, we implemented an announce function. Sounds an- between bottle and loudspeakers. For steep angles, as when nounce themselves via auditory icons (musicons in our case) standing close to the bottles, however, they have the feeling or speech labels when the user lifts a hand above the head that the sound is hovering above the bottle. For sounds in- without pinching. In our case, all sounds in the room an- side objects, we define a selection area that starts immediately nounce themselves. For a larger number of sounds, it would above the object. 254 Session: Audio Interaction CHI 2014, One of a CHInd, Toronto, ON, Canada Grabbing regardless of the current value, such as filter or effect. For For grabbing, we use the pinch gesture because it is robust these parameters, we use an absolute selection style, where to recognize using computer vision and easy to understand. both hands close to each other set the value to zero, and the When a user pinches while a sound is selected, that sound distance in horizontal direction sets one parameter and the is grabbed and can be moved. We currently give a general distance in vertical direction the other. Note that it is diffi- feedback that a sound is grabbed. However, users also need cult to cross both hands, therefore we use the horizontal axis to verify which source they have grabbed. The behaviour we for parameters that have only positive values (such as filter have observed most often was to move the sound close to the steepness or effect strength) and the vertical direction for pa- ear and back and forth, or to move it around the head. Using rameters that have both positive and negative values (such aus this technique, users were able to verify quickly which source frequency relative to 440 Hz). For discrete parameters, such they had grabbed, so that we see no need for additional feed- as variant, we quickly noticed that it was not always audi- back. Users also reported that the spatial impression from the ble when the value/variant had changed. We therefore added audio was strongest when they had grabbed a sound. This can discrete feedback for these events. be explained with the proprioceptive ventriloquist effect. This effect explains that if there is a discrepancy between auditory Dropping and proprioceptive cues, the perception is biased towards the Dropping simply involves releasing the pinch when a sound proprioceptive cues [17]. is grabbed. Sounds can be dropped in mid-air, into bottles or bowls. Dropping sounds into bowls allows to apply an effect Manipulating to multiple sounds simultaneously (useful, e.g., for solo). A For manipulating sound, our first approach was to use mid- different feedback is given when sounds are dropped in mid- air gestures. In the first iteration of the volume adjustment, air or into objects, to give users the chance to verify that they sounds could be grabbed in mid-air and then be moved up have successfully dropped a sound into an object. In our first and down to raise and lower the volume. This approach, iteration, sounds were dropped into objects when they were however, clashed with the proprioceptive ventriloquist effect. close to the object and not grabbed. This lead to the phe- When the sound was grabbed and the hand lowered, there was nomenon of inadvertently “collecting” all sounds along the a strong expectation that the sound should move vertically, way when one carried an object across the room. In our cur- too. Because the angular difference between loudspeakers rent implementation, sounds are only dropped into objects if and hand was so large, it became audible that the sound was they are explicitly released above them. still coming from the same location, breaking the illusion that the sound was held in hand. Further, we had problems with LIMITATIONS the limited feedback bandwidth of spatial audio compared to The experiments were conducted with a single user at a time. visual interfaces, making it difficult to communicate in which The tracked position of the user’s head was used as reference manipulation mode the user currently was. Our second ap- point for calculating the WFS driving signals. With a few proach involved physical objects, like a pepper caster, that limitations, the BoomRoom is also multiuser capable. In this could be held in hand and be “put inside the sound”. This case, the reference point is chosen between users [15]. There- however strongly reduced the manual dexterity for gestural fore, a focused source located directly between users cannot manipulations, in particular it was difficult to use pinch as a be rendered correctly for all users as some of them would be delimiter when holding an object. In our current approach, outside the “allowed” area (see Figure 2). Nevertheless, all we use bowls, which have the affordance that sounds can be other positions work for all users. Furthermore, when one “put into” and “held over” them. The hands are now free for user is standing still while others are moving, the sound per- gestures and the bowls provide a visual feedback for the zone ception changes also for the stationary user because the refer- where each action can be performed. In order to maintain ence point is changing. Finally, like all practical sound field consistency with the proprioceptive ventriloquist effect, the reproduction systems, the BoomRoom suffers from more or sound always needs to be held in one hand above the bowl. less audible artifacts caused by spatial aliasing [20]. The other hand can then define two parameters by moving These are all physical limitations and not shortcomings of the horizontally and vertically. current implementation. However, future research based on We support three different manipulation styles: relative, ab- sound perception may lead to methods that allow to elude solute, and discrete manipulation. Relative manipulation is these physical limitations. used for parameters that users usually want to manipulate rel- ative to their current value, such as volume. In our initial CONCLUSION implementation volume was defined by the relative height of We presented a system that allows users to “touch”, grab and the two hands. When both hands were at the same height, manipulate sounds in mid-air. We took four main learnings there was no change in volume. We quickly noticed that users from this project. We learned that direct “touch” interaction had difficulty understanding this concept. Instead, most users with sound is an interesting alternative to indirect interaction grabbed the sound with one hand, pinched with the other hand mediated by controllers or visual interfaces. Sound locali- at an arbitrary location, and moved the second hand up and zation is surprisingly accurate (11.5 cm), even in the pres- down in the expectation for the volume to change accord- ence of distractors. The ventriloquist effect can be leveraged ingly. We subsequently implemented this behavior. For other by assigning sounds to real objects or holding them in the parameters, it is important that they can be easily set to zero hands. Finally, affordances of real objects should be used 255 Session: Audio Interaction CHI 2014, One of a CHInd, Toronto, ON, Canada to enrich the limited feedback bandwidth of spatial audio for 13. Leslie, G., Zamborlin, B., Jodlowski, P., and Schnell, N. more complex interactions. We believe that mid-air auditory Grainstick: A collaborative, interactive sound direct manipulation has significant potential beyond what we installation. In International Computer Music explored in this paper. Conference (2010). 14. Melchior, F., Laubach, T., and de Vries, D. Authoring ACKNOWLEDGEMENTS and user interaction for the production of Wave Field We want to thank Sean Gustafson and Patrick Baudisch for Synthesis content in an augmented reality system. In extensive discussions and support on this project. This work Fourth IEEE and ACM International Symposium on was supported by the ICT Labs of the European Institute of Mixed and Augmented Reality (2005). Innovation and Technology. 15. Melchior, F., Sladeczek, C., de Vries, D., and Fr¨ohlich, REFERENCES B. User-dependent optimization of Wave Field Synthesis 1. Berkhout, A. A holographic approach to acoustic reproduction for directive sound fields. In 124th control. Journal of the Audio Engineering Society 36, 12 Convention of the Audio Engineering Society (2008). (1988), 977–995. 16. Mynatt, E. D., Back, M., Want, R., Baer, M., and Ellis, 2. Blauert, J. Spatial Hearing: The Psychophysics of J. B. Designing Audio Aura. In SIGCHI Conference on Human Sound Localization, revised ed. MIT Press, Human Factors in Computing Systems (1998). 1996. 17. Pick, H. L., Warren, D. H., and Hay, J. C. Sensory 3. Brewster, S., Lumsden, J., Bell, M., Hall, M., and conflict in judgments of spatial direction. Perception & Tasker, S. Multimodal ‘eyes-free’ interaction techniques Psychophysics 6, 4 (1969), 203–205. for wearable devices. In SIGCHI Conference on Human 18. Shneiderman, B. The future of interactive systems and Factors in Computing Systems (2003). the emergence of direct manipulation. Behaviour & 4. Brungart, D. S., and Rabinowitz, W. M. Auditory Information Technology 1, 3 (1982), 237–256. localization of nearby sources. Head-related transfer 19. Spors, S. Extension of an analytic secondary source functions. Journal of the Acoustical Socienty of America selection criterion for Wave Field Synthesis. In 123rd 106, 3 (1999), 1465–1479. Convention of the Audio Engineering Society (2007). 5. Daniel, J. Spatial sound encoding including near field 20. Spors, S., Wierstorf, H., Raake, A., Melchior, F., Frank, effect: Introducing distance coding filters and a viable, M., and Zotter, F. Spatial sound with loudspeakers and new Ambisonic format. In 23rd International its perception: A review of the current state. IEEE Conference of the Audio Engineering Society (2003). Proceedings 101, 9 (2013), 1920–1938. 6. de Vries, D. Wave Field Synthesis. AES Monograph. 21. Springer, J. P., Sladeczek, C., Scheffler, M., Hochstrate, Audio Engineering Society, 2009. J., Melchior, F., and Fr¨ohlich, B. Combining Wave Field Synthesis and multi-viewer stereo displays. In IEEE 7. Fohl, W., and Nogalski, M. A gesture control interface Virtual Reality Conference (2006). for a Wave Field Synthesis system. In International Conference on New Interfaces for Musical Expression 22. V¨olk, F., M¨uhlbauer, U., and Fastl, H. Minimum audible (2013). distance (MAD) by the example of Wave Field Synthesis. In German Annual Conference on Acoustics 8. Geier, M., and Spors, S. Spatial audio reproduction with (DAGA) (2012). the SoundScape Renderer. In 27th Tonmeistertagung – VDT International Convention (2012). 23. Wierstorf, H., Raake, A., Geier, M., and Spors, S. Perception of focused sources in Wave Field Synthesis. 9. Hutchins, E. L., Hollan, J. D., and Norman, D. A. Direct Journal of the Audio Engineering Society 61, 1/2 (2013), manipulation interfaces. Human–Computer Interaction 5–16. 1, 4 (1985), 311–338. 24. Yost, W. A., Dye, R. H., and Sheft, S. A simulated 10. Ishii, H., Mazalek, A., and Lee, J. Bottles as a minimal cocktail party with up to three sound sources. Perception interface to access digital information. In SIGCHI & Psychophysics 58 (1996), 1026–1036. Conference on Human Factors in Computing Systems (2001). 25. Zotter, F., and Spors, S. Is sound field control determined at all frequencies? How is it related to 11. Ishii, H., and Ullmer, B. Tangible bits: towards seamless numerical acoustics? In 52nd Conference of the Audio interfaces between people, bits and atoms. In SIGCHI Engineering Society (2013). Conference on Human Factors in Computing Systems (1997). 12. Jackson, C. V. Visual factors in auditory localization. Quarterly Journal of Experimental Psychology 5, 2 (1953), 52–65. 256
US