A Javascript Library for Flexible Visualization of Audio Descriptors Gerard Roma Anna Xambó CeReNeM Centre for Digital Music University of Huddersfield Queen Mary University of London

[email protected] [email protected]

Owen Green Pierre Alexandre Tremblay CeReNeM CeReNeM University of Huddersfield University of Huddersfield

[email protected] [email protected]

ABSTRACT In this paper, we propose a flexible framework for Research in audio analysis has provided a large number of visualization of audio descriptors in web applications. The ways to describe audio recordings, which can be used for framework is implemented in a Javascript library that enhancing their visual representation in web applications. allows processing and combining audio descriptors and In this paper we present fav.js, a Javascript library for drawing them in different styles. Our focus is not real- flexible visualization of audio descriptors. We explain the time visualization but displaying time series of descriptors proposed design and demonstrate its potential for web audio obtained from existing recordings. While we are interested applications through several visualization examples. in the representation of sound collections, we do not focus on layout algorithms for positioning multiple sounds. Our hope is to allow web developers and researchers to experiment with currently available methods for obtaining sound descriptors 1. INTRODUCTION (both client- and server-side), and use them for visualization Decades of research on automatic speech recognition, in novel web audio prototypes. Potential application areas environmental sound recognition, and particularly music include audio content creation and distribution, as well as information retrieval (MIR) have contributed to establish education. the notion of audio descriptors [13]. Audio descriptors can The rest of the paper is organized as follows. In the next be generally seen as metadata related to a given audio section we review existing research related with web-based recording. While descriptors can be obtained in many audio visualization and visualization of audio descriptors different ways, a large effort has been devoted to the in general. We then describe the design of the framework. automatic extraction of descriptors using signal processing Finally, we show some examples of visualization with fav.js, techniques. Such descriptors (often known as acoustic and reflect on future work. features) are typically inspired, albeit sometimes loosely, by current understanding of human perception of sound, or by established concepts in music theory. In this sense, 2. RELATED WORK it is common to distinguish between low-level, mid-level Method chaining is a popular technique for designing and high-level descriptors, depending on how close they concise object oriented APIs. This pattern can be used to are to common language. Many toolboxes are available for build embedded Domain Specific Laguages (DSL) [8]. This automatic extraction [11]. design was popularized in the domain of web-based data While descriptors produced by automatic analysis have visualization by the D3 library [3] and has been followed been extensively used in machine learning research, their by many other libraries. A notable example related to our use for visualization of sound in interactive applications is work is the DataToMusic API [16]. Among other things, a promising direction, as evidenced by early work [6]. This this library implements several transformations of time series direction has unfortunately received relatively little attention. with a focus on sonification. Among other issues, some descriptors may be difficult to Visualization of audio and other time series data in web understand, or may need further processing or scaling. Some applications has been implemented in the WavesJS library.1 of them may not be relevant in the absence of musical sound An early version [15] was based on D3, and thus followed or in noisy conditions. the method chaining approach. In [10], the data model was extended to integrate audio playback and interaction. The general library includes many useful components for web audio development, whilst waves-ui and waves-blocks expose Licensed under a Creative Commons Attribution 4.0 International License (CC BY the functionality for interactive visualizations. The semantics 4.0). Attribution: owner/author(s). of these components mostly focus on configuration of layers Web Audio Conference WAC-2018, September 19–21, 2018, Berlin, Germany. including interaction. Like in D3, WavesJS visualizations c 2018 Copyright held by the owner/author(s). 1 https://github.com/wavesjs leverage Scalable Vector Graphics (SVG) and the Document Object Model (DOM), which allow for complex application- Table 1: Summary of functionality oriented data models and abstractions. Our approach is Signal Display different in that we focus on the specific task of rendering time Unary ops. Binary ops. Drawing functions series, which allows us to use the HTML canvas. At the same threshold slice time, our system aims to help visualizing arbitrary audio norm offset add descriptors, and thus offers a variety of drawing functions. As log square wave subtract an example, two-dimensional features such as spectrograms pow exp line multiply can be rendered as bitmap images in the canvas. sqrt abs fill over Outside the web platform, the visualization of audio scale diff range and descriptors has been a common topic of research within delay smooth image or the area of MIR, as discussed in [12], where a range of schmitt sample errorbar xor techniques for feature extraction (from either symbolic or draw slide audio representations) is presented, and common issues reflect (e.g. forms, formats, dimensions of music) are discussed. Analyzing audio recordings has become an important task in musicology, where automatic audio analysis has helped to The library implements an internal DSL through method gain a better aural understanding [5]. chaining. The main goal is to offer a concise way to One of the best known desktop applications for visualizing express a route from a Signal object to a Display object audio descriptors is Sonic Visualiser [4]. This program (Figure 1). A Signal is basically a wrapper around one allows configuring several panes and layers with waveform or several Float32Array objects, along with a sample rate and spectrogram displays. Many audio descriptors can and a type identifier. The type identifier distinguishes be displayed in different layers thanks to vamp plugins.2 between binary, integer or real signals, but for simplicity EAnalysis [7] is a similar application that provides several these are always encoded as floats. Each Float32Array pre-configured templates for laying out the visualization. is assumed to vary in time according to the sample rate, Descriptors can also be obtained using vamp plugins. Both and two-dimensional signals are represented by an Array of programs enable the use of audio descriptors in specific Float32Array objects (e.g. frequency channels). Operators applications such as musicological analysis. By providing modify each element in the signal, as is common in an open-ended Javascript library, as opposed to a user-level array languages and scientific computing environments. A application, we hope to enable their use in new applications summary of currently implemented functionality can be seen related to audio and music. in Table 1. While the system could be used for visualization of Computing the descriptors themselves is, however, beyond other kinds of time series, the choice of operators and drawing the scope of our library. Our goal is to leverage existing functions is motivated by our focus on audio descriptors. and future efforts for this task. At the time of writing, We define a number of unary operators to allow two libraries have been presented for client-based feature transformations of a signal, such as scaling, smoothing, extraction in Javascript: JS-Xtract [9] and Meyda [14]. C++ thresholding, slicing or applying simple mathematical libraries such as Essentia [2] can be compiled to Javascript operations. A special unary operator is “sample”, which via Emscripten. We used JS-Xtract in our examples. Given upsamples or downsamples the Signal. This is obviously the computational cost, another possibility is to compute an important part in the process of visualization, as the the descriptors on the server. For example, the Freesound desired width of the display will only rarely coincide with API3 , provides access to descriptors computed for any sound the length of the signal. The resampling process does not in the Freesound database. While the default setting returns follow common audio resampling techniques, as the goal statistics of the descriptor time series, obtaining the full is to efficiently produce a visual representation. In order series is also possible. to accommodate non-integer ratios, the descriptor is split into subsequences of potentially different sizes. Then several statistics (e.g. mean, median, standard deviation) can be 3. DESIGN computed for each sequence. The same statistics are available The design of the library stems from our research on for smoothing without resampling. For upsampling, the mean manipulating large collections of audio for music creation. statistic is used, so the signal is linarly interpolated. In this context, we hope that improving on existing tools for Binary operators allow combining two Signals by sample- interactive audio analysis can provide new opportunities for wise arithmetic operations, including boolean arithmetic. creative segmentation and manipulation of sounds. At the This may be useful when one descriptor is not useful in parts same time, visualization of audio descriptors has potential for of the sequence. For instance, a measure of pitch confidence, applications to browsing audio in a broader sense. We hope or an amplitude measure, can be used to select when a to facilitate making available results from signal processing pitch descriptor is displayed. Descriptors from different research to musicians, creative coders and web developers channels of a recording can be combined to visualize the who may all have varying dispositions towards technology. spatial image. Binary operators are mostly thought for In this sense, our design priority is ease of use, as well as combining descriptors of the same sound, so two Signals of easy integration with other tools. Thus, we focused on a the same length are required. lightweight codebase with no dependencies. Finally, a Display is a container for several Layer objects. For an efficient notation, Layers can be accessed using array 2 https://www.vamp-plugins.org operators. A Display owns a DOM container element, and 3 https://freesound.org/docs/api attaches an HTML canvas for each Layer. All Layers in a Feature Signal OP1 OP2 ... Display extraction Figure 1: Visualization process Display share the same position and dimensions, which are specified in the constructor. If the width is not specified it will be determined for all Layers the first time a Signal is drawn. Otherwise, the sample operator is used internally to scale the Signal to the desired width. Different Layer types can be used for different kinds of drawings, some restricted to particular dimensions. Available Layer types 1 let sc = getSignal ( audio , " spec_centroid " ) ; are “wave”, for drawing waveforms; “line”, “errorbar” and “fill”, 2 let wave = new fav . Signal ( audio , sampleRate ) ; for unipolar time series; and “image”, for two-dimensional 3 let display = new fav . Display ( " container " , signals such as spectrograms. For one-dimensional signals, 4 " wave " , 800 , 200) ; a second signal can be provided to control the color. In 5 wave . draw ( display , this case, as well as with two-dimensional signals, the color 6 [ sc . smooth (20) mapping can be controlled using hue, saturation and lightness 7 . normalize () 8 . scale (360) , (HSL), which is available for the HTML5 canvas in all major 9 70 , 50 browsers. Each of the three parameters allows for intuitive 10 ]) ; visual mappings, however for simplicity we focus on lightness, allowing the user to specify the hue. Drawing is triggered by an operator on the Signal, which potentially results Figure 2: Waveform coloring with Spectral Centroid from several operations. Display objects offer minimal interaction capabilities, allowing the developer to attach a function to click and drag events. Unlike other frameworks like WavesJS, the goal for fav.js is to focus on transformation and visualization. This obviously does not preclude the development of interactions such as zooming or scrolling. 4. EXAMPLES 1 let rms = getSignal ( audio , " rms " ) ; In this section we show some examples of the utilization 2 let wave = new fav . Signal ( audio , sampleRate ) ; of the library. All examples assume an “audio” array, that 3 let display = new fav . Display ( " container " , has been obtained by decoding a buffer, and a “getSignal” 4 " wave " , 800 , 200) ; function that returns a Signal object with some descriptor. 5 wave . draw ( display , Descriptors were computed using JS-Xtract, but as long as 6 [237 , 100 , rms . normalize () 7 . reflect () Float32Array and a sample rate can be provided, they can 8 . scale (70) be obtained in many other ways. The sample rate for the 9 . offset (30) loaded audio is obtained from the AudioContext. 10 ]) ; One example of descriptor-based visualization is provided by the Freesound project [1], where the waveform is colored by the spectral centroid. The idea can be traced back to Figure 3: Waveform luminance coloring using RMS Timbregrams described in [6]. With the proposed framework, any descriptor (or combination of descriptors) can be used to color waveforms. Figure 2 shows an example using the 4, this technique is applied to color a grayscale spectrogram. spectral centroid with a drum pattern. The descriptor is On the other hand, the ability to transform and combine mapped to the hue, which has a range of 360 degrees. As a descriptors makes it possible to use the library interactively result, each drum instrument gets a different color. Another to develop more complex forms of object selection in order combination is shown in Figure 3. Here the root mean square to obtain better accuracy. An example is shown in Figure (RMS) amplitude descriptor is used to control the lightness. 5. We show only part of the code for saving space, the full The information is a bit redundant with the waveform, so example can be obtained with the library code. Here, the the visual effect reinforces the decay of the notes. slice operator is used to zoom into the signal in a second Another common application is segmentation. display, and observe the effect of the different operations. Thresholding the RMS signal yields a binary signal The original RMS is shown in the pale blue shade, and the that can be used to visualy identify sound objects. In Figure first order derivative in the dark blue line. The smoothed [2] D. Bogdanov, N. Wack, E. Gómez Gutiérrez, S. Gulati, P. Herrera Boyer, O. Mayor, G. Roma Trepat, J. Salamon, J. R. Zapata González, and X. Serra. Essentia: An Audio Analysis Library for Music Information Retrieval. In Proceedings of the 14th Conference of the International Society for Music Information Retrieval (ISMIR), 2013. 1 let spgm = getSignal ( audio , " spectrum " ) ; 2 let rms = getSignal ( audio , " rms " ) ; [3] M. Bostock, V. Ogievetsky, and J. Heer. D3 3 let display = new fav . Display ( " container " , Data-Driven Documents. IEEE transactions on 4 " image " , 800 , 200) ; visualization and computer graphics, 17(12):2301–2309, 5 display . addLayer ( " fill " ) ; 2011. 6 spgm . log () 7 . normalize () [4] C. Cannam, C. Landone, and M. Sandler. Sonic 8 . draw ( display [0]) ; Visualiser: An Open Source Application for Viewing, 9 rms . smooth (20) Analysing, and Annotating Music Audio Files. In 10 . threshold (0.05) Proceedings of the 18th ACM international conference 11 . draw ( display [1] , " rgba (100 ,0 ,0 ,0.3) " ) ; on Multimedia, 2010. [5] N. Cook. Methods for Analysing Recordings. In The Cambridge Companion to Recorded Music, pages Figure 4: Spectrogram with basic RMS thresholding 221–245. Cambridge University Press, 2009. [6] M. Cooper, J. Foote, E. Pampalk, and G. Tzanetakis. derivative is thresholded then combined with the original Visualization in Audio-Based Music Information RMS and thresholded again. The resulting signal is able to Retrieval. Computer Music Journal, 30(2):42–62, 2006. segment the decay of a snare drum sound. [7] P. Couprie. EAnalysis : Aide à l’Analyse de la Musique Finally, Figure 6 shows an experimental example that uses Électroacoustique. In Actes des Journées errorbar layers to illustrate a selection of related, yet eclectic, d’Informatique Musicale, 2012. sounds of bowed cardboard. Several descriptors (spectral [8] M. Fowler. Domain-Specific languages. Addison-Wesley centroid, zero-crossing rate, spectral skew and RMS energy) Professional, 2010. are used as a visual fingerprint for assessing the relatedness [9] N. Jillings, J. Bullock, and R. Stables. JS-Xtract: A of somewhat disparate sounds from the same corpus. It is Realtime Audio Feature Extraction Library for the interesting to see that features which are strongly correlated Web. In Proceedings of the 17th Conference of the in one sound may not be in another. International Society for Music Information Retrieval (ISMIR), 2016. 5. CONCLUSIONS [10] B. Matuszewski, N. Schnell, and S. Goldszmidt. The combination of existing signal processing techniques Interactive Audiovisual Rendering of Recorded Audio for description of sound and music with the visualization and Related Data with the WavesJS Building Blocks. power of current web technologies creates a great opportunity In Proceedings of the 2nd Web Audio Conference for interactive web audio applications. In this paper we have (WAC), 2016. proposed a framework to make this possible, implemented [11] D. Moffat, D. Ronan, J. D. Reiss, et al. An Evaluation in a lightweight Javascript library. We have shown some of Audio Feature Extraction Toolboxes. In Proceedings examples of potential applications. We plan to continue this of the 18th International Conference on Digital Audio work experimenting with other sources of audio descriptors. Effects, 2015. Also, since the Javascript language is also available in the [12] N. Orio. Music Retrieval: A Tutorial and Review. Max/MSP environment for user interface graphics, we plan Foundations and Trends R in Information Retrieval, to adapt our library to this environment. The library can be 1(1):1–90, 2006. obtained from https://github.com/flucoma/fav.js. [13] G. Peeters. A large set of audio features for sound description (similarity and classification) in the cuidado 6. ACKNOWLEDGEMENT project. Technical report, IRCAM, 2004. This research was part of the Fluid Corpus Manipulation [14] H. Rawlinson, N. Segal, and J. Fiala. Meyda: An project (FluCoMa), which has received funding from the Audio Feature Extraction Library for the Web Audio European Research Council (ERC) under the European API. In Proceedings of the 1st Web Audio Conference Union’s Horizon 2020 research and innovation programme (WAC), 2015. (grant agreement No 725899). [15] V. Saiz, B. Matuszewski, and S. Goldszmidt. Audio Oriented UI Components for the Web Platform. In 7. REFERENCES Proceedings of the 1st Web Audio Conference (WAC), [1] V. Akkermans, F. Font Corbera, J. Funollet, 2015. B. De Jong, G. Roma Trepat, S. Togias, and X. Serra. [16] T. Tsuchiya, J. Freeman, and L. W. Lerner. Freesound 2: An improved platform for sharing audio Data-to-Music API: Real-time Data-Agnostic clips. In Proceedings of the 12th Conference of the Sonification with Musical Structure Models. In International Society for Music Information Retrieval Proceedings of The 21st International Conference on (ISMIR), 2011. Auditory Display (ICAD), 2015. 1 let part = wave . slice (0 ,0.5 , " seconds " ) 2 part . draw ( displayZoom , " black " ) 3 let partRMS = rmsM . slice (0 ,0.5 , " seconds " ) 4 partRMS 5 . draw ( displayZoom , " rgba (9 , 123 , 255 , 0.4) " , 1) 6 . diff () . normalize () . slide (1 ,10) 7 . draw ( displayZoom , " rgba (0 , 0 , 255 , 0.7) " ,2) 8 . schmitt (0.6 ,0.3) . or ( partRMS . slide (1 ,10) . threshold ( thresh ) ) 9 . draw ( displayZoom , " rgba (0 , 0 , 0 , 0.1) " ,3) Figure 5: Custom RMS-based segmentation 1 let zcr = getSignal ( audio , " zcr " ) ; 2 let rms = getSignal ( audio , " rms " ) ; 3 let centroid = getSignal ( audio , " s p e c t r a l _ c e n t r o i d " ) ; 4 let skew = getSignal ( audio , " s p e c t r a l _ s k e w n e s s " ) ; 5 let wave = new fav . Signal ( audio , sampleRate ) ; 6 let display = new fav . Display ( container , " errorbar " , 300 , 225) ; 7 for ( i = 0; i < 3; i ++) display . addLayer ( " errorbar " ) ; 8 zcr . draw ( display [0] , " rgba (255 ,9 ,123 ,0.4) " ) ; 9 rms . draw ( display [1] , " rgba (123 ,9 ,255 ,0.4) " ) ; 10 centroid . draw ( display [2] , " rgba (255 ,140 ,0 ,0.4) " ) ; 11 skew . draw ( display [3] , " rgba (0 ,255 ,255 ,0.3) " ) ; Figure 6: Multiple errorbar patterns