Live Sports Event Detection Based on Broadcast Video and Web-casting Text Changsheng Xu, Jinjun Wang, Kongwah Wan, Yiqun Li and Lingyu Duan Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 119613 {xucs, stuwj2, kongwah, yqli, lingyu }@i2r.a-star.edu.sg ABSTRACT 1. INTRODUCTION With the proliferation of sports content broadcasting, sports fans Event detection is essential for sports video summarization, often find themselves not being able to watch live games due to indexing and retrieval and extensive research efforts have been reasons such as region and time difference. Usually, only a small devoted to this area. However, the previous approaches are heavily portion in a sports game is exciting and highlight-worthy for most relying on video content itself and require the whole video content audience. Therefore, the ability to access (especially in a live for event detection. Due to the semantic gap between low-level manner) events/highlights from lengthy and voluminous sports features and high-level events, it is difficult to come up with a video programs and to skip less interesting parts of the videos is of generic framework to achieve a high accuracy of event detection. In great value and highly demanded by the audience. However, addition, the dynamic structures from different sports domains current event/highlight generation from sports video is very labor- further complicate the analysis and impede the implementation of intensive and inflexible. One limitation is that the events/highlights live event detection systems. In this paper, we present a novel are determined and generated manually by studio professionals in a approach for event detection from the live sports game using web- traditional one-to-many broadcast mode, which may not meet some casting text and broadcast video. Web-casting text is a text audience’s appetites who may only be interested in the events broadcast source for sports game and can be live captured from the related to certain player or team. Another limitation is that these web. Incorporating web-casting text into sports video analysis events/highlights are usually only accessible to the audience during significantly improves the event detection accuracy. Compared the section breaks (e.g. the half-time in a soccer game). Clearly, with previous approaches, the proposed approach is able to: (1) with the advent of mobile devices and need for instant gratification, detect live event only based on the partial content captured from the it would be helpful for sports fans who are unable to watch the live web and TV; (2) extract detailed event semantics and detect exact broadcast, to nonetheless be kept updated on the live proceedings of event boundary, which are very difficult or impossible to be the game, through personalized events/highlights. Therefore, the handled by previous approaches; and (3) create personalized availability of automatic tools to live detect and generate summary related to certain event, player or team according to personalized events from broadcast sports videos and live send the user’s preference. We present the framework of our approach and generated events to users’ mobile devices will not only improve the details of text analysis, video analysis and text/video alignment. We production efficiency for the broadcast professionals but also conducted experiments on both live games and recorded games. provide better game viewer-ship for the sports fans. This trend The results are encouraging and comparable to the manually necessitates the development of automatic event detection from live detected events. We also give scenarios to illustrate how to apply sports games. the proposed solution to professional and consumer services. In this paper, we present a generic framework and methodology to Categories and Subject Descriptors automatically detect events from live sports videos. In particular, H.3.1 [Information Storage and Retrieval]: Content Analysis and we use soccer video as our initial target because it is not only a Indexing – abstract methods, indexing methods. globally popular sport but also presents many challenges for video analysis due to its loose and dynamic structure compared with other General Terms sports such as tennis. Our framework is generic and can be Algorithms, Measurement, Performance, Experimentation. extended to other sports domains. We also discuss scenarios how to deploy the proposed solution into various devices. Keywords Event Detection, Broadcast Video, Web-casting Text. 1.1 Related Work Extensive research efforts have been devoted to sports video event detection in recent years. The existing approaches can be classified Permission to make digital or hard copies of all or part of this work for into event detection based on video content only and event personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that detection based on external sources. copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, 1.1.1 Event detection based on video content only requires prior specific permission and/or a fee. Most of the previous work of event detection in sports video is MM’06, October 23–27, 2006, Santa Barbara, California, USA. based on audio/visual/textual features directly extracted from video Copyright 2006 ACM 1-59593-447-2/06/0010…$5.00. 221 content itself. The basic idea of these approaches is to use low-level obtained from web to assist event detection in soccer video. They or mid-level audio/visual/textual features and rule-based or still used audio/visual features extracted from the video itself to statistical learning algorithms to detect events in sports video. detect some events (e.g. goal), while for the events which are very These approaches can be further classified into single-modality difficult or impossible to be detected using audio/visual features, based approach and multi-modality based approach. Single- they used text from match report and game log to detect such modality based approaches only use single stream in sports video events. The text events and video events were fused based on sports for event detection. For example, audio features were used for video structure using rule-based, aggregation and Bayesian baseball highlight detection [1] and soccer event detection [2]; inference schemes. Since some events were still detected using visual features were used for soccer event detection [3][4]; and audio/visual features, the accuracy is much lower than event textual features (caption text overlaid on the video) were utilized detection using text and the proposed event detection model is also for event detection in baseball [5] and soccer [6] videos. The single difficult to be applied to other sports domains. On the other hand, modality based approaches have low computational load thus can the proposed framework has to structure the whole video into achieve real-time performance in event detection [7][8], but the phases (Break, Draw, Attack) before event detection, hence it is not accuracy of event detection is low as broadcast sports video is the able to achieve live event detection. integration of different multi-modal information and only using single modality is not able to fully characterize the events in sports 1.2 Our Contribution video. In order to improve the robustness of event detection, multi- In this paper, we present a novel approach for semantic event modality based approaches were utilized for sports video analysis. detection from live sports game based on analysis and alignment of For example, audio/visual features were utilized for event detection web-casting text and broadcast sports video. Compared with in tennis [9], soccer [10] and basketball [11]; and previous approaches, the contributions of our approach include: audio/visual/textual features were utilized for event detection in baseball [12], basketball [13], and soccer [14]. Compared with the (1) We propose a generic framework by combining analysis and single-modality based approaches, multi-modality based alignment of web-casting text and broadcast sports video for event approaches are able to obtain better event detection accuracy but detection from live sports games. Particularly, the incorporation of have high computational cost, thus it is difficult to achieve real- web-casting text significantly improves the event detection time performance. accuracy and helps extracting event semantics. Both single-modality based approaches and multi-modality based (2) We propose novel approaches for game start detection and game approaches are heavily replying on audio/visual/textual features time recognition from live broadcast sports video, which enables directly extracted from the video itself. Due to the semantic gap the exact matching between the time tag of an event in web-casting between low-level features and high-level events as well as text and the event moment in the broadcast video. The previous dynamic structures of different sports games, it is difficult to use methods [18][19] assumed that the time tag of an event in web- these approaches to address following challenges: (1) ideal event casting text corresponded to the video time, which is not true in the detection accuracy (~100%); (2) extraction of the event semantics, broadcast video and will cause bias for text/video alignment. e.g. who scores the goal and how the goal is scored for a “goal” event in soccer video; (3) detection of the exact event boundaries; (3) We propose a robust approach based on finite state machine to (4) generation of personalized summary based on certain event, align the web-casting text and broadcast sports video to detect the player or team; (5) a generic event detection framework for exact event boundaries in the video. different sports games; and (6) robust performance with the (4) We propose several scenarios to illustrate how to deploy the increase of the test dataset and live videos. In order to address these proposed solution into current professional services and consumer challenges, we have to seek available external sources to help. services. The rest of the paper is organized as follows. The framework of the 1.1.2 Event detection based on external sources proposed approach is described in Section 2. The technical details Currently there are two external sources that can be used for sports of text analysis, video analysis and text/video alignment are video analysis: closed caption and web. Both are text sources. presented in Section 3, 4, and 5 respectively. Experimental results Incorporation of text into sports video analysis is able to help are reported in Section 6. The potential applications of the proposed bridge the semantic gap between low-level features and high-level solution are discussed in Section 7. We conclude the paper with events and thus facilitate the sports video semantic analysis. future work in Section 8. Closed caption is a manually tagged transcript from speech to text and is encoded into video signals. It can be used separately to 2. FRAMEWORK identify semantic event segments in sports video [15] or combined The framework of our proposed approach is illustrated in Figure 1. with other features (audio/visual) for sports video semantic analysis The framework contains four live modules: live text/video [16][17]. Since closed caption is a direct transcript from speech to capturing, live text analysis, live video analysis, and live text/video text, it contains a lot of information irrelevant to the games and alignment. The live text/video capturing module captures the web- lacks of a well-defined structure. On the other hand, currently casting text from the web and the broadcast video from TV. Then closed caption is only available for certain sports videos and in for the captured text and video, the live text analysis module detects certain countries. the text event and formulates the detected event with proper In addition to closed caption, some researchers attempt to use semantics; the live video analysis module detects the game start information in the web to assist sports video analysis. Xu et.al. point and game time by recognizing the clock digit overlaid on the [18][19] proposed an approach to utilize match report and game log video. Based on detected text event and recognized game time, the 222 Live Capture Live Text Analysis Text Event Detection Text Event Formulation Web-casting Text Text/Video Alignment Event Moment Detection Domain Knowledge Event Boundary Detection . Sports game Live Video Analysis . Live Capture Game Start Detection . Game Time Recognition Broadcast Video Events Figure 1. Framework of proposed approach live text/video alignment module detects the event with exact boundaries in the video. This can be done by defining a video segment that contains the event moment, structuring the video segment, and detecting the start and end boundary of the event in the video. The detected video events with text semantics can be sent to different devices based on users’ preferences. The proposed framework is generic and can be used for different sports domains. The technical detail of each module in the framework will be described in the following sections. 3. TEXT ANALYSIS There are two external text sources that can be used to assist sports Figure 3. Web-casting text (Flash) [21] video analysis: closed caption (Figure 2) and web-casting text (Figure 3 and Figure 4). Compared with the closed caption which is a transcript from speech to text and only available for certain sports games and in certain countries, the content of web-casting text is more focused on events of sports games with a well-defined structure and is available in many sports websites [20,21]. Therefore we choose web-casting text as an external source in our approach. Since an event in the web-casting text contains information such as time of the event moment, the development of the event, players and team involved in the event, etc., which is very difficult to be obtained directly from the broadcast video using previous approaches, it will greatly help event and event semantics detection from a live sports game. The live capturing and analysis of web-casting text is discussed in the following subsections. Figure 4. Web-casting text (HTML) [20] 3.1 Text Capturing The web-casting text [20,21] serves as a text broadcasting for live sports games. The live text describes the event happened in a game Figure 2. Closed caption [16] with a time stamp and is updated every a few minutes. The first 223 step for text analysis is to live capture the text from the web in casting text (Table 1) and the freestyle web-casting text (Table 2). either HTML [20] or flash [21] format. Figure 5 illustrates the Such definition is extendable. process of text capturing and extraction, which can be summarized as following steps: (1) our program keeps sending request to the Table 1 Keyword definition for well-structured web text. website server regularly to get the HTML/flash file regularly; (2) the text describing the game event is extracted using rule-based Event Keyword Event Keyword keyword matching method; and (3) the program checks for the Goal goal, scored Red card dismissed, sent off difference or update of the text event between the current file and Shot shot, header Yellow card booked, booking the previous one, extracts the new text event and adds it to the Save save, blocked Foul foul, event database. Offside offside Free kick free kick, free-kick Corner corner kick Sub. substitution, replaced HTML request Live text server Table 2 Keyword definition for freestyle web text Event Keyword Goal g-o-a-l or scores or goal or equalize – kick “yellow card" or “red card" or “yellowcard" or “redcard" or Card “yellow-card" or “red-card" Text extraction Foul (commits or by or booked or ruled or yellow) w/5 foul Offside (flag or adjudge or rule) w/4 (offside or “off side" or “off-side") Event Save (make or produce or bring or dash or “pull off") w/5 save HTML file database injury and not “injury time" It is not included in the defined Injury keywords. (take or save or concede or deliver or fire or curl) w/6 (“free- Free kick kick" or “free kick" or freekick) Text extraction Sub. substitution Yes Once proper keywords are defined, an event can be detected by finding sentences that contain the relevant keyword and analyzing Event text New event context information before and after the keyword. A simple extraction keyword based text search technique is enough for our task. HTML capturing Text extraction For a detected text event, we record the type of event and players/team involved in the event for personalized summarization. Figure 5. Text capturing and event extraction The time stamp of each detected event is also logged which is used by our text/video alignment module for video event boundary detection as discussed in subsection 5.2. 3.2 Text Event Detection The types of the event are different for different sports games, but the number of the event types for each sports game is limited. 4. VIDEO ANALYSIS Therefore, in order to detect the events for certain sports game, we From the detected text event, we can obtain the time stamp which need to construct a database that contains all the event types for that indicates when the event occurs in the game. To detect the same sports game. For example, in soccer game, we select the event event in the video, we have to know the event moment in the video. types listed in Table 1 for event detection. This may not cover all An intuitive way is to directly link the time stamp in the text event the event types in a soccer game, but the database is extensible and to the video time, but this is not true for live broadcast video where the selected events are interesting to most soccer fans. the game start time is different from the video start (broadcasting) time. Therefore, in order to detect the event moment in the video, To detect these events from the captured web-casting text, we we should know the start point of the game in the video. We observed from our database that each type of the sports event propose to combine two approaches to detect the game time: game features one or several unique nouns, which are defined as start detection and game time recognition from the digital clock keywords related to this event. This is because the web-casting text overlaid on the video. Game start detection is to detect the start is tagged by sports professionals and has fixed structures. Hence by point of the game in the video and use it as a reference point to detecting these keywords, the relevant event can be recognized. We infer the game time. Game time recognition is to detect the digital have also observed that some keywords may correspond to different clock overlaid on the video and recognize the time from the video events, for example, goal and goal kick are two different events if clock. The reason why we combine these two schemes instead of goal is defined as a keyword. In our approach, we also conduct just using one of them is as follows: (1) game start detection is context analysis before or after the keywords to eliminate the false suitable for the games without clock stopping like soccer but cannot alarms of event detection. Usually, the sources that provide the work for those games with clock stopping during the game like web-casting text can be classified into two groups: one with well- basketball, while game time recognition can work for both kind of defined syntax structure [20] and the other with freestyle text [24]. games; (2) sometimes the appearance of the digital clock in the Our approach uses the well-structured web-casting text, here we video (especially for soccer) delays for 20 seconds to a minute after also present the freestyle web-casting text for comparison. To the game is started, thus game start detection will help identify achieve accurate text event detection performance, we give those events occurred before the digital clock appears; and (3) both different sets of keyword definitions for the well-structured web- schemes can verify each other to improve the accuracy. Therefore, 224 with the numeric characters on the clock. Obviously we cannot image pixel value in position (x, y) for the digit character to be expect good reliability and accuracy by using such a method. In our recognized, and I is the ROI of the digit character. When i=10, approach, we first locate the clock digits so that only numeric T10j(x, y) is the template for a flat region without any character. The characters on the video clock are required to be recognized. These clock digits on every frame are recognized when a best match is numeric characters are uniform in font, size and color. Hence, the found. The detail of game time recognition can be found in [26]. recognition accuracy is improved and the whole recognition process is also simplified. 5. TEXT/VIDEO ALIGNMENT After we know the game time, we can detect the event moment in the video by linking the time stamp in the text event to the game time in the video. However, an event should be a video segment which exhibits the whole process of the event (e.g. how the event is Clock digits developed, players involved in the event, reaction of the players to the event, etc.) rather than just a moment. Therefore, in addition to event moment, we also need to detect the start and end boundaries Figure 8. Overlaid video clock of the event to formulate a complete event. In this section, we Since the clock digits are changing periodically, for each character present a novel approach to live detect the event boundary using ROI, we observe its TNPS sequence, which is defined as follows: video structure analysis and finite state machine. S ( n ) = ∑ Bn − 1( x, y ) ⊗ Bn ( x, y ) (2) 5.1 Feature Extraction ( x , y )∈I Based on the detected event moment in the video, we define a where B(x, y) is the binarized image pixel value in position (x, y), n temporal range containing the event moment and detect event is the frame sequence number, I is the character region, and ⊗ is boundary within this range. This is due to the following XOR operation. S(n) shows the pattern change. If the pattern considerations: (1) since we are dealing with live video, we are change is regular, the character is considered as a clock digit only able to analyze the video close to the event moment; and (2) character. E.g., if the pattern changes once per second, it should be event structure follows certain temporal patterns due to the the SECOND digit pattern. In this way, the SECOND digit position production rules in sports game broadcasting. Thus, we first extract is located. The location of other digits can be located referring to some generic features from the video and use these features to the SECOND digit position. model the event structure. The feature extraction is conducted in real time. 4.2.2 Clock digits recognition After the clock digits are located, we observe the TEN-SECOND 5.1.1 Shot boundary detection digit pattern change using the TNPS. At the time when the pattern The video broadcasting of a sports game generally adheres to a set change happens, we extract the pattern of “0” from the SECOND of studio production rules. For example, hard-cut shot transitions digit ROI. At the next second we extract the pattern of “1”. And are used during play to depict the fast pace game action, while next is the “2”, “3”, “4”, and so on. Therefore, all the numeric gradual transitions such as dissolves are used during game breaks or digits from 0 to 9 are extracted automatically. Since the extracted lull. Most broadcasters also use a flying logo wipes during game digits may vary along time due to the low quality of the video, we replays. The logo may be an emblem image of the tournament or may extract a few patterns for the same digit character. These the teams/clubs/national-flag. Game replays are important visual sample digit patterns are used to recognize all the 4 digits on the cues for significant game moments. Typically a few replay shots clock. Some of these sample digit patterns are shown in Figure 9. are shown in slow motion between the flying logo wipe transitions. The shot transitions between these slow motion shots are usually dissolves. This leads us to perform a rudimentary structure analysis of the video to detect these shot boundaries and their transition types. The basic idea is to locate the clusters of successive gradual transitions in the video as candidate segment boundaries for significant game moments. Figure 9. Sample digit patterns Detecting hard-cut shot changes is relatively easier than detecting gradual shot changes. For hard-cut detection, we compute the mean After the templates for each digit character from “0” to “9” are absolute differences (MAD) of successive frame gray level pixels, collected, for every frame of decoded images, every clock digit is and use an adaptive threshold to decide the frame boundaries of matched against the templates. The matching score of numeric abrupt shot changes. To handle gradual shot change, we character i is calculated as follows: additionally compute multiple pair-wise MAD. Specifically, for each frame k, we calculate its pair-wise MAD with frame k-s, S (i ) = Min{ j ∑ T ( x, y) ⊗ D( x, y)} ij (3) where s=7, 14, 20 is set empirically. Hence, we buffer about 1 ( x , y )∈I second worth of video frames, and maintain 3 MAD profiles. i=0, 1, …, 9, 10 Figure 10 shows an example of the MAD profiles. Shot changes are usually areas where all MAD values show significant changes where Tij(x, y) is the binarized image pixel value in position (x, y) (deltas) in the same direction. That is, they either increase or for the jth template of numeric character i, D(x, y) is the binarized decrease simultaneously. 226 In spite of this, we still observe a fair amount of false positives. Si=[hard-cut, non-far-view, dissolve]T These usually occur during a close-up shot of a moving player Far-view Non-far-view Non-far-view Non-far-view Far-view amidst a complex background clutter. Other causes of false positives include foreground occlusion and fast camera movement. Event boundary This can be reduced by applying rules to compare the detected shot with its adjacent shots. Far-view Non-far-view Far-view Event boundary Hard cut boundary Dissolve boundary Figure 11. Event boundary pattern modeling Hence our strategy to find Ds and De is as follows. When a text event is identified, our event boundary detection module first extracts a candidate segment from the video where the true event duration is included. In our current setup, the duration between the start of the candidate segment to Tt is empirically set to 1 minute, and the end of the candidate segment to Tt is set to 2 minutes. Then feature extraction (Subsection 5.1) is carried out to obtain the sequence S (Eq.4) from the candidate segment. Finally S is sent to another finite state machine (FSM) to compute Ds and De. Figure 10. Simultaneous deltas in the MAD The FSM first detects the event start boundary then the event end boundary. To detect the start boundary, the FSM first identifies the 5.1.2 Shot classification shot in the S sequence (the “Start” state in Figure 12) into which the With the obtained shot boundary, shot classification is conducted event time-stamp falls and names this shot as reference shot Sr. using a majority voting of frame view types (subsection 4.1) Starting from Sr, the FSM performs a backward search along {Si}i=r- identified within a single shot. Since we have two frame view types, 1,…,1 to find a suitable event start shot Ss. In this backward search, we can accordingly produce two types of shot, specifically the far- the FSM changes states with given conditions listed in Figure 12. view shot and non-far-view shot. We also log the start and end The FSM sometimes jumps into the “Hypothesis” state if it cannot boundary type of each shot, i.e. a hard cut boundary or a dissolve verify whether a far-view shot is inside the event boundary (e.g. it boundary, to generate an R3 shot classification sequence S as is a replay) or outside the event. In the “Hypothesis” state, the FSM assumes the far-view shot to be inside the event boundary and S={Si}={[sbti, sti, ebti]T} i=1,…,N (4) checks whether such an assumption violates any rules (e.g. it results where the start boundary type sbti∈{hard cut, dissolve}, the shot in too long an event boundary). Note in Figure 11 that, as the event type sti∈{far-view, non-far-view}, and end boundary type ebti∈ start boundary is not aligned with the start boundary of Ss, a “Start {hard cut, dissolve}. N is the total number of shot in the sequence. boundary refine” state is adopted to find a suitable starting frame as Once the shot classification sequence S is generated, our system the exact event start boundary. This is achieved by thresholding the will proceed to text/video alignment to detect event boundary. duration between the desired event start boundary frame to the ending frame of Ss. After the start boundary is detected, the FSM performs a forward search along {Si}i=s+1,…,N to find the event end 5.2 Event Boundary Detection boundary. The algorithm in the forward search is similar to the In a typical event detected from the web text source, the related backward search except that it is working in a forward direction. time stamp (denoted as Tt) usually logs a most representative moment of the event (Figure 4). For example, in a goal-scoring C C event, the time stamp records the extract time instance when the ball goes across the bottom-line [20]. Starting from Tt, our event D boundary detection module finds a suitable event boundary [Tt-Ds, A Far-view Far-view D E A E Tt+De] where the available scenes of the stated event in the original Start End Hypot Hypot broadcast recording are all encapsulated. Here Ds and De indicate Start F hesis boundary F hesis boundary End refine refine the time duration between event start/end boundary and event B B Non- Non- moment respectively. far-view G far-view E To compute Ds and De, we observed from our database that the extracted S sequence features one of the two patterns for event C C boundary as illustrated in Figure 11. We have additionally observed the following rules: Figure 12. Finite state machine for event boundary detection Rule 1: Any far-view shot that is too long is not inside an event. Transition condition A: far- view shot; B: non-far-view shot; C shot type Rule 2: Most events last longer than 20 seconds. unchanged; D: Rule 1 satisfied or Hypothesis failed; E: A far-view shot but does not satisfy Rule 1; F: same as A; G: Hypothesis failed. 227 Figure 13. Schematic Diagram of our live set-up clock digits can achieve above 99%. Some detailed results are listed 6. EXPERIMENTAL RESULTS in column “GTR” of Table 5. The inaccurate recognition for 2 EPL We conducted our experiments on both live games and recorded games are due to the small game clock in MPEG I video recording games. The recorded games are used to evaluate individual modules which leads to incorrect clock digits location identification. It also and the live games are used to evaluate the whole system. The can be seen that the result of GTR are more accurate and reliable dataset, experimental setup and results are described and reported than the result of GSD. in the following subsections. Table 3. Text event detection based on well-structured web text Event Precision/Recall Event Precision/Recall 6.1 Text Analysis Goal 100%/100% Red card 100%/100% We conducted text event detection experiment on 8 games (6 EPL Shot 97.1%/87.2% Yellow card 100%/100% and 2 UEFL). To give a comparison, we used two types of web text: Save 94.4%/100% Foul 100%/100% well-structured web text [20] and freestyle web text [24]. The former presents a well-defined syntax structure which significantly Free kick 100%/100% Offside 100%/100% facilities our keyword based text event detection method. The Corner 100%/100% Substitution 100%/100% freestyle web text lacks of decent structure for event description. Due to its dynamic structure and diverse presenting style, freestyle Table 4. Text event detection based on freestyle web text web text is more difficult for event detection. Table 3 and Table 4 Event Precision/Recall Event Precision/Recall list the event detection performance from well-structured web text and freestyle web text respectively. The relatively lower Card 97.6%/95.2% Free kick 96.7%/100% precision/recall in Table 4 validates the advantage of using well- Foul 96.9%/93.9% Save 97.5%/79.6% structured web text for text event detection. Goal 81.8%/93.1% Injury 100%/100% Offside 100%/93.1% Substitution 86.7%/100% 6.2 Video Analysis 6.2.1 Game start detection The performance of Game Start Detection (GSD) is tested using 8 6.3 Text/Video Alignment EPL games, 2 UEFA games and 15 International friendship games. To assess the suitableness of the automatically selected event The starts (both the start of the first half and the second half) of 12 boundary, we use the Boundary Detection Accuracy (BDA) [14] to games are detected within 5 seconds, 6 games within 15 seconds, 5 measure the detected event boundary compared with the manually games within 30 seconds, and 2 above 30 seconds. Some results are labeled boundary. listed in column “GSD” of Table 5. It can be seen that some of the τ db ∩ τ mb (5) detected game starts are delayed due to two reasons: 1) The BDA = presence of captions which causes incorrect frame view type max(τ db ,τ mb ) classification, and 2) The occurrence of early events which lead to where τ db and τ mb are the automatically detected event boundary the game pause and interferes the game start detection. and the manually labeled event boundary, respectively. The higher 6.2.2 Game time recognition the BDA score, the better the performance. Table 6 lists the BDA The performance using our Game Time Recognition (GTR) for scores for 4 EPL games. It is observed that the boundary detection game start detection is tested using 8 EPL games, 4 UEFA 2005 performance of free kick events is lower than other events. This is games, 4 Euro-Cup 2004 games and 10 World-Cup 2002 games. because our selected web casting text source usually includes other For most game videos the clock digits can be located without any event (e.g. foul) before the free kick event, and hence the extracted false location and the recognition accuracy after correct location of time stamp is not accurate, which affects the alignment accuracy. 228 Table 5. GSD and GTR results on 8 games (6 EPL, 2 UEFA) Table 7. Event boundary detection for 4 EPL games Game Actual start GSD GTR Event BDA Event BDA ManU-Sunderland 17:25 17:24 18:23 Goal 75% Red card NA Portsmouth-Arsenal 8:05 8:50 8:05 Shot 82.5% Yellow card 83% Arsenal-WestBrom 1:55 2:35 1:55 Save 90% Foul 77.7% Bolton-Chelsea 08:01 8:02 7:59 Free kick 40% Offside 85.3% Aston-VillaBolton 07:38 7:48 07:38 Corner 66.7% Substitution NA Blackburn-Newcastle 03:26 03:29 03:26 Table 8. Event boundary detection for 61 World-Cup games Chelsea-BayernMunich 12:08 12:11 12:08 Event BDA Event BDA Liverpool-Chelsea 11:35 11:40 11:35 Goal 76.7% Red card 82% Shot 76.1% Yellow card 84% Table 6. Event boundary detection Save 60% Foul 77.7% Event BDA Event BDA Free kick 43.3% Offside 70.5% Goal 90% Red card 77.5% Corner 75% Substitution 78.1% Shot 86.9% Yellow card 77.5% Save 97.5% Foul 77.7% 7.1 Professional Services Free kick 43.3% Offside 80% The delivery of sports video highlights over new media channels Corner 40% Substitution 75% such as 3G is an attractive option for both service-providers and consumers. Most of the existing commercial offerings are of three types: (1) live SMS updates, (2) live video streaming of the game 6.4 Live Performance over 3G, and (3) post-game 3G short highlight video clips. There is 6.4.1 Live setup clearly a market gap for live 3G short highlight video updates. The Figure 13 shows the schematic workflow of our live experimental main reasons for the gap are (a) concerns for dilution of TV rights, setup. The system uses Dell Optiplex GX620 PC (3.4G dual-core (b) accuracy and acceptability of video, and (c) cost. As regards to cpu, 1G memory) with Hauppauge PCI-150 TV capture card. Our rights dilution, this is a business issue and we would not dwell too main reason for selecting the Hauppauge PCI-150 encoder card is much onto it, apart from mentioning that there is an increasing because we were able to simultaneously read the output MPEG file concern amongst the EU regulatory bodies that overly-restrictive while it is being written. Another key consideration is maintaining a business contracts in premium sports content are hindering the balance of CPU resource to sustain both the live video capture and development of new media markets such as 3G. As for the accuracy our intention for highlight detection with minimum delay. The main and acceptability of video, we argue that this is a mindset issue. delay comes from the live text availability on the target URL. Once Traditionally, video highlights creation is part of post-production. the event is identified from the web-text, the average processing Extensive video editing is required to put together different delay for event boundary detection is around 10 seconds. interesting segments and an automatic system would not be as good. This point is valid but we argue that our system is not trying to do that in the first place. It attempts to segment a short, continuous 6.4.2 Live event boundary detection portion from the running broadcast which would hopefully Our system went “live” over the April 13th-15th 2006 weekend EPL encompass the key highlights of the sports event. This would broadcast. Some integration oversight restricted the system to only suffice for an instant video alert market. The traditional way of complete its run for the 1st half of 4 games: Bolton vs. Chelsea, crafting post-production content can still continue. As for cost Arsenal vs. WestBrom, Portsmouth vs. Arsenal and ManUnited vs. concerns, our system uses low-cost off-the-shelf equipment and is Sunderland. We improved our system and a second live trial was automatic. In occasions where it may require operator assistance, conducted from the June 10th to July 10th for all the 64 World-Cup we expect this inspection effort to be minimal and expedited. 2006 games. All processing modules were able to execute seamlessly for the whole match of 61 games where 3 games were missed due to erroneous system configuration. The live event 7.2 Consumer Services boundary detection performance for live EPL games and World- We foresee a significant market in consumer client-based Cup games is listed in Table 7 and Table 8 respectively. applications. With the advent of pervasive broadband/UMTS connectivity, IPTV, home media centers and plummeting cost of Since we only dealt with the 1st half of 4 live EPL games, some set-top OEM, we envision a great demand for both time-shifting events had not occurred in the games, which were listed as NA in and place-shifting video services. In the latter, especially, relevant Table 7. video can be detected from a broadcast, segmented and transcoded over an IP channel to a mobile device. IP rights may not be that big 7. APPLICATIONS an issue as for professional service-providers. We believe the The proposed solution for live event detection will benefit both computed footprint of our system can be further reduced to fit into professional service providers and consumers. We give two these scenarios. scenarios below to illustrate how to deploy the proposed solution into professional and consumer services. 8. CONCLUSION Event detection from live sports games is a challenge task. We have presented a novel framework for live sports event detection by 229 combining the live analysis and alignment of web-casting text and Acoustics, Speech, & Signal Processing, Hong Kong, China, broadcast video. Within this framework, we have developed live Vol.3, pp.189-192, 2003. event detection system for soccer game and conducted live trials on [10] K. Wan, C. Xu, “Efficient multimodal features for automatic various soccer games. The experimental results are promising and soccer highlight generation”, In Proc. of International validate the proposed framework. Conference on Pattern Recognition, Cambridge, UK, Vol.3, We believe that the incorporation of web-casting text into sports pp.973-976, 23-26 Aug. 2004. video analysis, which combines the complementary strength of [11] M. Xu, L. Duan, C. Xu, M.S. Kankanhalli, and Q. Tian, low-level features and high-level semantics, will open up a new “Event detection in basketball video using multi-modalities”, possibility for personalized sports video event detection and In Proc. of IEEE Pacific Rim Conference on Multimedia, summarization and create a new business model for professional Singapore, Vol.3, pp.1526-1530, 15-18 Dec, 2003. and consumer services. In this paper, our focus is on event [12] M. Han, W. Hua, W. Xu, and Y. Gong, “An integrated detection from live sports games. After we have the events and baseball digest system using maximum entropy method”, In event semantics, it is not difficult to create personalized summary Proc. of ACM Multimedia, pp.347-350, 2002. related to certain event, player, team or their combination according [13] S. Nepal, U. Srinivasan, and G. Reynolds, “Automatic to user’s preference. detection of goal segments in basketball videos, In Proc. of Web-casting texts for various sports games are accessible from ACM Multimedia, Ottawa, Canada, pp.261-269, 2001. many sports websites. They are generated by professionals or [14] J. Wang, C. Xu, E.S. Chng,, K. Wan, and Q. Tian, “Automatic amateurs using various styles (well-structured or freestyle) and generation of personalized music sports video”, In Proc. of different languages. Our future work will focus on exploiting more ACM International Conference on Multimedia, Singapore, web-casting text sources, investigating more advanced text mining pp.735-744, 6-11 Nov. 2005. approach to deal with web-casting text (e.g. automatic detect event [15] N. Nitta and N. Babaguchi, “Automatic story segmentation of keywords) with different styles and languages, and conducting live closed-caption text for semantic content analysis of trials on more sports domains. broadcasted sports video,” In Proc. of 8th International Workshop on Multimedia Information Systems ’02, pp. 110– 9. REFERENCES 116, 2002. [1] Y. Rui, A. Gupta, and A. Acero, “Automatically extracting [16] N. Babaguchi, Y. Kawai, and T. Kitahashi, “Event based highlights for TV baseball programs”, In Proc. of ACM indexing of broadcasted sports video by intermodal Multimedia, Los Angeles, CA, pp. 105-115, 2000. collaboration,” IEEE Trans. on Multimedia, Vol. 4, pp. 68–75, [2] M. Xu, N.C. Maddage, C. Xu, M.S. Kakanhalli, and Q. Tian, March 2002. “Creating audio keywords for event detection in soccer video”, [17] N. Nitta, N. Babaguchi, and T. Kitahashi, “Generating In Proc. of IEEE International Conference on Multimedia and semantic descriptions of broadcasted sports video based on Expo, Baltimore, USA, Vol.2, pp.281-284, 2003. structure of sports game,” Multimedia Tools and Applications, [3] Y. Gong, L.T. Sin, C.H. Chuan, H.J. Zhang, and M. Sakauchi, Vol. 25, pp. 59–83, January 2005. “Automatic parsing of TV soccer programs”, In Proc. of [18] H. Xu and T. Chua, “The fusion of audio-visual features and International Conference on Multimedia Computing and external knowledge for event detection in team sports video,” Systems, pp. 167-174, 1995. In Proc. of Workshop on Multimedia Information Retrieval [4] A. Ekin, A. M. Tekalp, and R. Mehrotra, “Automatic soccer (MIR’04), Oct 2004. video analysis and summarization”, IEEE Trans. on Image [19] H. Xu and T. Chua, “Fusion of multiple asynchronous Processing, vol. 12:7, no. 5, pp. 796–807, 2003. information sources for event detection in soccer video”, In [5] D. Zhang, and S.F. Chang, “Event detection in baseball video Proc. of IEEE ICME’05, Amsterdam, Netherlands, pp.1242- using superimposed caption recognition”, In Proc. of ACM 1245, 2005. Multimedia, pp. 315-318, 2002. [20] http://news.bbc.co.uk/sport2/hi/football/teams/ [6] J. Assfalg, M. Bertini, C. Colombo, A. Bimbo, and W. [21] http://sports.espn.go.com/ Nunziati, “Semantic annotation of soccer videos: automatic [22] M. Bertini, R. Cucchiara, A. D. Bimbo, and A. Prati, “Object highlights identification,” Computer Vision and Image andevent detection for semantic annotation and transcoding,” in Understanding (CVIU), Vol. 92, pp. 285–305, November 2003. Proc.IEEE Int. Conf. Multimedia and Expo, Baltimore, MD, Jul. [7] R. Radhakrishan, Z. Xiong, A. Divakaran, Y. Ishikawa, 2003, pp.421–424. "Generation of sports highlights using a combination of [23] R. Leonardi and P. Migliorati, “Semantic indexing of supervised & unsupervised learning in audio domain", In Proc. multimedia documents,” IEEE Multimedia, Vol. 9, pp. 44–51, of International Conference on Pacific Rim Conference on Apr.–June 2002. Multimedia, Vol. 2, pp. 935-939, December 2003. [24] http://soccernet.espn.go.com/ [8] K. Wan, and C. Xu, “Robust soccer highlight generation with [25] Y. Tan and et al, “Rapid estimation of camera motion from a novel dominant-speech feature extractor”, In Proc. of IEEE compressed video with application to video annotation,” IEEE International Conference on Multimedia and Expo, Taipei, Trans. on Circuits and Systems for Video Technology, vol. 10- Taiwan, pp.591-594, 27-30 Jun. 2004. 1, pp. 133–146, 2000. [9] M. Xu, L. Duan, C. Xu, and Q. Tian, “A fusion scheme of [26] Y. Li, C. Xu, K. Wan, X. Yan, and X. Yu, Reliable video visual and auditory modalities for event detection in sports clock time recognition, In Proc. of Intl. Conf. Pattern video”, In Proc. of IEEE International Conference on Recognition, Hong Kong, 20-24, Aug. 2006. 230