This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3118048, IEEE Access Digital Object Identifier Video Index Point Detection and Extraction Framework using Custom YoloV4 Darknet Object Detection Model MEHUL MAHRISHI1 (Senior Member, IEEE), SUDHA MORWAL2 , ABDUL WAHAB MUZAFFAR3 , SURBHI BHATIA 4 , PANKAJ DADHEECH 5 , MOHAMMAD KHALID IMAM RAHMANI 6 (SENIOR MEMBER, IEEE) 1 Department of Information Technology, Swami Keshvanand Institute of Technology, Management & Gramothan, Jaipur, Rajsathan, India (e-mail:
[email protected]) 2 Department of Computer Science, Banasthali Vidhyapith, Niwai, Rajasthan, India (e-mail:
[email protected]) 3 College of Computing and Informatics, Saudi Electronic University, Riyadh 11673, Saudi Arabia (email :
[email protected]) 4 College of Computer science and information technology, King Faisal University, Saudi Arabia. (email :
[email protected]) 5 Department of Computer Science & Engineering,Swami Keshvanand Institute of Technology, Management & Gramothan, Jaipur, Rajsathan, India (email :
[email protected]) 6 College of Computing and Informatics, Saudi Electronic University, Riyadh 11673, Saudi Arabia (email :
[email protected]) Corresponding author: Dr. Surbhi Bhatia (e-mail:
[email protected]). ABSTRACT The trend of learning from videos instead of documents has increased. There could be hundreds and thousands of videos on a single topic, with varying degrees of context, content, and depth of the topic. The literature claims that learners are nowadays less interested in viewing a complete video but prefers the topic of their interests. This develops the need for indexing of video lectures. Manual annotation or topic-wise indexing is not new in the case of videos. However, manual indexing is time-consuming due to the length of a typical video lecture and intricate storylines. Automatic indexing and annotation is, therefore, a better and efficient solution. This research aims to identify the need for automatic video indexing for better information retrieval and ease users navigating topics inside a video. The automatically identified topics are referred to as “Index Points.” 137-layer YoloV4 Darknet Neural Network creates a custom object detector model. The model is trained on approximately 6000 video frames and then tested on a suite of 50 videos of around 20 hours of run time. Shot Boundary detection is performed using Structural Similarity fused with a Binary Search Pattern algorithm which outperformed the state-of-the-art SSIM technique by reducing the processing time to approximately 21% and providing around 96% accuracy. Generation of accurate index points in terms of true positives and false negatives is detected through precision, recall, and F1 score, which varies between 60-80% for every video. The results show that the proposed algorithm successfully generates a digital index with reasonable accuracy in topic detection. INDEX TERMS Binary Search Pattern, Keyframe Identification, Neural Network, Optical Character Recognition, Structural Similarity, Shot Boundary Detection, Video Frames, YoloV4. I. INTRODUCTION that approximately 41% of school-goers globally have taken an online course that was not part of their curriculum in the The automatic generation of multimedia content from a past 12 months [3]. captured video lecture is not new. Once stored, videos create The detailed literature survey in Section II clearly states that an extensive knowledge repository. They are, in itself, a there is a paradigm shift in the mode of online learning. disruptive source of education apart from textbooks and Learners are increasingly turning to online video lectures; classroom study [1]. One can easily see the considerable however, they are less interested in watching the entire video attention paid to educational videos amongst all other video and prefer to focus on specific topics of interest. Therefore, sources. The findings show that up to 65 percent of Berkeley indexing of video lecture is the need of the hour. students utilize videos to improve their understanding Mathematically, video indexing and annotation is the of subjects they missed in class [2]. The Cambridge probability of discovering a topic and its associated International Global Education Census Survey 2019 states VOLUME , 2021 1 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3118048, IEEE Access Mahrishi et al.: Video Index Point Detection and Extraction Framework using Custom YoloV4 Darknet Object Detection Model text descriptor (keywords) in a video. However, in long on image and text analysis. video lectures, manually performing this task can be • To identify and implement an accurate shot boundary time-consuming. Additionally, new multimedia content detection technique for abrupt transitions. demands comprehensive indexing and restructuring of • To automatically detect the index points within a information for a deeper understanding of a video. Therefore video by validating the trained YoloV4 Custom Object Automatic indexing and annotation is a better and effective Detector. solution. The literature also claims that deep learning is quite • To apply accuracy measures and provide a comparative restrictive for such cases even though much work is done in benchmark for further research in this area and produce video analysis and segmentation. a significant impact. A video can be analyzed in two different ways. On the • To provide a method for intelligent information one hand, a video can be examined based on underlying retrieval, researchers can find the applicability of this multimedia elements, i.e. speech and images. On the other technique in videos of other domains. hand, text inspection and keyword identification open a • To support and enhance the research community whole new dimension. This research intends to develop a knowledge. technique for identifying automatic index points from a video lecture. These index points are nothing more than excerpts B. ARTICLE ORGANIZATION from a video’s main headings and discussion topics. The This research work focuses on developing a technique current research involves studying all materials, including for identifying automatic index points for a video lecture. variable replay duration, classroom lectures, online tutorials, Section II takes a close look at the theories and related open courseware, and other comparable resources. research about video indexing to grasp detailed literature. A 137-layer YoloV4 Darknet Neural Network is used to Section III discusses the technical perspective, research create a custom object detector model. The model is trained methodology and the technique to facilitate the algorithm’s on approximately 6000 video frames and then tested on a development based on research gaps identified in the suit of 50 educational videos of around 20 hours of run literature and the objectives of the investigation. Since we time. Structural Similarity Index Matrix algorithm fused with have used a subjective criterion to evaluate the success of our Binary Search Pattern algorithm is used for improving the heading detection algorithm, the hypothesis and ground truth overall performance of shot boundary detection. are discussed in Section IV. After successfully implementing the algorithm, it is then tested, and the output is displayed A. RESEARCH CONTRIBUTIONS graphically in Section V. Finally, the complete work is The motivation of this research comes from the fact that concluded in Section VI, and future work is suggested in accessing the range of interest in a lecture video is not easy Section VII. because there is no table of contents or index sections. The temporal pace of the sequences only affords the user to II. INVESTIGATIONS IN AUTOMATIC VIDEO INDEXING explore the video linearly. Through this research technique, In a video, words have various meanings depending upon the a video lecture is automatically partitioned into segments context. As the saying goes, “sometimes, they are important, based on image and text analysis. This process is called while others are just a recollection of a former viewpoint.” “Index Point Detection”, and the identified keywords are When looking at the appearance of an index point in a video, termed “Index Points”. Each ‘Index Point’ will also contain it is essential to understand its importance [4]. The literature the timestamp of the relative video time instance. The review tries to uncover the critical characteristics to decide textual content of the videos will be obtained by using a topic change in a video lecture. A categorical literature Google’s Tesseract OCR Engine. The research is limited survey is done based on the techniques intended to be to educational videos that contain presentation slides in the performed during the research and tabulate the performance, English language only. Using these indices, Users can easily relative merits, and limitations of existing approaches. search topics to find a particular segment of a video file or create a digital database. A. INDEXING OF VIDEOS Learning blended with Natural Language Processing is The repository of educational videos is immense, and it is undoubtedly the most gripping scientific frontiers that impossible for a learner to copiously go through it and narrow persuade the researchers to explore novel dimensions in down the content of interest. Rosalind Picard proposed indexing and partitioning, thereby producing a coherent Affective video indexing in 1990 while defining practical video document structure. computing. As per the definition given by her, the need for Thus, the main contribution of this research are: indexing a video can be summarised as “Although affective • To study the existing methods of automatic video annotations, like content annotation, will not be universal, indexing and annotation to analyze the outcomes and they will still help reduce time searching for the right scene”. gaps. Automatic video indexing is indispensable to make a video • To develop a methodology through which a video interactive and autodidact. The purpose of indexing is lecture is automatically partitioned into segments based to divide a lecture video into parts that reflect distinct 2 VOLUME , 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3118048, IEEE Access Mahrishi et al.: Video Index Point Detection and Extraction Framework using Custom YoloV4 Darknet Object Detection Model sub-topics. A successful index for a video lecture can only [13] Proposed a new way to improve the interpretability and be created when the contents are automatically analyzed manipulability of a video by combined keyframe extraction to extract relevant metadata. The first step is to identify and object-based video segmentation method. The similarity all the timestamps in the video when the scene changes and redundancy between the frames is removed through the considerably. This will allow for segment-wise browsing Kullback Leibler distance (AIKLD) criterion. of the video and content analysis explicitly targeted to the There could be hundreds and thousands of videos on category. Furthermore, a subset of these locations is chosen a single topic— all of them with different points as index points, representing the start of a sub-topic, which of view and numerous instances of more profound subsequently serves as a part of the index for the video. insight. This sub-section claims that manual annotation or The specific problem this research will cater is to extract the topic-wise indexing is not new in videos; it is sometimes keywords from the video frames based on which a lecture can time-consuming due to the length of a typical video lecture. be segmented. Each such keyword is called an index point, Automatic indexing and annotation is, therefore, a better and which will also contain the timestamp of the relative video efficient solution. time instance. 2) Automatic Video Indexing using Deep Learning 1) Existing Approaches for Automatic Video Indexing Deep Learning for automatic video indexing opens a whole Many researchers have developed frameworks and systems new dimensions. Table 1 summarizes the review of research for video segmentation either for information retrieval or articles and their proposed idea for deep learning techniques, for content-wise indexing. Mottaleb et al. [5] suggested which are already used for video indexing. A method of one of the oldest pieces of literature in this field that use automated indexing of videos using a text-based query is a combination of content analysis and physical property suggested by [20]. The users are asked to provide the data to pick the shots shown to users. Riedl et. al [6] keywords as a query; the framework returns a list of videos. proposes a novel algorithm ’TopicTiling’ which is based Jothilakshmi [21] developed a novel method for content on the standard ’TextTiling’ algorithm for text segmentation identification of news scenes in the NEWS Dataset and where blocks of text are compared via bag-of-words vectors. suggested it for automated content recognition systems. Their work proves to be significant in terms of complexity When making a transcription, this method comes with and computationally less expensive than other LDA-based limitations. TV news sequences can only be processed using segmentation methods. F. Sauli [7] researched hyper videos a commercially available programme, classifying them into that introduce hyperlinks within a video. The work focuses one of the six predetermined categories (National Politics, on the educational domain and provides clickable access National News, World, Finance, Society & Culture and to video components like pictures, web pages, and text. Sports). Zhou et al. [22] proposes a technique proposed Biswas et al. [8] developed a rank model MmtoC for that includes automated video text panoramas to capture essential keywords (salient words). Both visual slides and natural scene text information and picture stitching. It has voice transcripts were utilized to create the sentence. Cost been observed that [23] has obtained a very substantial functions are derived depending on the terms used in 95.3% Recall and 98.6% Precision using CNN with Cosine a ranking. The topic-oriented indexing of the movie is Similarity. Lu et al. [24] indicate that deep convolutional generated using the programming optimization method, neural networks combined with OCR Tools are also used which accelerates processing. An analysis of the textual nowadays to detect and recognize video text. content is conducted, and lecture video segmentation is Through the extensive literature survey, a conclusion based on the linguistic characteristics of the text. Lin et can be drawn that there is a need for automatic al. [9] convert lecture videos into text analyses of the video indexing in educational videos. To reduce the resulting text to find dissimilarities and find ways to improve computational complexity, shot boundary detection and them. To distinguish context-dependent information from keyframe identification techniques are also indispensable. content-based information is one of the main challenges for Even though much work is done in video analysis video indexing researchers. Uke and Thool [10] aims to and segmentation, Deep Learning for such cases is still provide a digital index to education videos by converting restrictive. videos into images and extracting the text directly through the Google Tesseract OCR Engine. Merler et al. [11] is one B. A REVIEW OF VIDEO STRUCTURE ANALYSIS AND step ahead. The concept of both types of research is the same, SHOT BOUNDARY DETECTION TECHNIQUES unlike the latter extracts and recognizes the text directly from During the past decade, shot boundary detection techniques the video rather than converting the whole video into images. have been a sphere of influence on researchers worldwide Adcock et al. [12] worked on lecture webcasts and tried to working in video analysis and Image processing. A large provide a searchable text index so that users could access portion of the research community has been devoted to local material within a video. The technique focuses on shot boundary detection using edges, color, motion cues, videos with presentation slides only and performs keyframe object correlation, singly, or a combination. Numerous extraction using associativity of the textual content. Hui et al. techniques have been developed, and several comprehensive VOLUME , 2021 3 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3118048, IEEE Access Mahrishi et al.: Video Index Point Detection and Extraction Framework using Custom YoloV4 Darknet Object Detection Model TABLE1: Comparison of various Deep Learning Models in Video Analysis S. No. Paper Title Author & Deep Learning Model Conclusion Year 1. An Innovative Video Searching [14] 2021 Recurrent neural network Video searching through indexing is proposed. Approach using Video Indexing (RNN) VIST-Visual Story Telling dataset is used to generate ResNet-152 Video captions for preparing indices. Bi-LSTM (Bidirectional RNN model is used for audio to text conversion and LSTMs), speech recognition. OCR technique is used for Text extraction from pre-processed frames. 2. Automatic Image and Video Caption [15] 2020 CNN +RNN/LSTM, A comprehensive review is presented for both image Generation With Deep Learning: A GAN, CNN+LSTM and captioning and video captioning methodologies Concise Review and Algorithmic LSTM+GAN based on deep learning. Overlap 3. A Browsing and Retrieval System for [16] 2019 Convolutional Neural Video browsing and retrieval system comprises Broadcast Videos using Scene Detection Network (CNN) of scene Detection module, a concept detection and Automatic Annotation algorithm and a retrieval algorithm with which users can search for scenes inside a video collection. Siamese Neural network, a CNN based network whose penultimate layer is concatenated with features extracted from the transcript of the video. 4. Deep Learning Based Semantic Video [17] 2018 GoogLeNet network Content based Video retrieval system is proposed in Indexing and Retrieval structure which features are extracted by CNN model. Convolutional Neural Network (CNN) 5. Beyond Short Snippets: Deep Networks [18] 2015 Convolutional Neural Proposes a CNN architectures for obtaining global for Video Classification Network (CNN) video-level descriptors and a framework for video Recurrent neural network and action classification tasks. derived from Long Short Term Memory (LSTM) 6. Large-scale Video Classification with [19] 2014 Convolutional Neural CNN is used for obtaining global video-level Convolutional Neural Networks Network (CNN) descriptors, video classification and retrieval TABLE2: Cut and Gradual Transition Detection surveys have been presented to summarise them. It can be observed that although almost all the approaches detect Year Reference CT Detection GT Detection Cut transitions efficiently, some recent approaches focus on 2021 [25] Y N detecting Gradual transitions. 2021 [26] Y Y Table 2 compares the existing research approaches that 2021 [27] Y Y are efficient in detecting either Gradual Transition, Cut 2020 [28] N Y transition, or both. Although almost all the approaches 2020 [28] Y Y detect Cut transitions efficiently, some recent approaches 2020 [29] Y Y focus on detecting Gradual transitions. Several methods 2019 [30] Y Y have been implemented to detect shot boundaries, producing 2018 [31] Y Y highly accurate and acceptable results. We can infer 2018 [32] Y Y from the literature that Color histograms are the most 2017 [33] Y Y used global features for video shot boundary detection. 2017 [34] Y Y Histograms provide a good trade-off between accuracy and 2017 [35] Y Y computational time. Color Histograms (CBH) uses RGB 2016 [36] Y Y space for the computation while Hue Saturation Value (HSV) 2016 [37] Y N Histograms are computed by HSV Color Space [43]. The 2016 [38] Y Y hue, saturation, and value (HSV) color space are more 2016 [39] Y Y intuitive and alternate for the RGB color space. It also uses 2015 [40] Y Y three dimensions to describe a color, which produces sturdy 2013 [41] Y Y detection results [44]. Tuna et al. [45]discusses a detailed 2009 [42] Y Y literature survey of Shot boundary detection techniques that detects Abrupt Transitions and Gradual Transitions. Ma et al. [46] detected gradual and abrupt transitions using a dual detection model by performing pre-detection. An uneven employing the Scale Invariant Feature Transform (SIFT). blocked mechanism is used based on the human visual Amiri and Fathy [47] developed a noise-robust algorithm system, histogram, and pixel difference for pre-detection. using Generalized Eigenvalue Decomposition (GED) and The false detection was removed in the detection phase by obtained a distance function. Cut transitions were realized when abrupt changes were noticed, and gradual transitions 4 VOLUME , 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3118048, IEEE Access Mahrishi et al.: Video Index Point Detection and Extraction Framework using Custom YoloV4 Darknet Object Detection Model were obtained when semi-Gaussian behaviour was reflected the comparison of the results with the state-of-the-art in the distance function. Multilevel Difference of Color algorithm SSIM. Histograms and the voting mechanism [38], and [48] used 3) Phase-III: Automatic Index Point Detection: This Convolution Neural Networks (CNN) to determine cut and phase incorporates training of YoloV4 neural Network gradual transitions. over 6000 image frames for heading detection. The Hue Saturation Value is used for detecting abrupt shots, model is then tested upon a tailor-made suit of 50 and a 3-dimensional convolution layer in CNN is used to videos. Finally, the results are expressed in Precision, obtain gradual transitions. Furthermore, CNN avoids the Recall, and F1 Score using the Mean Opinion Score. disturbance caused by abrupt transitions [30]. In [35], the When a video is split into frames, numerous boundary Frobenius norm with double threshold and Singular Value and non-boundary frames are generated. The number of Detection updating was used for detecting Cut and Global non-boundary frames is comparatively higher than that Transitions, respectively. of boundary frames. Since the non-boundary frames are A Modified Artificial Bee Colony algorithm is used to redundant, we need to eliminate them to improve the identify candidate boundaries. A hybrid of Fast Accelerated computational complexity. For this purpose, Shot Boundary Segment Test and fuzzy histograms is further used to Detection is used. When Shot Boundaries are successfully extract local and global features to verify the obtained identified, there is a single keyframe that represents a shot. boundaries [49]. Symmetric Local Binary Pattern with We call that keyframe as ‘Candidate Segment’. We can histogram features was implemented on six TV series conclude that in a non-boundary segment, the first frame and to handle illuminating changes and detect hard cuts [2]. the last frame are highly correlated [52]. Temporal features were extracted from movies and trailers, The significant phases in the proposed work are gathering followed by Dynamic Mode Decomposition to extract custom data set, shot boundary detection and candidate temporal background and foreground modes to detect hard segment identification, customizing configuration files cuts, Fade, and dissolve [5]. The high-Level Fuzzy Petri necessary to train the Neural Network Model, Validating Net model with keypoint matching is used on Commercial training data for accuracy and finally testing the custom videos, movies, TV shows, and dramas to detect gradual object detector to see the model’s accuracy in real-time. transition by removing false shots [32]. Audio and optical features were extracted from talk shows A. FRAMES EXTRACTION AND PROCESSING obtained using Daily Motion and YouTube websites using This study is based on a dataset of video lectures, and the clustering-based algorithms to show that the number of model is trained on videos ranging in length from 30-45 frames is limited to two while the actor movement is minutes, for a total run time of 22 hours. Lecture videos often more [50]. Candidate segments are obtained by comparing feature a frame rate of 30, which means that 30 frames are Oriented FAST and Rotated BRIEF (ORB) features, Cut produced every second [53]. As a result, the average number Transitions are extracted by comparing Structural Similarity, of frames per video we have worked on is 35,000 - 40,000 and Gradual Transitions are detected in the gradual transition frames, with just 25-30 boundary frames. Therefore, Frame model in 106049 test frames from Open-video project, reduction is a significant challenge: the fewer the frames, the YouTube and YOUKU [51]. faster and more effective the video processing is. Henceforth, initially, to reduce the number of frames, We assumed that III. TECHNICAL PERSPECTIVE OF THE RESEARCH since a topic will be debated for at least 10 seconds in a video lecture, a delay of the same can be added in frame Semantic characteristics can now be extracted from pictures generation, and subsequent redundant frames can be reduced. and video sequences. There are still problems with For example, the total amount of frames initially produced in video lectures because of their lengthy running duration, our dataset video entitled “What is JWT JSON” was 38,680. non-uniform material, and complex narratives. Automatic The frames are reduced to 19,015 after a 10-second delay, video indexing is a technique that presents a potential which is almost half the number. solution to these problems. This research is carried out in Another critical aspect of frame generation is keeping track 3-phases, each of which has its significance and contribution of the frame’s arrival time w.r.t. video playback. This frame towards better accuracy and less computational complexity timestamp indicates the exact point in time when that frame as depicted in Figure 1. appears in the video. Since the purpose of the research is to 1) Phase-I: Frame Extraction and Pre-processing: The provide automatic video indexing, timestamps plays a very important steps in this phase are identification of vital role in video browsing and navigation. appropriate dataset for the research, extraction of frames and embedding of timestamp with frames. B. BINARY SEARCH PATTERN ALGORITHM (BSPA) 2) Phase-II: Candidate Segment Extraction using Shot Binary Search Pattern Algorithm (BSPA) also known as Boundary Detection: The significant steps in this half-interval algorithm . It is a searching algorithm that phase are an amalgamation of Binary Search Pattern searches for a targeted image frame in a sorted array of Algorithm with Structural Similarity Index and then frames. In our case, the frames are sorted based on their index VOLUME , 2021 5 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3118048, IEEE Access Mahrishi et al.: Video Index Point Detection and Extraction Framework using Custom YoloV4 Darknet Object Detection Model FIGURE1: Technical Perspective of Custom YoloV4 for Heading Detection Algorithm 1 Binary Search Pattern Algorithm We have worked on the SSIM approach in this experiment 1: procedure BSPA(left, right) because of its easy implementation with satisfactory results. 2: A ← Array_of _f rames SSIM is a metric that looks for the similarity between the 3: n ← Size_of _array pixels of two images. If the pixel density is remarkably 4: x ← Boundary_f rame similar, SSIM will return ' 1, and if it is vastly different, 5: Set left = 0 SSIM will return ' -1. The assessment index of quality is 6: Set right = n-1 based on the computation of three expressions: Luminance 7: Compare ssim_left and ssim_right expression, Contrast expression and Structural expression as 8: if ssim_lef t 6= ssim_right then Set ssim_mid = per below equation 1: (ssim_left + ssim_right) / 2 9: elsessim_pivot = ssim_mid SSIM (a, b) = [l(a, b)]α ∗ [c(a, b)]β ∗ [s(a, b)]γ (1) 10: Compare ssim_pivot and ssim_right 11: if ssim_pivot = ssim_right then Here, l stands for Luminance, c stands for the Contrast boundary_frame = ssim_right and s stands for the Structural expressions respectively. The End Procedure expressions for Luminance, Contrast and Structure can be observed in equations 2, 3 and 4. (2µa µb + z1 ) number. BSPA is required to reduce the computational time l(a, b) = (2) from O(n) to the logarithmic time, O(log n), where n is the (µ2a + µ2b + z1 ) number of elements in the array. Algorithm 1 represents how BSPA is used to divide the array of n frames and select the (2σa σb + z2 ) pivot frame. c(a, b) = (3) (σa2 + σb2 + z2 ) C. STRUCTURAL SIMILARITY INDEX (σab + z3 ) The similarity is a computed value between two images that s(a, b) = (4) (σa + σb + z3 ) determines the pixel or visually similar the images are. There are various methods to compute the similarity between two In the equations above µa , µb shows the local means, images like Template Matching, Image Descriptors (such σa , σb shows the standard deviations, σab shows the as SIFT, SURF and FAST) and Structural Similarity Index cross-covariance for the images and z1 , z2 , z3 are SSIM Measure (SSIM). constants. 6 VOLUME , 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3118048, IEEE Access Mahrishi et al.: Video Index Point Detection and Extraction Framework using Custom YoloV4 Darknet Object Detection Model Here, the assumption is that α = β = γ = 1 and z1 = z2 Algorithm 3 Proposed Algorithm for Candidate Segment 2 , then the above equations simplifies to equation 5: identification 1: Score = Calculate SSIM(Left,Right) 2: mid = Calculate mid (Left,Right) 2µa µb + z1 2σab + z2 3: L_Score = Calculate SSIM(Left,mid) SSIM a, b = (5) 4: R_Score = Calculate SSIM(mid,Right) µ2a + µ2b + z1 σa2 + σb2 + z2 5: mod_L = mod(L_Score - Score) 6: mod_R = mod(R_Score - Score) 7: mod_LR = mod(L_Score - R_Score) Technically, SSIM is based on three utility functions that 8: Sim,L_Sim,R_Sim = False perform all essential tasks. 9: if mod_L ≤ 0.1 then L_Sim = True 10: if mod_R ≤ 0.1 then R_Sim = True • gaussian Function that generates an array of numbers 11: if L_Sim&R_Sim then Sim = True return sampled from a gaussian distribution. The length of the Index array is equal to the size of the window. Sigma σ is used to represent the standard deviation of the distribution. • create_window Function is used to create a 2-dimensional The proposed approach and the state-of-the-art SSIM array by multiplying the array (generated by gaussian Technique are examined for shot boundary detection on Function) with its transpose. python 3.7.1 with a 1.80 GHz CPU and 8 GB RAM • ssim Function is used to perform the mathematical with 5900 image frames to gauge the effectiveness. calculations, including luminance, contrast and SSIM The results quantified on the basis of precision, recall, score based on standard formulae. and F1 score. The comparative results show that the proposed algorithm successfully detects abrupt transitions There is a plot twist while implementing SSIM, that for with reasonable accuracy in detection and significantly image quality assessment, it is useful to apply the SSIM decreases the complexity of the algorithm. index locally rather than globally. Therefore, we have used The input to the system is a video sequence that is converted the Mean Index Value for Structural Similarity. to image frames. An index number 0th to (N-1)th is assigned Algorithm 2 represent the sate-of-the-art SSIM approach to each frame, where N is the total number of frames. A implementation. It can be seen in algorithm 2 that the 5-frame window is created to check the structural similarity similarity score between two image frames is kept to be 0.42 of the two frames. (i.e. 42% similarity for non-adjacent frames), which is the The algorithm checks if the frames in the window are less mean similarity score for the SSIM approach. than five or more. If the number is less than five, it is evident that the last frame is the boundary frame, and hence the Algorithm 2 SSIM Threshold algorithm returns the last frame; otherwise, the window is conditions: left, mid, right FnFunction: F moved forward to the next set of 5-frames. if the subtracted value of left and right is greater than five The location of the frames that have to be compared is then, return Algo(left, mid) divided into left, mid and right frames. The 0th and (N-1)th elseright value is greater than left+1 frames are compared by the standard SSIM approach in the if SSIM score is less than 0.42 then, return Algo(left, first iteration. If they are found similar, we can instantly mid) return Algo(mid, right) conclude that the video contains only a single shot. However, end function this case is only possible for videos with few seconds of running time, and those videos are not targeted in the proposed research. Else wise, when the 0th and (N-1)th are dissimilar, the algorithm will divide the frames in half. The D. CANDIDATE SEGMENT IDENTIFICATION USING execution of the 5-frame window will be as follows: STRUCTURAL SIMILARITY AND BINARY SEARCH • Checks the SSIM score of the first and last frame inside PATTERN the 5-frame window. Identifying correct and accurate shot boundaries is • If the SSIM score ≤ 0.42, then the frames do not belong the backbone for successful video segmentation and to the same shot, and therefore ALGO(left, mid) is content-based video retrieval. The maximum part of this executed. research revolves around the optimizing technique to identify • Otherwise, it can be concluded that since the SSIM abrupt transitions for shot boundary detection based on score ≥ 0.42, the frames are adjacent, i.e. belongs to Structural Similarity (SSIM) fused with Binary Search the same shot. Pattern Algorithm (BSPA). The algorithm for the same is • Move the 5-frame window to the next five frames by represented in Algorithm 3. We tried to make the overall executing Algo(mid, right). solution less costly and computationally efficient. • Return the left frame as boundary frame. VOLUME , 2021 7 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3118048, IEEE Access Mahrishi et al.: Video Index Point Detection and Extraction Framework using Custom YoloV4 Darknet Object Detection Model •If not, then the algorithm checks if it is the last frame. If yes, it returns right as the boundary frame. Table 3 shows the comparison between the state-of-the-art SSIM and proposed approach for abrupt shot boundary detection. The results are compared based on the average percentage of precision, recall and F1-Score as depicted through Figure 2, Figure 3 and Figure 4 respectively. The comparative analysis of existing state-of-the-art techniques is compared with the proposed method, as displayed in Table 4. The only limitation of our proposed approach is that if a shot is repeated, the proposed method considers it as a new shot and fetch its boundary frame, so the boundary frames might be similar, hence increasing the redundancy. FIGURE4: Comparison of F1 Score of proposed Binary Search fused with SSIM and State of the art SSIM Algorithm but there is a significant change in average recall( 97%) and average F1-Score (89%), respectively. The results demonstrate that integrating Binary Search in the standard SSIM Algorithm gradually improves detection performance and reduces the computational time from O(n) to the logarithmic time O(logn). E. INDEX POINT DETECTION USING YOLOV4 After successfully identifying Candidate Segments, the next important task is to extract the index points from those image frames. YoloV4 Darknet custom object detector is FIGURE2: Comparison of Precision of proposed Binary Search used for automatic extraction of index points. Darknet is fused with SSIM and State of the art SSIM Algorithm a robust open access neural network framework geared towards object detection, using CPU and GPU computation. The unique architecture and high speed distinct the Darknet Framework from other existing NN architectures. A specialized framework such as YOLO is found in the Darknet. YOLO (an acronym for “You Only Look Once”) can run on CPU, but it gains up to 500 times more speed on GPUs because it uses CUDA and cuDNN. Another positive aspect of Darknet is that it is comparatively convenient to train compared to the other heavily optimized and popular frameworks in customized data sets. Through this research, we concluded that the method of automatic keyword extraction using Deep Neural Network gives satisfactory results in terms of speed and accuracy. 1) Configuration FIGURE3: Comparison of Recall of proposed Binary Search This technique uses a 137-layer YoloV4 custom object fused with SSIM and State of the art SSIM Algorithm detector neural network model for identifying the headings/keywords in the image frames. The neural network The computational time for detecting Shot Boundary model is trained on approximately 6000 video frames and through the proposed approach improves significantly then tested on a suit of 50 educational videos of around 20 compared to the traditional SSIM Technique. The average hours of run time. The comparative results viz. detection time by proposed approach is 129 seconds, which is 21.4% accuracy, precision, recall, and F1 score show that the lesser than the basic approach. We can conclude that the proposed algorithm successfully generates a digital index proposed algorithm is 20-25 times faster. The average with reasonable accuracy in topic detection. precision is 83% which is the same for both approaches, The technical configuration of YoloV4 is mostly summarized 8 VOLUME , 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3118048, IEEE Access Mahrishi et al.: Video Index Point Detection and Extraction Framework using Custom YoloV4 Darknet Object Detection Model TABLE3: Result comparison between Standard SSIM Algorithm and Proposed Approach SSIM Proposed Approach Average Average Average Average Average Average F1 Average Computational Time Improvement Precision Recall F1 Score Precision Recall Score using proposed approach 0.86 0.93 0.89 0.82 0.97 0.88 21.4 Seconds TABLE4: Comparing proposed work with State-of-the-Art Approaches METHODS PROS CONS Convolutional Neural Misses long dissolves, partial scene changes and 120x real-time Networks [54] scenes with motion blur ORB Fused with Structural device-independent, low computational burden Some false-positive caused by flicker and blur Similarity [55] and high accuracy Spatio temporal Regularity High accuracy, no false or missed detection High-speed GPU and high computational cost of Video Cube [56] Hybrid Key point Detection [32] High accuracy Fail to identify the essential key points Proposed Method High recall, low computational time, burden Some boundary frames might be redundant by 4-python files as in figure 1 viz.: 1) Detection.py Detection.py uses SSIM, OpenCV as imports and gets the unique Frames from SSIM.py and gets the title frames from the YOLOv4 model, which is trained specifically for this purpose. Then, after detection, it yields the data to UI.py in the below format:- ’frame_no’:frame_no,’x’:x,’y’:y,’x + w’:x + w,’y + h’:y + h 2) Reducer.py FIGURE5: Heading Detection using Custom YoloV4 This python script gets the video, applies the SSIM to the video frames, and yields the frames to Detection.py IV. DATASET, HYPOTHESIS AND GROUND TRUTH when a unique frame is found. Training and validation data comprises of extracted frames 3) UI Runner.py from video lectures of various domains. The frames are then The file starts the UI—the program’s starting point (UI, manually labelled through the makesense.ai object detection Detection as Imports). toolkit. Since this is a single class object detector, the only 4) UI.py heading label is shown in figure 6, and each training image The file that initializes the UI Sets up VLC Media will have a .txt file associated with it which contains the Player and uses the Extracted title frames from coordinates of the heading in the image. detection, gets the text image using OpenCV, applies SSIM like reducing redundant frames and converts them to text using PyTesseract and timestamps the unique remaining frames. (uses Detection, Pyside 2, python-VLC, SSIM as imports). 2) Heading Extraction The research states the application of Single Class YOLOV4 as a custom object detector through Google Colaboratory. Approximately 6000 video frames from different videos were included in our dataset. 90% of frames were used for training our model, and 10% frames were used for validation. Yolo performs unzipping and inflating all our training images and our validation images. Testing of the model is done explicitly FIGURE6: Dataset labelling for training using makesense.ai. on a suit of 50 different video lectures. A sample result of heading detection can be seen in figure 5. The number of index points within a video varies VOLUME , 2021 9 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3118048, IEEE Access Mahrishi et al.: Video Index Point Detection and Extraction Framework using Custom YoloV4 Darknet Object Detection Model PQ PQ PQ AveP (q) q=1 AveP (q) q=1 AveP (q) q=1 mAP = M OS_1 + M OS_2 + M OS_3 (6) Q Q Q PQ PQ PQ q=1 AveR(q) q=1 AveR(q) q=1 AveR(q) mAR = M OS_1 + M OS_2 + M OS_3 (7) Q Q Q mAP ∗ mAR mAF 1 = 2 (8) mAP + mAR The difference of precision, recall and F1 score for heading by determining the shot boundaries and keyframe using detection through the proposed approach and Mean Opinion Algorithm 3. We then locate the cut transitions/heading Score given by the external volunteer can be seen through changes by watching the videos and noting each shot’s figure 7. beginning and end times. Several spurious frames are produced during detection, The performance and throughput of the YoloV4 custom FIGURE7: Difference of precision, recall and F1 for heading detection through proposed approach and through Mean Opinion Score some false negatives are generated, and some unnecessary headings are generated due to PyTesseract’s efficiency. It can be observed from the figure that the range of Precision, Recall and F1 Score of the proposed approach lies between 65-70%. FIGURE8: Average Loss vs Iterations. V. RESULTS AND DISCUSSIONS object detector during the training can be observed through This paper analyzes the results in two categories. The figure 8 which shows average loss vs iterations. For a model first category results (as discussed in Subsection III-D) to be ‘accurate’, we aim for a loss under 2, which is currently depicts the improvement in shot boundary detection when 1.83 for 8000 iterations. the state-of-the-art Structural Similarity Index Matrix(SSIM) The experimental setup is summarised in table 5. Using this algorithm is modified by fusing with a Binary search information as a ground truth, we prepared the confusion pattern algorithm. The results justify that there is a good matrix to calculate the Precision and Recall. The results enhancement in accuracy, computational complexity and show that the mean precision and mean F1 score is 70% as average processing time. depicted in Figure 9 , whereas there is a subsequent increase The second category of results (as discussed in this Section in recall with 76% as shown in Figure 10. This shows that V) demonstrates the performance of a custom VoloV4 object the number of True Positives is better than False Negatives. detection model for heading detection. In order to gauge the On the trained YoloV4 model, 81.24% of keyframe detection effectiveness, apart from calculating the confusion matrix, efficiency was obtained after the first test images and the the accuracy of detected index points is also assessed based Range of Precision, Recall, and F1 score varies between on the Mean Opinion Score. The model is trained on 60-75% which is considered to be reasonably good. approximately 6000 video frames and then tested on a suite of 50 videos of around 20 hours of run time. We begin 10 VOLUME , 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3118048, IEEE Access Mahrishi et al.: Video Index Point Detection and Extraction Framework using Custom YoloV4 Darknet Object Detection Model TABLE5: Evaluation of heading detection technique with respect to the videos. The results show a decrease in precision and increasing recall. The efficiency of keyframe detection, however, increases. S. No. Video Name True Positive (TP) False Positive (FP) False Negative (FN) Precision Recall F1 1 A Brief History of AI 8 5 0 0.62 1.00 0.76 2 A Brief Introduction of Micro-Sensors 3 0 0 1.00 1.00 1.00 3 Agriculture 3 0 1 1.00 0.75 0.86 4 Algorithm A_ 8 2 6 0.80 0.57 0.67 5 ANOVA - I 5 3 7 0.63 0.42 0.50 6 Artificial Intelligence 18 10 3 0.64 0.86 0.73 7 Artificial Neural Networks 8 3 4 0.73 0.67 0.70 8 Biological Neural Network 3 3 0 0.50 1.00 0.67 9 Chi - Square Test of Independence - I 2 3 4 0.40 0.33 0.36 10 Cloud Bigtable as a NoSQL Option 6 0 1 1.00 0.86 0.92 11 Computer Integrated Manufacturing 2 9 0 0.18 1.00 0.31 12 Data frames 9 2 2 0.82 0.82 0.82 13 Deterministic Search 5 2 2 0.71 0.71 0.71 14 Environmental and Occupational Hazzards 1 1 5 0.50 0.17 0.25 15 Introduction- Biosafety Aspects of GE Plants 4 2 1 0.67 0.80 0.73 16 Introduction to Mathematical Logic 11 3 1 0.79 0.92 0.85 17 Introduction to User-Centric Computing 7 3 3 0.70 0.70 0.70 18 Environmental Chemistry and Microbiology 1 1 1 0.50 0.50 0.50 19 Lec 6 _ MIPS Pipeline for Multi-Cycle Operations 5 2 0 0.71 1.00 0.83 20 Lec 17_ Introduction to DRAM System 6 2 6 0.75 0.50 0.60 21 Lecture 2 _ Introduction to Fuzzy Logic 6 3 5 0.67 0.55 0.60 22 Lecture 07_ Lexical Analysis 8 4 2 0.67 0.80 0.73 23 Lecture 07_ Projective Transformation 11 4 3 0.73 0.79 0.76 24 Lecture 7 11 5 2 0.69 0.85 0.76 25 Lecture 9 _ Introducing Functions 6 6 2 0.50 0.75 0.60 26 Lecture 10 _ Misra-Gries sketch 3 9 5 0.25 0.38 0.30 27 Lecture 17 6 9 1 0.40 0.86 0.55 28 Linear Regression vs Logistic Regression 8 4 0 0.67 1.00 0.80 29 Optical Fibres 5 1 1 0.83 0.83 0.83 30 Philosophical Foundations of Social Research 2 2 0 0.50 1.00 0.67 31 Phonetics and Phonology 1 0 0 1.00 1.00 1.00 32 Probability and NLP 5 2 0 0.71 1.00 0.83 33 Probability Distributions- Gaussian, Bernoulli 7 0 1 1.00 0.88 0.93 34 SAT Problem 4 6 0 0.40 1.00 0.57 35 SPM_Intro 4 2 5 0.67 0.44 0.53 36 Statistical Properties of Words - Part 01 3 3 8 0.50 0.27 0.35 37 Terraform 8 2 3 0.80 0.73 0.76 38 What Is Project Management 23 9 4 0.72 0.85 0.78 39 What are microservices 8 1 3 0.89 0.73 0.80 40 What Is Jenkins 11 8 0 0.58 1.00 0.73 41 What is Machine Learning 23 5 1 0.82 0.96 0.88 42 What is Scrum 5 3 3 0.63 0.63 0.63 43 Introduction To DevOps 11 6 1 0.65 0.92 0.76 44 Network Security Tutorial 33 13 7 0.72 0.83 0.77 45 SOAP vs REST 5 0 2 1.00 0.71 0.83 46 Introduction to AWS Services 7 0 4 1.00 0.64 0.78 47 POP3 vs IMAP 3 0 2 1.00 0.60 0.75 48 Kubernetes for Beginners 8 1 1 0.89 0.89 0.89 49 Need of IoT and Application 9 0 2 1.00 0.82 0.90 50 What is JWT JSON 3 4 3 0.43 0.50 0.46 Average Precision, Recall and F1 Score 0.70 0.76 0.70 FIGURE10: Recall representation for proposed approach FIGURE9: Precision representation for proposed approach VI. CONCLUSION The study establishes a basis for indexing digital video through the YOLO V4 Darknet Neural Network. Detection VOLUME , 2021 11 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3118048, IEEE Access Mahrishi et al.: Video Index Point Detection and Extraction Framework using Custom YoloV4 Darknet Object Detection Model accuracy, Precision, Recall, and F1 Score are used to reflect Workshop, (Jeju Island, Korea), pp. 37–42, Association for Computational overall performance and outcomes. The threshold values are Linguistics, July 2012. [7] F. Sauli, A. Cattaneo, and H. van der Meij, “Hypervideo for educational based on experimental observations of multiple videos. As purposes: a literature review on a multifaceted technological tool,” this is a new approach, much work can be done in the Technology, Pedagogy and Education, vol. 27, no. 1, pp. 115–134, 2018. future. For instance, an adaptive threshold value can be used [8] A. Biswas, A. Gandhi, and O. Deshmukh, “Mmtoc: A multimodal method for table of content creation in educational videos,” in Proceedings of the instead, and the performance is still abysmal in the case of 23rd ACM International Conference on Multimedia, MM ’15, (New York, handwritten text in the videos. The framework used a basic NY, USA), p. 621–630, Association for Computing Machinery, 2015. binary search algorithm infused with Structural Similarity [9] M. Lin, J. Nunamaker, M. Chau, and H. Chen, “Segmentation of lecture videos based on text: a method combining multiple linguistic features,” in Index. The dataset used for training the neural network is 37th Annual Hawaii International Conference on System Sciences, 2004. made from 6000 video frames, and for testing, a suite of 50 Proceedings of the, pp. 9 pp.–, 2004. videos was randomly selected. [10] N. J. Uke and R. Thool, “Segmentation and organization of lecture video based on visual contents,” International Journal of e-Education, It is important to note that the manual annotation used for e-Business, e-Management and e-Learning, 2012. training has limitations. For example, during the random data [11] M. Merler and J. R. Kender, “Semantic keyword extraction via adaptive collection to create a training dataset, one cannot always text binarization of unstructured unsourced video,” in 2009 16th IEEE mention the annotators’ context, accuracy, and qualification. International Conference on Image Processing (ICIP), pp. 261–264, 2009. [12] J. Adcock, M. Cooper, L. Denoue, H. Pirsiavash, and L. A. Rowe, However, to confine the scope of the study to a specific “Talkminer: A lecture webcast search engine,” in Proceedings of the 18th domain, the experiments were structured to include only ACM International Conference on Multimedia, MM ’10, (New York, NY, participants with an educational context and to cover videos USA), p. 241–250, Association for Computing Machinery, 2010. [13] C. Hui, S. Yunyu, Y. Haisheng, G. Ming, and Y. Liu Xiang, Xia, “A fast from educational portals. Furthermore, manual indexing is and robust key frame extraction method for video copyright protection,” only used for evaluation and not for the generation of index Journal of Electrical and Computer Engineering, March 2017. points. As a result, our subjective assessment of the proposed [14] J. J. S. I. V.P.Devassia, “An innovative video searching approach using video indexing,” International Journal of ComputerScience & Network, algorithm using three different types of experiments is vol. 8, pp. 144–147, 2021. sufficient to determine its efficacy. [15] S. Amirian, K. Rasheed, T. R. Taha, and H. R. Arabnia, “Automatic image and video caption generation with deep learning: A concise review and algorithmic overlap,” IEEE Access, vol. 8, pp. 218386–218400, 2020. VII. FUTURE ASPECTS OF THE RESEARCH [16] S. Ilyas and H. Ur Rehman, “A deep learning based approach for precise The present study tackled some aspects; nevertheless, more video tagging,” in 2019 15th International Conference on Emerging in-depth optimization and benchmarking are necessary to Technologies (ICET), pp. 1–6, 2019. study the effect on the efficiency of index point detection [17] A. Podlesnaya and S. Podlesnyy, “Deep learning based semantic video indexing and retrieval,” in Proceedings of SAI Intelligent Systems for standardization and relevant summaries. In the future, Conference (IntelliSys) 2016, pp. 359–372, Springer International we shall explore the applicability of this technique and any Publishing, 2018. potentially needed extensions. [18] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for This research discovered that the frequency of words, video classification,” in Proceedings of the IEEE Conference on Computer n-grams, and the number of first-time words that appeared in Vision and Pattern Recognition (CVPR), June 2015. a video provide valuable information for video segmentation [19] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural by topic. It is therefore expected to do a long-term, real-world networks,” in 2014 IEEE Conference on Computer Vision and Pattern effect analysis to evaluate keywords in instructional videos. Recognition, pp. 1725–1732, 2014. The research findings can also be improved by synchronizing [20] V. K. Kamabathula and S. Iyer, “Automated tagging to enable fine-grained browsing of lecture videos,” in 2011 IEEE International Conference on speech with textual data. The spoken words can be extracted Technology for Education, pp. 96–102, 2011. using speech transcripts and based on the context and [21] S. Jothilakshmi, “Spoken keyword detection using autoassociative neural semantics of the speech, and they can be associated with networks,” International Journal of Speech Technology, vol. 17, 2014. image text. [22] T. Zhou, K. Wang, J. Wu, and R. Li, “Video text processing method based on image stitching,” in 2019 IEEE 4th International Conference on Image, Vision and Computing (ICIVC), pp. 561–566, 2019. REFERENCES [23] L. Zhang and Y. Lu, “Video object segmentation by latent outcome [1] G. D. Abowd, “Classroom 2000: An experiment with the instrumentation regression,” IEEE Access, vol. 8, pp. 30355–30367, 2020. of a living educational environment,” IBM systems journal, vol. 38, no. 4, [24] W. Lu, H. Sun, J. Chu, X. Huang, and J. Yu, “A novel approach for video pp. 508–530, 1999. text detection and recognition based on a corner response feature map [2] T. Kar and P. Kanungo, “A novel method of shot boundary detection and transferred deep convolutional neural network,” IEEE Access, vol. 6, using center symmetric local binary pattern,” International journal of pp. 40198–40211, 2018. engineering research and technology, vol. 4, 2018. [25] T. D. M. Chakraborty Saptarshi, Singh Alok, “A novel bifold-stage shot [3] A. Coombe, “Global education census report,” Cambridge Assessment boundary detection algorithm: invariant to motion and illumination,” The International Education, 2018. Visual Computer, February 2021. [4] R. Lienhart, “Automatic text segmentation and text recognition for video [26] N. H. S. Rashmi B. S., “Video shot boundary detection using block indexing,” Multimedia Systems, Jan 2000. based cumulative approach,” Multimedia Tools and Applications, vol. 80, [5] M. Abdel-Mottaleb, N. Dimitrova, R. Desai, and J. Martino, “Conivas: January 2021. Content-based image and video access system,” in Proceedings of the [27] R. Mishra, “Video shot boundary detection using hybrid dual tree complex Fourth ACM International Conference on Multimedia, MULTIMEDIA wavelet transform with walsh hadamard transform,” Multimedia Tools and ’96, (New York, NY, USA), p. 427–428, Association for Computing Applications, vol. 80, January 2021. Machinery, 1997. [28] F.-F. Duan and F. Meng, “Video shot boundary detection based [6] M. Riedl and C. Biemann, “TopicTiling: A text segmentation algorithm on feature fusion and clustering technique,” IEEE Access, vol. 8, based on LDA,” in Proceedings of ACL 2012 Student Research pp. 214633–214645, 2020. 12 VOLUME , 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3118048, IEEE Access Mahrishi et al.: Video Index Point Detection and Extraction Framework using Custom YoloV4 Darknet Object Detection Model [29] U. A. Khan, M. Á. Martínez-Del-Amor, S. M. Altowaijri, A. Ahmed, A. U. [50] C. Choudary and T. Liu, “Summarization of visual content in instructional Rahman, N. U. Sama, K. Haseeb, and N. Islam, “Movie tags prediction and videos,” IEEE Transactions on Multimedia, vol. 9, no. 7, pp. 1443–1455, segmentation using deep learning,” IEEE Access, vol. 8, pp. 6071–6086, 2007. 2020. [51] H. Liu, T. Tan, and T. Kuo, “A novel shot detection approach based on [30] C. Huang and H. Wang, “Novel key-frames selection framework for orb fused with structural similarity,” IEEE Access, vol. 8, pp. 2472–2481, comprehensive video summarization,” IEEE Transactions on Circuits and 2020. Systems for Video Technology, 2019. [52] M. Mahrishi and S. Morwal, “Index point detection and semantic indexing [31] L. Wu, S. Zhang, M. Jian, Z. Lu, and D. Wang, “Two stage shot boundary of videos a comparative review,” Advances in Intelligent Systems and detection via feature fusion and spatial-temporal convolutional neural Computing AISC Springer, 2020. networks,” IEEE Access, vol. 7, pp. 77268–77276, 2019. [53] D. Brunner, “Frame rate: A beginner’s guide,” Tech Smith, January 2021. [32] R. Shen, Y. Lin, T. T. Juang, V. R. L. Shen, and S. Y. Lim, “Automatic [54] L. Wu, S. Zhang, M. Jian, Z. Lu, and D. Wang, “Two stage shot boundary detection of video shot boundary in social media using a hybrid approach detection via feature fusion and spatial-temporal convolutional neural of hlfpn and keypoint matching,” IEEE Transactions on Computational networks,” IEEE Access, vol. 7, pp. 77268–77276, 2019. Social Systems, vol. 5, no. 1, pp. 210–219, 2018. [55] H. Liu, T.-H. Tan, and T.-Y. Kuo, “A novel shot detection approach based on orb fused with structural similarity,” IEEE Access, vol. 8, [33] C. Bi, Y. Yuan, J. Zhang, Y. Shi, Y. Xiang, Y. Wang, and R. Zhang, pp. 2472–2481, 2019. “Dynamic mode decomposition based video shot detection,” IEEE Access, [56] S. R. Kumar, Rupesh and M. Sharma, “Abrupt scene change detection vol. 6, pp. 21397–21407, 2018. using spatiotemporal regularity of video cube,” Advances in VLSI [34] S. Tippaya, S. Sitjongsataporn, T. Tan, M. M. Khan, and K. Chamnongthai, Communication and Signal Processing, pp. 991–1002, 2020. “Multi-modal visual features-based video shot boundary detection,” IEEE Access, vol. 5, pp. 12563–12575, 2017. [35] B. Youssef, E. Fedwa, A. Driss, and S. Ahmed, “Shot boundary detection via adaptive low rank and svd-updating,” Computer Vision and Image Understanding, vol. 161, pp. 20–28, 2017. [36] R. Hannane, A. Elboushaki, K. Afdel, P. Nagabhushan, and M. Javed, “An efficient method for video shot boundary detection and keyframe extraction using sift-point distribution histogram,” International Journal of MEHUL MAHRISHI (IEEE Senior Member) is Multimedia Information Retrieval, vol. 5, pp. 89–104, 2016. currently working as an Associate Professor in [37] B. H. Shekar, K. P. Uma, and K. R. Holla, “Shot boundary detection using the Department of Information Technology at correlation based spectral residual saliency map,” in 2016 International the Swami Keshvanand Institute of Technology, Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 2242–2247, 2016. Management and Gramothan, Jaipur – India. He [38] Z. Li, X. Liu, and S. Zhang, “Shot boundary detection based on multilevel is a Senior Member of the IEEE Delhi Section, a difference of colour histograms,” in 2016 First International Conference Life Member of the Institution of Engineers, India on Multimedia and Image Processing (ICMIP), pp. 15–22, 2016. (L-IEI) and a Life Member of the International [39] J. Xu, L. Song, and R. Xie, “Shot boundary detection using convolutional Association of Engineers (L-IAENG). In addition, neural networks,” 2016 Visual Communications and Image Processing he was a member of an Indian contingent of the (VCIP), pp. 1–4, 2016. BRICS International Forum that travelled to Moscow, Russia, to attend [40] F. Liu and Y. Wan, “Improving the video shot boundary detection Russian Energy Week-2018 and the BRICS Youth Energy Summit. using the hsv color space and image subsampling,” in 2015 Seventh He has published more than 25 research papers in national and international International Conference on Advanced Computational Intelligence journals/conferences, including 19th ISDA, 4th SoCTA, 3rd International (ICACI), pp. 351–354, IEEE, 2015. Conference on Machine Learning and Computing (ICMLC), 3rd IACC and [41] T. Vo, D. Tran, W. Ma, and K. Nguyen, “Improved HOG descriptors books/book chapters. He has also participated in various forums by Infosys, in image classification with CP decomposition,” in Neural Information TCS, WIPRO-Mission 10 X and IBM. His research activities are currently Processing - 20th International Conference, ICONIP 2013, Daegu, Korea, twofold: while the first research activity is set to explore the developmental November 3-7, 2013. Proceedings, Part III (M. Lee, A. Hirose, Z. Hou, enhancements in applications in computer vision, the second central research and R. M. Kil, eds.), vol. 8228 of Lecture Notes in Computer Science, theme focused on the emerging capabilities of parallel and cloud computing. pp. 384–391, Springer, 2013. Mr. Mahrishi is rewarded several occasions in various domains, including [42] J. Ren, J. Jiang, and J. Chen, “Shot boundary detection in mpeg videos Recognition as an active reviewer by Journal of Parallel and Distributed using local and global indicators,” IEEE Transactions on Circuits and Computing (JPDC, Elsevier, and SCI & Scopus Indexed), IEEE continuing Systems for Video Technology, vol. 19, no. 8, pp. 1234–1238, 2009. education certification for “Cloud Computing Enable Technologies and [43] Y. Yang, Z.-J. Zha, Y. Gao, X. Zhu, and T.-S. Chua, “Exploiting web Recognition for outstanding performance in Campus Connect Program by images for semantic video indexing via robust sample-specific loss,” IEEE Infosys, India. Transactions on Multimedia, vol. 16, no. 6, pp. 1677–1689, 2014. [44] B. C. O’Connor and R. L. Anderson, Video Structure Meaning. Morgan & Claypool Publishers, 2019. [45] T. Tuna, M. Joshi, V. Varghese, R. Deshpande, J. Subhlok, and R. Verma, “Topic based segmentation of classroom videos,” in 2015 IEEE Frontiers in Education Conference (FIE), pp. 1–9, 2015. [46] M. Ma, S. Met, J. Hou, S. Wan, and Z. Wang, “Video summarization via temporal collaborative representation of adjacent frames,” in DR. SUDHA MORWAL is currently working 2017 International Symposium on Intelligent Signal Processing and as an Associate Professor in department of Communication Systems (ISPACS), pp. 164–169, IEEE, 2017. computer science, Banasthali Vidyapith. She is [47] A. Amiri and M. Fathy, “Video shot boundary detection using generalized MSc, MTech and PhD in computer science and eigenvalue decomposition and gaussian transition detection,” Computing has played a prominent role in the department’s and informatics, vol. 30, no. 3, pp. 595–619, 2012. academic activities for the last decade. She is [48] Z. Li, X. Liu, and S. Zhang, “Shot boundary detection based on multilevel co-author of many books on computer science difference of colour histograms,” 2016 First International Conference on in Hindi and English. She has published many Multimedia and Image Processing (ICMIP), pp. 15–22, 2016. papers in national and international journals and [49] A. B. Colony, “Video shot boundary detection method using modified conferences. artificial bee colony and fast accelerated segment test algorithm,” International Journal of Engineering & Technology, vol. 8, no. 1.5, pp. 177–182, 2019. VOLUME , 2021 13 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3118048, IEEE Access Mahrishi et al.: Video Index Point Detection and Extraction Framework using Custom YoloV4 Darknet Object Detection Model ABDUL WAHAB MUZAFFAR was born in DR. PANKAJ DADHEECH received his Ph.D. Azad Jammu & Kashmir, Pakistan in 1983. He degree in Computer Science & Engineering from received a BS degree in Computer Science from Suresh Gyan Vihar University (Accredited by the University of Azad Jammu and Kashmir, an NAAC with ‘A’ Grade), Jaipur, Rajasthan, India. MS degree in computing from the SZABIST He received his M.Tech. degree in Computer Pakistan, in 2010 and a Ph.D. degree in software Science & Engineering from Rajasthan Technical engineering from the National University of University, Kota and he has received his BE in Sciences and Technology (NUST) Islamabad, Computer Science & Engineering from University Pakistan, in 2017. From 2010 to 2018, he worked of Rajasthan, Jaipur. He has more than 15 years in different roles at the National University of of experience in teaching. He is currently working Sciences and Technology (NUST), Pakistan. Since August 2018, he has as an Associate Professor & Dy. HOD in the Department of Computer been an Assistant Professor with the Saudi Electronic University, Saudi Science & Engineering (NBA Accredited), Swami Keshvanand Institute Arabia. He is the author of 27 conference and journal papers. His research of Technology, Management & Gramothan (SKIT), Jaipur, Rajasthan, interests include software engineering, data and text mining, bioinformatics India. He has published 15 Patents at Intellectual Property India, and machine learning. Dr. Abdul Wahab Muzaffar has participated in Office of the Controller General of Patents, Design and Trade Marks, conferences in UAE, Thailand and the USA and presented his research. He Department of Industrial Policy and Promotion, Ministry of Commerce also reviewed papers from a different journal, including PLOS ONE. and Industry, Government of India. He has published 4 Australian Patents at Commissioner of Patents, Intellectual Property Australia, Australian Government. He has also Registered & Granted Research Copyright at Registrar of Copyrights, Copyright Office, Department for Promotion of Industry and Internal Trade, Ministry of Commerce and Industry, Government of India. He has presented 51 papers in various National & International conferences. He has 38 publications in various International & National Journals. He has published 2 Books & 8 Book Chapters. He is a member of many Professional Organizations like the CSI, ACM, IAENG & ISTE. He has appointed as a Ph.D. Research Supervisor in the Department of Computer Science & Engineering at SKIT, Jaipur (Recognized Research Centre of Rajasthan Technical University, Kota). He has also guided various M.Tech. Research Scholars. He has Chaired Technical Sessions in various International Conferences & Contributed as Resource Person in various FDP’s, Workshops, STTP’s, and Conferences. He is also acting as a Guest Editor of the various reputed Journal publishing houses, Conference Proceedings and Bentham Ambassador of Bentham Science Publisher. His area of interest includes High Performance Computing, Cloud Computing, Information Security, Big Data Analytics, Intellectual Property Right and Internet of Things. MOHAMMAD KHALID IMAM RAHMANI (IEEE Senior Member) was born in Patherghatti, Kishanganj, Bihar, India in 1975. He received the B.Sc. (Engg.) and the M.Tech. degrees in DR. SURBHI BHATIA is doctorate in Computer Computer Engineering from Aligarh Muslim Science and Engineering from Banasthali University, India in 1998 and Maharshi Dayanand Vidypaith, India, did her Masters in Technology University Rohtak in 2010 respectively and the from Amity University in 2012 and Bachelors Ph.D. degree in Computer Science Engineering in Information Technology in 2010. She has from Mewar University, India, in 2015. From 1999 earned professional management professional to 2006, he was a Lecturer with Maulana Azad certification from PMI, USA. She is currently College of Engineering and Technology, Patna. From 2006 to 2008, he an Assistant Professor in the Department of was a Lecturer and Senior Lecturer with Galgotias College of Engineering Information Systems, College of Computer and Technology, Greater Noida. From 2010 to 2011 he was Assistant Sciences and Information Technology, King Faisal Professor in GSMVNIET, Palwal. From 2017 he is Assistant Professor in the University, Saudi Arabia. She has rich 8 years of teaching and academic department of Computer Science, College of Computing and Informatics, experience. She is in the Editorial board member with Inderscience Saudi Electronic University, Riyadh, Saudi Arabia. His research interests Publishers in the International Journal of Hybrid Intelligence, SN Applied include Algorithms, IoT, Cryptography, Image Retrieval, Machine Learning, Sciences, Springer. She has published many research papers in reputed Deep Learning. He has published more than 30 research papers in journals journals and conferences in high indexing databases and have patents, and conferences of international repute and holds one patent of innovation. granted from USA, Australia and India. She is currently serving as a guest He also reviewed papers of different journal including Sādhanā (Springer) editor of special issues in reputed journals. She has delivered talks as keynote and International journal of advanced computer science and applications. speaker in IEEE conferences and in faculty development programs. She has authored two books and edited seven books from Springer, Wiley, and Elsevier. She has completed two funded research projects from Deanship of Scientific Research, King Faisal University and Ministry of Education, Saudi Arabia. Her research interests are Machine Learning, Sentiment Analysis, and Information Retrieval. 14 VOLUME , 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/