International Journal of Engineering Research & Technology (IJERT) ISSN: 2278-0181 Vol. 2 Issue 8, August - 2013 Load Balancer Scheduling Over Streaming Data in Federated Databases A. Sreeja 1 I. V. Sailaxmiharitha 2 N. Bhaskar 3 1 M-Tech in CSE, CMRTC, Hyderabad, India, 2 M-Tech in CSE, CMRTC, Hyderabad, India, 3 Associate Professor CSE, CMRTC, Hyderabad, India, Abstract historical data. On the other hand, Data Stream Management Systems (DSMS) support simple analyses The project includes a streaming data warehouse on recently arrived data in real time. Streaming update problem as a scheduling problem where jobs warehouses such as Data Depot combine the features of correspond to the process that load new data into these two systems by maintaining a unified view of tables and the objective is to minimize data staleness current and historical data. This enables a real-time over time. The proposed scheduling framework that decision support for business-critical applications that handles the complications encountered by a stream receive streams of append-only data from external warehouse: view hierarchies and priorities, data sources.  Online consistency, inability to pre-empt updates, Applications include: heterogeneity of update jobs caused by different inter stock trading, where recent RT arrival times and data volumes among different sources transactions generated by multiple stock and transient overload. Update scheduling in exchanges are compared against historical streaming data warehouses which combine the features trends in nearly real time to identify profit IJE  Credit card or telephone fraud detection, of traditional data warehouses and data stream opportunities; systems. The need for on-line warehouse refreshment introduces several challenges in the implementation of where streams of point-of-sale transactions or data warehouse transformations, with respect to their call details are collected in nearly real time  Network data warehouses maintained by Execution time and their overhead to the warehouse and compared with past customer behavior; processes. The problem with this approach is that new data may arrive on multiple streams, but there is no Internet Service Providers (ISPs), which mechanism for limiting the number of tables that can collect various system logs and traffic be updated simultaneously. summaries to monitor network performance and detect network attacks. A load balancer can be used to increase the capacity of Keywords: Online Scheduling, Data Warehouse, Data a server farm beyond that of a single server. It can also Modules, Web Database. allow the service to continue even in the face of server down time due to server failure or server maintenance. 1. Introduction A load balancer consists of a virtual server which, in turn, consists of an IP Address and port. This virtual Data mining is the process of analyzing data from server is bound to a number of physical services different perspectives and summarizing it into useful running on the physical servers in a server farm. A information that can be used to increase revenue, cuts client sends a request to the virtual server, which in costs, or both. Data mining software is one of a number turn selects a physical server in the server farm and of analytical tools for analyzing data. It allows users to directs this request to the selected physical server. Load analyze data from many different dimensions or angles, balancers are sometimes referred to as "directors"; categorize it, and summarize the relationships while originally a marketing name chosen by various identified. Technically, data mining is the process of companies, it also reflects the load balancer's role in finding correlations or patterns among dozens of fields managing connections between clients and servers. We in large relational databases. Traditional data then propose a scheduling framework that handles the warehouses are updated during downtimes and store complications encountered by a stream warehouse: layers of complex materialized views over terabytes of view hierarchies and priorities, data consistency, IJERTV2IS80460 www.ijert.org 1550 International Journal of Engineering Research & Technology (IJERT) ISSN: 2278-0181 Vol. 2 Issue 8, August - 2013 inability to pre-empt updates, heterogeneity of update The purpose of the extraction process is to reach to the jobs caused by different inter-arrival times and data source systems and collect the data needed for the data volumes among different sources, and transient warehouse. Usually data is consolidated from different overload. source systems that may use a different data organization or format so the extraction must convert The goal of a streaming warehouse is to propagate new the data into a format suitable for transformation data across all the relevant tables and views as quickly processing. The complexity of the extraction process as possible. Once new data are loaded, the applications may vary and it depends on the type of source data. The and triggers defined on the warehouse can take extraction process also includes selection of the data as immediate action. This allows businesses to make the source usually contains redundant data or data of decisions in nearly real time, which may lead to little interest. For the ETL extraction to be successful, it increased profits, improved customer satisfaction, and requires an understanding of the data layout. A good prevention of serious problems that could develop if no ETL tool additionally enables storage of an action was taken. Recent work on streaming intermediate version of data being extracted. This is warehouses has focused on speeding up the Extract- called "staging area" and makes reloading raw data Transform-Load (ETL) process. possible in case of further loading problem, without re- extraction. The raw data should also be backed up and 2. Extract Transform Load: archived. The term ETL which stands for extract, transform, and 2.2 Transformation. load is a three-stage process in database usage and data warehousing. It enables integration and analysis of the The transform stage of an ETL process involves an data stored in different databases and heterogeneous application of a series of rules or functions to the formats. After it is collected from multiple sources extracted data. It includes validation of records and (extraction), the data is reformatted and cleansed for their rejection if they are not acceptable as well as RT operational needs (transformation). Finally, it is loaded integration part. The amount of manipulation needed into a target database, data warehouse or a data mart to for transformation process depends on the data. Good be analyzed. Most of numerous extraction and data sources will require little transformation, whereas IJE transformation tools also enable loading of the data into others may require one or more transformation the end target. Except for data warehousing and techniques to meet the business and technical business intelligence, ETL Tools can also be used to requirements of the target database or the data move data from one operational system to another. warehouse. The most common processes used for transformation are conversion, clearing the duplicates, 2.1 Extraction. standardizing, filtering, sorting, translating and looking up or verifying if the data sources are inconsistent. A The extraction step is conceptually the simplest task of good ETL tool must enable building up of complex all, with the goal of identifying the correct subset of processes and extending a tool library so custom user's source data that has to be submitted to the ETL functions can be added. workflow for further processing. As with the rest of the ETL process, extraction also takes place at idle times of 2.3 Load. the source system - typically at night. Practically, the task is of considerable difficulty, due to two technical The loading is the last stage of ETL process and it  The source must suffer minimum overhead constraints: loads extracted and transformed data into a target repository. There are various ways in which ETL load during the extraction, since other the data. Some of them physically insert each record as administrative activities also take place during a new row into the table of the target warehouse  Both for technical and political reasons, that period, and, involving SQL insert statement build-in, whereas others link the extraction, transformation, and loading administrators are quite reluctant to accept processes for each record from the source. The loading major interventions to their system's part is usually a bottleneck of the whole process. To configuration; therefore, there must be increase efficiency with larger volumes of data we may minimum interference with the software need to skip SQL and data recovery or apply external configuration at the source side. high-performance sort that additionally improves performance. IJERTV2IS80460 www.ijert.org 1551 International Journal of Engineering Research & Technology (IJERT) ISSN: 2278-0181 Vol. 2 Issue 8, August - 2013 innovation is the Multi track Proportional algorithm for Jobs must be completed before their deadlines a simple scheduling the large and heterogeneous job sets metric to understand and to prove results about. In a encountered by a streaming warehouse additionally; we firm real-time system, jobs can miss their deadlines, propose an update chopping to deal with transient and if they do, they are discarded. The performance overload. metric in a firm real-time system is the fraction of jobs that meet their deadlines. However, a streaming 4.1 A Proposed System Architecture warehouse must load all of the data that arrive therefore no updates can be discarded. In a soft real-time system, late jobs are allowed to stay in the system, and the performance metric is lateness which is the difference between the completion times of late jobs and their deadlines. However, concerned about properties of the update jobs. Instead, we will define a scheduling metric in terms of data staleness, roughly defined as the difference between the current time and the time stamp of the most recent record in a table. 3. Existing System The closest work to ours is which finds the best way to schedule updates of tables and views in order to maximize data freshness. The traditional data warehouses are typically refreshed during downtimes, streaming warehouses are updated as new data arrive. RT Where traditional data warehouse store layers of complex materialized views over terabytes of historical data. This existing system does not support to make IJE decisions in real time and immediately. This existing system is not suitable for data warehouse maintenance. The problem with this approach is that new data may arrive on multiple streams, but there is no mechanism Figure 1: Proposed System Architecture for limiting the number of tables that can be updated Every time, the seller sends details about share which simultaneously. will be automatically streamed or updated in the top of the form before the buyer buy the particular share. The 4. Proposed System share details like company name, shares sold, available quantity etc would be updating from the database. In this paper, we motivated, formalized, and solved the These share details how in streaming format. The users problem of nonpreemptively scheduling updates in a don’t need to refresh the page every time real-time streaming warehouse. We proposed the notion of average staleness as a scheduling metric and .5. Literature Survey presented scheduling algorithms designed to handle the complex environment of a streaming data warehouse. 5.1 Soft Real-Time Database System We then proposed a scheduling framework that assigns jobs to processing tracks and uses basic algorithms to The Proposed efficiently export a materialized view but schedule jobs within a track. The main feature of our to knowledge none have studied how to efficiently framework is the ability to reserve resources for short import one. To install a stream of updates, a real-time jobs that often correspond to important frequently database system must process new updates in a timely refreshed tables, while avoiding the inefficiencies fashion to keep the database fresh, but at the same time associated with partitioned scheduling techniques. must process transactions and meet their time The best way to schedule updates of tables and views in Constraints. Various properties of updates and views order to maximize data freshness. Aside from using a that affects this trade-off. Examining through different definition of staleness, our Max Benefit basic simulation, four algorithms for scheduling transactions algorithm is analogous to the max-impact algorithm as and installing updates in a soft real-time database [1]. is our “Sum” priority inheritance technique. Our main IJERTV2IS80460 www.ijert.org 1552 International Journal of Engineering Research & Technology (IJERT) ISSN: 2278-0181 Vol. 2 Issue 8, August - 2013 product code, product name, quantity, sold share, last 5.2 Multiple View Consistency for Data and current year profit, and term period etc. Warehouse 6.2 Seller Product Details The proposed data warehouse stores integrated information from multiple distributed data sources. In The company registers their product. They will enter effect, the warehouse stores materialized views over the the product code, brand name and description about the source data. The problem of ensuring data consistency product. This is called the registration about the at the warehouse can be divided into two components: particular product. After feeding these data, the seller ensuring that each view reflects a consistent stare of the will submit on the database. When the details once are base data, and ensuring that multiple views are stored means, the buyer can view those details and buy mutually consistent. Guarantying multiple view the particular share. consistency (MVC) and identify and define formally three layers of consistency for materialized views in a 6.3 View Share Details distributed environment [2]. The buyer can see the details regarding each share that 5.3 Synchronizing a Database to Improve are given by the seller. The buyer sees the product Freshness name, product code, and brand name of product etc. The data are collected from the relevant database. The proposed a method to refresh a local copy of an autonomous data source to maintain he copy up-to- 6.4 List out Data in Streaming date. As the size of the data grows, difficult to maintain the fresh copy making it crucial to synchronize the This is the main operation between seller and buyer. copy electively. Two fresh Metrics, such as change Every time, the seller details about share which will be RT models of the underlying data and synchronization automatically streamed or updated in the top of the policies [3]. form before the buyer buy the particular share. The share details like company name, shares sold available 5.4 Operator Scheduling For Memory quantity etc., would be updating from the database. The IJE users don’t need to refresh the page every time. These The proposed many applications involving continuous modules have to show all details about particular share data streams, data arrival are busty and data rate in various companies. These share details show in fluctuates over time. Systems that seek to give rapid or streaming format. The users don’t need to refresh the real-time query responses in such an environment must page every time be prepared to deal gracefully with bursts in data arrival without compromising system performance. Strategies 6.5 Buyer View Stock Details for processing burst streams adaptive, load-aware scheduling of query operators to minimize resource This is used to view a particular product details for a consumption during times of peak load. Chain buyer or a customer. Before buying the product, they scheduling, an operator scheduling strategy for data can view all the information about the product. But also stream systems that is near-optimal in minimizing run- the data will be going streaming wise in the form more time memory usage for any collection of single stream information buyer goes to view stock details page. queries involving selections, projections, and foreign- key joins with stored relations. Chain scheduling also 6.6 Buyer Buying Process performs well for queries with sliding-window joins over multiple streams, and multiple queries of the This module, the buyer gives the data to seller. The above types [4]. buyer gives the information like total cost of share, buyer id, buyer name, date of buying etc... And finally 6. Modules will submit it into the database. When completing the buying process, it will goes to streaming data in FIFO 6.1 New Share Entry (First in First Out) method. Here if any share price and quantity will be updating means that updating share The user will upload the new share details into the also added in streaming instead of old data’s. Display database. They enter the information like id number of the streaming data based on ranking and priorities. Here company, company name, date of submission of share, IJERTV2IS80460 www.ijert.org 1553 International Journal of Engineering Research & Technology (IJERT) ISSN: 2278-0181 Vol. 2 Issue 8, August - 2013 Buyer Analyze the share details history, if he satisfied Minimization in Data Stream Systems,” Proc.ACM SIGMOD with that share details means he purchase the share. Int’l Conf. Management of Data, pp. 253- 264, 2003. [6] JAMES A.LARSON “Federated Database system for Managing Distributed, Hetrogeneous and Autonomous 7. Conclusion databases” ACM computing Surveys, Vol.22, No.3, September 1990. The formalized and solved the problem of no pre- [7] A. Burns, “Scheduling Hard Real-Time Systems: A emptively scheduling updates in a real-time streaming Review,” Software Eng. J., vol. 6, no. pp. 116- 128, 1991. warehouse. The proposed the notion of average staleness as scheduling metric and presented scheduling algorithms designed to handle the complex environment of a streaming data warehouse. Then proposed a scheduling framework that assigns jobs to processing tracks and uses basic algorithms to schedule jobs within a track. The main feature of framework is the ability to reserve resources for short jobs that often correspond to important frequently refreshed Tables, while avoiding the inefficiencies associated with partitioned scheduling techniques. 8. Acknowledgement The Successful Completion of any task would be incomplete without expression of simple gratitude to the people who encouraged our work. The words are RT not enough to express the sense of gratitude towards everyone who directly or indirectly helped in this task.I thankful to this Organization CMR Technical Campus, which provided good facilities to accomplish my work IJE and would like to sincerely thank to our chairman Gopal Reddy Sir, Director Dr. A. Raji Reddy Sir, Dean Dr. Purna Chandra Rao Sir, and my HOD K SrujanRaju, sir and faculty members for giving great support, valuable suggestions and guidance in every aspect of my work. 9. References [1] B. Adel berg, H. Garcia-Molina, and B. Kao, “Applying Update Streams in a Soft Real-Time Database System,” Proc.ACM SIGMOD Int’l Conf. Management of Data, pp. 245-256, 1995. [2] Y. Huge, J. Wiener, and H. Garcia-Molina, “Multiple View Consistency for Data Warehousing,” Proc. IEEE 13th Int’l Conf. Data Eng. (ICDE), pp. 289-300, 1986. [3] J. Cho and H. Garcia-Molina, “Synchronizing a Database to Improve Freshness,” Proc. ACM SIGMOD Int’l Conf. Management of Data, pp.117- 128, 2000. [4] L. Golab, T. Johnson, and V. Shkapenyuk, “Scheduling Updates in a Real-Time Stream Warehouse,” Proc. IEEE 25th Int’l Conf. Data Eng. (ICDE), pp. 1207-1210, 2009. [5] B.Babcock, S.Babu, M.Datar, and R.Motwani, “Chain: Operator Scheduling for Memory IJERTV2IS80460 www.ijert.org 1554