Data Mining Middleware for Wide Area High Performance Networks Robert L. Grossman*, Yunhong Gu, David Hanley, and Michal Sabala National Center for Data Mining, University of Illinois at Chicago, USA Joe Mambretti International Center for Advanced Internet Research, Northwestern University, USA Alex Szalay and Ani Thakar Johns Hopkins University, USA Kazumi Kumazoe and Oie Yuji Kitakyushu JGNII Research Center, Japan Minsun Lee, Yoonjoo Kwon, and Woojin Seok Korea Institute of Science and Technology Information, Korea ABSTRACT In this paper, we describe two distributed, data intensive applications that were demonstrated at iGrid 2005 (iGrid Demonstration US109 and iGrid Demonstration US121). One involves transporting astronomical data from the Sloan Digital Sky Survey (SDSS) and the other involves computing histograms from multiple high volume data streams. Both rely on newly developed data transport and data mining middleware. Specifically, we describe a new version of the UDT network protocol called Composible-UDT, a file transfer utility based upon UDT called UDT-Gateway, and an application for building histograms on high volume data flows called BESH for Best Effort Streaming Histogram. For both demonstrations, we include a summary of the experimental studies performed at iGrid 2005. Keywords High performance network protocols, high performance networks, high performance data mining, data mining middleware integrated. 1. INTRODUCTION For example, telescopes in the Sloan Digital Sky Survey (SDSS) High-speed (1Gb/s and 10Gb/s) wide area networks provide us [19] collect gigabytes of data per day. This data is currently stored the opportunity to deploy data intensive applications over large locally, and a data release is made periodically, e.g., every quarter geographic areas. Until recently, distributed data intensive of a year. The data is then sent to astronomers around the world applications were usually designed to minimize inter-process data via disks or tapes. Analysis results that produce large data sets are communications; if large data transfers could not be avoided, difficult to exchange among the astronomers. Also, overlaying a large data sets were sometimes loaded onto disks or tapes and second data set on top of the SDSS data in order to discover physically sent to remote sites. As a consequence, there were astronomical objects that are too faint to be identified from one usually substantial delays when analyzing large distributed data data set alone requires a substantial effort. sets, especially when two or more such data sets had to be With high-speed wide area optical links connecting the observation stations, processing centers, and astronomers, these data sets and the analysis results can now be shared in near real * Contact Author:

[email protected]

time. Thus the processing delay can be significantly reduced and Robert L. Grossman is also with Open Data Partners. different data sets can be more easily combined. However, existing applications cannot automatically make use of the emerging high-speed networks. First, the de facto Internet transport protocol, TCP as usually deployed, significantly 1 underutilizes the network bandwidth in high-speed long distance environments. Several alternatives and enhancements to TCP have been developed over the past several years [8], including UDT [12]. Second, the current generation of data mining software and middleware was not designed to process data at high speeds across distributed computing sites and data sources. At iGrid 2005, we demonstrated three middleware applications designed to address these issues. One is a new version of the UDT protocol we have previously described [12] that is composible in the sense that it is designed to easily support different congestion control algorithms [10]. The application is called Composible- Figure 1. iGrid 2005 NCDM Booth Network Diagram. This UDT. The remaining two middleware applications are built over figure illustrates the 10Gb/s network infrastructure between Composible-UDT. The first of these is a file transfer utility called several nodes on the Teraflow Testbed and the Linux cluster at UDT-Gateway that provides access to UDT-based data services iGrid 2005. All links have 10Gb/s capacity. using TCP-based applications for the “last mile.” This greatly expands the population of end users that can use UDT-based data services. The second of these is a best effort online histogram 2.2 Network Infrastructure application called BESH for Best Effort Streaming Histogram. The two machines in Chicago (Figure 1) were located at StarLight At iGrid 2005, we demonstrated Composible-UDT, UDT- and were each connected to their counterparts at iGrid via 10 Gb/s Gateway and BESH in two demonstrations. The first was iGrid optical paths. The third machine at iGrid was connected to its Demonstration US121, which transported data from the SDSS. counterpart in Daejeon, Korea via the PWave and KERONet2 The second was iGrid Demonstration US109, which computed (GLORIAD) networks. The fourth machine at iGrid was streaming histograms using data from web logs for web servers connected to its counterpart in one of the JGN II research centers that provided results of the 1998 World Cup. located in Tokyo, Japan via the Abilene, StarLight, and JGN II In Section 2 we describe the experimental set up. In Section 3, we networks. Each of the connections was 10 Gb/s. The RTT describe the data transport middleware: UDT and UDT-Gateway. between the Chicago and iGrid was 66 ms; the RTT between In Section 4, we describe the data mining middleware for Tokyo, Japan and iGrid was 180 ms; and the RTT between computing histograms on high volume streaming data. The Daejeon, Korea and iGrid was 139 ms. experimental results are described in Section 5. Section 6 briefly reviews related work. Section 7 contains a summary and conclusion. 2.3 SDSS Data Set In US121 demonstration, we preloaded 797 GB of compressed Sloan Digital Sky Survey (SDSS) [19] data to our four machines at iGrid. This was the BESTDR3 release of the data. The data was 2. EXPERIMENTAL SETUP divided into 64 files, each of which was about 12.463 GB. During In this section we describe the hardware and network our demonstration, we moved this data to the machine in Korea. infrastructure used in our iGrid 2005 demonstrations. We also When uncompressed, the data was 1.5 TB. describe the data sets we used. After iGrid, we moved SDSS data from our machines at StarLight to our machines in Japan and applied the streaming histogram algorithm (used in US109) to the SDSS data to analyze the 2.1 Hardware Infrastructure distribution of the brightness of the stars. The results are reported We had 4 dual Opteron machines with 10GE NICs at the iGrid in Section 5. 2005 Conference, which were each connected to a 10GE port on one of the iGrid 2005 switches. Located outside the conference, as part of a testbed we operate 2.4 World Cup 98 Data Set called the Teraflow Testbed [22], we had four dual Opteron The US109 demonstration used web log data from four web machines with 10GE NICs. Two of the machines were in servers located in Paris, California, Texas and Virginia that Chicago, one in Tokyo, Japan, and one in Daejeon, Korea. Each provided information about the 1998 (soccer) World Cup. For the was connected to one of the four machines at the conference via demonstration, we replicated this data onto four of our machines routed 10 Gb/s link (Figure 1). located in Chicago (2), Japan, and Korea. All the machines have dual AMD Opterons running at 2.4 Ghz, 4 The original data set contained 8 attributes, including the number GB of physical memory, and 1.5 TB RAID 5 or RAID 0 disk of bytes of the data transfer. This attribute ranged in value from 0 space, except for the Korean machine, which uses dual XEON 31 to 2 . Additional derived attributes were added for this processors. Debian Linux 2.6 SMP is installed on each of these demonstration bringing the number of attributes to 13. Using the systems. iGrid machines, a separate histogram was built on each of the four streams, and then the four histograms were merged to produce a single summary histogram for all four streams. 2 3. DATA TRANSPORT MIDDLEWARE Previously, we have provided an overview of Composible-UDT’s It is well known that TCP substantially underutilizes the network congestion control framework, which is called CCC [10]. bandwidth in high bandwidth-delay product environments. We UDT/CCC allows user defined congestion control algorithms to have analyzed this problem and developed practical solutions be easily implemented. The UDT/CCC library enables easy since 2001 [9, 10, 11, 12, 13, 14]. implementations of a large variety of congestion control algorithms. The overhead of the framework, compared to a direct As mentioned above, at iGrid 2005, we demonstrated the current implementation, is minimal. When using the default algorithm, version of UDT, (called Composible-UDT) and a file transfer Composible-UDT has essentially the same performance as our utility based upon Composible-UDT called UDT-Gateway. In this direct UDT implementation. section we will give a brief review of Composible-UDT and UDT-Gateway. Composible-UDT is still an ongoing project. Future releases will also include more configuration abilities, such as limited data reliability. 3.1 UDT UDT is an application level data transport protocol designed for the emerging applications that will require transfer of large 3.3 UDT Gateway amounts of data distributed over high-speed wide area networks For many end users, it is easier to use a file transfer utility (e.g., 1 Gb/s or above). UDT uses UDP to transfer data but unlike employing TCP, or a web application employing HTTP and TCP, simple UDP it has its own reliability control and congestion rather than to use UDT directly. To support this requirement, we control mechanisms. UDT is not only for private or QoS-enabled developed the UDT-Gateway utility. To the user, it appears they links, but also for shared networks. Furthermore, the current are accessing data using a TCP-based application on the gateway version of UDT that was used at iGrid is designed using a machine, but, in fact, the data resides on a data server that is Composible framework that supports multiple congestion control connected to the gateway machine using a high performance algorithms. For more information about UDT, see [12]. network and UDT. The data server can serve multiple gateway machines. Specifically, the UDT gateway behaves exactly as an HTTP file server, and serves clients files via the ordinary HTTP/TCP channels. However, the gateway server does not host the files it serves locally. When a request arrives for a file, the file is streamed from a central repository via UDT, then streamed to the end consumer via TCP. In other words, the gateway machine allows the user to access large data sets using UDT and high performance networks for all except the “last mile,” which is handled using more standard networks and TCP. HTTP/TCP access for the last mile solves many practical issues related to firewalls that are still a problem for many end users. Figure 2. UDT Memory-Memory Data Transfer Performance. The idea is that gateways can be placed on high-speed backbones, This performance was obtained between two dual AMD Opteron and end-users will simply use gateways close to their actual machines between Chicago and Tokyo, via a 10Gb/s link with 180 location. This system also provides the benefit of enabling what ms RTT. we call "lightweight routes." The UDT gateways sit on the high- speed backbone, as does the central data repository. The user Figure 2 shows the memory-memory data transfer performance fetches the data from the closest gateway, and indirectly retrieves using UDT between a pair of machines at Chicago and Tokyo, the data from the central repository. This not only makes the best respectively. Because CPU usage limits the throughput of a single use of high-speed backbones, but also hides the procedure to flow of UDT to about 5 Gb/s, we started two parallel UDT flows locate the actual data repository, which increases the security and between the machines to take advantage of the dual processors on flexibility of the system. each machine. Using two UDT flows, we can reach a peak throughput of more than 7 Gb/s, which is near the hardware limit. In the experiment shown in Figure 2, UDT reached an average 4. Data Mining Middleware throughput of 6.21 Gb/s, with the maximum throughput of 7.10 In this section, we describe the data mining middleware we Gb/s. The standard deviation of the throughput per one-second demonstrated at iGrid 2005. unit time is 0.81 Gb/s. Simply developing high-speed data transport middleware will not enable wide-area data intensive applications. We also need data mining middleware that scales to high volume data flows. At 3.2 Composible-UDT iGrid, we demonstrated data mining middleware supporting a streaming model for processing data. This model places two Composible-UDT is not only a transport protocol library, but also requirements on the algorithm: first, the data is examined only a framework that supports many protocol configurations, in once; second, only a fixed amount of storage (independent of the particular different congestion control algorithms. size of the data stream) is available. 3 There is quite a bit of prior work on streaming algorithms. 3. If new nodes are generated in Step 2, check every node However, we are not aware of any prior work addressing how to in the tree. For any node B such that (V[B] / N < min- scale streaming algorithms to 1 Gbps or 10 Gbps streams. thresh), remove all its children nodes. A key component of several data mining algorithms is computing histograms. We recently developed a binary partition dynamic The leaf nodes are not the final result. Once a partial histogram of histogram algorithm that is designed to scale to high volume data the stream is needed, a merge process on the leaf nodes is flows. As mentioned above the algorithm is called BESH for Best executed: Effort Streaming Histogram. BESH Merge: For every leaf node B from the left to the right, if At iGrid, we used BESH to solve the following problem. Suppose (V[B] / N > min-thresh), then it is one of the buckets in the final there is an infinite data stream consisting of integers between 0 result. On the other hand, if the last bucket on the current final and 231. We want to keep a histogram H to record the total number histogram Bp satisfies (V[Bp] / N < min-thresh), then B will be of each value that has appeared so far. merged with Bp, and no new buckets will be inserted into the final Because the value space of the attributes of the data stream is too histogram; otherwise B will be inserted into the final histogram. large to maintain an accurate histogram on it (due to the limitation of memory space), we employ an approximate histogram algorithm in which each bucket covers a scope of 0 1024 multiple values, rather than a single value. For example, a bucket 5000 [x, y] tracks the number of records whose specific attribute has a value between x and y. Because we must process the data stream in a single pass, we do 0 256 257 1024 not know in advance either the number of buckets or the size of each bucket. For this reason, we dynamically split and merge the 1000 4000 buckets as we process the stream. There are three major approximate histograms: equiv-width, equiv-depth, and V-Optimal [4, 5]. The histogram that results from our algorithm is similar to the equiv-depth histogram, but 257 768 769 1024 not the same. 2500 1500 The binary partitioning algorithm is illustrated in Figure 3. At the very beginning, there is only one node in the system: [0, 1024] (supposing the values are between 0 and 1024). Once a new value Figure 3. The BESH Dynamic Binary Partition Histogram. arrives, the root node is split evenly into two nodes. As values Each node contains the lower and upper limit of values in one continues to arrive, if a node contains more values than a bucket, and the number of values in the bucket. The leaf nodes threshold, it will further be split. Similarly, if a node contains a consist of the histogram. very small number of values, all of its sub-nodes will be merged. The leaf nodes consist of the histogram. For example, in Figure 3, Merging histograms from different streams. Finally, because the histogram on the stream with values varying between 0 and there is one histogram on each stream, the final histogram H is 1024 is: (0 – 256: 1000), (257 – 768: 2500), and (769 – 1024: produced by merging all the sub-histograms Hi: 1500). 1. For every bucket in every sub-histogram Hi, put the In order to describe the detailed algorithm, we define the boundaries of the bucket into a boundary list. following four parameters: 2. For every interval along the boundary list constructed in N: Total of values scanned so far. Step 1, a new bucket B is allocated. For each bucket B' V[Bi]: The number of values in bucket Bi. in every histogram Hi, if B' overlaps with B, then update B with the portion of the contribution from B', assuming Max-thresh: The upper percentage limit of the size of that the values in B' are evenly distributed. buckets. Buckets exceeding this limit must be split. 3. Apply the BESH Merge described above to the Min-thresh: The lower percentage limit of the size of histograms constructed in Step 2. buckets. Back-to-back buckets smaller than this size must be merged. The dominant cost of computing a BESH histogram is the cost of the updates. Because there is an upper limit and a lower limit for BESH Initialization: There is one bucket in the histogram, the size of each bucket, the total number of leaf buckets is less covering values from 0 to 231. than M, where M = 1/min-thresh. For each new record, in the Computing a histogram using BESH. For each new record: worst case, all the existing buckets may have to be scanned to 1. Locate the bucket B that the new value belongs to, check if there are any buckets to be merged. For this reason the update V[B], all its parent nodes, and N; cost of computing a BESH histogram is O(MN). 2. If (V[B] / N > max-thresh), equally split the node into two leaf nodes; 4 5. EXPERIMENTAL RESULTS For each of our demonstrations, we had two official time slots. Figure 6. Aggregate Throughput of Streaming Data Mining We also tested our applications during the nighttime, especially (US109). This figure shows the aggregate throughput of the BESH algorithm. computing histograms over four parallel flows during one official US109 Experimental Results. The performance of US109 is US109 demo slot. recorded in Figures 4, 6 and 7, which are the figures taken from the real time display during one of our demos. Figure 4 shows the real time dynamic histogram on the four data streams. Figure 6 shows the aggregate throughput of the streaming data mining application. The average speed is around 8 Gb/s with a peak speed of 14 Gb/s. The per-flow throughput is shown in Figure 7. Each flow realized an average throughput of 2 Gb/s. Figure 7. Per Flow Throughput of Streaming Data Mining (US109). This figure shows the per flow throughput of computing histograms of four parallel flows during one official US109 demo slot. As we mentioned previously, after iGrid 2005, we applied the BESH algorithm to the SDSS data between Chicago and Tokyo, which were connected via 10Gb/s link. The data was transferred Figure 4. Real Time Histogram on High Speed Data Streams from Chicago to Tokyo in one stream. The SDSS histogram and (US109). This figure is a snapshot of the histogram on the four the data transfer and processing speed are listed in Figure 8 and real time web traffic streams used in US109 demo. Figure 9, respectively. Figure 9 shows that an average of 3 Gb/s throughput had been reached. Figure 5 shows the counterpart histogram obtained by a standard histogram algorithm (i.e., without any of the limitations imposed by the streaming model) using the same buckets as those in Figure 4. The two figures are almost identical, and, at least for this data, there is very little loss in accuracy when using the best effort BESH algorithm. Figure 8. Analysis of SDSS data using BESH. We transferred SDSS data from Chicago to Tokyo via 10Gb/s link and used BESH to analyze the distribution of the brightness of the stars. This figure shows the brightness histogram. Figure 5. This figure contains a histogram computed from the same data sets that were used for US109 but computed as usual with a standard histogram algorithm. The same buckets were used for the streaming BESH algorithm in US109. Figure 9 . Analysis of SDSS data using BESH. We transferred SDSS data from Chicago to Tokyo via 10Gb/s link and used BESH 5 to analyze the distribution of the brightness of the stars. This 6. RELATED WORK figure show the brightness histogram. Moving large data sets over high-speed wide area networks has US121 Experimental Results. During the US121 demo, we been recognized as a challenging task for many years. During transferred the entire BESTDR3 release of the SDSS data from iGrid 2002, various groups demonstrated prototypes of several the iGrid floor to nodes in Daejeon (Korea), Tokyo (Japan), and different tools for high performance data transport [2, 3, 9, 16, 18, Chicago. The results are reported in Table 1. As mentioned above, 21]. the data consisted of 64 compressed files, each about 12.463 GBs, Since then, various new data transport protocols or related and in total comprising 797 GB of compressed data. When congestion control algorithms [8, 10, 12] have been designed and uncompressed, the data was about 1.5 TB. developed. Comparison between different protocols is now For example, we transferred this data from San Diego to Daejeon commonly regarded as a complicated topic, as each protocol has in approximately 2.5 hours. The average transfer speed was 1027 both advantages and disadvantages and no single protocol has Mb/s and the peak speed was over 1200 Mb/s. This was the first proved superior [15]. time that an astronomy data set of this size was transferred from Since 1999, we have continued to develop a high performance disk to disk at this speed across the Pacific. With conventional data transport protocol based upon UDP: The first version was networks and network protocols this transfer would not have been called SABUL [14]. SABUL used TCP as a control channel and practical. A portion of the SDSS transfer throughput is shown in was demonstrated at iGrid 2002. The next version [12] was Figure 10. completely implemented in UDP in order to gain efficiency. The Note that in this demonstration the disk IO speed is one of the current version [10] is called Composible-UDT and was used at major bottlenecks. iGrid 2005. One of the advantages of UDT is that it is easy to deploy since it can be deployed at the application level and does not require changing the kernel. Table 1. This table summarizes three transfers of the Sloan Perhaps the most widely deployed tool for bulk data transport is Digital Sky Survey (SDSS) Release 3 data. The data consisted GridFTP [1], which uses parallel TCP to transfer data. In of 64 files, each about 12.463 GB in size, and compromising contrast, UDT uses UDP to transport the data and adds reliability about 797 GB in total. All results are reported in Mbps. The and congestion control. We note that an upcoming release of mean, median, standard deviation, minimum and maximum are computed from the 64 different transfers. GridFTP is expected to be integrated with a UDT driver enabling GridFTP to transfer data using UDT. Tools for high performance data transport that have been widely Transfer: Mean Median Standard Min Max adopted have tended to provide a more convenient user interface from Deviation than that provided by a raw socket API. For this reason, UDT- iGrid to Gateway provides a HTTP interface and hides the details of the Chicago 653 712 255 128 1008 UDT protocol. Kisti 1027 1160 229 312 1280 We turn next to related work involving high volume data streams. Although streaming data mining is an area of active research, Tokyo 398 416 96 88 448 most of the work focuses on sensor networks and traditional Internet environments where the data transfer speed is much lower than what we saw during iGrid 2005. Various approaches have been proposed for histogram computation on streaming data [5, 6, 17, 20]. These methods basically fall into two classes. One is to use different strategies to dynamically split and merge the buckets. The other is to construct a summary structure on the data stream and build histograms from the summary structure. Our method belongs to the first class. As far as we are aware, our implementation of the BESH algorithm over UDT is the first time that histograms have been computed on streaming data at the speeds seen at iGrid 2005. More detailed analysis of histograms and streaming data processing is beyond the scope of this paper. Some general streaming data processing issues are discussed in [4, 7]. Figure 10. Throughout of SDSS data transfer from San Diego to KISTI, Korea. This snapshot illustrates the disk-disk data 7. CONCLUSIONS transfer throughput (Mb/s) over time (seconds). In this paper, we have described two demonstrations at iGrid 2005 that use data transport middleware and data mining middleware tools that we have developed. 6 For this first demonstration, we used the UDT-Gateway file [7] Mohamed Medhat Gaber, Arkady B. Zaslavsky, Shonali transfer utility to transfer astronomical data from the iGrid 2005 Krishnaswamy: Mining data streams: a review. SIGMOD conference to Korea. We transferred over 797 GB of data at a Record 34(2): 18-26 (2005). mean rate of 1027 Mb/s. This was the first time that we are aware [8] Mathieu Goutelle (editor), Yunhong Gu, Eric He (editor), of that astronomical data of this size has been transported across Sanjay Hegde, Rajkumar Kettimuthu, Jason Leigh, Pascale the Pacific. Vicat-Blanc/Primet, Michael Welzl (editor), and Chaoyue For the second demonstration, we computed histograms on four Xiong. A Survey of Transport Protocols other than Standard high volume data flows that were streamed from Chicago, Korea, TCP. In Global Grid Forum Data Transport Research Group. and Japan to the iGrid conference. We used an algorithm we February 2004. designed and implemented called BESH. The average processing [9] Robert L. Grossman, Yunhong Gu, Dave Hanley, Xinwei rate was about 8 Gb/s, with a peak speed of 14 Gb/s. This is one Hong, Dave Lillethun, Jorge Levera, Joe Mambretti, Marco of the highest rates at which histograms have been computed on Mazzucco, and Jeremy Weinberger, Experimental Studies data distributed around the world that we are aware of. Using Photonic Data Services at IGrid 2002, Journal of Both applications were built over Composibile-UDT [10], a recent Future Computer Systems, 2003, Volume 19, Number 6, implementation of the UDT protocol that is composible in the pages 945-955. sense that different congestion control algorithms can be easily [10] Yunhong Gu and Robert Grossman, Supporting Configurable implemented. Congestion Control in Data Transport Services, SC 2005, Seattle, Nov. 2005. Finally, both of these demonstrations show the practicality of building useful, distributed data intensive applications using [11] Yunhong Gu and Robert Grossman, Optimizing UDP-based UDT-enabled middleware. Protocol Implementation. PFLDNet 2005, Lyon, France, Feb. 2005. [12] Yunhong Gu, Xinwei Hong, and Robert Grossman, Experiences in Design and Implementation of a High 8. ACKNOWLEDGMENTS Performance Transport Protocol, SC 2004, Nov 6 - 12, This work was supported in part by the National Science Pittsburgh, PA, USA. Foundation under grant ANI 9977868, the Department of Energy under grant DE-FG02-04ER25639, and the U.S. Army Pantheon [13] Yunhong Gu, Xinwei Hong and Robert Grossman, An Project. Analysis of AIMD Algorithms with Decreasing Increases, First Workshop on Networks for Grid Applications (Gridnets 2004), Oct. 29, San Jose, CA, USA. 9. REFERENCES [14] Yunhong Gu and Robert L. Grossman, SABUL: A Transport [1] Allcock, W., Bester, J., Bresnahan, J., Chervenak, A., Foster, Protocol for Grid Computing, Journal of Grid Computing, 2003, Volume 1, Issue 4, pp. 377-386. I., Kesselman, C., Meder, S., Nefedova, V., Quesnel, D. and Tuecke, S. Data Management and Transfer in High [15] Sangtae Ha, Yusung Kim, Long Le, Injong Rhee, and Lisong Performance Computational Grid Environments. Parallel Xu, A Step toward Realistic Performance Evaluation of Computing. 2001. High-Speed TCP Variants, PFLDNet 2006, Nara, Japan. [2] W. Allcock, J. Bresnahan, J. Bunn, S. Hegde, J. Insley, R. [16] C. de Laat, E. Radius, S. Wallace, The rationale of current Kettimuthu, H. Newman, S. Ravot, T. Rimovsky, C. optical networking initiatives, Future Generation Computer Steenberg, L. Winkler, "Grid-enabled particle physics event Systems, Vol. 19, Number 6, August 2003, pp. 999-1008. analysis: experiences using a 10 Gb, high-latency network [17] G S Manku and R Motwani, Approximate Frequency Counts for a high-energy physics application", Future Generation over Data Streams, VLDB 2002 (28th VLDB), p 346-357, Computer Systems 19(6):983-997 August 2003 August 2002. [3] A. Antony, J. Blom, C. de Laat, J. Lee and W. Sjouw. [18] J. Mambretti, J. Weinberger, J. Chen, E. Bacon, F. Yeh, D. Microscopic Examination of TCP Flows over Transatlantic Lillethun, R. Grossman, Y. Gu, M. Mazzucco, The Photonic Links Future Generation Computer Systems, Volume 19(6), TeraStream: Enabling Next Generation Applications 2003. Through Intelligent Optical Networking at iGrid 2002, [4] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Journal of Future Computer Systems, Elsevier Press, Volume Models and Issues in Data Stream Systems, in Proc. of the 19, Number 6, pages 897-908. 2002 ACM Symp. on Principles of Database Systems (PODS [19] A. Szalay, J. Gray, A. Thakar, P. Kuntz, T. Malik, J. 2002), June 2002. Raddick, C. Stoughton. J. Vandenberg: The SDSS SkyServer [5] Donko Donjerkovic, Yannis E. Ioannidis, Raghu - Public Access to the Sloan Digital Sky Server Data, ACM Ramakrishnan. Dynamic Histograms: Capturing Evolving SIGMOD 2002. Data Sets, Proceedings of the 16th International Conference [20] N. Thaper, S. Guha, P. Indyk, and N. Koudas. Dynamic on Data Engineering, San Diego, California, USA, February multidimensional histograms. In Proc. SIGMOD, 2002. 2000. [21] Chong Zhang, Jason Leigh, Thomas A. DeFanti, Marco [6] Filippo Furfaro, Giuseppe M. Mazzeo, Domenico Saccà, Mazzucco and Robert Grossman, TeraScope: Distributed Cristina Sirangelo: Hierarchical binary histograms for Visual Data Mining of Terascale Data Sets Over Photonic summarizing multi-dimensional data. SAC 2005: 598-603. 7 Networks, Journal of Future Computer Systems, 2003, [23] UDT: UDP-based Data Transfer Protocol, http://udt.sf.net. Volume 19, Number 6, pages 935-943. [22] Teraflow Testbed, http://www.teraflowtestbed.net. 8