Massive Datasets and Data Streams
RadixZip: Linear Time Compression of Token Streams
by B D Vo and G S Manku
VLDB 2007 (33rd International Conference on Very Large Data Bases), p 1162-1172, Sep 2007
Abstract: RadixZip is a block compression technique for token streams. It introduces RadixZip Transform, a linear time algorithm that rearranges bytes using a technique inspired by radix sorting. For appropriate data, RadixZip Transform is analogous to Burrows-Wheeler Transform used in bzip2, but is both simpler in operation and more effective in compression. In addition, RadixZip Transform can take advantage of correlations between token streams with no computational overhead. Experiments over practical data show that for common token streams, RadixZip is superior to bzip2.
Detecting Near-Duplicates for Web Crawling
by G S Manku, A Jain and A D Sarma
WWW 2007 (16th International World Wide Web Conference), p 141-149, May 2007
Abstract: Near-duplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrelevant for web search. So the quality of a web crawler increases if it can assess whether a newly crawled web page is a near-duplicate of a previously crawled web page or not.
In the course of developing a near-duplicate detection system for a multi-billion page repository, we make two research contributions. First, we demonstrate that Charikar's fingerprinting technique is appropriate for this goal. Second, we present an algorithmic technique for identifying existing f-bit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k. Our technique is useful for both online queries (single fingerprints) and batch queries (multiple fingerprints). Experimental evaluation over real data confirms the practicality of our design.
Approximate Counts and Quantiles over Sliding Windows
by A Arasu and G S Manku
PODS 2004 (23rd ACM Symposium on Principles of Database Systems), p 286-296, June 2004
Abstract: We consider the problem of maintaining ε-approximate counts and quantiles over a stream sliding window using limited space. We consider two types of sliding windows depending on whether the number of elements N in the window is fixed (fixed-size sliding window) or variable (variable-size sliding window). In a fixed-size sliding window, both the ends of the window slide synchronously over the stream. In a variable-size sliding window, an adversary slides the window ends independently, and therefore has the ability to vary the number of elements N in the window.
We present various deterministic and randomized algorithms for approximate counts and quantiles. All of our algorithms require O(1/ε polylog(1/ε, N)) space. For quantiles, this space requirement is an improvement over the previous best bound of O(1/ε² polylog(1/ε, N)). We believe that no previous work on space-efficient approximate counts over sliding windows exists.
Query Processing, Resource Management and Approximation in a Data Stream Management System
by R Motwani, J Widom, A Arasu, B Babcock, S Babu, M Datar, G S Manku, C Olston, J Rosenstein and R Varma
CIDR 2003 (1st Biennial Conference On Innovative Data Systems Research), p 245-254, Jan 2003
Abstract: This paper describes ongoing work developing the
Stanford Stream Data Manager (
STREAM), a system for executing continuous queries over multiple continuous data streams. The STREAM system supports a declarative query language and it copes with high data rates and query workloads by providing approximate answers when resources are limited. This paper describes specific contributions made so far and enumerates our next steps in developing a general Data Stream Management System.
Approximate Frequency Counts over Data Streams
by G S Manku and R Motwani
VLDB 2002 (28th VLDB), p 346-357, August 2002
Abstract: Consider a stream of elements. This paper proposes two simple algorithms for identifying all elements whose relative frequency (in the portion of the stream seen so far) exceeds a certain threshold (say, 0.01%). Both algorithms require provably small memory footprints. Although the output is approximate, the error is guaranteed not to exceed a user-specified parameter. The algorithms can easily be deployed for streams of singleton items like those found in IP network monitoring.
The latter half of the paper considers streams of variable-sized sets of items, e.g., sequence of market-basket transactions at a retail store. One of the above algorithms is adapted to compute frequent itemsets for such streams in a single pass. The implementation is currently the fastest program for computing frequent itemsets and association rules.
Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets
by G S Manku, S Rajagopalan and B G Lindsay
Abstract: In the followup paper in SIGMOD-99, we address two issues left open in our earlier paper. First, all known and space efficient algorithms for approximate quantile finding require advance knowledge of the length of the input sequence. Many important database applications employing quantiles cannot provide this information. In this paper, we present a novel non-uniform random sampling scheme and an extension of our framework. Together, they form the basis of a new algorithm which computes approximate quantiles without knowing the input sequence length. Second, if the desired quantile is an extreme value (e.g., within the top 1% of the elements), the space requirements of currently known algorithms are overly pessimistic. We provide a simple algorithm which estimates extreme values using less space than required by the earlier more general technique for computing all quantiles. Our principal observation here is that random sampling is quantifiably better when estimating extreme values than is the case with the median.
Approximate Medians and other Quantiles in One Pass and with Limited Memory
by G S Manku, S Rajagopalan and B G Lindsay
Abstract: We present new algorithms for computing approximate quantiles of large datasets in a single pass. The approximation guarantees are explicit and apply without regard to the value distribution or the arrival distributions of the dataset. The main memory requirements are smaller than those reported earlier by an order of magnitude. We also discuss methods that couple the approximation algorithms with random sampling to further reduce memory requirements. With sampling, the approximation guarantees are explicit but probabilistic, i.e., they apply with respect to a (user controlled) confidence parameter. We present the algorithms, their theoretical analysis and simulation results.
Peer to Peer Systems -- Distributed Hash Tables
Brief Announcement: Papillon: Greedy Routing in Rings
by I Abraham, D Malkhi and G S Manku
DISC 2005 (19th International Symposium on Distributed Computing), p 514-515, September 2005
Abstract: We construct the first n-node degree-d ring-based network with worst-case GREEDY routes of length Θ(log n / log d) hops.
Decentralized Algorithms using Both Local and Random Probes for P2P Load Balancing
by K Kenthapadi and G S Manku
SPAA 2005 (17th ACM Symposium on Parallelism in Algorithms and Architectures), p 135-144, July 2005
Abstract: We study randomized algorithms for placing a sequence of n nodes on a circle with unit perimeter. Nodes divide the circle into disjoint arcs. We desire that a newly-arrived node (which is oblivious of its index in the sequence) choose its position on the circle by learning the positions of as few existing nodes as possible. At the same time, we desire that that the variation in arc-lengths be small. To this end, we propose a new algorithm that works as follows: The k-th node chooses r random points on the circle, inspects the sizes of v arcs in the vicinity of each random point, and places itself at the mid-point of the largest arc encountered. We show that for any combination of r and v satisfying rv ≥ c log k, where c is a small constant, the ratio of the largest to the smallest arc-length is at most eight w.h.p., for an arbitrarily long sequence of n nodes. This strategy of node placement underlies a novel decentralized load-balancing algorithm that we propose for Distributed Hash Tables (DHTs) in peer-to-peer environments.
Underlying the analysis of our algorithm is Structured Coupon Collection over n/b disjoint cliques with b nodes per clique, for any n, b ≥ 1. Nodes are initially uncovered. At each step, we choose d nodes independently and uniformly at random. If all the nodes in the corresponding cliques are covered, we do nothing. Otherwise, from among the chosen cliques with at least one uncovered node, we select one at random and cover an uncovered node within that clique. We show that as long as bd ≥ c log n, O(n) steps are sufficient to cover all nodes w.h.p. and each of the first Ω(n) steps succeeds in covering a node w.h.p. These results are then utilized to analyze a stochastic process for growing binary trees that are highly balanced -- the leaves of the tree belong to at most four different levels with high probability.
Balanced Binary Trees for ID Management and Load Balance in Distributed Hash Tables
by G S Manku
PODC 2004 (23rd ACM Symposium on Principles of Distributed Computing), p 197-205, July 2004
Abstract: We present a low-cost, decentralized algorithm for ID management in distributed hash tables (DHTs) managed by a dynamic set of hosts. Each host is assigned an ID in the unit interval [0, 1). At any time, the set of IDs splits the interval into disjoint partitions. Hosts do not possess global knowledge of other IDs in the system. The challenge then is to design an efficient decentralized algorithm that maintains roughly equi-sized partitions, in the face of arrivals, departures and changes in the average number of hosts.
Our ID management algorithm is the first to enjoy all of the following properties: (a) both arrivals and departures of hosts are handled, (b) departure of a host causes at most one existing host to change its ID, (c) the ratio of the largest to the smallest partition is at most 4, with high probability, and (d) the expected cost per arrival/departure is Θ(R + log n) messages, where n denotes the current number of participants, and R denotes the cost of routing one message in the DHT. In fact, our algorithm is independent of the topology of the overlay network used for routing.
Variations of our algorithm diminish the ratio between the largest and the smallest partition to (1+ε), for any ε > 0, albeit at the cost of re-assigning the IDs of O(1/ε) existing hosts per arrival/departure. Ours is the first algorithm that allows such fine-tuning.
Finally, our ID management algorithm enables (a) estimation of the total number of hosts in the system by making only local measurements, and (b) emulation of a variety of deterministic and randomized families of routing topologies, in a straightforward fashion. Among these families are several networks that require O(log n / log k) routing hops in an n-node network with k links per node.
Know thy Neighbor's Neighbor: the Power of Lookahead in Randomized P2P Networks
by G S Manku, M Naor and U Wieder
STOC 2004 (36th ACM Symposium on Theory of Computing), p 54-63, June 2004
Abstract: Several peer-to-peer networks are based upon randomized graph topologies that permit efficient GREEDY routing, e.g., randomized hypercubes, randomized Chord, skip-graphs and constructions based upon small-world percolation networks. In each of these networks, a node has out-degree Θ(log n), where n denotes the total number of nodes, and GREEDY routing is known to take O(log n) hops on average. We establish lower-bounds for GREEDY routing for these networks, and analyze Neighbor-of-Neighbor (NoN)-GREEDY routing. The idea behind NoN, as the name suggests, is to take a neighbor's neighbors into account for making better routing decisions.
The following picture emerges: Deterministic routing networks like hypercubes and Chord have diameter Θ(log n) and GREEDY routing is optimal. Randomized routing networks like randomized hypercubes, randomized Chord, and constructions based on small-world percolation networks, have diameter Θ(log n / log log n) with high probability. The expected diameter of Skip graphs is also Θ(log n / log log n). In all of these networks, GREEDY routing fails to find short routes, requiring Ω(log n) hops with high probability. Surprisingly, the NoN-GREEDY routing algorithm is able to diminish route-lengths to Θ(log n / log log n) hops, which is asymptotically optimal.
Optimal Routing in Chord
by P Ganesan and G S Manku
SODA 2004 (15th Annual ACM-SIAM Symposium on Discrete Algorithms), p 169-178, January 2004
Abstract: We propose optimal routing algorithms for Chord, a popular topology for routing in peer-to-peer networks. Chord is an undirected graph on 2^b nodes arranged in a circle, with edges between pairs of nodes that are 2^k positions apart for any k ≥ 0. The standard Chord routing algorithm uses edges in only one direction. Our algorithms exploit the bidirectionality of edges for optimality. At the heart of the new protocols lie algorithms for writing a positive integer d as the difference of two non-negative integers d′ and d″ such that the total number of 1-bits in the binary representation of d' and d″ is minimized. Given that Chord is a variant of the hypercube, the optimal routes possess a surprising combinatorial structure.
Routing Networks for Distributed Hash Tables
by G S Manku
PODC 2003 (22nd ACM Symposium on Principles of Distributed Computing), p 133-142, June 2003
Abstract: Routing topologies for distributed hashing in peer-to-peer networks are classified into two categories: deterministic and randomized. A general technique for constructing deterministic routing topologies is presented. Using this technique, classical parallel interconnection networks can be adapted to handle the dynamic nature of participants in peer-to-peer networks. A unified picture of randomized routing topologies is also presented. Two new protocols are described which improve average latency as a function of out-degree. One of the protocols can be shown to be optimal with high probability. Finally, routing networks for distributed hashing are revisited from a systems perspective and several open design problems are listed.
Symphony: Distributed Hashing in a Small World
by G S Manku, M Bawa and P Raghavan
USITS 2003 (4th USENIX Symposium on Internet Technologies and Systems), p 127-140, March 2003
Abstract: Symphony is a protocol for maintaining distributed hash tables in a wide area network. The key idea is to arrange all participants along a ring and equip them with long distance contacts drawn from a family of harmonic distributions. Symphony is inspired by Kleinberg's Small World construction. We extend Kleinberg's theoretical result by showing that with k links per node, greedy routing can achieve an average latency of O((log² n) / k) hops. The paper builds upon this basic idea to engineer a practical protocol that is scalable, flexible, stable in the presence of frequent updates and offers small average latency with only a handful of long distance links per node. The cost of updates when hosts join and leave is small.
SETS: Search Enhanced by Topic Segmentation
by M Bawa, G S Manku, and P Raghavan
SIGIR 2003 (26th International ACM SIGIR 2003), p 306-313, July 2003
Abstract: We present SETS, an architecture for efficient search in peer-to-peer networks, building upon ideas drawn from machine learning and social network theory. The key idea is to arrange participating sites in a topic-segmented overlay topology in which most connections are short-distance, connecting pairs of sites with similar content. Topically focused sets of sites are then joined together into a single network by long-distance links. Queries are matched and routed to only the topically closest regions. We discuss a variety of design issues and tradeoffs that an implementor of SETS would face. We show that SETS is efficient in network traffic and query processing load.
Miscellaneous Subjects
A Loop-free Gray Code for Minimal Signed-Binary Representations
by G S Manku and J Sawada
ESA 2005 (13th Annual European Symposium on Algorithms), p 438-447, Oct 2005
Abstract: A string over an alphabet {-1, 0, 1} is said to be a minimal signed-binary representation of an integer n if for and the number of non-zero digits is minimal. We present a loopless (and hence a Gray code) algorithm for generating all minimal signed-binary representations of a given integer n.
Self-Similarity in File-System Traffic
by S D Gribble, G S Manku, D Roselli, E A Brewer, T J Gibson and E L Miller
Abstract: We demonstrate that high-level file system events exhibit self-similar behaviour, but only for short-term time scales of under a day. We do so through the analysis of four sets of traces that span time scales of milliseconds through months, and that differ in the trace collection method, the filesystems being traced, and the chronological times of the tracing. Two sets of detailed, short-term file system trace data are analyzed; both are shown to have self-similar like behaviour, with consistent Hurst parameters (a measure of self-similarity) for all file system traffic as well as individual classes of file system events. Long-term file system trace data is then analyzed, and we discover that the traces' high variability and self-similar behaviour does not persist across time scales of days, weeks, and months. Using the short-term trace data, we show that sources of file system traffic exhibit ON/OFF source behaviour, which is characterized by highly variably lengthed bursts of activity, followed by similarly variably lengthed periods of inactivity. This ON/OFF behaviour is used to motivate a simple technique for synthesizing a stream of events that exhibit the same self-similar short-term behaviour as was observed in the file-system traces.
Structural Symmetry and Model Checking
by G S Manku, R Hojati and R K Brayton
CAV 1998 (10th International Conference on Computer-Aided Verification), p p 159-171, July 1998
Abstract: We present a fully automatic framework for identifying symmetries in structural descriptions of digital circuits and CTL* formulas and using them in a model checker. The set of sub-formulas of a formula is partitioned into equivalence classes so that truth values for only one sub-formula in any class need be evaluated for model checking. Structural symmetries in net-list descriptions of digital circuits and CTL* formulas are formally defined and their relationship with Kripke structures is described. A technique for automatic identification of structural symmetries is described that requires computation of the automorphism group of a suitable directed labeled graph. A novel fast algorithm for this problem is presented. Finally, experimental results are reported for BLIF-MV net-lists derived from Verilog.
Object Tracking using Affine Structure for Point Correspondences
by G S Manku, P Jain, A Aggarwal, L Kumar and S Banerjee
CVPR 1997 (IEEE Conf. for Computer Vision and Pattern Recognition), p 704-709, June 1997
Abstract: A new object tracking algorithm based on affine structure has been developed and it is shown that the performance is better than that of a Kalman filter based correlation tracker. The algorithm is fast, reliable, viewpoint invariant, and insensitiver to occlusion and/or individual corner disappearance or reappearance. Detailed experimental analysis on a long real image sequence is also presented.
A New Voting Based Hardware Data Prefetch Scheme
by G S Manku, M R Prasad and D A Patterson
HiPC 1997 (Fourth International Conference on High Performance Computing), p 100-105, December 1997
Abstract: The dramatic increase in the processor-memory gap in recent years has led to the development of techniques like data prefetching that hide the latency of cache misses. Two such hardware techniques are the stream buffer and the stride predictor. These two have dissimilar architectures, are effective for different kinds of memory access patterns and require different amounts of extra memory bandwidth. We compare the performance of these two techniques and propose a scheme that unifies them. Simulation studies on six benchmark programs confirms that the combined scheme is more effective in reducing the average memory access time (AMAT) than either of the two individually.
A Linear Time Algorithm for the Bottleneck Biconnected Spanning Subgraph Problem
by G S Manku
IPL 1996 (Information Processing Letters), p 1-7, July 1996
Abstract: A linear time algorithm for the Bottleneck Biconnected Spanning Subgraph problem is presented. This improves the hitherto best-known solution, which has a running time of O(m + n log n), where m and n are the number of edges and vertices of the graph. Linear time is acheived by first mapping the problem onto a special kind of graphs (that we call B_Graphs), which are shown to have nice structural properties. Then, a binary search on the weights of edges is done. The main idea is to use the properties of the B_Graph to shrink it by removing redundant edges and vertices in each iteration, thereby reducing the size of the graph by at least a factor of two.
Circuit Partitioning with Partial Order for Mixed Simulation Emulation Environment
by G S Manku, A Kumar and S Kumar
RSP 1995 (Sixth Intl. Conf. on Rapid System Prototyping), p 201-207, June 1995
Abstract: A low-cost hybrid simulator for VLSI circuits has been under development at IIT Delhi. The simulator uses a Reconfigurable System (RS) consisting of a limited number of FPGAs for hardware emulation and blends the ideas of hardware emulation with conventional software simulation. A crucial preparatory step is to partition a given circuit into as few parts as possible. The parts are then downloaded onto the RS one by one and emulated in stand alone mode or in conjunction with software simulator. The hybrid simulation environment poses some unique requirements on the partitioner. This paper presents an efficient partitioning algorithm for this purpose. A study of performance of the algorithm on 32 benchmark circuits for various I/O and size constraints of FPGAs has been carried out and good results have been obtained.
Patents
System and Method for Searching Peer-to-Peer Computer Networks by Selecting a Computer Based on At Least a Number of Files Shared by the Computer
by W J Labio, G T Nguyen, W W Liu, G S Manku (assignee: Napster Inc.)
Abstract: A method and system for intelligently directing a search of a peer-to-peer network, in which a user performing a search is assisted in choosing a host which is likely to return fast, favorable results to the user. A host monitor monitors the peer-to-peer network and collects data on various characteristics of the hosts which make up the network. Thereafter, a host selector ranks the hosts using the data, and passes this information to the user. The user then selects one or more of the highly-ranked hosts as an entry point into the network. Additionally, a cache may collect a list of hosts based on the content on the hosts. In this way, a user may choose to connect to a host which is known to contain information relevant to the user's search. The host selector may be used to select from among the hosts listed in the cache.
Single Pass Space Efficient System and Method for Generating an Approximate Quantile in a Data Set Having an Unknown Size
by B G Lindsay, G S Manku, S Rajagopalan (assignee: IBM Corp.)
Abstract: A space-efficient system and method for generating an approximate φ-quantile data element of a data set in a single pass over the data set, without a priori knowledge of the size of the data set. The approximate φ-quantile is guaranteed to lie within a user-specified approximation error ε of the true quantile being sought with a probability of at least 1-δ, with δ being a user-defined probability of failure. B buffers, each having a capacity of k elements, initially are filled with elements from the data set, with the values of b and k depending on approximation error e and the probability δ. The buffers are then collapsed into an output buffer, with the remaining buffers then being refilled with elements, collapsed (along with the previous output buffer), and so on until the entire data set has been processed and a single output remains. The element of the output corresponding to the approximate quantile is then output as the approximate quantile. In later iterations (when the height of the tree is at least equal to a predetermined height that depends on δ and ε), the data is sampled non-uniformly to populate the buffers to render the desired performance. Parallel processors can be used, with the final output buffers of the processors being sent to a collecting processor P0 as input buffers to the collecting processor P0.
Single Pass Space Efficient System and Method for Generating Approximate Quantiles Satisfying an Apriori User-Defined Approximation Error
by B G Lindsay, G S Manku, S Rajagopalan (assignee: IBM Corp.)
Abstract: A system and method for finding an ε-approximate φ-quantile data element of a data set with N data elements in a single pass over the data set. The ε-approximate φ-quantile data element is guaranteed to lie within a user-specified approximation error ε of a true φ-quantile data element being sought. B buffers, each having a capacity of k elements, initially are filled with sorted data elements from the data set, with the values of b and k depending on ε and N. The buffers are then collapsed into an output buffer, with the remaining buffers then being refilled with data elements, collapsed (along with the previous output buffer), and so on until the entire data set has been processed and a single output buffer remains. A data element of the output buffer corresponding to the ε-approximate φ-quantile is then output as the approximate φ-quantile data element. If desired, the system and method can be practiced with sampling to even further reduce the amount of space required to find a desired ε-approximate φ-quantile data element.
Theses
Dipsea: A Modular Distributed Hash Table
by G S Manku
Abstract: Not available yet.
Structural Symmetries and Model Checking
by G S Manku
M. S. Thesis (U C Berkeley, Tech Report UCB/ERL M97/92), p 1-76, December 1997
Abstract: Not available yet.
Object Tracking using Affine Multiple Views Geomtry
by G S Manku and H Nautiyal
B. Tech. Thesis (IIT Delhi (won the Best B.Tech. Project Award)), p 1-56, May 1995
Abstract: Not available yet.