Estimation of abritary subset sum with priority sampling
Suppose we have a stream of object and each stream has a non-negative weight, we want to maintain a generic sample of a certain limited size which can be use to estimate the total weight of arbitary subsets.
Many scheme to solve that problem has a disadvantage at controlling the size of the sample.
A good way to solve that problem is use a scheme proposed by Duffield, Lund, and Thorup; called priority sampling.
This article is base on Duffield, Lund and Thorup paper “Priority Sampling for Estimation of Arbitary Subset Sums”.
Index
Introduction
This article is focusing on sampling from a high volume stream of weighted items. The items arrive faster and in larger quantities than can be saved, so only a sample that can be saved can be stored efficiently. We want to create a generic sample of a certain limitedsize that we can later use to estimate the total weight of arbitrary subsets.
Applied to internet traffic analysis, the items could be record summarizing the flows of packets streaming by a router, with say, a hundred records to be sampled each hour. A subset could be flow records of a worm attack whose signature is only determined after sampling has taken place. The samples taken in the past allow us to trace the history of the attack even though the worm was unknown at the time of sampling.
Finding a Subset
The point for selecting subsets is that an item besides the weight has any associated information, and selecting an item may be based with the associated information.
Internet trafic analysis
Internet routers transfer information about transmition of data that pass through. This transmission is called flows which could be a ftp transfer, email, or some other related data. A flow record is exported with statistics such as summary information which contain application type, source and destination IP addresses, and the number of packets and total bytes in the flow. Now, we can choose the byte size as the weight.
We might want to sample the flows such that we can aswer like how many bytes of traffic came from a given customer or how much traffic was generated by a certain application. If we knew which selection in advance of measurement were of interest, we could have a counter for each selection and increment these as flows passed by. The challange for that question is we must not be constraint to selection known in advance of measurement.
External information in the selection
Supposed a department store sample of all their sales which include information such as item, date, time, and price. Based on the sample we might as a question How many days of rain before we get a boom in the sale of rain gear?. The asnwer is we can find a database with historical rain information and then looked up at each sample sales record with rain gear, and check how many days it had rained before the sale. From this example, we can concluded that the selection can be based on external information that not even imagined relevant at the time when measurements are made.
Priority Sampling
Suppose we are give n items with weight w. The goal is to compute all the subset sum queries. In some situations, when n is huge, we can’t really store all the weights from the item. The way to compute that is take a sample from the universe and then calculated the weight of the sample.
There’s many techniques to take a sample; from randomly picked up sample - missed item with high weight to weighted sample - always give high weight items. Each of the schemes show a vulnarablity but there is an elegant scheme proposed by Duffield, Lund, and Thorup which called priority sampling.
Image1. Priority sampling of size 3 from a set of 10 weighted item
The priority sampling scheme as can seen bellow :
- For each item, generate a random number ai between 0 and 1. Assign a priority i to i of value wi/ai where qi is unique from others.
- S is the set of items with the k highest prorities.
- Take the top k elements (with respect to priority), and let the priority of the (k+1)th element be t. If k >= n set t = 0.
- If i is subset of S, set w’i = max(wi, t), else set w’i = 0.
Lemma 1
Lemma 2
Lemma 3
Lemma 4
References
- Nick Duffield, Carsten Lund, and Mikkel Thorup. Priority sampling for estimation of arbitrary subset sums. Journal of the ACM (JACM), 2007.