aboutsummaryrefslogtreecommitdiff
path: root/draft-summermatter-set-union.xml
diff options
context:
space:
mode:
authorChristian Grothoff <christian@grothoff.org>2021-06-12 17:19:59 +0200
committerChristian Grothoff <christian@grothoff.org>2021-06-12 17:19:59 +0200
commit2e15bfdc45ea02d516126f2947e8d3ffb409d084 (patch)
treefce7737425e9e5ac3039afb0694819450a502173 /draft-summermatter-set-union.xml
parentda2d89045c2fc10ea393e4eb917c3dca480486bc (diff)
downloadlsd0003-2e15bfdc45ea02d516126f2947e8d3ffb409d084.tar.gz
lsd0003-2e15bfdc45ea02d516126f2947e8d3ffb409d084.zip
chapter 2
Diffstat (limited to 'draft-summermatter-set-union.xml')
-rw-r--r--draft-summermatter-set-union.xml114
1 files changed, 63 insertions, 51 deletions
diff --git a/draft-summermatter-set-union.xml b/draft-summermatter-set-union.xml
index 66fdfab..87793e6 100644
--- a/draft-summermatter-set-union.xml
+++ b/draft-summermatter-set-union.xml
@@ -174,16 +174,17 @@
174 <section anchor="background" numbered="true" toc="default"> 174 <section anchor="background" numbered="true" toc="default">
175 <name>Background</name> 175 <name>Background</name>
176 <section anchor="bf" numbered="true" toc="default"> 176 <section anchor="bf" numbered="true" toc="default">
177 <name>Bloom Filters</name> 177 <name>Bloom Filter</name>
178 <t> 178 <t>
179 A Bloom filter (BF) is a space-efficient datastructure to test if an element is part of a set of elements. 179 A Bloom filter (BF) is a space-efficient probabilistic
180 Elements are identified by an element ID. 180 datastructure to test if an element is part of a set of elements.
181 Since a BF is a probabilistic datastructure, it is possible to have false-positives: when asked 181 Elements are identified by an element ID.
182 if an element is in the set, the answer from a BF is either "no" or "maybe". 182 Since a BF is a probabilistic datastructure, it is possible to have false-positives: when asked
183 if an element is in the set, the answer from a BF is either "no" or "maybe".
183 </t> 184 </t>
184 <t> 185 <t>
185 A BF consists of L buckets. Every bucket is a binary value that can be either 0 or 1. All buckets are initialized 186 A BF consists of L buckets. Every bucket is a binary value that can be either 0 or 1. All buckets are initialized
186 to 0. A mapping function M is used to map each ID of each element from the set to a subset of k buckets. M is non-injective 187 to 0. A mapping function M is used to map each ID of each element from the set to a subset of k buckets. In the original proposal by Bloom, M is non-injective
187 and can thus map the same element multiple times to the same bucket. 188 and can thus map the same element multiple times to the same bucket.
188 The type of the mapping function can thus be described by the following mathematical notation: 189 The type of the mapping function can thus be described by the following mathematical notation:
189 </t> 190 </t>
@@ -211,13 +212,11 @@
211 To check if an element may be in the set, one tests if all buckets under the map M are set to 1. 212 To check if an element may be in the set, one tests if all buckets under the map M are set to 1.
212 </t> 213 </t>
213 <t> 214 <t>
214 Further in this document a bitstream output by the mapping function is represented by
215 a set of numeric values for example (0101) = (2,4).
216 In the BF the buckets are set to 1 if the corresponding bit in the bitstream is 1. 215 In the BF the buckets are set to 1 if the corresponding bit in the bitstream is 1.
217 If there is a collision and a bucket is already set to 1, the bucket stays 1. 216 If there is a collision and a bucket is already set to 1, the bucket stays at 1.
218 </t> 217 </t>
219 <t> 218 <t>
220 In the following example the element M(element) = (1,3) has been added: 219 In the following example the element e0 with M(e0) = {1,3} has been added:
221 </t> 220 </t>
222 <figure anchor="figure_bf_insert_0"> 221 <figure anchor="figure_bf_insert_0">
223 <artwork name="" type="" align="left" alt=""><![CDATA[ 222 <artwork name="" type="" align="left" alt=""><![CDATA[
@@ -228,8 +227,10 @@
228 ]]></artwork> 227 ]]></artwork>
229 </figure> 228 </figure>
230 <t> 229 <t>
231 It is easy to see that the M(element) = (0,3) could be in the BF below and M(element) = (0,2) cannot be 230 It is easy to see that an element e1 with M(e1) = {0,3}
232 in the BF below: 231 could have been added to the BF below, while an element e2
232 with M(e2) = {0,2} cannot be in the set represented by the
233 BF below:
233 </t> 234 </t>
234 235
235 <figure anchor="figure_bf_contains"> 236 <figure anchor="figure_bf_contains">
@@ -255,14 +256,18 @@
255 <section anchor="cbf" numbered="true" toc="default"> 256 <section anchor="cbf" numbered="true" toc="default">
256 <name>Counting Bloom Filter</name> 257 <name>Counting Bloom Filter</name>
257 <t> 258 <t>
258 A Counting Bloom Filter (CBF) is an extension of the <xref target="bf" format="title" />. In the CBF, buckets are 259 A Counting Bloom Filter (CBF) is a variation on the idea
259 unsigned numbers instead of binary values. This allows the removal of an element from the CBF. 260 of a <xref target="bf" format="title" />. With a CBF, buckets are
261 unsigned numbers instead of binary values.
262 This allows the removal of an element from the CBF.
260 </t> 263 </t>
261 <t> 264 <t>
262 Adding an element to the CBF is similar to the adding operation of the BF. However, instead of setting the bucket on hit to 1 the 265 Adding an element to the CBF is similar to the adding operation of the BF.
263 numeric value stored in the bucket is increased by 1. For example if two colliding elements M(element1) = (1,3) and 266 However, instead of setting the buckets to 1 the
264 M(element2) = (0,3) are added to the CBF, bucket 0 and 1 are set to 1 and bucket 3 (the colliding bucket) is set 267 numeric value stored in the bucket is increased by 1.
265 to 2: 268 For example, if two colliding elements M(e1) = {1,3} and
269 M(e2) = {0,3} are added to the CBF, bucket 0 and 1 are set
270 to 1 and bucket 3 (the colliding bucket) is set to 2:
266 </t> 271 </t>
267 <figure anchor="figure_cbf_insert_0"> 272 <figure anchor="figure_cbf_insert_0">
268 <artwork name="" type="" align="left" alt=""><![CDATA[ 273 <artwork name="" type="" align="left" alt=""><![CDATA[
@@ -273,13 +278,15 @@
273 ]]></artwork> 278 ]]></artwork>
274 </figure> 279 </figure>
275 <t> 280 <t>
276 The counter stored in the bucket is also called the order of the bucket. 281 The counter stored in the bucket is also called the order of the bucket.
277 </t> 282 </t>
278 <t> 283 <t>
279 To remove an element form the CBF the counters of all buckets the element is mapped to are decreased by 1. 284 To remove an element form the CBF the counters of all buckets
285 the element is mapped to are decreased by 1.
280 </t> 286 </t>
281 <t> 287 <t>
282 Removing M(element2) = (1,3) from the CBF above: 288 For example, removing M(e2) = {1,3} from the CBF above
289 results in:
283 </t> 290 </t>
284 <figure anchor="figure_cbf_remove_0"> 291 <figure anchor="figure_cbf_remove_0">
285 <artwork name="" type="" align="left" alt=""><![CDATA[ 292 <artwork name="" type="" align="left" alt=""><![CDATA[
@@ -290,15 +297,19 @@
290 ]]></artwork> 297 ]]></artwork>
291 </figure> 298 </figure>
292 <t> 299 <t>
293 In practice, the number of bits available for the counters is usually finite. For example, given a 4-bit 300 In practice, the number of bits available for the counters
294 counter, a CBF bucket would overflow 16 elements are mapped to the same bucket. To efficiently 301 is often finite. For example, given a 4-bit
295 handle this case, the maximum value (15 in our example) is considered to represent "infinity". Once the 302 counter, a CBF bucket would overflow 16 elements are mapped
303 to the same bucket. To handle this case, the maximum value
304 (15 in our example) is considered to represent "infinity". Once the
296 order of a bucket reaches "infinity", it is no longer incremented or decremented. 305 order of a bucket reaches "infinity", it is no longer incremented or decremented.
297 </t> 306 </t>
298 <t> 307 <t>
299 The parameters L and k and the number of bits allocated to the counters depend on the set size. 308 The parameters L and k and the number of bits allocated to the counters
300 An IBF will degenerate when subjected to insert and remove iterations of different elements, and eventually all 309 SHOULD depend on the set size.
301 buckets will reach "infinity". The speed of the degradation will depend on the choice of L and k in 310 A CBF will degenerate when subjected to insert and remove iterations of
311 different elements, and eventually all buckets will reach "infinity".
312 The speed of the degradation will depend on the choice of L and k in
302 relation to the number of elements stored in the IBF. 313 relation to the number of elements stored in the IBF.
303 </t> 314 </t>
304 </section> 315 </section>
@@ -309,34 +320,34 @@
309 <t> 320 <t>
310 An Invertible Bloom Filter (IBF) is a further extension of the <xref target="cbf" format="title" />. 321 An Invertible Bloom Filter (IBF) is a further extension of the <xref target="cbf" format="title" />.
311 An IBF extends the <xref target="cbf" format="title" /> with two more operations: 322 An IBF extends the <xref target="cbf" format="title" /> with two more operations:
312 decode and set difference. This two extra operations are useful to efficiently extract 323 decode and set difference. This two extra operations are key to efficiently obtain
313 small differences between large sets. 324 small differences between large sets.
314 </t> 325 </t>
315 <section anchor="ibf_structure" numbered="true" toc="default"> 326 <section anchor="ibf_structure" numbered="true" toc="default">
316 <name>Structure</name> 327 <name>Structure</name>
317 <t> 328 <t>
318 An IBF consists of a mapping function M and 329 An IBF consists of an injective mapping function M mapping
319 L buckets that each store a signed 330 elements to k out of L buckets. Each of the L buckets stores
320 counter and an XHASH. An XHASH is the XOR of various 331 a signed COUNTER, an IDSUM and an XHASH.
321 hash values. As before, the 332 An IDSUM is the XOR of various element IDs.
322 values used for k, L and the number of bits used 333 An XHASH is the XOR of various hash values.
323 for the signed counter and the XHASH depend 334 As before, the values used for k, L and the number of bits used
324 on the set size and various other trade-offs, 335 for the signed counter and the XHASH depend
325 including the CPU architecture. 336 on the set size and various other trade-offs.
326 </t> 337 </t>
327 <t> 338 <t>
328 If the IBF size is too small or the mapping 339 If the IBF size is too small or the mapping
329 function does not spread out the elements 340 function does not spread out the elements
330 uniformly, the signed counter can overflow or 341 uniformly, the signed counter can overflow or
331 underflow. As with the CBF, the "maximum" value is 342 underflow. As with the CBF, the "maximum" value is
332 thus used to represent "infinite". As there is no 343 thus used to represent "infinite". As there is no
333 need to distinguish between overflow and 344 need to distinguish between overflow and
334 underflow, the most canonical representation of 345 underflow, the most canonical representation of
335 "infinite" would be the minimum value of the 346 "infinite" would be the minimum value of the
336 counter in the canonical 2-complement 347 counter in the canonical 2-complement
337 interpretation. For example, given a 4-bit 348 interpretation. For example, given a 4-bit
338 counter a value of -8 would be used to represent 349 counter a value of -8 would be used to represent
339 "infinity". 350 "infinity".
340 </t> 351 </t>
341 <figure anchor="figure_ibf_structure"> 352 <figure anchor="figure_ibf_structure">
342 <artwork name="" type="" align="left" alt=""><![CDATA[ 353 <artwork name="" type="" align="left" alt=""><![CDATA[
@@ -737,7 +748,8 @@ FUNCTION id_calculation (element,ibf_salt):
737 <section anchor="ibf_format_bucket_identification" numbered="true" toc="default"> 748 <section anchor="ibf_format_bucket_identification" numbered="true" toc="default">
738 <name>Mapping Function</name> 749 <name>Mapping Function</name>
739 <t> 750 <t>
740 The mapping function M as described above in the figure <xref target="bf_mapping_function_math" format="default" /> 751 For an IBF, it is beneficial to use an injective mapping function M.
752 The mapping function M as described above in the figure <xref target="bf_mapping_function_math" format="default" />
741 decides in which buckets the ID and HASH have to be binary XORed to. In practice 753 decides in which buckets the ID and HASH have to be binary XORed to. In practice
742 the following algorithm is used: 754 the following algorithm is used:
743 </t> 755 </t>