diff options
author | Christian Grothoff <christian@grothoff.org> | 2021-06-12 17:19:59 +0200 |
---|---|---|
committer | Christian Grothoff <christian@grothoff.org> | 2021-06-12 17:19:59 +0200 |
commit | 2e15bfdc45ea02d516126f2947e8d3ffb409d084 (patch) | |
tree | fce7737425e9e5ac3039afb0694819450a502173 /draft-summermatter-set-union.xml | |
parent | da2d89045c2fc10ea393e4eb917c3dca480486bc (diff) | |
download | lsd0003-2e15bfdc45ea02d516126f2947e8d3ffb409d084.tar.gz lsd0003-2e15bfdc45ea02d516126f2947e8d3ffb409d084.zip |
chapter 2
Diffstat (limited to 'draft-summermatter-set-union.xml')
-rw-r--r-- | draft-summermatter-set-union.xml | 114 |
1 files changed, 63 insertions, 51 deletions
diff --git a/draft-summermatter-set-union.xml b/draft-summermatter-set-union.xml index 66fdfab..87793e6 100644 --- a/draft-summermatter-set-union.xml +++ b/draft-summermatter-set-union.xml | |||
@@ -174,16 +174,17 @@ | |||
174 | <section anchor="background" numbered="true" toc="default"> | 174 | <section anchor="background" numbered="true" toc="default"> |
175 | <name>Background</name> | 175 | <name>Background</name> |
176 | <section anchor="bf" numbered="true" toc="default"> | 176 | <section anchor="bf" numbered="true" toc="default"> |
177 | <name>Bloom Filters</name> | 177 | <name>Bloom Filter</name> |
178 | <t> | 178 | <t> |
179 | A Bloom filter (BF) is a space-efficient datastructure to test if an element is part of a set of elements. | 179 | A Bloom filter (BF) is a space-efficient probabilistic |
180 | Elements are identified by an element ID. | 180 | datastructure to test if an element is part of a set of elements. |
181 | Since a BF is a probabilistic datastructure, it is possible to have false-positives: when asked | 181 | Elements are identified by an element ID. |
182 | if an element is in the set, the answer from a BF is either "no" or "maybe". | 182 | Since a BF is a probabilistic datastructure, it is possible to have false-positives: when asked |
183 | if an element is in the set, the answer from a BF is either "no" or "maybe". | ||
183 | </t> | 184 | </t> |
184 | <t> | 185 | <t> |
185 | A BF consists of L buckets. Every bucket is a binary value that can be either 0 or 1. All buckets are initialized | 186 | A BF consists of L buckets. Every bucket is a binary value that can be either 0 or 1. All buckets are initialized |
186 | to 0. A mapping function M is used to map each ID of each element from the set to a subset of k buckets. M is non-injective | 187 | to 0. A mapping function M is used to map each ID of each element from the set to a subset of k buckets. In the original proposal by Bloom, M is non-injective |
187 | and can thus map the same element multiple times to the same bucket. | 188 | and can thus map the same element multiple times to the same bucket. |
188 | The type of the mapping function can thus be described by the following mathematical notation: | 189 | The type of the mapping function can thus be described by the following mathematical notation: |
189 | </t> | 190 | </t> |
@@ -211,13 +212,11 @@ | |||
211 | To check if an element may be in the set, one tests if all buckets under the map M are set to 1. | 212 | To check if an element may be in the set, one tests if all buckets under the map M are set to 1. |
212 | </t> | 213 | </t> |
213 | <t> | 214 | <t> |
214 | Further in this document a bitstream output by the mapping function is represented by | ||
215 | a set of numeric values for example (0101) = (2,4). | ||
216 | In the BF the buckets are set to 1 if the corresponding bit in the bitstream is 1. | 215 | In the BF the buckets are set to 1 if the corresponding bit in the bitstream is 1. |
217 | If there is a collision and a bucket is already set to 1, the bucket stays 1. | 216 | If there is a collision and a bucket is already set to 1, the bucket stays at 1. |
218 | </t> | 217 | </t> |
219 | <t> | 218 | <t> |
220 | In the following example the element M(element) = (1,3) has been added: | 219 | In the following example the element e0 with M(e0) = {1,3} has been added: |
221 | </t> | 220 | </t> |
222 | <figure anchor="figure_bf_insert_0"> | 221 | <figure anchor="figure_bf_insert_0"> |
223 | <artwork name="" type="" align="left" alt=""><![CDATA[ | 222 | <artwork name="" type="" align="left" alt=""><![CDATA[ |
@@ -228,8 +227,10 @@ | |||
228 | ]]></artwork> | 227 | ]]></artwork> |
229 | </figure> | 228 | </figure> |
230 | <t> | 229 | <t> |
231 | It is easy to see that the M(element) = (0,3) could be in the BF below and M(element) = (0,2) cannot be | 230 | It is easy to see that an element e1 with M(e1) = {0,3} |
232 | in the BF below: | 231 | could have been added to the BF below, while an element e2 |
232 | with M(e2) = {0,2} cannot be in the set represented by the | ||
233 | BF below: | ||
233 | </t> | 234 | </t> |
234 | 235 | ||
235 | <figure anchor="figure_bf_contains"> | 236 | <figure anchor="figure_bf_contains"> |
@@ -255,14 +256,18 @@ | |||
255 | <section anchor="cbf" numbered="true" toc="default"> | 256 | <section anchor="cbf" numbered="true" toc="default"> |
256 | <name>Counting Bloom Filter</name> | 257 | <name>Counting Bloom Filter</name> |
257 | <t> | 258 | <t> |
258 | A Counting Bloom Filter (CBF) is an extension of the <xref target="bf" format="title" />. In the CBF, buckets are | 259 | A Counting Bloom Filter (CBF) is a variation on the idea |
259 | unsigned numbers instead of binary values. This allows the removal of an element from the CBF. | 260 | of a <xref target="bf" format="title" />. With a CBF, buckets are |
261 | unsigned numbers instead of binary values. | ||
262 | This allows the removal of an element from the CBF. | ||
260 | </t> | 263 | </t> |
261 | <t> | 264 | <t> |
262 | Adding an element to the CBF is similar to the adding operation of the BF. However, instead of setting the bucket on hit to 1 the | 265 | Adding an element to the CBF is similar to the adding operation of the BF. |
263 | numeric value stored in the bucket is increased by 1. For example if two colliding elements M(element1) = (1,3) and | 266 | However, instead of setting the buckets to 1 the |
264 | M(element2) = (0,3) are added to the CBF, bucket 0 and 1 are set to 1 and bucket 3 (the colliding bucket) is set | 267 | numeric value stored in the bucket is increased by 1. |
265 | to 2: | 268 | For example, if two colliding elements M(e1) = {1,3} and |
269 | M(e2) = {0,3} are added to the CBF, bucket 0 and 1 are set | ||
270 | to 1 and bucket 3 (the colliding bucket) is set to 2: | ||
266 | </t> | 271 | </t> |
267 | <figure anchor="figure_cbf_insert_0"> | 272 | <figure anchor="figure_cbf_insert_0"> |
268 | <artwork name="" type="" align="left" alt=""><![CDATA[ | 273 | <artwork name="" type="" align="left" alt=""><![CDATA[ |
@@ -273,13 +278,15 @@ | |||
273 | ]]></artwork> | 278 | ]]></artwork> |
274 | </figure> | 279 | </figure> |
275 | <t> | 280 | <t> |
276 | The counter stored in the bucket is also called the order of the bucket. | 281 | The counter stored in the bucket is also called the order of the bucket. |
277 | </t> | 282 | </t> |
278 | <t> | 283 | <t> |
279 | To remove an element form the CBF the counters of all buckets the element is mapped to are decreased by 1. | 284 | To remove an element form the CBF the counters of all buckets |
285 | the element is mapped to are decreased by 1. | ||
280 | </t> | 286 | </t> |
281 | <t> | 287 | <t> |
282 | Removing M(element2) = (1,3) from the CBF above: | 288 | For example, removing M(e2) = {1,3} from the CBF above |
289 | results in: | ||
283 | </t> | 290 | </t> |
284 | <figure anchor="figure_cbf_remove_0"> | 291 | <figure anchor="figure_cbf_remove_0"> |
285 | <artwork name="" type="" align="left" alt=""><![CDATA[ | 292 | <artwork name="" type="" align="left" alt=""><![CDATA[ |
@@ -290,15 +297,19 @@ | |||
290 | ]]></artwork> | 297 | ]]></artwork> |
291 | </figure> | 298 | </figure> |
292 | <t> | 299 | <t> |
293 | In practice, the number of bits available for the counters is usually finite. For example, given a 4-bit | 300 | In practice, the number of bits available for the counters |
294 | counter, a CBF bucket would overflow 16 elements are mapped to the same bucket. To efficiently | 301 | is often finite. For example, given a 4-bit |
295 | handle this case, the maximum value (15 in our example) is considered to represent "infinity". Once the | 302 | counter, a CBF bucket would overflow 16 elements are mapped |
303 | to the same bucket. To handle this case, the maximum value | ||
304 | (15 in our example) is considered to represent "infinity". Once the | ||
296 | order of a bucket reaches "infinity", it is no longer incremented or decremented. | 305 | order of a bucket reaches "infinity", it is no longer incremented or decremented. |
297 | </t> | 306 | </t> |
298 | <t> | 307 | <t> |
299 | The parameters L and k and the number of bits allocated to the counters depend on the set size. | 308 | The parameters L and k and the number of bits allocated to the counters |
300 | An IBF will degenerate when subjected to insert and remove iterations of different elements, and eventually all | 309 | SHOULD depend on the set size. |
301 | buckets will reach "infinity". The speed of the degradation will depend on the choice of L and k in | 310 | A CBF will degenerate when subjected to insert and remove iterations of |
311 | different elements, and eventually all buckets will reach "infinity". | ||
312 | The speed of the degradation will depend on the choice of L and k in | ||
302 | relation to the number of elements stored in the IBF. | 313 | relation to the number of elements stored in the IBF. |
303 | </t> | 314 | </t> |
304 | </section> | 315 | </section> |
@@ -309,34 +320,34 @@ | |||
309 | <t> | 320 | <t> |
310 | An Invertible Bloom Filter (IBF) is a further extension of the <xref target="cbf" format="title" />. | 321 | An Invertible Bloom Filter (IBF) is a further extension of the <xref target="cbf" format="title" />. |
311 | An IBF extends the <xref target="cbf" format="title" /> with two more operations: | 322 | An IBF extends the <xref target="cbf" format="title" /> with two more operations: |
312 | decode and set difference. This two extra operations are useful to efficiently extract | 323 | decode and set difference. This two extra operations are key to efficiently obtain |
313 | small differences between large sets. | 324 | small differences between large sets. |
314 | </t> | 325 | </t> |
315 | <section anchor="ibf_structure" numbered="true" toc="default"> | 326 | <section anchor="ibf_structure" numbered="true" toc="default"> |
316 | <name>Structure</name> | 327 | <name>Structure</name> |
317 | <t> | 328 | <t> |
318 | An IBF consists of a mapping function M and | 329 | An IBF consists of an injective mapping function M mapping |
319 | L buckets that each store a signed | 330 | elements to k out of L buckets. Each of the L buckets stores |
320 | counter and an XHASH. An XHASH is the XOR of various | 331 | a signed COUNTER, an IDSUM and an XHASH. |
321 | hash values. As before, the | 332 | An IDSUM is the XOR of various element IDs. |
322 | values used for k, L and the number of bits used | 333 | An XHASH is the XOR of various hash values. |
323 | for the signed counter and the XHASH depend | 334 | As before, the values used for k, L and the number of bits used |
324 | on the set size and various other trade-offs, | 335 | for the signed counter and the XHASH depend |
325 | including the CPU architecture. | 336 | on the set size and various other trade-offs. |
326 | </t> | 337 | </t> |
327 | <t> | 338 | <t> |
328 | If the IBF size is too small or the mapping | 339 | If the IBF size is too small or the mapping |
329 | function does not spread out the elements | 340 | function does not spread out the elements |
330 | uniformly, the signed counter can overflow or | 341 | uniformly, the signed counter can overflow or |
331 | underflow. As with the CBF, the "maximum" value is | 342 | underflow. As with the CBF, the "maximum" value is |
332 | thus used to represent "infinite". As there is no | 343 | thus used to represent "infinite". As there is no |
333 | need to distinguish between overflow and | 344 | need to distinguish between overflow and |
334 | underflow, the most canonical representation of | 345 | underflow, the most canonical representation of |
335 | "infinite" would be the minimum value of the | 346 | "infinite" would be the minimum value of the |
336 | counter in the canonical 2-complement | 347 | counter in the canonical 2-complement |
337 | interpretation. For example, given a 4-bit | 348 | interpretation. For example, given a 4-bit |
338 | counter a value of -8 would be used to represent | 349 | counter a value of -8 would be used to represent |
339 | "infinity". | 350 | "infinity". |
340 | </t> | 351 | </t> |
341 | <figure anchor="figure_ibf_structure"> | 352 | <figure anchor="figure_ibf_structure"> |
342 | <artwork name="" type="" align="left" alt=""><![CDATA[ | 353 | <artwork name="" type="" align="left" alt=""><![CDATA[ |
@@ -737,7 +748,8 @@ FUNCTION id_calculation (element,ibf_salt): | |||
737 | <section anchor="ibf_format_bucket_identification" numbered="true" toc="default"> | 748 | <section anchor="ibf_format_bucket_identification" numbered="true" toc="default"> |
738 | <name>Mapping Function</name> | 749 | <name>Mapping Function</name> |
739 | <t> | 750 | <t> |
740 | The mapping function M as described above in the figure <xref target="bf_mapping_function_math" format="default" /> | 751 | For an IBF, it is beneficial to use an injective mapping function M. |
752 | The mapping function M as described above in the figure <xref target="bf_mapping_function_math" format="default" /> | ||
741 | decides in which buckets the ID and HASH have to be binary XORed to. In practice | 753 | decides in which buckets the ID and HASH have to be binary XORed to. In practice |
742 | the following algorithm is used: | 754 | the following algorithm is used: |
743 | </t> | 755 | </t> |