lsd0003

LSD0003: Set Union
Log | Files | Refs | README

commit 2e15bfdc45ea02d516126f2947e8d3ffb409d084
parent da2d89045c2fc10ea393e4eb917c3dca480486bc
Author: Christian Grothoff <christian@grothoff.org>
Date:   Sat, 12 Jun 2021 17:19:59 +0200

chapter 2

Diffstat:
Mdraft-summermatter-set-union.xml | 114++++++++++++++++++++++++++++++++++++++++++++-----------------------------------
1 file changed, 63 insertions(+), 51 deletions(-)

diff --git a/draft-summermatter-set-union.xml b/draft-summermatter-set-union.xml @@ -174,16 +174,17 @@ <section anchor="background" numbered="true" toc="default"> <name>Background</name> <section anchor="bf" numbered="true" toc="default"> - <name>Bloom Filters</name> + <name>Bloom Filter</name> <t> - A Bloom filter (BF) is a space-efficient datastructure to test if an element is part of a set of elements. - Elements are identified by an element ID. - Since a BF is a probabilistic datastructure, it is possible to have false-positives: when asked - if an element is in the set, the answer from a BF is either "no" or "maybe". + A Bloom filter (BF) is a space-efficient probabilistic + datastructure to test if an element is part of a set of elements. + Elements are identified by an element ID. + Since a BF is a probabilistic datastructure, it is possible to have false-positives: when asked + if an element is in the set, the answer from a BF is either "no" or "maybe". </t> <t> A BF consists of L buckets. Every bucket is a binary value that can be either 0 or 1. All buckets are initialized - to 0. A mapping function M is used to map each ID of each element from the set to a subset of k buckets. M is non-injective + to 0. A mapping function M is used to map each ID of each element from the set to a subset of k buckets. In the original proposal by Bloom, M is non-injective and can thus map the same element multiple times to the same bucket. The type of the mapping function can thus be described by the following mathematical notation: </t> @@ -211,13 +212,11 @@ To check if an element may be in the set, one tests if all buckets under the map M are set to 1. </t> <t> - Further in this document a bitstream output by the mapping function is represented by - a set of numeric values for example (0101) = (2,4). In the BF the buckets are set to 1 if the corresponding bit in the bitstream is 1. - If there is a collision and a bucket is already set to 1, the bucket stays 1. + If there is a collision and a bucket is already set to 1, the bucket stays at 1. </t> <t> - In the following example the element M(element) = (1,3) has been added: + In the following example the element e0 with M(e0) = {1,3} has been added: </t> <figure anchor="figure_bf_insert_0"> <artwork name="" type="" align="left" alt=""><![CDATA[ @@ -228,8 +227,10 @@ ]]></artwork> </figure> <t> - It is easy to see that the M(element) = (0,3) could be in the BF below and M(element) = (0,2) cannot be - in the BF below: + It is easy to see that an element e1 with M(e1) = {0,3} + could have been added to the BF below, while an element e2 + with M(e2) = {0,2} cannot be in the set represented by the + BF below: </t> <figure anchor="figure_bf_contains"> @@ -255,14 +256,18 @@ <section anchor="cbf" numbered="true" toc="default"> <name>Counting Bloom Filter</name> <t> - A Counting Bloom Filter (CBF) is an extension of the <xref target="bf" format="title" />. In the CBF, buckets are - unsigned numbers instead of binary values. This allows the removal of an element from the CBF. + A Counting Bloom Filter (CBF) is a variation on the idea + of a <xref target="bf" format="title" />. With a CBF, buckets are + unsigned numbers instead of binary values. + This allows the removal of an element from the CBF. </t> <t> - Adding an element to the CBF is similar to the adding operation of the BF. However, instead of setting the bucket on hit to 1 the - numeric value stored in the bucket is increased by 1. For example if two colliding elements M(element1) = (1,3) and - M(element2) = (0,3) are added to the CBF, bucket 0 and 1 are set to 1 and bucket 3 (the colliding bucket) is set - to 2: + Adding an element to the CBF is similar to the adding operation of the BF. + However, instead of setting the buckets to 1 the + numeric value stored in the bucket is increased by 1. + For example, if two colliding elements M(e1) = {1,3} and + M(e2) = {0,3} are added to the CBF, bucket 0 and 1 are set + to 1 and bucket 3 (the colliding bucket) is set to 2: </t> <figure anchor="figure_cbf_insert_0"> <artwork name="" type="" align="left" alt=""><![CDATA[ @@ -273,13 +278,15 @@ ]]></artwork> </figure> <t> - The counter stored in the bucket is also called the order of the bucket. + The counter stored in the bucket is also called the order of the bucket. </t> <t> - To remove an element form the CBF the counters of all buckets the element is mapped to are decreased by 1. + To remove an element form the CBF the counters of all buckets + the element is mapped to are decreased by 1. </t> <t> - Removing M(element2) = (1,3) from the CBF above: + For example, removing M(e2) = {1,3} from the CBF above + results in: </t> <figure anchor="figure_cbf_remove_0"> <artwork name="" type="" align="left" alt=""><![CDATA[ @@ -290,15 +297,19 @@ ]]></artwork> </figure> <t> - In practice, the number of bits available for the counters is usually finite. For example, given a 4-bit - counter, a CBF bucket would overflow 16 elements are mapped to the same bucket. To efficiently - handle this case, the maximum value (15 in our example) is considered to represent "infinity". Once the + In practice, the number of bits available for the counters + is often finite. For example, given a 4-bit + counter, a CBF bucket would overflow 16 elements are mapped + to the same bucket. To handle this case, the maximum value + (15 in our example) is considered to represent "infinity". Once the order of a bucket reaches "infinity", it is no longer incremented or decremented. </t> <t> - The parameters L and k and the number of bits allocated to the counters depend on the set size. - An IBF will degenerate when subjected to insert and remove iterations of different elements, and eventually all - buckets will reach "infinity". The speed of the degradation will depend on the choice of L and k in + The parameters L and k and the number of bits allocated to the counters + SHOULD depend on the set size. + A CBF will degenerate when subjected to insert and remove iterations of + different elements, and eventually all buckets will reach "infinity". + The speed of the degradation will depend on the choice of L and k in relation to the number of elements stored in the IBF. </t> </section> @@ -309,34 +320,34 @@ <t> An Invertible Bloom Filter (IBF) is a further extension of the <xref target="cbf" format="title" />. An IBF extends the <xref target="cbf" format="title" /> with two more operations: - decode and set difference. This two extra operations are useful to efficiently extract + decode and set difference. This two extra operations are key to efficiently obtain small differences between large sets. </t> <section anchor="ibf_structure" numbered="true" toc="default"> <name>Structure</name> <t> - An IBF consists of a mapping function M and - L buckets that each store a signed - counter and an XHASH. An XHASH is the XOR of various - hash values. As before, the - values used for k, L and the number of bits used - for the signed counter and the XHASH depend - on the set size and various other trade-offs, - including the CPU architecture. - </t> - <t> - If the IBF size is too small or the mapping - function does not spread out the elements - uniformly, the signed counter can overflow or - underflow. As with the CBF, the "maximum" value is - thus used to represent "infinite". As there is no - need to distinguish between overflow and - underflow, the most canonical representation of - "infinite" would be the minimum value of the - counter in the canonical 2-complement - interpretation. For example, given a 4-bit - counter a value of -8 would be used to represent - "infinity". + An IBF consists of an injective mapping function M mapping + elements to k out of L buckets. Each of the L buckets stores + a signed COUNTER, an IDSUM and an XHASH. + An IDSUM is the XOR of various element IDs. + An XHASH is the XOR of various hash values. + As before, the values used for k, L and the number of bits used + for the signed counter and the XHASH depend + on the set size and various other trade-offs. + </t> + <t> + If the IBF size is too small or the mapping + function does not spread out the elements + uniformly, the signed counter can overflow or + underflow. As with the CBF, the "maximum" value is + thus used to represent "infinite". As there is no + need to distinguish between overflow and + underflow, the most canonical representation of + "infinite" would be the minimum value of the + counter in the canonical 2-complement + interpretation. For example, given a 4-bit + counter a value of -8 would be used to represent + "infinity". </t> <figure anchor="figure_ibf_structure"> <artwork name="" type="" align="left" alt=""><![CDATA[ @@ -737,7 +748,8 @@ FUNCTION id_calculation (element,ibf_salt): <section anchor="ibf_format_bucket_identification" numbered="true" toc="default"> <name>Mapping Function</name> <t> - The mapping function M as described above in the figure <xref target="bf_mapping_function_math" format="default" /> + For an IBF, it is beneficial to use an injective mapping function M. + The mapping function M as described above in the figure <xref target="bf_mapping_function_math" format="default" /> decides in which buckets the ID and HASH have to be binary XORed to. In practice the following algorithm is used: </t>