commit 2e15bfdc45ea02d516126f2947e8d3ffb409d084
parent da2d89045c2fc10ea393e4eb917c3dca480486bc
Author: Christian Grothoff <christian@grothoff.org>
Date: Sat, 12 Jun 2021 17:19:59 +0200
chapter 2
Diffstat:
1 file changed, 63 insertions(+), 51 deletions(-)
diff --git a/draft-summermatter-set-union.xml b/draft-summermatter-set-union.xml
@@ -174,16 +174,17 @@
<section anchor="background" numbered="true" toc="default">
<name>Background</name>
<section anchor="bf" numbered="true" toc="default">
- <name>Bloom Filters</name>
+ <name>Bloom Filter</name>
<t>
- A Bloom filter (BF) is a space-efficient datastructure to test if an element is part of a set of elements.
- Elements are identified by an element ID.
- Since a BF is a probabilistic datastructure, it is possible to have false-positives: when asked
- if an element is in the set, the answer from a BF is either "no" or "maybe".
+ A Bloom filter (BF) is a space-efficient probabilistic
+ datastructure to test if an element is part of a set of elements.
+ Elements are identified by an element ID.
+ Since a BF is a probabilistic datastructure, it is possible to have false-positives: when asked
+ if an element is in the set, the answer from a BF is either "no" or "maybe".
</t>
<t>
A BF consists of L buckets. Every bucket is a binary value that can be either 0 or 1. All buckets are initialized
- to 0. A mapping function M is used to map each ID of each element from the set to a subset of k buckets. M is non-injective
+ to 0. A mapping function M is used to map each ID of each element from the set to a subset of k buckets. In the original proposal by Bloom, M is non-injective
and can thus map the same element multiple times to the same bucket.
The type of the mapping function can thus be described by the following mathematical notation:
</t>
@@ -211,13 +212,11 @@
To check if an element may be in the set, one tests if all buckets under the map M are set to 1.
</t>
<t>
- Further in this document a bitstream output by the mapping function is represented by
- a set of numeric values for example (0101) = (2,4).
In the BF the buckets are set to 1 if the corresponding bit in the bitstream is 1.
- If there is a collision and a bucket is already set to 1, the bucket stays 1.
+ If there is a collision and a bucket is already set to 1, the bucket stays at 1.
</t>
<t>
- In the following example the element M(element) = (1,3) has been added:
+ In the following example the element e0 with M(e0) = {1,3} has been added:
</t>
<figure anchor="figure_bf_insert_0">
<artwork name="" type="" align="left" alt=""><![CDATA[
@@ -228,8 +227,10 @@
]]></artwork>
</figure>
<t>
- It is easy to see that the M(element) = (0,3) could be in the BF below and M(element) = (0,2) cannot be
- in the BF below:
+ It is easy to see that an element e1 with M(e1) = {0,3}
+ could have been added to the BF below, while an element e2
+ with M(e2) = {0,2} cannot be in the set represented by the
+ BF below:
</t>
<figure anchor="figure_bf_contains">
@@ -255,14 +256,18 @@
<section anchor="cbf" numbered="true" toc="default">
<name>Counting Bloom Filter</name>
<t>
- A Counting Bloom Filter (CBF) is an extension of the <xref target="bf" format="title" />. In the CBF, buckets are
- unsigned numbers instead of binary values. This allows the removal of an element from the CBF.
+ A Counting Bloom Filter (CBF) is a variation on the idea
+ of a <xref target="bf" format="title" />. With a CBF, buckets are
+ unsigned numbers instead of binary values.
+ This allows the removal of an element from the CBF.
</t>
<t>
- Adding an element to the CBF is similar to the adding operation of the BF. However, instead of setting the bucket on hit to 1 the
- numeric value stored in the bucket is increased by 1. For example if two colliding elements M(element1) = (1,3) and
- M(element2) = (0,3) are added to the CBF, bucket 0 and 1 are set to 1 and bucket 3 (the colliding bucket) is set
- to 2:
+ Adding an element to the CBF is similar to the adding operation of the BF.
+ However, instead of setting the buckets to 1 the
+ numeric value stored in the bucket is increased by 1.
+ For example, if two colliding elements M(e1) = {1,3} and
+ M(e2) = {0,3} are added to the CBF, bucket 0 and 1 are set
+ to 1 and bucket 3 (the colliding bucket) is set to 2:
</t>
<figure anchor="figure_cbf_insert_0">
<artwork name="" type="" align="left" alt=""><![CDATA[
@@ -273,13 +278,15 @@
]]></artwork>
</figure>
<t>
- The counter stored in the bucket is also called the order of the bucket.
+ The counter stored in the bucket is also called the order of the bucket.
</t>
<t>
- To remove an element form the CBF the counters of all buckets the element is mapped to are decreased by 1.
+ To remove an element form the CBF the counters of all buckets
+ the element is mapped to are decreased by 1.
</t>
<t>
- Removing M(element2) = (1,3) from the CBF above:
+ For example, removing M(e2) = {1,3} from the CBF above
+ results in:
</t>
<figure anchor="figure_cbf_remove_0">
<artwork name="" type="" align="left" alt=""><![CDATA[
@@ -290,15 +297,19 @@
]]></artwork>
</figure>
<t>
- In practice, the number of bits available for the counters is usually finite. For example, given a 4-bit
- counter, a CBF bucket would overflow 16 elements are mapped to the same bucket. To efficiently
- handle this case, the maximum value (15 in our example) is considered to represent "infinity". Once the
+ In practice, the number of bits available for the counters
+ is often finite. For example, given a 4-bit
+ counter, a CBF bucket would overflow 16 elements are mapped
+ to the same bucket. To handle this case, the maximum value
+ (15 in our example) is considered to represent "infinity". Once the
order of a bucket reaches "infinity", it is no longer incremented or decremented.
</t>
<t>
- The parameters L and k and the number of bits allocated to the counters depend on the set size.
- An IBF will degenerate when subjected to insert and remove iterations of different elements, and eventually all
- buckets will reach "infinity". The speed of the degradation will depend on the choice of L and k in
+ The parameters L and k and the number of bits allocated to the counters
+ SHOULD depend on the set size.
+ A CBF will degenerate when subjected to insert and remove iterations of
+ different elements, and eventually all buckets will reach "infinity".
+ The speed of the degradation will depend on the choice of L and k in
relation to the number of elements stored in the IBF.
</t>
</section>
@@ -309,34 +320,34 @@
<t>
An Invertible Bloom Filter (IBF) is a further extension of the <xref target="cbf" format="title" />.
An IBF extends the <xref target="cbf" format="title" /> with two more operations:
- decode and set difference. This two extra operations are useful to efficiently extract
+ decode and set difference. This two extra operations are key to efficiently obtain
small differences between large sets.
</t>
<section anchor="ibf_structure" numbered="true" toc="default">
<name>Structure</name>
<t>
- An IBF consists of a mapping function M and
- L buckets that each store a signed
- counter and an XHASH. An XHASH is the XOR of various
- hash values. As before, the
- values used for k, L and the number of bits used
- for the signed counter and the XHASH depend
- on the set size and various other trade-offs,
- including the CPU architecture.
- </t>
- <t>
- If the IBF size is too small or the mapping
- function does not spread out the elements
- uniformly, the signed counter can overflow or
- underflow. As with the CBF, the "maximum" value is
- thus used to represent "infinite". As there is no
- need to distinguish between overflow and
- underflow, the most canonical representation of
- "infinite" would be the minimum value of the
- counter in the canonical 2-complement
- interpretation. For example, given a 4-bit
- counter a value of -8 would be used to represent
- "infinity".
+ An IBF consists of an injective mapping function M mapping
+ elements to k out of L buckets. Each of the L buckets stores
+ a signed COUNTER, an IDSUM and an XHASH.
+ An IDSUM is the XOR of various element IDs.
+ An XHASH is the XOR of various hash values.
+ As before, the values used for k, L and the number of bits used
+ for the signed counter and the XHASH depend
+ on the set size and various other trade-offs.
+ </t>
+ <t>
+ If the IBF size is too small or the mapping
+ function does not spread out the elements
+ uniformly, the signed counter can overflow or
+ underflow. As with the CBF, the "maximum" value is
+ thus used to represent "infinite". As there is no
+ need to distinguish between overflow and
+ underflow, the most canonical representation of
+ "infinite" would be the minimum value of the
+ counter in the canonical 2-complement
+ interpretation. For example, given a 4-bit
+ counter a value of -8 would be used to represent
+ "infinity".
</t>
<figure anchor="figure_ibf_structure">
<artwork name="" type="" align="left" alt=""><![CDATA[
@@ -737,7 +748,8 @@ FUNCTION id_calculation (element,ibf_salt):
<section anchor="ibf_format_bucket_identification" numbered="true" toc="default">
<name>Mapping Function</name>
<t>
- The mapping function M as described above in the figure <xref target="bf_mapping_function_math" format="default" />
+ For an IBF, it is beneficial to use an injective mapping function M.
+ The mapping function M as described above in the figure <xref target="bf_mapping_function_math" format="default" />
decides in which buckets the ID and HASH have to be binary XORed to. In practice
the following algorithm is used:
</t>