chapter 2 - lsd0003 - LSD0003: Set Union

commit 2e15bfdc45ea02d516126f2947e8d3ffb409d084
parent da2d89045c2fc10ea393e4eb917c3dca480486bc
Author: Christian Grothoff <christian@grothoff.org>
Date:   Sat, 12 Jun 2021 17:19:59 +0200

chapter 2

Diffstat:
M draft-summermatter-set-union.xml  | 114 ++++++++++++++++++++++++++++++++++++++++++++-----------------------------------

1 file changed, 63 insertions(+), 51 deletions(-)
diff --git a/draft-summermatter-set-union.xml b/draft-summermatter-set-union.xml
@@ -174,16 +174,17 @@
         <section anchor="background" numbered="true" toc="default">
             <name>Background</name>
             <section anchor="bf" numbered="true" toc="default">
-                <name>Bloom Filters</name>
+                <name>Bloom Filter</name>
                 <t>
-                    A Bloom filter (BF) is a space-efficient datastructure to test if an element is part of a set of elements.
-                    Elements are identified by an element ID.
-                    Since a BF is a probabilistic datastructure, it is possible to have false-positives: when asked
-                    if an element is in the set, the answer from a BF is either "no" or "maybe".
+                  A Bloom filter (BF) is a space-efficient probabilistic
+                  datastructure to test if an element is part of a set of elements.
+                  Elements are identified by an element ID.
+                  Since a BF is a probabilistic datastructure, it is possible to have false-positives: when asked
+                  if an element is in the set, the answer from a BF is either "no" or "maybe".
                 </t>
                 <t>
                     A BF consists of L buckets. Every bucket is a binary value that can be either 0 or 1. All buckets are initialized
-                    to 0.  A mapping function M is used to map each ID of each element from the set to a subset of k buckets.  M is non-injective
+                    to 0.  A mapping function M is used to map each ID of each element from the set to a subset of k buckets.  In the original proposal by Bloom, M is non-injective
                     and can thus map the same element multiple times to the same bucket.
                     The type of the mapping function can thus be described by the following mathematical notation:
                 </t>
@@ -211,13 +212,11 @@
                     To check if an element may be in the set, one tests if all buckets under the map M are set to 1.
                 </t>
                 <t>
-                    Further in this document a bitstream output by the mapping function is represented by
-                    a set of numeric values for example (0101) = (2,4).
                     In the BF the buckets are set to 1 if the corresponding bit in the bitstream is 1.
-                    If there is a collision and a bucket is already set to 1, the bucket stays 1.
+                    If there is a collision and a bucket is already set to 1, the bucket stays at 1.
                 </t>
                 <t>
-                    In the following example the element M(element) = (1,3) has been added:
+                    In the following example the element e0 with M(e0) = {1,3} has been added:
                 </t>
                     <figure anchor="figure_bf_insert_0">
                         <artwork name="" type="" align="left" alt=""><![CDATA[
@@ -228,8 +227,10 @@
                      ]]></artwork>
                     </figure>
                 <t>
-                    It is easy to see that the M(element) = (0,3) could be in the BF below and M(element) = (0,2) cannot be
-                    in the BF below:
+                  It is easy to see that an element e1 with M(e1) = {0,3}
+                  could have been added to the BF below, while an element e2
+                  with M(e2) = {0,2} cannot be in the set represented by the
+                  BF below:
                 </t>
 
                 <figure anchor="figure_bf_contains">
@@ -255,14 +256,18 @@
             <section anchor="cbf" numbered="true" toc="default">
                 <name>Counting Bloom Filter</name>
                 <t>
-                  A Counting Bloom Filter (CBF) is an extension of the <xref target="bf" format="title" />. In the CBF, buckets are
-                  unsigned numbers instead of binary values.  This allows the removal of an element from the CBF.
+                  A Counting Bloom Filter (CBF) is a variation on the idea
+                  of a <xref target="bf" format="title" />. With a CBF, buckets are
+                  unsigned numbers instead of binary values.
+                  This allows the removal of an element from the CBF.
                 </t>
                 <t>
-                  Adding an element to the CBF is similar to the adding operation of the BF. However, instead of setting the bucket on hit to 1 the
-                  numeric value stored in the bucket is increased by 1. For example if two colliding elements M(element1) = (1,3) and
-                    M(element2) = (0,3) are added to the CBF, bucket 0 and 1 are set to 1 and bucket 3 (the colliding bucket) is set
-                    to 2:
+                  Adding an element to the CBF is similar to the adding operation of the BF.
+                  However, instead of setting the buckets to 1 the
+                  numeric value stored in the bucket is increased by 1.
+                  For example, if two colliding elements M(e1) = {1,3} and
+                  M(e2) = {0,3} are added to the CBF, bucket 0 and 1 are set
+                  to 1 and bucket 3 (the colliding bucket) is set to 2:
                 </t>
                 <figure anchor="figure_cbf_insert_0">
                     <artwork name="" type="" align="left" alt=""><![CDATA[
@@ -273,13 +278,15 @@
                      ]]></artwork>
                 </figure>
                 <t>
-                    The counter stored in the bucket is also called the order of the bucket.
+                  The counter stored in the bucket is also called the order of the bucket.
                 </t>
                 <t>
-                    To remove an element form the CBF the counters of all buckets the element is mapped to are decreased by 1.
+                  To remove an element form the CBF the counters of all buckets
+                  the element is mapped to are decreased by 1.
                 </t>
                 <t>
-                    Removing M(element2) = (1,3) from the CBF above:
+                  For example, removing M(e2) = {1,3} from the CBF above
+                  results in:
                 </t>
                 <figure anchor="figure_cbf_remove_0">
                     <artwork name="" type="" align="left" alt=""><![CDATA[
@@ -290,15 +297,19 @@
                      ]]></artwork>
                 </figure>
                 <t>
-                  In practice, the number of bits available for the counters is usually finite. For example, given a 4-bit
-                  counter, a CBF bucket would overflow 16 elements are mapped to the same bucket. To efficiently
-                  handle this case, the maximum value (15 in our example) is considered to represent "infinity". Once the
+                  In practice, the number of bits available for the counters
+                  is often finite. For example, given a 4-bit
+                  counter, a CBF bucket would overflow 16 elements are mapped
+                  to the same bucket. To handle this case, the maximum value
+                  (15 in our example) is considered to represent "infinity". Once the
                   order of a bucket reaches "infinity", it is no longer incremented or decremented.
                 </t>
                 <t>
-                  The parameters L and k and the number of bits allocated to the counters depend on the set size.
-                  An IBF will degenerate when subjected to insert and remove iterations of different elements, and eventually all
-                  buckets will reach "infinity".  The speed of the degradation will depend on the choice of L and k in
+                  The parameters L and k and the number of bits allocated to the counters
+                  SHOULD depend on the set size.
+                  A CBF will degenerate when subjected to insert and remove iterations of
+                  different elements, and eventually all buckets will reach "infinity".
+                  The speed of the degradation will depend on the choice of L and k in
                   relation to the number of elements stored in the IBF.
                 </t>
             </section>
@@ -309,34 +320,34 @@
             <t>
                 An Invertible Bloom Filter (IBF) is a further extension of the <xref target="cbf" format="title" />.
                 An IBF extends the <xref target="cbf" format="title" /> with two more operations:
-                decode and set difference. This two extra operations are useful to efficiently extract
+                decode and set difference. This two extra operations are key to efficiently obtain
                 small differences between large sets.
             </t>
             <section anchor="ibf_structure" numbered="true" toc="default">
                 <name>Structure</name>
                 <t>
-                    An IBF consists of a mapping function M and
-                    L buckets that each store a signed
-                    counter and an XHASH.  An XHASH is the XOR of various
-                    hash values.  As before, the
-                    values used for k, L and the number of bits used
-                    for the signed counter and the XHASH depend
-                    on the set size and various other trade-offs,
-                    including the CPU architecture.
-                </t>
-                <t>
-                    If the IBF size is too small or the mapping
-                    function does not spread out the elements
-                    uniformly, the signed counter can overflow or
-                    underflow. As with the CBF, the "maximum" value is
-                    thus used to represent "infinite".  As there is no
-                    need to distinguish between overflow and
-                    underflow, the most canonical representation of
-                    "infinite" would be the minimum value of the
-                    counter in the canonical 2-complement
-                    interpretation.  For example, given a 4-bit
-                    counter a value of -8 would be used to represent
-                    "infinity".
+                  An IBF consists of an injective mapping function M mapping
+                  elements to k out of L buckets. Each of the L buckets stores
+                  a signed COUNTER, an IDSUM and an XHASH.
+                  An IDSUM is the XOR of various element IDs.
+                  An XHASH is the XOR of various hash values.
+                  As before, the values used for k, L and the number of bits used
+                  for the signed counter and the XHASH depend
+                  on the set size and various other trade-offs.
+                </t>
+                <t>
+                  If the IBF size is too small or the mapping
+                  function does not spread out the elements
+                  uniformly, the signed counter can overflow or
+                  underflow. As with the CBF, the "maximum" value is
+                  thus used to represent "infinite".  As there is no
+                  need to distinguish between overflow and
+                  underflow, the most canonical representation of
+                  "infinite" would be the minimum value of the
+                  counter in the canonical 2-complement
+                  interpretation.  For example, given a 4-bit
+                  counter a value of -8 would be used to represent
+                  "infinity".
                 </t>
                     <figure anchor="figure_ibf_structure">
                         <artwork name="" type="" align="left" alt=""><![CDATA[
@@ -737,7 +748,8 @@ FUNCTION id_calculation (element,ibf_salt):
                 <section anchor="ibf_format_bucket_identification" numbered="true" toc="default">
                     <name>Mapping Function</name>
                     <t>
-                        The mapping function M as described above in the figure <xref target="bf_mapping_function_math" format="default" />
+                      For an IBF, it is beneficial to use an injective mapping function M.
+                      The mapping function M as described above in the figure <xref target="bf_mapping_function_math" format="default" />
                         decides in which buckets the ID and HASH have to be binary XORed to. In practice
                         the following algorithm is used:
                     </t>

	lsd0003 LSD0003: Set Union
	Log \| Files \| Refs \| README