chapter 4 - lsd0003 - LSD0003: Set Union

commit fa6c0d634c3d16e70eb57c589b3f32d8248e2f20
parent 4d240cd09d0d2adf4e6c755c2a5437e5c0f29463
Author: Christian Grothoff <christian@grothoff.org>
Date:   Sat, 12 Jun 2021 20:34:52 +0200

chapter 4

Diffstat:
M draft-summermatter-set-union.xml  | 201 +++++++++++++++++++++++++++++++++++++++++++------------------------------------

1 file changed, 110 insertions(+), 91 deletions(-)
diff --git a/draft-summermatter-set-union.xml b/draft-summermatter-set-union.xml
@@ -498,15 +498,18 @@ FUNCTION get_bucket_id (key, k, L)
                     <name>Insert Element</name>
                     <t>
                       To add an element to an IBF, the element is mapped to a subset of k buckets using
-                      the injective mapping function M as described in section <xref target="ibf_m" format="title" /> section introducing
-                      BFs. For the buckets selected by the mapping function, the counter is increased by one and the
-                      IDSUM field is set to the XOR of the element ID and the previously stored IDSUM. Furthermore,
-                      the HASHSUM is set to the XOR of the hash of the element ID and the previously stored HASHSUM.
+                      the injective mapping function M as described in section <xref target="ibf_m" format="title" />. For the buckets selected by the mapping function, the counter is increased by one and the
+                      IDSUM field is set to the XOR of the element ID
+                      computed as described in section <xref target="ibf_format_id_generation" format="title" />
+                      and the previously stored IDSUM. Furthermore,
+                      the HASHSUM is set to the XOR of the previously stored HASHSUM
+                      and the hash of the element ID computed as described
+                      in section <xref target="ibf_format_HASH_calculation" format="title" />.
                     </t>
                     <t>
                       In the following example, the insert operation is illustrated using an element with the
-                      ID 0x0102 and a hash of 0x4242, and a second element with the ID 0x0304 and
-                      a hash of 0x0101.
+                      ID 0x0102 mapped to {1,3} with a hash of 0x4242, and a second element with the
+                      ID 0x0304 mapped to {0,1} and a hash of 0x0101.
                     </t>
                     <t>Empty IBF:</t>
                     <figure anchor="figure_ibf_insert_0">
@@ -521,7 +524,7 @@ hashSum |    0x0000   |    0x0000   |    0x0000   |    0x0000   |
         +-------------+-------------+-------------+-------------+
                  ]]></artwork>
                     </figure>
-                    <t>Insert first element: [0101] with ID 0x0102 and hash 0x4242:</t>
+                    <t>Insert first element with ID 0x0102 and hash 0x4242 into {1,3}:</t>
                     <figure anchor="figure_ibf_insert_1">
                         <artwork name="" type="" align="left" alt=""><![CDATA[
             bucket-0     bucket-1       bucket-2      bucket-3
@@ -534,7 +537,7 @@ hashSum |    0x0000   |   0x4242    |    0x0000   |   0x4242    |
         +-------------+-------------+-------------+-------------+
                  ]]></artwork>
                     </figure>
-                    <t>Insert second element: [1100] with ID 0x0304 and hash 0101:</t>
+                    <t>Insert second element with ID 0x0304 and hash 0101 into {0,1}:</t>
                     <figure anchor="figure_ibf_insert_2">
                         <artwork name="" type="" align="left" alt=""><![CDATA[
             bucket-0     bucket-1       bucket-2      bucket-3
@@ -557,9 +560,9 @@ hashSum |    0x0101   |   0x4343    |    0x0000   |   0x4242    |
                         HASHSUM is similarly replaced with the XOR of the old HASHSUM and the hash of the ID.
                     </t>
                     <t>
-                        In the following example the remove operation for the element [1100] with the hash 0x0101 is demonstrated.
+                        In the following example the remove operation is illustrated.
                     </t>
-                <t>IBF with encoded elements:</t>
+                <t>IBF with two encoded elements:</t>
                 <figure anchor="figure_ibf_remove_0">
                     <artwork name="" type="" align="left" alt=""><![CDATA[
             bucket-0     bucket-1       bucket-2      bucket-3
@@ -572,7 +575,7 @@ hashSum |   0x0101    |   0x4343    |    0x0000   |   0x4242    |
         +-------------+-------------+-------------+-------------+
                  ]]></artwork>
                 </figure>
-                    <t>Remove element [1100] with ID 0x0304 and hash 0x0101 from the IBF:</t>
+                    <t>After removal of element with ID 0x0304 and hash 0x0101 mapped to {0,1} from the IBF:</t>
                     <figure anchor="figure_ibf_remove_1">
                         <artwork name="" type="" align="left" alt=""><![CDATA[
             bucket-0     bucket-1       bucket-2      bucket-3
@@ -599,9 +602,9 @@ hashSum |    0x0000   |   0x4242    |    0x0000   |   0x4242    |
                     </t>
                     <t>
                       Buckets in an IBF with a counter of 1 or -1 are crucial for decoding an IBF, as
-                      they might represent only a single element, with the IDSUM being the ID of that element.
-                      Following Eppstein (CITE), we will call buckets that only represent a single
-                      element pure buckets.
+                      they MIGHT represent only a single element, with the IDSUM being the ID of that element.
+                      Following Eppstein (FIXME-CITE), we will call buckets that only represent a single
+                      element <em>pure buckets</em>.
                       Note that due to the possibility of multiple insertion and removal operations
                       affecting the same bucket, not all buckets with a counter of 1 or -1 are
                       actually pure buckets.  Sometimes a counter can be 1 or -1 because N elements
@@ -611,51 +614,55 @@ hashSum |    0x0000   |   0x4242    |    0x0000   |   0x4242    |
                 </section>
 
                 <section anchor="ibf_operations_decode" numbered="true" toc="default">
-                    <name>Decode IBF</name>
+                    <name>Extracting elements</name>
                     <t>
-                      Decoding an IBF yields the HASH of an element from the IBF, or failure.
+                      Extracting elements from an IBF yields IDs of elements from the IBF.
+                      Elements are extracted from an IBF by repeatedly performing a
+                      decode operation on the IBF.
                     </t>
                     <t>
-                        A decode operation requires a pure bucket, that is a bucket to which M
-                        only mapped a single element, to succeed.  Thus, if there is no bucket with
-                        a counter of 1 or -1, decoding fails. However, as a counter of 1 or -1 is
-                        not a guarantee that the bucket is pure, there is also a chance that the
-                        decoder returns an IDSUM value that is actually the XOR of several IDSUMs.
-                        This is primarily detected by checking that the HASHSUM is the hash of the IDSUM.
-                        Only if the HASHSUM also matches, the bucket could be pure.  Additionally,
-                        one MUST check that the IDSUM value actually would be mapped by M to
-                        the respective bucket. If not, there was a hash collision.
+                      A decode operation requires a pure bucket, that is a bucket to which M
+                      only mapped a single element, to succeed.  Thus, if there is no bucket with
+                      a counter of 1 or -1, decoding fails. However, as a counter of 1 or -1 is
+                      not a guarantee that the bucket is pure, there is also a chance that the
+                      decoder returns an IDSUM value that is actually the XOR of several IDSUMs.
+                      This is primarily detected by checking that the HASHSUM is the hash of the IDSUM.
+                      Only if the HASHSUM also matches, the bucket could be pure.  Additionally,
+                      one MUST check that the IDSUM value actually would be mapped by M to
+                      the respective bucket. If not, there was a hash collision and the bucket
+                      is also not pure.
                     </t>
                     <t>
-                        The very rare case that after all these checks a bucket is still
-                        falsely identified as pure MUST be detected (say by determining that
-                        extracted element IDs do not match any actual elements), and addressed
-                        at a higher level in the protocol. As these failures are probabilistic
-                        and depend on element IDs and the IBF construction, they can typically
-                        be avoided by retrying with different parameters, such as a different
-                        way to assign element IDs to elements, using a larger value for L, or
-                        a different mapping function M.
-                        A more common scenario (especially if L was too small) is that
-                        IBF decoding fails because there is no pure bucket. In this case, the
-                        higher-level protocol SHOULD also retry using different parameters.
+                      The very rare case that after all these checks a bucket is still
+                      falsely identified as pure MUST be detected (say by determining that
+                      extracted element IDs do not match any actual elements), and addressed
+                      at a higher level in the protocol. As these failures are probabilistic
+                      and depend on element IDs and the IBF construction, they can typically
+                      be avoided by retrying with different parameters, such as a different
+                      way to assign element IDs to elements (by varying the IBF-salt),
+                      using a larger value for L, or a different mapping function M.
+                      A more common scenario (especially if L was too small) is that
+                      IBF decoding fails because there is no pure bucket. In this case, the
+                      higher-level protocol generally MUST also retry using different
+                      parameters (except if an attack is detected).
                     </t>
                     <t>
-                        Suppose the IBF contains a pure bucket. In this case, the IDSUM in the
-                        bucket identifies a single element.  Furthermore, it is then possible
-                        to remove that element from the IBF (by inserting it if the counter
-                        was negative, and by removing it if the counter was positive). This
-                        is likely to cause other buckets to become pure, allowing further
-                        elements to be decoded.  Eventually, decoding ought to succeed with
-                        all counters and IDSUM and HASHSUM values reach zero. However, it is also
-                        possible that an IBF only partly decodes and then decoding fails after
-                        obtaining some elements.
+                      Suppose the IBF contains a pure bucket. In this case, the IDSUM in the
+                      bucket is the ID of an element.  Furthermore, it is then possible
+                      to remove that element from the IBF (by inserting it if the counter
+                      was negative, and by removing it if the counter was positive). This
+                      is likely to cause other buckets to become pure, allowing further
+                      elements to be decoded.  Eventually, decoding ought to finish with
+                      all counters and IDSUM and HASHSUM values reach zero. However, it is also
+                      possible that an IBF only partly decodes and then decoding fails due
+                      to the lack of pure buckets after extracting some element IDs.
                     </t>
                     <t>
-                        In the following example the successful decoding of an IBF containing
-                        the two elements previously added in our running example.
+                      In the following example the successful decoding of an IBF containing
+                      the two elements previously added in our running example.
                     </t>
                     <t>
-                        IBF with the two encoded elements:
+                       We begin with an IBF with two elements added:
                     </t>
                     <figure anchor="figure_ibf_decode_0">
                         <artwork name="" type="" align="left" alt=""><![CDATA[
@@ -670,8 +677,11 @@ hashSum |   0x0101    |   0x4343    |    0x0000   |   0x4242    |
                  ]]></artwork>
                     </figure>
                     <t>
-                        In the IBF are two pure buckets to decode (bit-1 and bit-4) we choose to start with decoding bucket 1,
-                        we decode the element with the hash 1010 and we see that there is a new pure bucket created (bit-2)
+                      In the IBF are two pure buckets to decode (buckets 0 and 3) we choose to start with
+                      decoding bucket 0. This yields the element with the hash ID 0x0304 and
+                      hash 1010. This element ID is mapped to buckets
+                      {0,1}.
+                      Subtracting this element results in bucket 1 becoming pure:
                     </t>
                     <figure anchor="figure_ibf_decode_1">
                         <artwork name="" type="" align="left" alt=""><![CDATA[
@@ -686,9 +696,10 @@ hashSum |    0x0000   |   0x4242    |    0x0000   |   0x4242    |
                  ]]></artwork>
                     </figure>
                     <t>
-                        In the IBF only pure buckets are left, we choose to continue decoding bucket 2 and decode element
-                        with the hash 0x4242. Now the IBF is empty (all buckets have count 0) that means the IBF has been successfully
-                        decoded.
+                        We can now decoding bucket 2 and extract the element
+                        with the ID 0x0102 and hash 0x4242. Now the IBF is
+                        empty. Extraction completes with the status that
+                        the IBF has been successfully decoded.
                     </t>
                     <figure anchor="figure_ibf_decode_2">
                         <artwork name="" type="" align="left" alt=""><![CDATA[
@@ -710,10 +721,12 @@ hashSum |    0x0000   |    0x0000   |    0x0000   |    0x0000   |
                         Given addition and removal as defined above, it is possible to define an operation on IBFs
                         that computes an IBF representing the set difference.  Suppose IBF1 represents set A, and
                         IBF2 represents set B.  Then this set difference operation will compute IBF3 which
-                        represents the set A - B --- without having to transfer the elements from set A or B.
+                        represents the set A - B. Note that this computation can be done only on the IBFs,
+                        and does not require access to the elements from set A or B.
 
                         To calculate the IBF representing this set difference, both IBFs MUST have the same
-                        length L, the same number of buckets per element k and use the same map M.  Given this,
+                        length L, the same number of buckets per element k and use the same map M.
+                        Naturally, all IDs must have been computed using the same IBF-salt.  Given this,
                         one can compute the IBF representing the set difference by taking the XOR of the IDSUM and HASHSUM values
                         of the respective buckets and subtracting the respective counters.  Care MUST be taken
                         to handle overflows and underflows by setting the counter to "infinity" as necessary.
@@ -729,7 +742,8 @@ hashSum |    0x0000   |    0x0000   |    0x0000   |    0x0000   |
                         To demonstrate the set difference operation we compare IBF-A with IBF-B and generate as described
                         IBF-AB
                     </t>
-                    <t>IBF-A containing elements with hashes 0x0101 and 0x4242:</t>
+                    <t>IBF-A contains the elements with ID 0x0304 and hash 0x0101 mapped to {0,1},
+                       and ID 0x0102 and hash 0x4242 mapped to {1,3}:</t>
                     <figure anchor="figure_ibf_setdiff_A">
                         <artwork name="" type="" align="left" alt=""><![CDATA[
             bucket-0     bucket-1       bucket-2      bucket-3
@@ -742,36 +756,38 @@ hashSum |    0x0101   |   0x4343    |    0x0000   |   0x4242    |
         +-------------+-------------+-------------+-------------+
                  ]]></artwork>
                     </figure>
-                    <t>IBF-B containing elements with hashes 0x4242 and 0x5050</t>
+                    <t>IBF-B also contains the element with ID 0x0102 and
+                     and another element with ID 0x1345 and hash 0x5050
+                     mapped to {1,2}.</t>
                     <figure anchor="figure_ibf_setdiff_B">
                         <artwork name="" type="" align="left" alt=""><![CDATA[
             bucket-0     bucket-1       bucket-2      bucket-3
         +-------------+-------------+-------------+-------------+
   count |      0      |      1      |      1      |      1      |
         +-------------+-------------+-------------+-------------+
-  idSum |    0x0000   |    0x0102   |    0x1345   |    0x0102    |
+  idSum |    0x0000   |    0x1447   |    0x1345   |    0x0102   |
         +-------------+-------------+-------------+-------------+
-hashSum |    0x0000   |    0x4242   |    0x5050   |    0x4242   |
+hashSum |    0x0000   |    0x9292   |    0x5050   |    0x4242   |
         +-------------+-------------+-------------+-------------+
                  ]]></artwork>
                     </figure>
-                <t>IBF-AB XOR value and subtract count:</t>
+                <t>IBF-A minus IBF-B is then:</t>
                 <figure anchor="figure_ibf_setdiff_AB">
                     <artwork name="" type="" align="left" alt=""><![CDATA[
             bucket-0     bucket-1       bucket-2      bucket-3
         +-------------+-------------+-------------+-------------+
-  count |      1      |      1      |      -1     |      0      |
+  count |      1      |      0      |      -1     |      0      |
         +-------------+-------------+-------------+-------------+
-  idSum |    0x0304   |    0x0304   |    0x1345   |    0x0000   |
+  idSum |    0x0304   |    0x1049   |    0x1345   |    0x0000   |
         +-------------+-------------+-------------+-------------+
-hashSum |    0x0101   |    0x0101   |    0x5050   |    0x0000   |
+hashSum |    0x0101   |    0x5151   |    0x5050   |    0x0000   |
         +-------------+-------------+-------------+-------------+
                  ]]></artwork>
                 </figure>
                 <t>After calculating and decoding the IBF-AB shows clear that in IBF-A the element with the hash 0x5050
-                    is missing (-1 in bit-3) while in IBF-B the element with the hash 0101 is missing
-                    (1 in bit-1 and bit-2). The element with hash 0x4242 is present in IBF-A and IBF-B and is
-                    removed by the set difference operation (bit-4).
+                    is missing (-1 in bucket 2) while in IBF-B the element with the hash 0101 is missing
+                    (1 in bucket 0). The element with hash 0x4242 is present in IBF-A and IBF-B and is
+                    removed by the set difference operation.  Bucket 2 is not empty.
                 </t>
             </section>
             </section>
@@ -779,12 +795,18 @@ hashSum |    0x0101   |    0x0101   |    0x5050   |    0x0000   |
             <section anchor="ibf_format" numbered="true" toc="default">
                 <name>Wire format</name>
                 <t>
-                    The counter size transmitted over the wire
-                    varies between 1 and 64 bit, depending on the
-                    maximum counter in the IBF. This variable counter
-                    should cover most areas of application.
+                    For the counter field, we use a variable-size encoding to ensure
+                    that even for very large sets the counter should never reach
+                    "infinity", while also ensuring that the encoding is compact for
+                    small sets.
+                    Hence, the counter size transmitted over the wire
+                    varies between 1 and 64 bits, depending on the
+                    maximum counter in the IBF. A range of 1 to 64 bits
+                    should cover most areas of application and can be
+                    efficiently implemented on most contemporary CPU
+                    architectures and programming languages.
                     The bit length for the transmitted IBF
-                    is defined in the header of the
+                    will be communicated in the header of the
                     <em><xref target="messages_ibf" format="title" /></em> message
                     in the "IMCS" field as unsigned 8-bit integer.
                     For implementation details see section <xref target="performance_counter_variable_size" format="title" />.
@@ -806,18 +828,14 @@ hashSum |    0x0101   |    0x0101   |    0x5050   |    0x0000   |
                     For the "HASHSUM", we always use a 32-bit
                     representation.  Here, it is most important to
                     avoid collisions, where different elements are
-                    mapped to the same hash.  However, we note that
-                    by design only a few elements (certainly less than
-                    127) should ever be mapped
-                    to the same bucket, a small number of bits
-                    should suffice.  Furthermore, our protocol is designed
-                    to handle occasional collisions, so while with
-                    32-bits there remains a chance of
-                    accidental collisions, at 32 bit the chance is
-                    generally believed to be sufficiently small
-                    for the protocol to handle those cases efficiently
-                    for a wide range of use-cases.  Smaller hash
-                    values would safe bandwidth, but also drastically
+                    mapped to the same hash, possibly resulting in
+                    a bucket being falsely classified as pure.
+                    While with 32 bits there remains a non-negligible chance of
+                    accidental collisions, our protocol is designed
+                    to handle occasional collisions. Hence, at 32 bit the chance is
+                    believed to be sufficiently small enough
+                    for the protocol to handle those cases efficiently.  Smaller hash
+                    values would safe bandwidth, but also substantially
                     increase the chance of collisions. 32 bits are
                     also again a reasonable size for many CPU
                     architectures.
@@ -833,7 +851,7 @@ hashSum |    0x0101   |    0x0101   |    0x5050   |    0x0000   |
                     a good value for L.
                 </t>
                 <t>
-                    Basically a Strata Estimator (SE) is a series of IBFs (with a rather small value of L)
+                    Basically a Strata Estimator (SE) is a series of IBFs (with a rather small value of L=79)
                     in which increasingly large subsets of the full set
                     of elements are added to each IBF.  For the n-th IBF, the function selecting the
                     subset of elements MUST sample to select (probabilistically) 1/(2^n) of all
@@ -843,18 +861,19 @@ hashSum |    0x0101   |    0x0101   |    0x5050   |    0x0000   |
                     IBF being statistically expected to contain 1/(2^n) elements.
                 </t>
                 <t>
-                    Given two SEs, the set size difference can be estimated by trying to decode all of the
-                    IBFs.  Given that L was set to a rather small value, IBFs containing large strata
-                    will likely fail to decode.  For those IBFs that failed to decode, one simply
+                    Given two SEs, the set size difference can be estimated by attempting to decode all of the
+                    IBFs.  Given that L is set to a fixed and rather small value, IBFs containing large strata
+                    will likely fail to decode.  For IBFs that failed to decode, one simply
                     extrapolates the number of elements by scaling the numbers obtained from the
                     other IBFs that did decode.  If none of the IBFs of the SE decoded (which given
-                    a reasonable choice of L should be highly unlikely), one can retry using a different
-                    mapping function M.
+                    a reasonable number of IBFs in the SE should be highly unlikely), one can theoretically
+                    retry using a different IBF-salt.
                 </t>
                 <t>
-                    In addition, when decoding the IBFs in the strata estimator, it is possible to determine
+                    When decoding the IBFs in the strata estimator, it is possible to determine
                     on which side which part of the difference is. For this purpose, the pure buckets with
-                    counter 1 and -1 must be distinguished and assigned to the respective side when decoding the IBFs.
+                    counter 1 and -1 must be distinguished and assigned to the respective side when decoding
+                    the IBFs.
                 </t>
         </section>

	lsd0003 LSD0003: Set Union
	Log \| Files \| Refs \| README