Try to disentangle Bloom filters - lsd0004 - LSD0004: R5N Distributed Hash Table

commit 3f70faf531ebd101004f5b93e1c80ee1ec5bcdd1
parent 564e5fa775b0c0fdf947e857727412dfca4f5ac8
Author: Martin Schanzenbach <schanzen@gnunet.org>
Date:   Fri, 23 Dec 2022 20:36:03 +0900

Try to disentangle Bloom filters

Diffstat:
M draft-schanzen-r5n.xml  | 194 +++++++++++++++++++++++++++++++++++++++++++++++--------------------------------

1 file changed, 115 insertions(+), 79 deletions(-)
diff --git a/draft-schanzen-r5n.xml b/draft-schanzen-r5n.xml
@@ -349,7 +349,7 @@ Connectivity | |Underlay|  |Underlay|
       </figure>
     </section>
     <section anchor="overlay" numbered="true" toc="default">
-      <name>Overlay operations</name>
+      <name>Overlay Operations</name>
       <t>
         An implementation of this specification commonly exposes the two overlay
         operations "GET" and "PUT".
@@ -358,9 +358,9 @@ Connectivity | |Underlay|  |Underlay|
         picture of the protocol.
       </t>
       <section>
-        <name>The GET procedure</name>
+        <name>GET operation</name>
         <t>
-          A basic GET procedure may be exposed as:
+          A basic GET operation interface may be exposed as:
         </t>
         <t>
           <tt>GET(Query-Key, Block-Type) -> Results as List</tt>
@@ -462,9 +462,9 @@ Connectivity | |Underlay|  |Underlay|
         </dl>
       </section>
       <section>
-        <name>The PUT procedure</name>
+        <name>PUT operation</name>
         <t>
-          A PUT procedure may be exposed as:
+          A PUT operation interface may be exposed as:
         </t>
         <t>
           <tt>PUT(Key, Block-Type, Block-Expiration, Block-Data)</tt>
@@ -773,47 +773,6 @@ bchar = *(ALPHA / DIGIT)
         </t>
       </section>
     </section>
-    <section anchor="bloom_filters" numbered="true" toc="default">
-      <name>Bloom Filters</name>
-      <t>
-	R<sup>5</sup>N uses Bloom filters in several places.  This section
-	gives some general background on Bloom filters and defines functions
-	on this data structure shared by the various use-cases in R<sup>5</sup>N.
-      </t>
-      <t>
-        A Bloom filter (BF) is a space-efficient probabilistic datastructure
-        to test if an element is part of a set of elements.
-        Elements are identified by an element ID.
-        Since a BF is a probabilistic datastructure, it is possible to have
-        false-positives: when asked if an element is in the set, the answer from
-        a BF is either "no" or "maybe".
-      </t>
-      <t>
-        Bloom filters are defined as a string of <tt>L</tt> bits always
-        initially empty, consisting only of zeroes.
-        There are two functions which can be invoked on the Bloom filter:
-        BF-SET(bf, e) and BF-TEST(bf, e) where "e" is an element which is to
-        be added to the Bloom filter or queried against the set.
-        The mapping function <tt>M</tt> is defined as follows:
-      </t>
-      <t>
-        The element <tt>e</tt> is hashed using SHA-512.
-        The resulting byte string is interpreted as a string of 16
-        32-bit integers in network byte order.
-      </t>
-      <t>
-        When adding an element to the Bloom filter <tt>bf</tt> using
-        <tt>BF-SET(bf,e)</tt>, each integer <tt>n</tt> of the mapping
-        <tt>M(e)</tt> is interpreted as a bit offset <tt>n mod L</tt> within
-        <tt>bf</tt> and set to 1.
-      </t>
-      <t>
-        When testing if an element may be in the Bloom filter <tt>bf</tt> using
-        <tt>BF-TEST(bf,e)</tt>, each bit offset <tt>n mod L</tt> within
-        <tt>bf</tt> <bcp14>MUST</bcp14> have been set to 1.
-        Otherwise, the element is not considered to be in the Bloom filter.
-      </t>
-    </section>
     <section anchor="routing" numbered="true" toc="default">
       <name>Routing</name>
       <t>
@@ -920,9 +879,19 @@ bchar = *(ALPHA / DIGIT)
           traversed peers when searching for the next hops in the routing table.
         </t>
         <t>
-          The peer Bloom filter is always a 128 byte field. The elements
-	  hashed into the Bloom filter are the 32 byte peer IDs.  We note
-	  that the peer Bloom filter may exclude peers due to false-postive
+          The peer Bloom filter (see <xref target="bloom_filters"/>) consists of
+          L=1024 buckets.
+          The mapping function <tt>M</tt> is defined as follows:
+        </t>
+        <t>
+          The element <tt>e</tt> is a 32 byte peer ID and is hashed using
+          SHA-512.
+          The resulting byte string is interpreted as a string of k=16
+          32-bit integers in network byte order which are used to set the bucket bits
+          accordingly.
+        </t>
+        <t>
+	  We note that the peer Bloom filter may exclude peers due to false-postive
 	  matches.  This is acceptable as routing should nevertheless
 	  terminate (with high probability) in close vicinity of the key.
         </t>
@@ -2395,10 +2364,33 @@ gnunet+tcp://12.3.4.5/ \
           <dd>
 	  <t>
 	    The RESULT_FILTER for HELLO blocks is implemented using a
-      Bloom filter.
-    </t>
-     <figure anchor="hello_rf" title="The HELLO Block Result Filter.">
-        <artwork name="" type="" align="left" alt=""><![CDATA[
+            Bloom filter (see <xref target="bloom_filters"/>).
+            The size <tt>S = L/8</tt> of the Bloom filter in bytes depends on
+	    the number of elements <tt>F</tt> known to be filtered at the
+	    initiator. If <tt>F</tt> is zero, the size <tt>S</tt> is just 8 (bytes).
+	    Otherwise, <tt>S</tt> is set to the minimum of
+	    2<sup>15</sup> and the lowest power
+	    of 2 that is strictly larger than <tt>K*F/4</tt>
+	    (in bytes).
+          </t>
+          <t>
+            The <tt>k</tt>-value for the HELLO_BF
+            Bloom filter is always 16.
+            <tt>k</tt> is never transmitted.
+            The elements used in the Bloom filter
+            consist of an XOR between the <tt>H_ADDRS</tt> field (as computed using
+            SHA-512 over the <tt>ADDRESSES</tt>) and the SHA-512
+            hash of the <tt>MUTATOR</tt> field from a given HELLO block.
+            The mapping function M(<tt>H_ADDRS XOR MUTATOR</tt>) is an identity
+            function and returns the 512-bit XOR result unmodified.
+            This resulting byte string is interpreted as k=16 32-bit
+            integers in network byte order and used to set or check the bucket bits
+            accordingly.
+            This function returns an empty Bloom filter of length <tt>S= L/8</tt> as
+            part of RF:
+          </t>
+          <figure anchor="hello_rf" title="The HELLO Block Result Filter.">
+            <artwork name="" type="" align="left" alt=""><![CDATA[
 0     8     16    24    32    40    48    56
 +-----+-----+-----+-----+-----+-----+-----+-----+
 |        MUTATOR        |  HELLO_BF             /
@@ -2407,7 +2399,6 @@ gnunet+tcp://12.3.4.5/ \
 +-----+-----+-----+-----+-----+-----+-----+-----+
 ]]></artwork>
           </figure>
-
           <t>where:</t>
           <dl>
             <dt>MUTATOR</dt>
@@ -2416,24 +2407,16 @@ gnunet+tcp://12.3.4.5/ \
             </dd>
             <dt>HELLO_BF</dt>
             <dd>
-              The <tt>K</tt>-value for the HELLO_BF Bloom filter
-	            is always 16.  The size <tt>S</tt> of the Bloom filter in bytes depends on
-	            the number of elements <tt>F</tt> known to be filtered at the
-	            initiator. If <tt>F</tt> is zero, the size <tt>S</tt> is just 8 (bytes).
-	            Otherwise, <tt>S</tt> is set to the minimum of
-	            2<sup>15</sup> and the lowest power
-	            of 2 that is strictly larger than <tt>K*F/4</tt>
-	            (in bytes).  The wire format of HELLO_BF is the resulting byte
-              array. In particular, <tt>K</tt> is never transmitted.
+              The Bloom filter byte array.
             </dd>
           </dl>
-<t>
-	     R<sup>5</sup>N HELLO blocks use a <tt>MUTATOR</tt> value
-       to additionally "randomize" the computation of the bloom filter while remaining
-	    deterministic across peers.  The 32-bit <tt>MUTATOR</tt>
-	    value is set by the peer initiating the GET request, and changed
-	    every time the GET request is repeated by the initiator. Peers
-	    forwarding GET requests <bcp14>MUST</bcp14> not change the
+          <t>
+	    The <tt>MUTATOR</tt> value is used
+            to additionally "randomize" the computation of the Bloom filter while
+            remaining deterministic across peers.
+            It is only ever set by the peer initiating the GET
+            request, and changed every time the GET request is repeated.
+            Peers forwarding GET requests <bcp14>MUST</bcp14> not change the
 	    mutator value included in the <tt>RESULT_FILTER</tt> as they might not
 	    be able to recalculate the result filter with a different <tt>MUTATOR</tt>
 	    value.
@@ -2457,12 +2440,9 @@ gnunet+tcp://12.3.4.5/ \
           </dd>
             <dt>FilterResult(Block, Key, RF, XQuery) -&gt; (FilterEvaluationResult, RF')</dt>
             <dd>
-              To filter results of HELLO blocks using the Bloom filter, the
-             <tt>H_ADDRS</tt> field (as computed using SHA-512 over
-             the <tt>ADDRESSES</tt>) and XORed with the SHA-512
-             hash of the <tt>MUTATOR</tt> (in network byte order).
-             The resulting value is then used when hashing into the
-             Bloom filter as described in <xref target="bloom_filters" />.
+             The <tt>H_ADDRS</tt> field is XORed with the SHA-512
+             hash of the <tt>MUTATOR</tt> field from the HELLO block and the resulting
+             value is checked against the Bloom filter in RF.
              Consequently, HELLOs with completely identical sets of
              addresses will be filtered and FILTER_DUPLICATE is returned.
              Any small variation in the set of addresses will cause the block
@@ -2833,8 +2813,64 @@ Type    | Name            | References | Description
         </front>
       </reference>
     </references>
-    <!-- Change Log
-      v00 2017-07-23  MS   Initial version
-    -->
+    <section anchor="bloom_filters" numbered="true" toc="default">
+      <name>Bloom filters in R<sup>5</sup>N</name>
+      <t>
+	R<sup>5</sup>N uses Bloom filters in several places.  This section
+	gives some general background on Bloom filters and defines functions
+	on this data structure shared by the various use-cases in R<sup>5</sup>N.
+      </t>
+      <t>
+        A Bloom filter (BF) is a space-efficient probabilistic datastructure
+        to test if an element is part of a set of elements.
+        Elements are identified by an element ID.
+        Since a BF is a probabilistic datastructure, it is possible to have
+        false-positives: when asked if an element is in the set, the answer from
+        a BF is either "no" or "maybe".
+      </t>
+      <t>
+        Bloom filters are defined as a string of <tt>L</tt> bits called "buckets".
+        The buckets are initially always empty, meaning that the bits are set to
+        zero.
+        There are two functions which can be invoked on the Bloom filter "bf":
+        BF-SET(bf, e) and BF-TEST(bf, e) where "e" is an element that is to
+        be added to the Bloom filter or queried against the set.
+      </t>
+      <t>
+        A mapping function M is used to map each ID of each element from the set to a
+        subset of k buckets.
+        In the original proposal by Bloom, M is non-injective and can thus map the same
+        element multiple times to the same bucket.
+        The type of the mapping function can thus be described by the following
+        mathematical notation:
+      </t>
+      <figure anchor="figure_bf_func" title="Bloom filter mapping function.">
+        <artwork><![CDATA[
+        ------------------------------------
+        # M: E->B^k
+        ------------------------------------
+        # L = Number of buckets
+        # B = 0,1,2,3,4,...L-1 (the buckets)
+        # k = Number of buckets per element
+        # E = Set of elements
+        ------------------------------------
+        Example: L=256, k=3
+        M('element-data') = {4,6,255}
+]]>
+        </artwork>
+      </figure>
+      <t>
+        When adding an element to the Bloom filter <tt>bf</tt> using
+        <tt>BF-SET(bf,e)</tt>, each integer <tt>n</tt> of the mapping
+        <tt>M(e)</tt> is interpreted as a bit offset <tt>n mod L</tt> within
+        <tt>bf</tt> and set to 1.
+      </t>
+      <t>
+        When testing if an element may be in the Bloom filter <tt>bf</tt> using
+        <tt>BF-TEST(bf,e)</tt>, each bit offset <tt>n mod L</tt> within
+        <tt>bf</tt> <bcp14>MUST</bcp14> have been set to 1.
+        Otherwise, the element is not considered to be in the Bloom filter.
+      </t>
+    </section>
   </back>
 </rfc>

	lsd0004 LSD0004: R5N Distributed Hash Table
	Log \| Files \| Refs