more on Bloom filters - lsd0004 - LSD0004: R5N Distributed Hash Table

commit d1a71233594235411e1a4da75caa71c2bb31f902
parent a8d0df2b972d5a6d8938286e3758aecf874580ac
Author: Christian Grothoff <grothoff@gnunet.org>
Date:   Fri, 11 Mar 2022 05:12:48 +0100

more on Bloom filters

Diffstat:
M draft-schanzen-r5n.xml  | 41 ++++++++++++++++++++++++++++++++---------

1 file changed, 32 insertions(+), 9 deletions(-)
diff --git a/draft-schanzen-r5n.xml b/draft-schanzen-r5n.xml
@@ -864,7 +864,10 @@ bchar = *(ALPHA / DIGIT)
         </t>
         <t>
           The peer Bloom filter is always a 128-bit field. The elements
-	  hashed into the Bloom filter are the 32 byte peer IDs.
+	  hashed into the Bloom filter are the 32 byte peer IDs.  We note
+	  that the peer Bloom filter may exclude peers due to false-postive 
+	  matches.  This is acceptable as routing should nevertheless
+	  terminate (with high probability) in close vicinity of the key.
         </t>
       </section>
       <section anchor="routing_functions">
@@ -970,16 +973,36 @@ bchar = *(ALPHA / DIGIT)
           Any peer which is processing GET messages and has a result
           which matches the query key <bcp14>MUST</bcp14> check the result Bloom filter
           and only send a reply message if the block key is not in the
-          Bloom filter set.
+          Bloom filter set. <!-- FIXME: is this strictly always the block key?
+	  I think this is block-type dependent! -->
         </t>
         <t>
-          The Bloom filter is a 128-bit field.
-          It is initially empty, consisting only of zeroes.
-          There are two functions which can be invoked on the Bloom filter:
-          BF-SET(bf, e) and BF-TEST(bf, e) where "e" is an element which is to
-          be added to the Bloom filter or queried against the set.
-          Any Bloom filter uses k=16 different hash functions each of which is
-          defined as follows:
+          FIXME: say something about the size of the result Bloom filter here!
+        </t>
+        <t>
+	  Bloom filters are probabilistic data structures. Thus, especially
+	  given the small size of the result Bloom filter, it is always possible
+	  that a desireable result is filtered by the Bloom filter because of 
+	  a false-positive match created by a collision in the hash values
+	  between the desireable result and filtered items. 
+        </t>
+        <t>
+	  To address this problem, R<sup>5</sup>N uses a mutator value
+          to additionally randomize the process
+          when hashing results into the result Bloom filter. The mutator
+	  value is set by the peer initiating the GET request, and changed
+	  every time the GET request is repeated by the initiator. Peers
+	  forwarding GET requests <bcp14>MUST</bcp14> not change the 
+	  mutator value included in the GET message as they could not 
+	  recalculate the Bloom filter with the new mutator value.
+        </t>
+	<t>
+	  By including the mutator value in the hashing process, repeated
+	  requests have statistically independent probabilities of creating
+	  collisions in the Bloom filter. Thus, even if for one request
+	  a Bloom filter collision may exclude a result as a false-positive
+	  match, subsequent requests are likely to not have the same 
+	  false-positives.
         </t>
       </section>
       <section anchor="p2p_xq" numbered="true" toc="default">

	lsd0004 LSD0004: R5N Distributed Hash Table
	Log \| Files \| Refs