lsd0003

LSD0003: Set Union
Log | Files | Refs | README

commit ff4b147870e751f9712cd26aad77a73c76acac11
parent 64eb8f92d160be6384333705d094328f3d0aa702
Author: Christian Grothoff <christian@grothoff.org>
Date:   Wed, 13 Jan 2021 23:20:12 +0100

fix BF/CBF section

Diffstat:
Mdraft-summermatter-set-union.xml | 96++++++++++++++++++++++++++++++++++++++++++++++++-------------------------------
1 file changed, 58 insertions(+), 38 deletions(-)

diff --git a/draft-summermatter-set-union.xml b/draft-summermatter-set-union.xml @@ -177,35 +177,42 @@ <section anchor="background" numbered="true" toc="default"> <name>Background</name> <section anchor="bf" numbered="true" toc="default"> - <name>Bloom Filter</name> + <name>Bloom Filters</name> <t> - A Bloom Filter (BF) is a space-efficient datastructure to test if am element is part of a set of elements. - Since a Bloom filter is a probabilistic datastructure its possible to have false positives but false negatives - are not possible. + A Bloom filter (BF) is a space-efficient datastructure to test if am element is part of a set of elements. + Since a BF is a probabilistic datastructure, it is possible to have false-positives: when asked + if an element is in the set, the answer from a BF is either "no" or "maybe". </t> <t> - A BF consists of multiple buckets, every bucket can be set to 0 or 1. In the beginning all buckets are set - to 0. To add an element to the BF the corresponding buckets are set to 1. To map an element on the array of buckets - a mapping function M is required. The mapping function is non-injective an takes an element as input and outputs a - deterministic bit stream of the length of the BF. The mapping function is described by the following mathematical equation: + A BF consists of L buckets. Every bucket is a binary value that can be either 0 or 1. All buckets are initialized + to 0. A mapping function M is used to map each element of the set to a subset of k buckets. M is non-injective + and can thus map the same element multiple times to the same bucket. + The type of the mapping function can thus be described by the following mathematical notation: </t> <figure anchor="bf_mapping_function_math"> <artwork name="" type="" align="left" alt=""><![CDATA[ - -------------------------------- - # M: E->B^k (B=Z/l) - -------------------------------- - # l = Number of bits per element - # B = 0,1,2,3,4,...l - # k = Number of buckets - # E = Element from the set - # Z = Natural Numbers Mod l - -------------------------------- - Example: l=256, k=3 - M(E) = {4,6,255} + ------------------------------------ + # M: E->B^k + ------------------------------------ + # L = Number of buckets + # B = 0,1,2,3,4,...L-1 (the buckets) + # k = Number of buckets per element + # E = Set of elements + ------------------------------------ + Example: L=256, k=3 + M('element-data') = {4,6,255} ]]></artwork> </figure> <t> + A typical mapping function is constructed by hashing the element, for example + using the well-known HKDF construction (FIXME: cite HKDF RFC!). + </t> + <t> + To add an element to the BF, the corresponding buckets under the map M are set to 1. + To check if an element may be in the set, one tests if all buckets under the map M are set to 1. + </t> + <t> Further in this document a bitstream outputted by the mapping function is represented by a set of numeric values for example (0101) = (2,4). In the BF the buckets are set to 1 if the corresponding bit in the bitstream is 1. @@ -236,8 +243,13 @@ ]]></artwork> </figure> <t> - Its not possible to remove an element from the BF because buckets can only be set to 1 or 0 so its not possible to - differentiate between buckets containing one or multiple elements. To remove elements from the BF a <xref target="cbf" format="title" /> + The parameters L and k depend on the set size and must be + chosen carefully to ensure that the BF does not return too + many false-positives. + </t> + <t> + It is not possible to remove an element from the BF because buckets can only be set to 1 or 0. Hence it is impossible to + differentiate between buckets containing one or more elements. To remove elements from the BF a <xref target="cbf" format="title" /> is required. </t> </section> @@ -245,13 +257,13 @@ <section anchor="cbf" numbered="true" toc="default"> <name>Counting Bloom Filter</name> <t> - A Counting Bloom Filter (CBF) is an extension of the <xref target="bf" format="title" />. In the CBF the binary filed of the bucket is replaced by - an unsigned counter. This allows the removal of an elements from the CBF. + A Counting Bloom Filter (CBF) is an extension of the <xref target="bf" format="title" />. In the CBF, buckets are + unsigned numbers instead of binary values. This allows the removal of an elements from the CBF. </t> <t> - Adding an element to the CBF is similar to the adding operation of the BF but instead of setting the bucket on hit to 1 the bucket - is increased by 1. For example if two colliding elements M(element1) = (1,3) and - M(element2) = (0,3) are added to the CBF bucket 0 and 1 are set to 1 and bucket 3 (the colliding bucket) is set + Adding an element to the CBF is similar to the adding operation of the BF. However, instead of setting the bucket on hit to 1 the + numeric value stored in the bucket is increased by 1. For example if two colliding elements M(element1) = (1,3) and + M(element2) = (0,3) are added to the CBF, bucket 0 and 1 are set to 1 and bucket 3 (the colliding bucket) is set to 2: </t> <figure anchor="figure_cbf_insert_0"> @@ -263,11 +275,10 @@ ]]></artwork> </figure> <t> - The order of a bucket is defined by the counter, if a bucket contains two elements the counter is set to 2 so the order of - the bucket is two. + The counter stored in the bucket is also called the order of the bucket. </t> <t> - To remove an element form the CBF the counter is decreased by 1. + To remove an element form the CBF the counters of all buckets the element is mapped to are decreased by 1. </t> <t> Removing M(element2) = (1,3) from the CBF above: @@ -280,19 +291,28 @@ +-------------+-------------+-------------+-------------+ ]]></artwork> </figure> + <t> + In practice, the number of bits available for the counters is usually finite. For example, given a 4-bit + counter, a CBF bucket would overflow once 16 elements are mapped to the same bucket. To efficiently + handle this case, the maximum value (15 in our example) is considered to represent "infinity". Once the + order of a bucket reaches "infinity", it is no longer incremented or decremented. + </t> + <t> + The parameters L and k and the number of bits allocated to the counters should depend on the set size. + An IBF will degenerate when subjected to insert and remove iterations of different elements, and eventually all + buckets will reach "infinity". The speed of the degradation will depend on the choice of L and k in + relation to the number of elements stored in the IBF. + </t> </section> </section> <section anchor="ibv" numbered="true" toc="default"> <name>Invertible Bloom Filter</name> <t> - A Invertible Bloom Filter (IBF) is a further extension of the <xref target="cbf" format="title" />. The IBF extends the <xref target="cbf" format="title" /> with two more operations, - beside insert and remove the IBF supports a decode and set difference operation. This two extra operations allows the - IBF to calculate small differences in big sets very efficiently. - - IBF are useful when there is the need to check a set of elements for the presence or absence of some - elements in a big set efficiently. - + An Invertible Bloom Filter (IBF) is a further extension of the <xref target="cbf" format="title" />. + An IBF extends the <xref target="cbf" format="title" /> with two more operations: + decode and set difference. This two extra operations are useful to efficiently extract + small differences between large sets. </t> <section anchor="ibf_format" numbered="true" toc="default"> <name>Format</name> @@ -556,7 +576,7 @@ <section anchor="modeofoperation" numbered="true" toc="default"> <name>Mode of operation</name> <t> - The set union protocol uses the above discussed topics Invertible Bloom Filter and Strata Estimators. + The set union protocol uses the above discussed topics IBF and Strata Estimators. Depending on the state of the two sets there are different strategies or operation modes how to efficiently determinate missing elements in two sets of two peers. </t> @@ -848,7 +868,7 @@ <section anchor="messages_ibf_description" numbered="true" toc="default"> <name>Description</name> <t> - The Invertible Bloom Filter message contains a slices of the IBF. + The IBF message contains a slice of the IBF. </t> <t> The <em>IBF</em> message is sent at the start of the protocol from the initiating peer in the transaction