aboutsummaryrefslogtreecommitdiff
path: root/src/testbed/barriers.README.org
blob: 159e1c35522e44cc06ad4c4213a5a3b0c1e4f311 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
* Description
The testbed subsystem's barriers API facilitates coordination among the peers
run by the testbed and the experiment driver.  The concept is similar to the
barrier synchronisation mechanism found in parallel programming or
multi-threading paradigms - a peer waits at a barrier upon reaching it until the
barrier is reached by a predefined number of peers.  This predefined number of
peers required to cross a barrier is also called quorum.  We say a peer has
reached a barrier if the peer is waiting for the barrier to be crossed.
Similarly a barrier is said to be reached if the required quorum of peers reach
the barrier.  A barrier which is reached is deemed as crossed after all the
peers waiting on it are notified.

The barriers API provides the following functions:
1) GNUNET_TESTBED_barrier_init(): function to initialise a barrier in the
   experiment
2) GNUNET_TESTBED_barrier_cancel(): function to cancel a barrier which has been
   initialised before
3) GNUNET_TESTBED_barrier_wait(): function to signal barrier service that the
    caller has reached a barrier and is waiting for it to be crossed
4) GNUNET_TESTBED_barrier_wait_cancel(): function to stop waiting for a barrier
   to be crossed

Among the above functions, the first two, namely GNUNET_TESTBED_barrier_init()
and GNUNET_TESTBED_barrier_cacel() are used by experiment drivers.  All barriers
should be initialised by the experiment driver by calling
GNUNET_TESTBED_barrier_init().  This function takes a name to identify the
barrier, the quorum required for the barrier to be crossed and a notification
callback for notifying the experiment driver when the barrier is crossed.  The
GNUNET_TESTBED_function barrier_cancel() cancels an initialised barrier and
frees the resources allocated for it.  This function can be called upon a
initialised barrier before it is crossed.

The remaining two functions GNUNET_TESTBED_barrier_wait() and
GNUNET_TESTBED_barrier_wait_cancel() are used in the peer's processes.
GNUNET_TESTBED_barrier_wait() connects to the local barrier service running on
the same host the peer is running on and registers that the caller has reached
the barrier and is waiting for the barrier to be crossed.  Note that this
function can only be used by peers which are started by testbed as this function
tries to access the local barrier service which is part of the testbed
controller service.  Calling GNUNET_TESTBED_barrier_wait() on an uninitialised
barrier results in failure.  GNUNET_TESTBED_barrier_wait_cancel() cancels the
notification registered by GNUNET_TESTBED_barrier_wait().


* Implementation
Since barriers involve coordination between experiment driver and peers, the
barrier service in the testbed controller is split into two components.  The
first component responds to the message generated by the barrier API used by the
experiment driver (functions GNUNET_TESTBED_barrier_init() and
GNUNET_TESTBED_barrier_cancel()) and the second component to the messages
generated by barrier API used by peers (functions GNUNET_TESTBED_barrier_wait()
and GNUNET_TESTBED_barrier_wait_cancel()).

Calling GNUNET_TESTBED_barrier_init() sends a BARRIER_INIT message to the master
controller.  The master controller then registers a barrier and calls
GNUNET_TESTBED_barrier_init() for each its subcontrollers.  In this way barrier
initialisation is propagated to the controller hierarchy.  While propagating
initialisation, any errors at a subcontroller such as timeout during further
propagation are reported up the hierarchy back to the experiment driver.

Similar to GNUNET_TESTBED_barrier_init(), GNUNET_TESTBED_barrier_cancel()
propagates BARRIER_CANCEL message which causes controllers to remove an
initialised barrier.

The second component is implemented as a separate service in the binary
`gnunet-service-testbed' which already has the testbed controller service.
Although this deviates from the gnunet process architecture of having one
service per binary, it is needed in this case as this component needs access to
barrier data created by the first component.  This component responds to
BARRIER_WAIT messages from local peers when they call
GNUNET_TESTBED_barrier_wait().  Upon receiving BARRIER_WAIT message, the service
checks if the requested barrier has been initialised before and if it was not
initialised, an error status is sent through BARRIER_STATUS message to the local
peer and the connection from the peer is terminated.  If the barrier is
initialised before, the barrier's counter for reached peers is incremented and a
notification is registered to notify the peer when the barrier is reached.  The
connection from the peer is left open.

When enough peers required to attain the quorum send BARRIER_WAIT messages, the
controller sends a BARRIER_STATUS message to its parent informing that the
barrier is crossed.  If the controller has started further subcontrollers, it
delays this message until it receives a similar notification from each of those
subcontrollers.  Finally, the barriers API at the experiment driver receives the
BARRIER_STATUS when the barrier is reached at all the controllers.

The barriers API at the experiment driver responds to the BARRIER_STATUS message
by echoing it back to the master controller and notifying the experiment
controller through the notification callback that a barrier has been crossed.
The echoed BARRIER_STATUS message is propagated by the master controller to the
controller hierarchy.  This propagation triggers the notifications registered by
peers at each of the controllers in the hierarchy.  Note the difference between
this downward propagation of the BARRIER_STATUS message from its upward
propagation -- the upward propagation is needed for ensuring that the barrier is
reached by all the controllers and the downward propagation is for triggering
that the barrier is crossed.