aboutsummaryrefslogtreecommitdiff
path: root/RATIONALE
diff options
context:
space:
mode:
Diffstat (limited to 'RATIONALE')
-rw-r--r--RATIONALE246
1 files changed, 246 insertions, 0 deletions
diff --git a/RATIONALE b/RATIONALE
new file mode 100644
index 000000000..e68dcb883
--- /dev/null
+++ b/RATIONALE
@@ -0,0 +1,246 @@
1This document is a summary of why we're moving to GNUnet NG and what
2this major redesign tries to address.
3
4First of all, the redesign does not (intentionally) change anything
5fundamental about the application-level protocols or how files are
6encoded and shared. However, it is not protocol-compatible due to
7other changes that do not relate to the essence of the application
8protocols.
9
10
11The redesign tries to address the following major problem groups
12describing isssues that apply more or less to all GNUnet versions
13prior to 0.9.x:
14
15
16PROBLEM GROUP 1 (scalability):
17* The code was modular, but bugs were not. Memory corruption
18 in one plugin could cause crashes in others and it was not
19 always easy to identify the culprit. This approach
20 fundamentally does not scale (in the sense of GNUnet being
21 a framework and a GNUnet server running hundreds of
22 different application protocols -- and the result still
23 being debuggable, secure and stable).
24* The code was heavily multi-threaded resulting in complex
25 locking operations. GNUnet 0.8.x had over 70 different
26 mutexes and almost 1000 lines of lock/unlock operations.
27 It is challenging for even good programmers to program or
28 maintain good multi-threaded code with this complexity.
29 The excessive locking essentially prevents GNUnet from
30 actually doing much in parallel on multicores.
31* Despite efforts like Freeway, it was virtually
32 impossible to contribute code to GNUnet that was not
33 writen in C/C++.
34* Changes to the configuration almost always required restarts
35 of gnunetd; the existence of change-notifications does not
36 really change that (how many users are even aware of SIGHUP,
37 and how few options worked with that -- and at what expense
38 in code complexity!).
39* Valgrinding could only be done for the entire gnunetd
40 process. Given that gnunetd does quite a bit of
41 CPU-intensive crypto, this could not be done for a system
42 under heavy (or even moderate) load.
43* Stack overflows with threads, while rare under Linux these
44 days, result in really nasty and hard-to-find crashes.
45* structs of function pointers in service APIs were
46 needlessly adding complexity, especially since in
47 most cases there was no polymorphism
48
49SOLUTION:
50* Use multiple, lously-coupled processes and one big select
51 loop in each (supported by a powerful library to eliminate
52 code duplication for each process).
53* Eliminate all threads, manage the processes with a
54 master-process (gnunet-arm, for automatic restart manager)
55 which also ensures that configuration changes trigger the
56 necessary restarts.
57* Use continuations (with timeouts) as a way to unify
58 cron-jobs and other event-based code (such as waiting
59 on network IO).
60 => Using multiple processes ensures that memory corruption
61 stays localized.
62 => Using multiple processes will make it easy to contribute
63 services written in other language(s).
64 => Individual services can now be subjected to valgrind
65 => Process priorities can be used to schedule the CPU better
66 Note that we can not just use one process with a big
67 select loop because we have blocking operations (and the
68 blocking is outside of our control, thanks MySQL,
69 sqlite, gethostbyaddr, etc.). So in order to perform
70 reasonably well, we need some construct for parallel
71 execution.
72
73 RULE: If your service contains blocking functions, it
74 MUST be a process by itself.
75* Eliminate structs with function pointers for service APIs;
76 instead, provide a library (still ending in _service.h) API
77 that transmits the requests nicely to the respective
78 process (easier to use, no need to "request" service
79 in the first place; API can cause process to be started/stopped
80 via ARM if necessary).
81
82
83PROBLEM GROUP 2 (UTIL-APIs causing bugs):
84* The existing logging functions were awkward to use and
85 their expressive power was never really used for much.
86* While we had some rules for naming functions, there
87 were still plenty of inconsistencies.
88* Specification of default values in configuration could
89 result in inconsistencies between defaults in
90 config.scm and defaults used by the program; also,
91 different defaults might have been specified for the
92 same option in different parts of the program.
93* The TIME API did not distinguish between absolute
94 and relative time, requiring users to know which
95 type of value some variable contained and to
96 manually convert properly. Combined with the
97 possibility of integer overflows this is a major
98 source of bugs.
99* The TIME API for seconds has a theoretical problem
100 with a 32-bit overflow on some platforms which is
101 only partially fixed by the old code with some
102 hackery.
103
104SOLUTION:
105* Logging was radically simplified.
106* Functions are now more conistently named.
107* Configuration has no more defaults; instead,
108 we load a global default configuration file
109 before the user-specific configuration (which
110 can be used to override defaults); the global
111 default configuration file will be generated
112 from config.scm.
113* Time now distinguishes between
114 struct GNUNET_TIME_Absolute and
115 struct GNUNET_TIME_Relative. We use structs
116 so that the compiler won't coerce for us
117 (forcing the use of specific conversion
118 functions which have checks for overflows, etc.).
119 Naturally the need to use these functions makes
120 the code a bit more verbose, but that's a good
121 thing given the potential for bugs.
122* There is no more TIME API function to do anything
123 with 32-bit seconds
124
125
126PROBLEM GROUP 3 (statistics):
127* Databases and others needed to store capacity values
128 similar to what stats was already doing, but
129 across process lifetimes ("state"-API was a partial
130 solution for that, but using it was clunky)
131* Only gnunetd could use statistics, but other
132 processes in the GNUnet system might have had
133 good uses for it as well
134
135SOLUTION:
136* New statistics library and service that offer
137 an API to inspect and modify statistics
138* Statistics are distinguished by service name
139 in addition to the name of the value
140* Statistics can be marked as persistent, in
141 which case they are written to disk when
142 the statistics service shuts down.
143 => One solution for existing stats uses,
144 application stats, database stats and
145 versioning information!
146
147
148PROBLEM GROUP 4 (Testing):
149* The existing structure of the code with modules
150 stored in places far away from the test code
151 resulted in tools like lcov not giving good results.
152* The codebase had evolved into a complex, deeply
153 nested hierarchy often with directories that
154 then only contained a single file. Some of these
155 files had the same name making it hard to find
156 the source corresponding to a crash based on
157 the reported filename/line information.
158* Non-trivial portions of the code lacked good testcases,
159 and it was not always obvious which parts of the code
160 were not well-tested.
161
162SOLUTION:
163* Code that should be tested together is now
164 in the same directory.
165* The hierarchy is now essentially flat, each
166 major service having on directory under src/;
167 naming conventions help to make sure that
168 files have globally-unique names
169* All code added to the new repository must
170 come with testcases with reasonable coverage.
171
172
173PROBLEM GROUP 5 (core/transports):
174* The new DV service requires session key exchange
175 between DV-neighbours, but the existing
176 session key code can not be used to achieve this.
177* The core requires certain services
178 (such as identity, pingpong, fragmentation,
179 transport, traffic, session) which makes it
180 meaningless to have these as modules
181 (especially since there is really only one
182 way to implement these)
183* HELLO's are larger than necessary since we need
184 one for each transport (and hence often have
185 to pick a subset of our HELLOs to transmit)
186* Fragmentation is done at the core level but only
187 required for a few transports; future versions of
188 these transports might want to be aware of fragments
189 and do things like retransmission
190* Autoconfiguration is hard since we have no good
191 way to detect (and then use securely) our external IP address
192* It is currently not possible for multiple transports
193 between the same pair of peers to be used concurrently
194 in the same direction(s)
195* We're using lots of cron-based jobs to periodically
196 try (and fail) to build and transmit
197
198SOLUTION:
199* Rewrite core to integrate most of these services
200 into one "core" service.
201* Redesign HELLO to contain the addresses for
202 all enabled transports in one message (avoiding
203 having to transmit the public key and signature
204 many, many times)
205* With discovery being part of the transport service,
206 it is now also possible to "learn" our external
207 IP address from other peers (we just add plausible
208 addresses to the list; other peers will discard
209 those addresses that don't work for them!)
210* New DV will consist of a "transport" and a
211 high-level service (to handle encrypted DV
212 control- and data-messages).
213* Move expiration from one field per HELLO to one
214 per address
215* Require signature in PONG, not in HELLO (and confirm
216 on address at a time)
217* Move fragmentation into helper library linked
218 against by UDP (and others that might need it)
219* Link-to-link advertising of our HELLO is transport
220 responsibility; global advertising/bootstrap remains
221 responsibility of higher layers
222* Change APIs to be event-based (transports pull for
223 transmission data instead of core pushing and failing)
224
225
226PROBLEM GROUP 6 (FS-APIs):
227* As with gnunetd, the FS-APIs are heavily threaded,
228 resulting in hard-to-understand code (slightly
229 better than gnunetd, but not much).
230* GTK in particular does not like this, resulting
231 in complicated code to switch to the GTK event
232 thread when needed (which may still be causing
233 problems on Gnome, not sure).
234* If GUIs die (or are not properly shutdown), state
235 of current transactions is lost (FSUI only
236 saves to disk on shutdown)
237
238SOLUTION (draft, not done yet, details missing...):
239* Eliminate threads from FS-APIs
240 => Open question: how to best write the APIs to
241 allow integration with diverse event loops
242 of GUI libraries?
243* Store FS-state always also on disk
244 => Open question: how to do this without
245 compromising state/scalability?
246