diff options
author | Christian Grothoff <christian@grothoff.org> | 2009-05-29 00:46:26 +0000 |
---|---|---|
committer | Christian Grothoff <christian@grothoff.org> | 2009-05-29 00:46:26 +0000 |
commit | 0a217a8df1657b4334b55b0e4a6c7837a8dbcfd9 (patch) | |
tree | 6b552f40eb089db96409a312a98d9b12bd669102 /RATIONALE | |
download | gnunet-0a217a8df1657b4334b55b0e4a6c7837a8dbcfd9.tar.gz gnunet-0a217a8df1657b4334b55b0e4a6c7837a8dbcfd9.zip |
ng
Diffstat (limited to 'RATIONALE')
-rw-r--r-- | RATIONALE | 246 |
1 files changed, 246 insertions, 0 deletions
diff --git a/RATIONALE b/RATIONALE new file mode 100644 index 000000000..e68dcb883 --- /dev/null +++ b/RATIONALE | |||
@@ -0,0 +1,246 @@ | |||
1 | This document is a summary of why we're moving to GNUnet NG and what | ||
2 | this major redesign tries to address. | ||
3 | |||
4 | First of all, the redesign does not (intentionally) change anything | ||
5 | fundamental about the application-level protocols or how files are | ||
6 | encoded and shared. However, it is not protocol-compatible due to | ||
7 | other changes that do not relate to the essence of the application | ||
8 | protocols. | ||
9 | |||
10 | |||
11 | The redesign tries to address the following major problem groups | ||
12 | describing isssues that apply more or less to all GNUnet versions | ||
13 | prior to 0.9.x: | ||
14 | |||
15 | |||
16 | PROBLEM GROUP 1 (scalability): | ||
17 | * The code was modular, but bugs were not. Memory corruption | ||
18 | in one plugin could cause crashes in others and it was not | ||
19 | always easy to identify the culprit. This approach | ||
20 | fundamentally does not scale (in the sense of GNUnet being | ||
21 | a framework and a GNUnet server running hundreds of | ||
22 | different application protocols -- and the result still | ||
23 | being debuggable, secure and stable). | ||
24 | * The code was heavily multi-threaded resulting in complex | ||
25 | locking operations. GNUnet 0.8.x had over 70 different | ||
26 | mutexes and almost 1000 lines of lock/unlock operations. | ||
27 | It is challenging for even good programmers to program or | ||
28 | maintain good multi-threaded code with this complexity. | ||
29 | The excessive locking essentially prevents GNUnet from | ||
30 | actually doing much in parallel on multicores. | ||
31 | * Despite efforts like Freeway, it was virtually | ||
32 | impossible to contribute code to GNUnet that was not | ||
33 | writen in C/C++. | ||
34 | * Changes to the configuration almost always required restarts | ||
35 | of gnunetd; the existence of change-notifications does not | ||
36 | really change that (how many users are even aware of SIGHUP, | ||
37 | and how few options worked with that -- and at what expense | ||
38 | in code complexity!). | ||
39 | * Valgrinding could only be done for the entire gnunetd | ||
40 | process. Given that gnunetd does quite a bit of | ||
41 | CPU-intensive crypto, this could not be done for a system | ||
42 | under heavy (or even moderate) load. | ||
43 | * Stack overflows with threads, while rare under Linux these | ||
44 | days, result in really nasty and hard-to-find crashes. | ||
45 | * structs of function pointers in service APIs were | ||
46 | needlessly adding complexity, especially since in | ||
47 | most cases there was no polymorphism | ||
48 | |||
49 | SOLUTION: | ||
50 | * Use multiple, lously-coupled processes and one big select | ||
51 | loop in each (supported by a powerful library to eliminate | ||
52 | code duplication for each process). | ||
53 | * Eliminate all threads, manage the processes with a | ||
54 | master-process (gnunet-arm, for automatic restart manager) | ||
55 | which also ensures that configuration changes trigger the | ||
56 | necessary restarts. | ||
57 | * Use continuations (with timeouts) as a way to unify | ||
58 | cron-jobs and other event-based code (such as waiting | ||
59 | on network IO). | ||
60 | => Using multiple processes ensures that memory corruption | ||
61 | stays localized. | ||
62 | => Using multiple processes will make it easy to contribute | ||
63 | services written in other language(s). | ||
64 | => Individual services can now be subjected to valgrind | ||
65 | => Process priorities can be used to schedule the CPU better | ||
66 | Note that we can not just use one process with a big | ||
67 | select loop because we have blocking operations (and the | ||
68 | blocking is outside of our control, thanks MySQL, | ||
69 | sqlite, gethostbyaddr, etc.). So in order to perform | ||
70 | reasonably well, we need some construct for parallel | ||
71 | execution. | ||
72 | |||
73 | RULE: If your service contains blocking functions, it | ||
74 | MUST be a process by itself. | ||
75 | * Eliminate structs with function pointers for service APIs; | ||
76 | instead, provide a library (still ending in _service.h) API | ||
77 | that transmits the requests nicely to the respective | ||
78 | process (easier to use, no need to "request" service | ||
79 | in the first place; API can cause process to be started/stopped | ||
80 | via ARM if necessary). | ||
81 | |||
82 | |||
83 | PROBLEM GROUP 2 (UTIL-APIs causing bugs): | ||
84 | * The existing logging functions were awkward to use and | ||
85 | their expressive power was never really used for much. | ||
86 | * While we had some rules for naming functions, there | ||
87 | were still plenty of inconsistencies. | ||
88 | * Specification of default values in configuration could | ||
89 | result in inconsistencies between defaults in | ||
90 | config.scm and defaults used by the program; also, | ||
91 | different defaults might have been specified for the | ||
92 | same option in different parts of the program. | ||
93 | * The TIME API did not distinguish between absolute | ||
94 | and relative time, requiring users to know which | ||
95 | type of value some variable contained and to | ||
96 | manually convert properly. Combined with the | ||
97 | possibility of integer overflows this is a major | ||
98 | source of bugs. | ||
99 | * The TIME API for seconds has a theoretical problem | ||
100 | with a 32-bit overflow on some platforms which is | ||
101 | only partially fixed by the old code with some | ||
102 | hackery. | ||
103 | |||
104 | SOLUTION: | ||
105 | * Logging was radically simplified. | ||
106 | * Functions are now more conistently named. | ||
107 | * Configuration has no more defaults; instead, | ||
108 | we load a global default configuration file | ||
109 | before the user-specific configuration (which | ||
110 | can be used to override defaults); the global | ||
111 | default configuration file will be generated | ||
112 | from config.scm. | ||
113 | * Time now distinguishes between | ||
114 | struct GNUNET_TIME_Absolute and | ||
115 | struct GNUNET_TIME_Relative. We use structs | ||
116 | so that the compiler won't coerce for us | ||
117 | (forcing the use of specific conversion | ||
118 | functions which have checks for overflows, etc.). | ||
119 | Naturally the need to use these functions makes | ||
120 | the code a bit more verbose, but that's a good | ||
121 | thing given the potential for bugs. | ||
122 | * There is no more TIME API function to do anything | ||
123 | with 32-bit seconds | ||
124 | |||
125 | |||
126 | PROBLEM GROUP 3 (statistics): | ||
127 | * Databases and others needed to store capacity values | ||
128 | similar to what stats was already doing, but | ||
129 | across process lifetimes ("state"-API was a partial | ||
130 | solution for that, but using it was clunky) | ||
131 | * Only gnunetd could use statistics, but other | ||
132 | processes in the GNUnet system might have had | ||
133 | good uses for it as well | ||
134 | |||
135 | SOLUTION: | ||
136 | * New statistics library and service that offer | ||
137 | an API to inspect and modify statistics | ||
138 | * Statistics are distinguished by service name | ||
139 | in addition to the name of the value | ||
140 | * Statistics can be marked as persistent, in | ||
141 | which case they are written to disk when | ||
142 | the statistics service shuts down. | ||
143 | => One solution for existing stats uses, | ||
144 | application stats, database stats and | ||
145 | versioning information! | ||
146 | |||
147 | |||
148 | PROBLEM GROUP 4 (Testing): | ||
149 | * The existing structure of the code with modules | ||
150 | stored in places far away from the test code | ||
151 | resulted in tools like lcov not giving good results. | ||
152 | * The codebase had evolved into a complex, deeply | ||
153 | nested hierarchy often with directories that | ||
154 | then only contained a single file. Some of these | ||
155 | files had the same name making it hard to find | ||
156 | the source corresponding to a crash based on | ||
157 | the reported filename/line information. | ||
158 | * Non-trivial portions of the code lacked good testcases, | ||
159 | and it was not always obvious which parts of the code | ||
160 | were not well-tested. | ||
161 | |||
162 | SOLUTION: | ||
163 | * Code that should be tested together is now | ||
164 | in the same directory. | ||
165 | * The hierarchy is now essentially flat, each | ||
166 | major service having on directory under src/; | ||
167 | naming conventions help to make sure that | ||
168 | files have globally-unique names | ||
169 | * All code added to the new repository must | ||
170 | come with testcases with reasonable coverage. | ||
171 | |||
172 | |||
173 | PROBLEM GROUP 5 (core/transports): | ||
174 | * The new DV service requires session key exchange | ||
175 | between DV-neighbours, but the existing | ||
176 | session key code can not be used to achieve this. | ||
177 | * The core requires certain services | ||
178 | (such as identity, pingpong, fragmentation, | ||
179 | transport, traffic, session) which makes it | ||
180 | meaningless to have these as modules | ||
181 | (especially since there is really only one | ||
182 | way to implement these) | ||
183 | * HELLO's are larger than necessary since we need | ||
184 | one for each transport (and hence often have | ||
185 | to pick a subset of our HELLOs to transmit) | ||
186 | * Fragmentation is done at the core level but only | ||
187 | required for a few transports; future versions of | ||
188 | these transports might want to be aware of fragments | ||
189 | and do things like retransmission | ||
190 | * Autoconfiguration is hard since we have no good | ||
191 | way to detect (and then use securely) our external IP address | ||
192 | * It is currently not possible for multiple transports | ||
193 | between the same pair of peers to be used concurrently | ||
194 | in the same direction(s) | ||
195 | * We're using lots of cron-based jobs to periodically | ||
196 | try (and fail) to build and transmit | ||
197 | |||
198 | SOLUTION: | ||
199 | * Rewrite core to integrate most of these services | ||
200 | into one "core" service. | ||
201 | * Redesign HELLO to contain the addresses for | ||
202 | all enabled transports in one message (avoiding | ||
203 | having to transmit the public key and signature | ||
204 | many, many times) | ||
205 | * With discovery being part of the transport service, | ||
206 | it is now also possible to "learn" our external | ||
207 | IP address from other peers (we just add plausible | ||
208 | addresses to the list; other peers will discard | ||
209 | those addresses that don't work for them!) | ||
210 | * New DV will consist of a "transport" and a | ||
211 | high-level service (to handle encrypted DV | ||
212 | control- and data-messages). | ||
213 | * Move expiration from one field per HELLO to one | ||
214 | per address | ||
215 | * Require signature in PONG, not in HELLO (and confirm | ||
216 | on address at a time) | ||
217 | * Move fragmentation into helper library linked | ||
218 | against by UDP (and others that might need it) | ||
219 | * Link-to-link advertising of our HELLO is transport | ||
220 | responsibility; global advertising/bootstrap remains | ||
221 | responsibility of higher layers | ||
222 | * Change APIs to be event-based (transports pull for | ||
223 | transmission data instead of core pushing and failing) | ||
224 | |||
225 | |||
226 | PROBLEM GROUP 6 (FS-APIs): | ||
227 | * As with gnunetd, the FS-APIs are heavily threaded, | ||
228 | resulting in hard-to-understand code (slightly | ||
229 | better than gnunetd, but not much). | ||
230 | * GTK in particular does not like this, resulting | ||
231 | in complicated code to switch to the GTK event | ||
232 | thread when needed (which may still be causing | ||
233 | problems on Gnome, not sure). | ||
234 | * If GUIs die (or are not properly shutdown), state | ||
235 | of current transactions is lost (FSUI only | ||
236 | saves to disk on shutdown) | ||
237 | |||
238 | SOLUTION (draft, not done yet, details missing...): | ||
239 | * Eliminate threads from FS-APIs | ||
240 | => Open question: how to best write the APIs to | ||
241 | allow integration with diverse event loops | ||
242 | of GUI libraries? | ||
243 | * Store FS-state always also on disk | ||
244 | => Open question: how to do this without | ||
245 | compromising state/scalability? | ||
246 | |||