diff options
author | Christian Grothoff <christian@grothoff.org> | 2011-11-19 20:26:33 +0000 |
---|---|---|
committer | Christian Grothoff <christian@grothoff.org> | 2011-11-19 20:26:33 +0000 |
commit | 461cf105fd0e7bf0eb7b7ae0481b327db44e950c (patch) | |
tree | 21580e1c9b5c1e0d6c91c8d10c057188aee9b7df /RATIONALE | |
parent | dbaa827eaaeded675980aa2f8f8af87f82923df6 (diff) | |
download | gnunet-461cf105fd0e7bf0eb7b7ae0481b327db44e950c.tar.gz gnunet-461cf105fd0e7bf0eb7b7ae0481b327db44e950c.zip |
moved to drupal
Diffstat (limited to 'RATIONALE')
-rw-r--r-- | RATIONALE | 316 |
1 files changed, 0 insertions, 316 deletions
diff --git a/RATIONALE b/RATIONALE deleted file mode 100644 index 1851aeb78..000000000 --- a/RATIONALE +++ /dev/null | |||
@@ -1,316 +0,0 @@ | |||
1 | This document is a summary of the changes made to GNUnet for version | ||
2 | 0.9.x (from 0.8.x) and what this major redesign tries to address. | ||
3 | |||
4 | First of all, the redesign does not (intentionally) change anything | ||
5 | fundamental about the application-level protocols or how files are | ||
6 | encoded and shared. However, it is not protocol-compatible due to | ||
7 | other changes that do not relate to the essence of the application | ||
8 | protocols. This choice was made since productive development and | ||
9 | readable code were considered more important than compatibility at | ||
10 | this point. | ||
11 | |||
12 | The redesign tries to address the following major problem groups | ||
13 | describing isssues that apply more or less to all GNUnet versions | ||
14 | prior to 0.9.x: | ||
15 | |||
16 | |||
17 | PROBLEM GROUP 1 (scalability): | ||
18 | * The code was modular, but bugs were not. Memory corruption | ||
19 | in one plugin could cause crashes in others and it was not | ||
20 | always easy to identify the culprit. This approach | ||
21 | fundamentally does not scale (in the sense of GNUnet being | ||
22 | a framework and a GNUnet server running hundreds of | ||
23 | different application protocols -- and the result still | ||
24 | being debuggable, secure and stable). | ||
25 | * The code was heavily multi-threaded resulting in complex | ||
26 | locking operations. GNUnet 0.8.x had over 70 different | ||
27 | mutexes and almost 1000 lines of lock/unlock operations. | ||
28 | It is challenging for even good programmers to program or | ||
29 | maintain good multi-threaded code with this complexity. | ||
30 | The excessive locking essentially prevents GNUnet 0.8 from | ||
31 | actually doing much in parallel on multicores. | ||
32 | * Despite efforts like Freeway, it was virtually | ||
33 | impossible to contribute code to GNUnet 0.8 that was not | ||
34 | writen in C/C++. | ||
35 | * Changes to the configuration almost always required restarts | ||
36 | of gnunetd; the existence of change-notifications does not | ||
37 | really change that (how many users are even aware of SIGHUP, | ||
38 | and how few options worked with that -- and at what expense | ||
39 | in code complexity!). | ||
40 | * Valgrinding could only be done for the entire gnunetd | ||
41 | process. Given that gnunetd does quite a bit of | ||
42 | CPU-intensive crypto, this could not be done for a system | ||
43 | under heavy (or even moderate) load. | ||
44 | * Stack overflows with threads, while rare under Linux these | ||
45 | days, result in really nasty and hard-to-find crashes. | ||
46 | * structs of function pointers in service APIs were | ||
47 | needlessly adding complexity, especially since in | ||
48 | most cases there was no actual polymorphism | ||
49 | |||
50 | SOLUTION: | ||
51 | * Use multiple, lously-coupled processes and one big select | ||
52 | loop in each (supported by a powerful util library to eliminate | ||
53 | code duplication for each process). | ||
54 | * Eliminate all threads, manage the processes with a | ||
55 | master-process (gnunet-arm, for automatic restart manager) | ||
56 | which also ensures that configuration changes trigger the | ||
57 | necessary restarts. | ||
58 | * Use continuations (with timeouts) as a way to unify | ||
59 | cron-jobs and other event-based code (such as waiting | ||
60 | on network IO). | ||
61 | => Using multiple processes ensures that memory corruption | ||
62 | stays localized. | ||
63 | => Using multiple processes will make it easy to contribute | ||
64 | services written in other language(s). | ||
65 | => Individual services can now be subjected to valgrind | ||
66 | => Process priorities can be used to schedule the CPU better | ||
67 | Note that we can not just use one process with a big | ||
68 | select loop because we have blocking operations (and the | ||
69 | blocking is outside of our control, thanks to MySQL, | ||
70 | sqlite, gethostbyaddr, etc.). So in order to perform | ||
71 | reasonably well, we need some construct for parallel | ||
72 | execution. | ||
73 | |||
74 | RULE: If your service contains blocking functions, it | ||
75 | MUST be a process by itself. If your service | ||
76 | is sufficiently complex, you MAY choose to make | ||
77 | it a separate process. | ||
78 | * Eliminate structs with function pointers for service APIs; | ||
79 | instead, provide a library (still ending in _service.h) API | ||
80 | that transmits the requests nicely to the respective | ||
81 | process (easier to use, no need to "request" service | ||
82 | in the first place; API can cause process to be started/stopped | ||
83 | via ARM if necessary). | ||
84 | |||
85 | |||
86 | PROBLEM GROUP 2 (UTIL-APIs causing bugs): | ||
87 | * The existing logging functions were awkward to use and | ||
88 | their expressive power was never really used for much. | ||
89 | * While we had some rules for naming functions, there | ||
90 | were still plenty of inconsistencies. | ||
91 | * Specification of default values in configuration could | ||
92 | result in inconsistencies between defaults in | ||
93 | config.scm and defaults used by the program; also, | ||
94 | different defaults might have been specified for the | ||
95 | same option in different parts of the program. | ||
96 | * The TIME API did not distinguish between absolute | ||
97 | and relative time, requiring users to know which | ||
98 | type of value some variable contained and to | ||
99 | manually convert properly. Combined with the | ||
100 | possibility of integer overflows this is a major | ||
101 | source of bugs. | ||
102 | * The TIME API for seconds has a theoretical problem | ||
103 | with a 32-bit overflow on some platforms which is | ||
104 | only partially fixed by the old code with some | ||
105 | hackery. | ||
106 | |||
107 | SOLUTION: | ||
108 | * Logging was radically simplified. | ||
109 | * Functions are now more conistently named. | ||
110 | * Configuration has no more defaults; instead, | ||
111 | we load a global default configuration file | ||
112 | before the user-specific configuration (which | ||
113 | can be used to override defaults); the global | ||
114 | default configuration file will be generated | ||
115 | from config.scm. | ||
116 | * Time now distinguishes between | ||
117 | struct GNUNET_TIME_Absolute and | ||
118 | struct GNUNET_TIME_Relative. We use structs | ||
119 | so that the compiler won't coerce for us | ||
120 | (forcing the use of specific conversion | ||
121 | functions which have checks for overflows, etc.). | ||
122 | Naturally the need to use these functions makes | ||
123 | the code a bit more verbose, but that's a good | ||
124 | thing given the potential for bugs. | ||
125 | * There is no more TIME API function to do anything | ||
126 | with 32-bit seconds | ||
127 | * There is now a bandwidth API to handle | ||
128 | non-trivial bandwidth utilization calculations | ||
129 | |||
130 | |||
131 | PROBLEM GROUP 3 (statistics): | ||
132 | * Databases and others needed to store capacity values | ||
133 | similar to what stats was already doing, but | ||
134 | across process lifetimes ("state"-API was a partial | ||
135 | solution for that, but using it was clunky) | ||
136 | * Only gnunetd could use statistics, but other | ||
137 | processes in the GNUnet system might have had | ||
138 | good uses for it as well | ||
139 | |||
140 | SOLUTION: | ||
141 | * New statistics library and service that offer | ||
142 | an API to inspect and modify statistics | ||
143 | * Statistics are distinguished by service name | ||
144 | in addition to the name of the value | ||
145 | * Statistics can be marked as persistent, in | ||
146 | which case they are written to disk when | ||
147 | the statistics service shuts down. | ||
148 | => One solution for existing stats uses, | ||
149 | application stats, database stats and | ||
150 | versioning information! | ||
151 | |||
152 | |||
153 | PROBLEM GROUP 4 (Testing): | ||
154 | * The existing structure of the code with modules | ||
155 | stored in places far away from the test code | ||
156 | resulted in tools like lcov not giving good results. | ||
157 | * The codebase had evolved into a complex, deeply | ||
158 | nested hierarchy often with directories that | ||
159 | then only contained a single file. Some of these | ||
160 | files had the same name making it hard to find | ||
161 | the source corresponding to a crash based on | ||
162 | the reported filename/line information. | ||
163 | * Non-trivial portions of the code lacked good testcases, | ||
164 | and it was not always obvious which parts of the code | ||
165 | were not well-tested. | ||
166 | |||
167 | SOLUTION: | ||
168 | * Code that should be tested together is now | ||
169 | in the same directory. | ||
170 | * The hierarchy is now essentially flat, each | ||
171 | major service having on directory under src/; | ||
172 | naming conventions help to make sure that | ||
173 | files have globally-unique names | ||
174 | * All code added to the new repository must | ||
175 | come with testcases with reasonable coverage. | ||
176 | |||
177 | |||
178 | PROBLEM GROUP 5 (core/transports): | ||
179 | * The new DV service requires session key exchange | ||
180 | between DV-neighbours, but the existing | ||
181 | session key code can not be used to achieve this. | ||
182 | * The core requires certain services | ||
183 | (such as identity, pingpong, fragmentation, | ||
184 | transport, traffic, session) which makes it | ||
185 | meaningless to have these as modules | ||
186 | (especially since there is really only one | ||
187 | way to implement these) | ||
188 | * HELLO's are larger than necessary since we need | ||
189 | one for each transport (and hence often have | ||
190 | to pick a subset of our HELLOs to transmit) | ||
191 | * Fragmentation is done at the core level but only | ||
192 | required for a few transports; future versions of | ||
193 | these transports might want to be aware of fragments | ||
194 | and do things like retransmission | ||
195 | * Autoconfiguration is hard since we have no good | ||
196 | way to detect (and then use securely) our external IP address | ||
197 | * It is currently not possible for multiple transports | ||
198 | between the same pair of peers to be used concurrently | ||
199 | in the same direction(s) | ||
200 | * We're using lots of cron-based jobs to periodically | ||
201 | try (and fail) to build and transmit | ||
202 | |||
203 | SOLUTION: | ||
204 | * Rewrite core to integrate most of these services | ||
205 | into one "core" service. | ||
206 | * Redesign HELLO to contain the addresses for | ||
207 | all enabled transports in one message (avoiding | ||
208 | having to transmit the public key and signature | ||
209 | many, many times) | ||
210 | * With discovery being part of the transport service, | ||
211 | it is now also possible to "learn" our external | ||
212 | IP address from other peers (we just add plausible | ||
213 | addresses to the list; other peers will discard | ||
214 | those addresses that don't work for them!) | ||
215 | * New DV will consist of a "transport" and a | ||
216 | high-level service (to handle encrypted DV | ||
217 | control- and data-messages). | ||
218 | * Move expiration from one field per HELLO to one | ||
219 | per address | ||
220 | * Require signature in PONG, not in HELLO (and confirm | ||
221 | on address at a time) | ||
222 | * Move fragmentation into helper library linked | ||
223 | against by UDP (and others that might need it) | ||
224 | * Link-to-link advertising of our HELLO is transport | ||
225 | responsibility; global advertising/bootstrap remains | ||
226 | responsibility of higher layers | ||
227 | * Change APIs to be event-based (transports pull for | ||
228 | transmission data instead of core pushing and failing) | ||
229 | |||
230 | |||
231 | PROBLEM GROUP 6 (FS-APIs): | ||
232 | * As with gnunetd, the FS-APIs are heavily threaded, | ||
233 | resulting in hard-to-understand code (slightly | ||
234 | better than gnunetd, but not much). | ||
235 | * GTK in particular does not like this, resulting | ||
236 | in complicated code to switch to the GTK event | ||
237 | thread when needed (which may still be causing | ||
238 | problems on Gnome, not sure). | ||
239 | * If GUIs die (or are not properly shutdown), state | ||
240 | of current transactions is lost (FSUI only | ||
241 | saves to disk on shutdown) | ||
242 | * FILENAME metadata is killed by ECRS/FSUI to avoid | ||
243 | exposing HOME, but what if the user set it manually? | ||
244 | * The DHT was a generic data structure with no | ||
245 | support for ECRS-style block validation | ||
246 | |||
247 | SOLUTION: | ||
248 | * Eliminate threads from FS-APIs | ||
249 | * Incrementally store FS-state always also on disk using many | ||
250 | small files instead of one big file | ||
251 | * Have API to manipulate sharing tree before | ||
252 | upload; have auto-construction modify FILENAME | ||
253 | but allow user-modifications afterwards | ||
254 | * DHT API was extended with a BLOCK API for content | ||
255 | validation by block type; validators for FS and | ||
256 | DHT block types were written; BLOCK API is also | ||
257 | used by gap routing code. | ||
258 | |||
259 | |||
260 | PROBLEM GROUP 7 (User experience): | ||
261 | * Searches often do not return a sufficient / significant number of | ||
262 | results | ||
263 | * Sharing a directory with thousands of similar files (image/jpeg) | ||
264 | creates thousands of search results for the mime-type keyword | ||
265 | (problem with DB performance, network transmission, caching, | ||
266 | end-user display, etc.) | ||
267 | * Users that wanted to share important content had no way to | ||
268 | tell the system to replicate it more; replication was also | ||
269 | inefficient (this desired feature was sometimes called | ||
270 | "power" publishing or content pushing) | ||
271 | |||
272 | SOLUTION: | ||
273 | * Have option to canonicalize keywords (see suggestion on mailinglist end of | ||
274 | June 2009: keep consonants and sort those alphabetically); not | ||
275 | fully implemented yet | ||
276 | * When sharing directories, extract keywords first and then | ||
277 | push keywords that are common in all files up to the | ||
278 | directory level; when processing an AND-ed query and a directory | ||
279 | is found to match the result, do an inspection on the metadata | ||
280 | of the files in the directory to possibly produce further results | ||
281 | (requires downloading of the directory in the background); | ||
282 | needs more testing | ||
283 | * A desired replication level can now be specified and is tracked | ||
284 | in the datastore; migration prefers content with a high | ||
285 | replication level (which decreases as replicase are created) | ||
286 | => datastore format changed; we also took out a size field | ||
287 | that was redundant, so the overall overhead remains the same | ||
288 | * Peers with a full disk (or disabled migration) can now notify | ||
289 | other peers that they are not interested in migration right | ||
290 | now; as a result, less bandwidth is wasted pushing content | ||
291 | to these peers (and replication counters are not generally | ||
292 | decreased based on copies that are just discarded; naturally, | ||
293 | there is still no guarantee that the replicas will stay | ||
294 | available) | ||
295 | |||
296 | |||
297 | |||
298 | SUMMARY: | ||
299 | * Features eliminated from util: | ||
300 | - threading (goal: good riddance!) | ||
301 | - complex logging features [ectx-passing, target-kinds] (goal: good riddance!) | ||
302 | - complex configuration features [defaults, notifications] (goal: good riddance!) | ||
303 | - network traffic monitors (goal: eliminate) | ||
304 | - IPC semaphores (goal: d-bus? / eliminate?) | ||
305 | - second timers | ||
306 | * New features in util: | ||
307 | - scheduler | ||
308 | - service and program boot-strap code | ||
309 | - bandwidth and time APIs | ||
310 | - buffered IO API | ||
311 | - HKDF implementation (crypto) | ||
312 | - load calculation API | ||
313 | - bandwidth calculation API | ||
314 | * Major changes in util: | ||
315 | - more expressive server (replaces selector) | ||
316 | - DNS lookup replaced by async service | ||