From 0a217a8df1657b4334b55b0e4a6c7837a8dbcfd9 Mon Sep 17 00:00:00 2001 From: Christian Grothoff Date: Fri, 29 May 2009 00:46:26 +0000 Subject: ng --- RATIONALE | 246 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 246 insertions(+) create mode 100644 RATIONALE (limited to 'RATIONALE') diff --git a/RATIONALE b/RATIONALE new file mode 100644 index 000000000..e68dcb883 --- /dev/null +++ b/RATIONALE @@ -0,0 +1,246 @@ +This document is a summary of why we're moving to GNUnet NG and what +this major redesign tries to address. + +First of all, the redesign does not (intentionally) change anything +fundamental about the application-level protocols or how files are +encoded and shared. However, it is not protocol-compatible due to +other changes that do not relate to the essence of the application +protocols. + + +The redesign tries to address the following major problem groups +describing isssues that apply more or less to all GNUnet versions +prior to 0.9.x: + + +PROBLEM GROUP 1 (scalability): +* The code was modular, but bugs were not. Memory corruption + in one plugin could cause crashes in others and it was not + always easy to identify the culprit. This approach + fundamentally does not scale (in the sense of GNUnet being + a framework and a GNUnet server running hundreds of + different application protocols -- and the result still + being debuggable, secure and stable). +* The code was heavily multi-threaded resulting in complex + locking operations. GNUnet 0.8.x had over 70 different + mutexes and almost 1000 lines of lock/unlock operations. + It is challenging for even good programmers to program or + maintain good multi-threaded code with this complexity. + The excessive locking essentially prevents GNUnet from + actually doing much in parallel on multicores. +* Despite efforts like Freeway, it was virtually + impossible to contribute code to GNUnet that was not + writen in C/C++. +* Changes to the configuration almost always required restarts + of gnunetd; the existence of change-notifications does not + really change that (how many users are even aware of SIGHUP, + and how few options worked with that -- and at what expense + in code complexity!). +* Valgrinding could only be done for the entire gnunetd + process. Given that gnunetd does quite a bit of + CPU-intensive crypto, this could not be done for a system + under heavy (or even moderate) load. +* Stack overflows with threads, while rare under Linux these + days, result in really nasty and hard-to-find crashes. +* structs of function pointers in service APIs were + needlessly adding complexity, especially since in + most cases there was no polymorphism + +SOLUTION: +* Use multiple, lously-coupled processes and one big select + loop in each (supported by a powerful library to eliminate + code duplication for each process). +* Eliminate all threads, manage the processes with a + master-process (gnunet-arm, for automatic restart manager) + which also ensures that configuration changes trigger the + necessary restarts. +* Use continuations (with timeouts) as a way to unify + cron-jobs and other event-based code (such as waiting + on network IO). + => Using multiple processes ensures that memory corruption + stays localized. + => Using multiple processes will make it easy to contribute + services written in other language(s). + => Individual services can now be subjected to valgrind + => Process priorities can be used to schedule the CPU better + Note that we can not just use one process with a big + select loop because we have blocking operations (and the + blocking is outside of our control, thanks MySQL, + sqlite, gethostbyaddr, etc.). So in order to perform + reasonably well, we need some construct for parallel + execution. + + RULE: If your service contains blocking functions, it + MUST be a process by itself. +* Eliminate structs with function pointers for service APIs; + instead, provide a library (still ending in _service.h) API + that transmits the requests nicely to the respective + process (easier to use, no need to "request" service + in the first place; API can cause process to be started/stopped + via ARM if necessary). + + +PROBLEM GROUP 2 (UTIL-APIs causing bugs): +* The existing logging functions were awkward to use and + their expressive power was never really used for much. +* While we had some rules for naming functions, there + were still plenty of inconsistencies. +* Specification of default values in configuration could + result in inconsistencies between defaults in + config.scm and defaults used by the program; also, + different defaults might have been specified for the + same option in different parts of the program. +* The TIME API did not distinguish between absolute + and relative time, requiring users to know which + type of value some variable contained and to + manually convert properly. Combined with the + possibility of integer overflows this is a major + source of bugs. +* The TIME API for seconds has a theoretical problem + with a 32-bit overflow on some platforms which is + only partially fixed by the old code with some + hackery. + +SOLUTION: +* Logging was radically simplified. +* Functions are now more conistently named. +* Configuration has no more defaults; instead, + we load a global default configuration file + before the user-specific configuration (which + can be used to override defaults); the global + default configuration file will be generated + from config.scm. +* Time now distinguishes between + struct GNUNET_TIME_Absolute and + struct GNUNET_TIME_Relative. We use structs + so that the compiler won't coerce for us + (forcing the use of specific conversion + functions which have checks for overflows, etc.). + Naturally the need to use these functions makes + the code a bit more verbose, but that's a good + thing given the potential for bugs. +* There is no more TIME API function to do anything + with 32-bit seconds + + +PROBLEM GROUP 3 (statistics): +* Databases and others needed to store capacity values + similar to what stats was already doing, but + across process lifetimes ("state"-API was a partial + solution for that, but using it was clunky) +* Only gnunetd could use statistics, but other + processes in the GNUnet system might have had + good uses for it as well + +SOLUTION: +* New statistics library and service that offer + an API to inspect and modify statistics +* Statistics are distinguished by service name + in addition to the name of the value +* Statistics can be marked as persistent, in + which case they are written to disk when + the statistics service shuts down. + => One solution for existing stats uses, + application stats, database stats and + versioning information! + + +PROBLEM GROUP 4 (Testing): +* The existing structure of the code with modules + stored in places far away from the test code + resulted in tools like lcov not giving good results. +* The codebase had evolved into a complex, deeply + nested hierarchy often with directories that + then only contained a single file. Some of these + files had the same name making it hard to find + the source corresponding to a crash based on + the reported filename/line information. +* Non-trivial portions of the code lacked good testcases, + and it was not always obvious which parts of the code + were not well-tested. + +SOLUTION: +* Code that should be tested together is now + in the same directory. +* The hierarchy is now essentially flat, each + major service having on directory under src/; + naming conventions help to make sure that + files have globally-unique names +* All code added to the new repository must + come with testcases with reasonable coverage. + + +PROBLEM GROUP 5 (core/transports): +* The new DV service requires session key exchange + between DV-neighbours, but the existing + session key code can not be used to achieve this. +* The core requires certain services + (such as identity, pingpong, fragmentation, + transport, traffic, session) which makes it + meaningless to have these as modules + (especially since there is really only one + way to implement these) +* HELLO's are larger than necessary since we need + one for each transport (and hence often have + to pick a subset of our HELLOs to transmit) +* Fragmentation is done at the core level but only + required for a few transports; future versions of + these transports might want to be aware of fragments + and do things like retransmission +* Autoconfiguration is hard since we have no good + way to detect (and then use securely) our external IP address +* It is currently not possible for multiple transports + between the same pair of peers to be used concurrently + in the same direction(s) +* We're using lots of cron-based jobs to periodically + try (and fail) to build and transmit + +SOLUTION: +* Rewrite core to integrate most of these services + into one "core" service. +* Redesign HELLO to contain the addresses for + all enabled transports in one message (avoiding + having to transmit the public key and signature + many, many times) +* With discovery being part of the transport service, + it is now also possible to "learn" our external + IP address from other peers (we just add plausible + addresses to the list; other peers will discard + those addresses that don't work for them!) +* New DV will consist of a "transport" and a + high-level service (to handle encrypted DV + control- and data-messages). +* Move expiration from one field per HELLO to one + per address +* Require signature in PONG, not in HELLO (and confirm + on address at a time) +* Move fragmentation into helper library linked + against by UDP (and others that might need it) +* Link-to-link advertising of our HELLO is transport + responsibility; global advertising/bootstrap remains + responsibility of higher layers +* Change APIs to be event-based (transports pull for + transmission data instead of core pushing and failing) + + +PROBLEM GROUP 6 (FS-APIs): +* As with gnunetd, the FS-APIs are heavily threaded, + resulting in hard-to-understand code (slightly + better than gnunetd, but not much). +* GTK in particular does not like this, resulting + in complicated code to switch to the GTK event + thread when needed (which may still be causing + problems on Gnome, not sure). +* If GUIs die (or are not properly shutdown), state + of current transactions is lost (FSUI only + saves to disk on shutdown) + +SOLUTION (draft, not done yet, details missing...): +* Eliminate threads from FS-APIs + => Open question: how to best write the APIs to + allow integration with diverse event loops + of GUI libraries? +* Store FS-state always also on disk + => Open question: how to do this without + compromising state/scalability? + -- cgit v1.2.3