|author||Christian Grothoff <email@example.com>||2009-05-29 00:46:26 +0000|
|committer||Christian Grothoff <firstname.lastname@example.org>||2009-05-29 00:46:26 +0000|
Diffstat (limited to 'RATIONALE')
1 files changed, 246 insertions, 0 deletions
diff --git a/RATIONALE b/RATIONALE
new file mode 100644
@@ -0,0 +1,246 @@
+This document is a summary of why we're moving to GNUnet NG and what
+this major redesign tries to address.
+First of all, the redesign does not (intentionally) change anything
+fundamental about the application-level protocols or how files are
+encoded and shared. However, it is not protocol-compatible due to
+other changes that do not relate to the essence of the application
+The redesign tries to address the following major problem groups
+describing isssues that apply more or less to all GNUnet versions
+prior to 0.9.x:
+PROBLEM GROUP 1 (scalability):
+* The code was modular, but bugs were not. Memory corruption
+ in one plugin could cause crashes in others and it was not
+ always easy to identify the culprit. This approach
+ fundamentally does not scale (in the sense of GNUnet being
+ a framework and a GNUnet server running hundreds of
+ different application protocols -- and the result still
+ being debuggable, secure and stable).
+* The code was heavily multi-threaded resulting in complex
+ locking operations. GNUnet 0.8.x had over 70 different
+ mutexes and almost 1000 lines of lock/unlock operations.
+ It is challenging for even good programmers to program or
+ maintain good multi-threaded code with this complexity.
+ The excessive locking essentially prevents GNUnet from
+ actually doing much in parallel on multicores.
+* Despite efforts like Freeway, it was virtually
+ impossible to contribute code to GNUnet that was not
+ writen in C/C++.
+* Changes to the configuration almost always required restarts
+ of gnunetd; the existence of change-notifications does not
+ really change that (how many users are even aware of SIGHUP,
+ and how few options worked with that -- and at what expense
+ in code complexity!).
+* Valgrinding could only be done for the entire gnunetd
+ process. Given that gnunetd does quite a bit of
+ CPU-intensive crypto, this could not be done for a system
+ under heavy (or even moderate) load.
+* Stack overflows with threads, while rare under Linux these
+ days, result in really nasty and hard-to-find crashes.
+* structs of function pointers in service APIs were
+ needlessly adding complexity, especially since in
+ most cases there was no polymorphism
+* Use multiple, lously-coupled processes and one big select
+ loop in each (supported by a powerful library to eliminate
+ code duplication for each process).
+* Eliminate all threads, manage the processes with a
+ master-process (gnunet-arm, for automatic restart manager)
+ which also ensures that configuration changes trigger the
+ necessary restarts.
+* Use continuations (with timeouts) as a way to unify
+ cron-jobs and other event-based code (such as waiting
+ on network IO).
+ => Using multiple processes ensures that memory corruption
+ stays localized.
+ => Using multiple processes will make it easy to contribute
+ services written in other language(s).
+ => Individual services can now be subjected to valgrind
+ => Process priorities can be used to schedule the CPU better
+ Note that we can not just use one process with a big
+ select loop because we have blocking operations (and the
+ blocking is outside of our control, thanks MySQL,
+ sqlite, gethostbyaddr, etc.). So in order to perform
+ reasonably well, we need some construct for parallel
+ RULE: If your service contains blocking functions, it
+ MUST be a process by itself.
+* Eliminate structs with function pointers for service APIs;
+ instead, provide a library (still ending in _service.h) API
+ that transmits the requests nicely to the respective
+ process (easier to use, no need to "request" service
+ in the first place; API can cause process to be started/stopped
+ via ARM if necessary).
+PROBLEM GROUP 2 (UTIL-APIs causing bugs):
+* The existing logging functions were awkward to use and
+ their expressive power was never really used for much.
+* While we had some rules for naming functions, there
+ were still plenty of inconsistencies.
+* Specification of default values in configuration could
+ result in inconsistencies between defaults in
+ config.scm and defaults used by the program; also,
+ different defaults might have been specified for the
+ same option in different parts of the program.
+* The TIME API did not distinguish between absolute
+ and relative time, requiring users to know which
+ type of value some variable contained and to
+ manually convert properly. Combined with the
+ possibility of integer overflows this is a major
+ source of bugs.
+* The TIME API for seconds has a theoretical problem
+ with a 32-bit overflow on some platforms which is
+ only partially fixed by the old code with some
+* Logging was radically simplified.
+* Functions are now more conistently named.
+* Configuration has no more defaults; instead,
+ we load a global default configuration file
+ before the user-specific configuration (which
+ can be used to override defaults); the global
+ default configuration file will be generated
+ from config.scm.
+* Time now distinguishes between
+ struct GNUNET_TIME_Absolute and
+ struct GNUNET_TIME_Relative. We use structs
+ so that the compiler won't coerce for us
+ (forcing the use of specific conversion
+ functions which have checks for overflows, etc.).
+ Naturally the need to use these functions makes
+ the code a bit more verbose, but that's a good
+ thing given the potential for bugs.
+* There is no more TIME API function to do anything
+ with 32-bit seconds
+PROBLEM GROUP 3 (statistics):
+* Databases and others needed to store capacity values
+ similar to what stats was already doing, but
+ across process lifetimes ("state"-API was a partial
+ solution for that, but using it was clunky)
+* Only gnunetd could use statistics, but other
+ processes in the GNUnet system might have had
+ good uses for it as well
+* New statistics library and service that offer
+ an API to inspect and modify statistics
+* Statistics are distinguished by service name
+ in addition to the name of the value
+* Statistics can be marked as persistent, in
+ which case they are written to disk when
+ the statistics service shuts down.
+ => One solution for existing stats uses,
+ application stats, database stats and
+ versioning information!
+PROBLEM GROUP 4 (Testing):
+* The existing structure of the code with modules
+ stored in places far away from the test code
+ resulted in tools like lcov not giving good results.
+* The codebase had evolved into a complex, deeply
+ nested hierarchy often with directories that
+ then only contained a single file. Some of these
+ files had the same name making it hard to find
+ the source corresponding to a crash based on
+ the reported filename/line information.
+* Non-trivial portions of the code lacked good testcases,
+ and it was not always obvious which parts of the code
+ were not well-tested.
+* Code that should be tested together is now
+ in the same directory.
+* The hierarchy is now essentially flat, each
+ major service having on directory under src/;
+ naming conventions help to make sure that
+ files have globally-unique names
+* All code added to the new repository must
+ come with testcases with reasonable coverage.
+PROBLEM GROUP 5 (core/transports):
+* The new DV service requires session key exchange
+ between DV-neighbours, but the existing
+ session key code can not be used to achieve this.
+* The core requires certain services
+ (such as identity, pingpong, fragmentation,
+ transport, traffic, session) which makes it
+ meaningless to have these as modules
+ (especially since there is really only one
+ way to implement these)
+* HELLO's are larger than necessary since we need
+ one for each transport (and hence often have
+ to pick a subset of our HELLOs to transmit)
+* Fragmentation is done at the core level but only
+ required for a few transports; future versions of
+ these transports might want to be aware of fragments
+ and do things like retransmission
+* Autoconfiguration is hard since we have no good
+ way to detect (and then use securely) our external IP address
+* It is currently not possible for multiple transports
+ between the same pair of peers to be used concurrently
+ in the same direction(s)
+* We're using lots of cron-based jobs to periodically
+ try (and fail) to build and transmit
+* Rewrite core to integrate most of these services
+ into one "core" service.
+* Redesign HELLO to contain the addresses for
+ all enabled transports in one message (avoiding
+ having to transmit the public key and signature
+ many, many times)
+* With discovery being part of the transport service,
+ it is now also possible to "learn" our external
+ IP address from other peers (we just add plausible
+ addresses to the list; other peers will discard
+ those addresses that don't work for them!)
+* New DV will consist of a "transport" and a
+ high-level service (to handle encrypted DV
+ control- and data-messages).
+* Move expiration from one field per HELLO to one
+ per address
+* Require signature in PONG, not in HELLO (and confirm
+ on address at a time)
+* Move fragmentation into helper library linked
+ against by UDP (and others that might need it)
+* Link-to-link advertising of our HELLO is transport
+ responsibility; global advertising/bootstrap remains
+ responsibility of higher layers
+* Change APIs to be event-based (transports pull for
+ transmission data instead of core pushing and failing)
+PROBLEM GROUP 6 (FS-APIs):
+* As with gnunetd, the FS-APIs are heavily threaded,
+ resulting in hard-to-understand code (slightly
+ better than gnunetd, but not much).
+* GTK in particular does not like this, resulting
+ in complicated code to switch to the GTK event
+ thread when needed (which may still be causing
+ problems on Gnome, not sure).
+* If GUIs die (or are not properly shutdown), state
+ of current transactions is lost (FSUI only
+ saves to disk on shutdown)
+SOLUTION (draft, not done yet, details missing...):
+* Eliminate threads from FS-APIs
+ => Open question: how to best write the APIs to
+ allow integration with diverse event loops
+ of GUI libraries?
+* Store FS-state always also on disk
+ => Open question: how to do this without
+ compromising state/scalability?