From 0a217a8df1657b4334b55b0e4a6c7837a8dbcfd9 Mon Sep 17 00:00:00 2001
From: Christian Grothoff <christian@grothoff.org>
Date: Fri, 29 May 2009 00:46:26 +0000
Subject: ng

---
 RATIONALE | 246 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 246 insertions(+)
 create mode 100644 RATIONALE

(limited to 'RATIONALE')

diff --git a/RATIONALE b/RATIONALE
new file mode 100644
index 000000000..e68dcb883
--- /dev/null
+++ b/RATIONALE
@@ -0,0 +1,246 @@
+This document is a summary of why we're moving to GNUnet NG and what
+this major redesign tries to address.
+
+First of all, the redesign does not (intentionally) change anything
+fundamental about the application-level protocols or how files are
+encoded and shared.  However, it is not protocol-compatible due to
+other changes that do not relate to the essence of the application
+protocols.
+
+
+The redesign tries to address the following major problem groups
+describing isssues that apply more or less to all GNUnet versions
+prior to 0.9.x:
+
+
+PROBLEM GROUP 1 (scalability):
+* The code was modular, but bugs were not.  Memory corruption
+  in one plugin could cause crashes in others and it was not
+  always easy to identify the culprit.  This approach
+  fundamentally does not scale (in the sense of GNUnet being
+  a framework and a GNUnet server running hundreds of 
+  different application protocols -- and the result still
+  being debuggable, secure and stable).
+* The code was heavily multi-threaded resulting in complex
+  locking operations.  GNUnet 0.8.x had over 70 different
+  mutexes and almost 1000 lines of lock/unlock operations.
+  It is challenging for even good programmers to program or 
+  maintain good multi-threaded code with this complexity.
+  The excessive locking essentially prevents GNUnet from
+  actually doing much in parallel on multicores.
+* Despite efforts like Freeway, it was virtually 
+  impossible to contribute code to GNUnet that was not
+  writen in C/C++.
+* Changes to the configuration almost always required restarts
+  of gnunetd; the existence of change-notifications does not
+  really change that (how many users are even aware of SIGHUP,
+  and how few options worked with that -- and at what expense
+  in code complexity!).
+* Valgrinding could only be done for the entire gnunetd
+  process.  Given that gnunetd does quite a bit of 
+  CPU-intensive crypto, this could not be done for a system
+  under heavy (or even moderate) load.
+* Stack overflows with threads, while rare under Linux these
+  days, result in really nasty and hard-to-find crashes.
+* structs of function pointers in service APIs were
+  needlessly adding complexity, especially since in 
+  most cases there was no polymorphism
+
+SOLUTION:
+* Use multiple, lously-coupled processes and one big select
+  loop in each (supported by a powerful library to eliminate
+  code duplication for each process).  
+* Eliminate all threads, manage the processes with a 
+  master-process (gnunet-arm, for automatic restart manager) 
+  which also ensures that configuration changes trigger the 
+  necessary restarts.
+* Use continuations (with timeouts) as a way to unify
+  cron-jobs and other event-based code (such as waiting
+  on network IO).
+  => Using multiple processes ensures that memory corruption
+     stays localized.  
+  => Using multiple processes will make it easy to contribute
+     services written in other language(s). 
+  => Individual services can now be subjected to valgrind
+  => Process priorities can be used to schedule the CPU better
+  Note that we can not just use one process with a big
+  select loop because we have blocking operations (and the
+  blocking is outside of our control, thanks MySQL,
+  sqlite, gethostbyaddr, etc.).  So in order to perform
+  reasonably well, we need some construct for parallel
+  execution.
+
+  RULE: If your service contains blocking functions, it
+        MUST be a process by itself.
+* Eliminate structs with function pointers for service APIs;
+  instead, provide a library (still ending in _service.h) API
+  that transmits the requests nicely to the respective
+  process (easier to use, no need to "request" service
+  in the first place; API can cause process to be started/stopped
+  via ARM if necessary).
+
+
+PROBLEM GROUP 2 (UTIL-APIs causing bugs):
+* The existing logging functions were awkward to use and
+  their expressive power was never really used for much.
+* While we had some rules for naming functions, there
+  were still plenty of inconsistencies.
+* Specification of default values in configuration could 
+  result in inconsistencies between defaults in
+  config.scm and defaults used by the program; also,
+  different defaults might have been specified for the
+  same option in different parts of the program.
+* The TIME API did not distinguish between absolute
+  and relative time, requiring users to know which
+  type of value some variable contained and to
+  manually convert properly.  Combined with the
+  possibility of integer overflows this is a major
+  source of bugs.
+* The TIME API for seconds has a theoretical problem
+  with a 32-bit overflow on some platforms which is
+  only partially fixed by the old code with some
+  hackery.
+
+SOLUTION:
+* Logging was radically simplified.
+* Functions are now more conistently named.
+* Configuration has no more defaults; instead,
+  we load a global default configuration file
+  before the user-specific configuration (which 
+  can be used to override defaults); the global
+  default configuration file will be generated 
+  from config.scm.
+* Time now distinguishes between
+  struct GNUNET_TIME_Absolute and
+  struct GNUNET_TIME_Relative.  We use structs
+  so that the compiler won't coerce for us 
+  (forcing the use of specific conversion
+  functions which have checks for overflows, etc.).
+  Naturally the need to use these functions makes
+  the code a bit more verbose, but that's a good
+  thing given the potential for bugs.
+* There is no more TIME API function to do anything
+  with 32-bit seconds
+
+
+PROBLEM GROUP 3 (statistics):
+* Databases and others needed to store capacity values
+  similar to what stats was already doing, but
+  across process lifetimes ("state"-API was a partial
+  solution for that, but using it was clunky)
+* Only gnunetd could use statistics, but other
+  processes in the GNUnet system might have had
+  good uses for it as well
+
+SOLUTION:
+* New statistics library and service that offer
+  an API to inspect and modify statistics
+* Statistics are distinguished by service name
+  in addition to the name of the value
+* Statistics can be marked as persistent, in
+  which case they are written to disk when
+  the statistics service shuts down.
+  => One solution for existing stats uses,
+     application stats, database stats and
+     versioning information!
+
+
+PROBLEM GROUP 4 (Testing):
+* The existing structure of the code with modules
+  stored in places far away from the test code
+  resulted in tools like lcov not giving good results.
+* The codebase had evolved into a complex, deeply
+  nested hierarchy often with directories that
+  then only contained a single file.  Some of these
+  files had the same name making it hard to find
+  the source corresponding to a crash based on 
+  the reported filename/line information.
+* Non-trivial portions of the code lacked good testcases,
+  and it was not always obvious which parts of the code 
+  were not well-tested.
+
+SOLUTION:
+* Code that should be tested together is now
+  in the same directory.
+* The hierarchy is now essentially flat, each
+  major service having on directory under src/;
+  naming conventions help to make sure that
+  files have globally-unique names
+* All code added to the new repository must
+  come with testcases with reasonable coverage.
+
+
+PROBLEM GROUP 5 (core/transports):
+* The new DV service requires session key exchange
+  between DV-neighbours, but the existing
+  session key code can not be used to achieve this.
+* The core requires certain services
+  (such as identity, pingpong, fragmentation,
+   transport, traffic, session) which makes it 
+  meaningless to have these as modules
+  (especially since there is really only one
+  way to implement these)
+* HELLO's are larger than necessary since we need
+  one for each transport (and hence often have
+  to pick a subset of our HELLOs to transmit)
+* Fragmentation is done at the core level but only
+  required for a few transports; future versions of
+  these transports might want to be aware of fragments
+  and do things like retransmission
+* Autoconfiguration is hard since we have no good
+  way to detect (and then use securely) our external IP address
+* It is currently not possible for multiple transports
+  between the same pair of peers to be used concurrently
+  in the same direction(s)
+* We're using lots of cron-based jobs to periodically
+  try (and fail) to build and transmit
+
+SOLUTION:
+* Rewrite core to integrate most of these services
+  into one "core" service.
+* Redesign HELLO to contain the addresses for
+  all enabled transports in one message (avoiding
+  having to transmit the public key and signature
+  many, many times)
+* With discovery being part of the transport service,
+  it is now also possible to "learn" our external
+  IP address from other peers (we just add plausible
+  addresses to the list; other peers will discard 
+  those addresses that don't work for them!)
+* New DV will consist of a "transport" and a 
+  high-level service (to handle encrypted DV
+  control- and data-messages).
+* Move expiration from one field per HELLO to one
+  per address
+* Require signature in PONG, not in HELLO (and confirm
+  on address at a time)
+* Move fragmentation into helper library linked
+  against by UDP (and others that might need it)
+* Link-to-link advertising of our HELLO is transport
+  responsibility; global advertising/bootstrap remains
+  responsibility of higher layers
+* Change APIs to be event-based (transports pull for
+  transmission data instead of core pushing and failing)
+
+
+PROBLEM GROUP 6 (FS-APIs):
+* As with gnunetd, the FS-APIs are heavily threaded,
+  resulting in hard-to-understand code (slightly
+  better than gnunetd, but not much).
+* GTK in particular does not like this, resulting 
+  in complicated code to switch to the GTK event
+  thread when needed (which may still be causing
+  problems on Gnome, not sure).
+* If GUIs die (or are not properly shutdown), state
+  of current transactions is lost (FSUI only
+  saves to disk on shutdown)
+
+SOLUTION (draft, not done yet, details missing...):
+* Eliminate threads from FS-APIs
+  => Open question: how to best write the APIs to
+     allow integration with diverse event loops
+     of GUI libraries?
+* Store FS-state always also on disk
+  => Open question: how to do this without 
+     compromising state/scalability?
+
-- 
cgit v1.2.3