ng

author: Christian Grothoff <christian@grothoff.org> 2009-05-29 00:46:26 +0000
committer: Christian Grothoff <christian@grothoff.org> 2009-05-29 00:46:26 +0000
commit: 0a217a8df1657b4334b55b0e4a6c7837a8dbcfd9 (patch)
tree: 6b552f40eb089db96409a312a98d9b12bd669102 /RATIONALE
download: gnunet-0a217a8df1657b4334b55b0e4a6c7837a8dbcfd9.tar.gz
gnunet-0a217a8df1657b4334b55b0e4a6c7837a8dbcfd9.zip
1 files changed, 246 insertions, 0 deletions
diff --git a/RATIONALE b/RATIONALE
new file mode 100644
index 000000000..e68dcb883
--- /dev/null
+++ b/RATIONALE
@@ -0,0 +1,246 @@
+This document is a summary of why we're moving to GNUnet NG and what
+this major redesign tries to address.
+First of all, the redesign does not (intentionally) change anything
+fundamental about the application-level protocols or how files are
+encoded and shared.  However, it is not protocol-compatible due to
+other changes that do not relate to the essence of the application
+protocols.
+The redesign tries to address the following major problem groups
+describing isssues that apply more or less to all GNUnet versions
+prior to 0.9.x:
+PROBLEM GROUP 1 (scalability):
+* The code was modular, but bugs were not.  Memory corruption
+  in one plugin could cause crashes in others and it was not
+  always easy to identify the culprit.  This approach
+  fundamentally does not scale (in the sense of GNUnet being
+  a framework and a GNUnet server running hundreds of 
+  different application protocols -- and the result still
+  being debuggable, secure and stable).
+* The code was heavily multi-threaded resulting in complex
+  locking operations.  GNUnet 0.8.x had over 70 different
+  mutexes and almost 1000 lines of lock/unlock operations.
+  It is challenging for even good programmers to program or 
+  maintain good multi-threaded code with this complexity.
+  The excessive locking essentially prevents GNUnet from
+  actually doing much in parallel on multicores.
+* Despite efforts like Freeway, it was virtually 
+  impossible to contribute code to GNUnet that was not
+  writen in C/C++.
+* Changes to the configuration almost always required restarts
+  of gnunetd; the existence of change-notifications does not
+  really change that (how many users are even aware of SIGHUP,
+  and how few options worked with that -- and at what expense
+  in code complexity!).
+* Valgrinding could only be done for the entire gnunetd
+  process.  Given that gnunetd does quite a bit of 
+  CPU-intensive crypto, this could not be done for a system
+  under heavy (or even moderate) load.
+* Stack overflows with threads, while rare under Linux these
+  days, result in really nasty and hard-to-find crashes.
+* structs of function pointers in service APIs were
+  needlessly adding complexity, especially since in 
+  most cases there was no polymorphism
+SOLUTION:
+* Use multiple, lously-coupled processes and one big select
+  loop in each (supported by a powerful library to eliminate
+  code duplication for each process).  
+* Eliminate all threads, manage the processes with a 
+  master-process (gnunet-arm, for automatic restart manager) 
+  which also ensures that configuration changes trigger the 
+  necessary restarts.
+* Use continuations (with timeouts) as a way to unify
+  cron-jobs and other event-based code (such as waiting
+  on network IO).
+  => Using multiple processes ensures that memory corruption
+     stays localized.  
+  => Using multiple processes will make it easy to contribute
+     services written in other language(s). 
+  => Individual services can now be subjected to valgrind
+  => Process priorities can be used to schedule the CPU better
+  Note that we can not just use one process with a big
+  select loop because we have blocking operations (and the
+  blocking is outside of our control, thanks MySQL,
+  sqlite, gethostbyaddr, etc.).  So in order to perform
+  reasonably well, we need some construct for parallel
+  execution.
+  RULE: If your service contains blocking functions, it
+        MUST be a process by itself.
+* Eliminate structs with function pointers for service APIs;
+  instead, provide a library (still ending in _service.h) API
+  that transmits the requests nicely to the respective
+  process (easier to use, no need to "request" service
+  in the first place; API can cause process to be started/stopped
+  via ARM if necessary).
+PROBLEM GROUP 2 (UTIL-APIs causing bugs):
+* The existing logging functions were awkward to use and
+  their expressive power was never really used for much.
+* While we had some rules for naming functions, there
+  were still plenty of inconsistencies.
+* Specification of default values in configuration could 
+  result in inconsistencies between defaults in
+  config.scm and defaults used by the program; also,
+  different defaults might have been specified for the
+  same option in different parts of the program.
+* The TIME API did not distinguish between absolute
+  and relative time, requiring users to know which
+  type of value some variable contained and to
+  manually convert properly.  Combined with the
+  possibility of integer overflows this is a major
+  source of bugs.
+* The TIME API for seconds has a theoretical problem
+  with a 32-bit overflow on some platforms which is
+  only partially fixed by the old code with some
+  hackery.
+SOLUTION:
+* Logging was radically simplified.
+* Functions are now more conistently named.
+* Configuration has no more defaults; instead,
+  we load a global default configuration file
+  before the user-specific configuration (which 
+  can be used to override defaults); the global
+  default configuration file will be generated 
+  from config.scm.
+* Time now distinguishes between
+  struct GNUNET_TIME_Absolute and
+  struct GNUNET_TIME_Relative.  We use structs
+  so that the compiler won't coerce for us 
+  (forcing the use of specific conversion
+  functions which have checks for overflows, etc.).
+  Naturally the need to use these functions makes
+  the code a bit more verbose, but that's a good
+  thing given the potential for bugs.
+* There is no more TIME API function to do anything
+  with 32-bit seconds
+PROBLEM GROUP 3 (statistics):
+* Databases and others needed to store capacity values
+  similar to what stats was already doing, but
+  across process lifetimes ("state"-API was a partial
+  solution for that, but using it was clunky)
+* Only gnunetd could use statistics, but other
+  processes in the GNUnet system might have had
+  good uses for it as well
+SOLUTION:
+* New statistics library and service that offer
+  an API to inspect and modify statistics
+* Statistics are distinguished by service name
+  in addition to the name of the value
+* Statistics can be marked as persistent, in
+  which case they are written to disk when
+  the statistics service shuts down.
+  => One solution for existing stats uses,
+     application stats, database stats and
+     versioning information!
+PROBLEM GROUP 4 (Testing):
+* The existing structure of the code with modules
+  stored in places far away from the test code
+  resulted in tools like lcov not giving good results.
+* The codebase had evolved into a complex, deeply
+  nested hierarchy often with directories that
+  then only contained a single file.  Some of these
+  files had the same name making it hard to find
+  the source corresponding to a crash based on 
+  the reported filename/line information.
+* Non-trivial portions of the code lacked good testcases,
+  and it was not always obvious which parts of the code 
+  were not well-tested.
+SOLUTION:
+* Code that should be tested together is now
+  in the same directory.
+* The hierarchy is now essentially flat, each
+  major service having on directory under src/;
+  naming conventions help to make sure that
+  files have globally-unique names
+* All code added to the new repository must
+  come with testcases with reasonable coverage.
+PROBLEM GROUP 5 (core/transports):
+* The new DV service requires session key exchange
+  between DV-neighbours, but the existing
+  session key code can not be used to achieve this.
+* The core requires certain services
+  (such as identity, pingpong, fragmentation,
+   transport, traffic, session) which makes it 
+  meaningless to have these as modules
+  (especially since there is really only one
+  way to implement these)
+* HELLO's are larger than necessary since we need
+  one for each transport (and hence often have
+  to pick a subset of our HELLOs to transmit)
+* Fragmentation is done at the core level but only
+  required for a few transports; future versions of
+  these transports might want to be aware of fragments
+  and do things like retransmission
+* Autoconfiguration is hard since we have no good
+  way to detect (and then use securely) our external IP address
+* It is currently not possible for multiple transports
+  between the same pair of peers to be used concurrently
+  in the same direction(s)
+* We're using lots of cron-based jobs to periodically
+  try (and fail) to build and transmit
+SOLUTION:
+* Rewrite core to integrate most of these services
+  into one "core" service.
+* Redesign HELLO to contain the addresses for
+  all enabled transports in one message (avoiding
+  having to transmit the public key and signature
+  many, many times)
+* With discovery being part of the transport service,
+  it is now also possible to "learn" our external
+  IP address from other peers (we just add plausible
+  addresses to the list; other peers will discard 
+  those addresses that don't work for them!)
+* New DV will consist of a "transport" and a 
+  high-level service (to handle encrypted DV
+  control- and data-messages).
+* Move expiration from one field per HELLO to one
+  per address
+* Require signature in PONG, not in HELLO (and confirm
+  on address at a time)
+* Move fragmentation into helper library linked
+  against by UDP (and others that might need it)
+* Link-to-link advertising of our HELLO is transport
+  responsibility; global advertising/bootstrap remains
+  responsibility of higher layers
+* Change APIs to be event-based (transports pull for
+  transmission data instead of core pushing and failing)
+PROBLEM GROUP 6 (FS-APIs):
+* As with gnunetd, the FS-APIs are heavily threaded,
+  resulting in hard-to-understand code (slightly
+  better than gnunetd, but not much).
+* GTK in particular does not like this, resulting 
+  in complicated code to switch to the GTK event
+  thread when needed (which may still be causing
+  problems on Gnome, not sure).
+* If GUIs die (or are not properly shutdown), state
+  of current transactions is lost (FSUI only
+  saves to disk on shutdown)
+SOLUTION (draft, not done yet, details missing...):
+* Eliminate threads from FS-APIs
+  => Open question: how to best write the APIs to
+     allow integration with diverse event loops
+     of GUI libraries?
+* Store FS-state always also on disk
+  => Open question: how to do this without 
+     compromising state/scalability?
author	Christian Grothoff <christian@grothoff.org>	2009-05-29 00:46:26 +0000
committer	Christian Grothoff <christian@grothoff.org>	2009-05-29 00:46:26 +0000
commit	0a217a8df1657b4334b55b0e4a6c7837a8dbcfd9 (patch)
tree	6b552f40eb089db96409a312a98d9b12bd669102 /RATIONALE
download	gnunet-0a217a8df1657b4334b55b0e4a6c7837a8dbcfd9.tar.gz gnunet-0a217a8df1657b4334b55b0e4a6c7837a8dbcfd9.zip

diff --git a/RATIONALE b/RATIONALE new file mode 100644 index 000000000..e68dcb883 --- /dev/null +++ b/RATIONALE
@@ -0,0 +1,246 @@
	1	This document is a summary of why we're moving to GNUnet NG and what
	2	this major redesign tries to address.
	3
	4	First of all, the redesign does not (intentionally) change anything
	5	fundamental about the application-level protocols or how files are
	6	encoded and shared. However, it is not protocol-compatible due to
	7	other changes that do not relate to the essence of the application
	8	protocols.
	9
	10
	11	The redesign tries to address the following major problem groups
	12	describing isssues that apply more or less to all GNUnet versions
	13	prior to 0.9.x:
	14
	15
	16	PROBLEM GROUP 1 (scalability):
	17	* The code was modular, but bugs were not. Memory corruption
	18	in one plugin could cause crashes in others and it was not
	19	always easy to identify the culprit. This approach
	20	fundamentally does not scale (in the sense of GNUnet being
	21	a framework and a GNUnet server running hundreds of
	22	different application protocols -- and the result still
	23	being debuggable, secure and stable).
	24	* The code was heavily multi-threaded resulting in complex
	25	locking operations. GNUnet 0.8.x had over 70 different
	26	mutexes and almost 1000 lines of lock/unlock operations.
	27	It is challenging for even good programmers to program or
	28	maintain good multi-threaded code with this complexity.
	29	The excessive locking essentially prevents GNUnet from
	30	actually doing much in parallel on multicores.
	31	* Despite efforts like Freeway, it was virtually
	32	impossible to contribute code to GNUnet that was not
	33	writen in C/C++.
	34	* Changes to the configuration almost always required restarts
	35	of gnunetd; the existence of change-notifications does not
	36	really change that (how many users are even aware of SIGHUP,
	37	and how few options worked with that -- and at what expense
	38	in code complexity!).
	39	* Valgrinding could only be done for the entire gnunetd
	40	process. Given that gnunetd does quite a bit of
	41	CPU-intensive crypto, this could not be done for a system
	42	under heavy (or even moderate) load.
	43	* Stack overflows with threads, while rare under Linux these
	44	days, result in really nasty and hard-to-find crashes.
	45	* structs of function pointers in service APIs were
	46	needlessly adding complexity, especially since in
	47	most cases there was no polymorphism
	48
	49	SOLUTION:
	50	* Use multiple, lously-coupled processes and one big select
	51	loop in each (supported by a powerful library to eliminate
	52	code duplication for each process).
	53	* Eliminate all threads, manage the processes with a
	54	master-process (gnunet-arm, for automatic restart manager)
	55	which also ensures that configuration changes trigger the
	56	necessary restarts.
	57	* Use continuations (with timeouts) as a way to unify
	58	cron-jobs and other event-based code (such as waiting
	59	on network IO).
	60	=> Using multiple processes ensures that memory corruption
	61	stays localized.
	62	=> Using multiple processes will make it easy to contribute
	63	services written in other language(s).
	64	=> Individual services can now be subjected to valgrind
	65	=> Process priorities can be used to schedule the CPU better
	66	Note that we can not just use one process with a big
	67	select loop because we have blocking operations (and the
	68	blocking is outside of our control, thanks MySQL,
	69	sqlite, gethostbyaddr, etc.). So in order to perform
	70	reasonably well, we need some construct for parallel
	71	execution.
	72
	73	RULE: If your service contains blocking functions, it
	74	MUST be a process by itself.
	75	* Eliminate structs with function pointers for service APIs;
	76	instead, provide a library (still ending in _service.h) API
	77	that transmits the requests nicely to the respective
	78	process (easier to use, no need to "request" service
	79	in the first place; API can cause process to be started/stopped
	80	via ARM if necessary).
	81
	82
	83	PROBLEM GROUP 2 (UTIL-APIs causing bugs):
	84	* The existing logging functions were awkward to use and
	85	their expressive power was never really used for much.
	86	* While we had some rules for naming functions, there
	87	were still plenty of inconsistencies.
	88	* Specification of default values in configuration could
	89	result in inconsistencies between defaults in
	90	config.scm and defaults used by the program; also,
	91	different defaults might have been specified for the
	92	same option in different parts of the program.
	93	* The TIME API did not distinguish between absolute
	94	and relative time, requiring users to know which
	95	type of value some variable contained and to
	96	manually convert properly. Combined with the
	97	possibility of integer overflows this is a major
	98	source of bugs.
	99	* The TIME API for seconds has a theoretical problem
	100	with a 32-bit overflow on some platforms which is
	101	only partially fixed by the old code with some
	102	hackery.
	103
	104	SOLUTION:
	105	* Logging was radically simplified.
	106	* Functions are now more conistently named.
	107	* Configuration has no more defaults; instead,
	108	we load a global default configuration file
	109	before the user-specific configuration (which
	110	can be used to override defaults); the global
	111	default configuration file will be generated
	112	from config.scm.
	113	* Time now distinguishes between
	114	struct GNUNET_TIME_Absolute and
	115	struct GNUNET_TIME_Relative. We use structs
	116	so that the compiler won't coerce for us
	117	(forcing the use of specific conversion
	118	functions which have checks for overflows, etc.).
	119	Naturally the need to use these functions makes
	120	the code a bit more verbose, but that's a good
	121	thing given the potential for bugs.
	122	* There is no more TIME API function to do anything
	123	with 32-bit seconds
	124
	125
	126	PROBLEM GROUP 3 (statistics):
	127	* Databases and others needed to store capacity values
	128	similar to what stats was already doing, but
	129	across process lifetimes ("state"-API was a partial
	130	solution for that, but using it was clunky)
	131	* Only gnunetd could use statistics, but other
	132	processes in the GNUnet system might have had
	133	good uses for it as well
	134
	135	SOLUTION:
	136	* New statistics library and service that offer
	137	an API to inspect and modify statistics
	138	* Statistics are distinguished by service name
	139	in addition to the name of the value
	140	* Statistics can be marked as persistent, in
	141	which case they are written to disk when
	142	the statistics service shuts down.
	143	=> One solution for existing stats uses,
	144	application stats, database stats and
	145	versioning information!
	146
	147
	148	PROBLEM GROUP 4 (Testing):
	149	* The existing structure of the code with modules
	150	stored in places far away from the test code
	151	resulted in tools like lcov not giving good results.
	152	* The codebase had evolved into a complex, deeply
	153	nested hierarchy often with directories that
	154	then only contained a single file. Some of these
	155	files had the same name making it hard to find
	156	the source corresponding to a crash based on
	157	the reported filename/line information.
	158	* Non-trivial portions of the code lacked good testcases,
	159	and it was not always obvious which parts of the code
	160	were not well-tested.
	161
	162	SOLUTION:
	163	* Code that should be tested together is now
	164	in the same directory.
	165	* The hierarchy is now essentially flat, each
	166	major service having on directory under src/;
	167	naming conventions help to make sure that
	168	files have globally-unique names
	169	* All code added to the new repository must
	170	come with testcases with reasonable coverage.
	171
	172
	173	PROBLEM GROUP 5 (core/transports):
	174	* The new DV service requires session key exchange
	175	between DV-neighbours, but the existing
	176	session key code can not be used to achieve this.
	177	* The core requires certain services
	178	(such as identity, pingpong, fragmentation,
	179	transport, traffic, session) which makes it
	180	meaningless to have these as modules
	181	(especially since there is really only one
	182	way to implement these)
	183	* HELLO's are larger than necessary since we need
	184	one for each transport (and hence often have
	185	to pick a subset of our HELLOs to transmit)
	186	* Fragmentation is done at the core level but only
	187	required for a few transports; future versions of
	188	these transports might want to be aware of fragments
	189	and do things like retransmission
	190	* Autoconfiguration is hard since we have no good
	191	way to detect (and then use securely) our external IP address
	192	* It is currently not possible for multiple transports
	193	between the same pair of peers to be used concurrently
	194	in the same direction(s)
	195	* We're using lots of cron-based jobs to periodically
	196	try (and fail) to build and transmit
	197
	198	SOLUTION:
	199	* Rewrite core to integrate most of these services
	200	into one "core" service.
	201	* Redesign HELLO to contain the addresses for
	202	all enabled transports in one message (avoiding
	203	having to transmit the public key and signature
	204	many, many times)
	205	* With discovery being part of the transport service,
	206	it is now also possible to "learn" our external
	207	IP address from other peers (we just add plausible
	208	addresses to the list; other peers will discard
	209	those addresses that don't work for them!)
	210	* New DV will consist of a "transport" and a
	211	high-level service (to handle encrypted DV
	212	control- and data-messages).
	213	* Move expiration from one field per HELLO to one
	214	per address
	215	* Require signature in PONG, not in HELLO (and confirm
	216	on address at a time)
	217	* Move fragmentation into helper library linked
	218	against by UDP (and others that might need it)
	219	* Link-to-link advertising of our HELLO is transport
	220	responsibility; global advertising/bootstrap remains
	221	responsibility of higher layers
	222	* Change APIs to be event-based (transports pull for
	223	transmission data instead of core pushing and failing)
	224
	225
	226	PROBLEM GROUP 6 (FS-APIs):
	227	* As with gnunetd, the FS-APIs are heavily threaded,
	228	resulting in hard-to-understand code (slightly
	229	better than gnunetd, but not much).
	230	* GTK in particular does not like this, resulting
	231	in complicated code to switch to the GTK event
	232	thread when needed (which may still be causing
	233	problems on Gnome, not sure).
	234	* If GUIs die (or are not properly shutdown), state
	235	of current transactions is lost (FSUI only
	236	saves to disk on shutdown)
	237
	238	SOLUTION (draft, not done yet, details missing...):
	239	* Eliminate threads from FS-APIs
	240	=> Open question: how to best write the APIs to
	241	allow integration with diverse event loops
	242	of GUI libraries?
	243	* Store FS-state always also on disk
	244	=> Open question: how to do this without
	245	compromising state/scalability?
	246