taler-docs

Documentation for GNU Taler components, APIs and protocols
Log | Files | Refs | README | LICENSE

076-paywall-proxy.rst (7089B)


      1 DD 76: Paivana - Fighting AI Bots with GNU Taler
      2 ################################################
      3 
      4 Summary
      5 =======
      6 
      7 This design document describes the architecture of an AI Web firewall using GNU
      8 Taler, as well as new features that are required for the implementation.
      9 
     10 Motivation
     11 ==========
     12 
     13 AI bots are causing enormous amounts of traffic by scraping sites like git
     14 forges. They neither respect robots.txt nor 5xx HTTP responses. Solutions like
     15 Anubis and IP-based blocking do not work anymore at this point.
     16 
     17 Requirements
     18 ============
     19 
     20 * Must withstand high traffic from bots, requests before a payment happened
     21   must be *very* cheap, both in terms of response generation and database
     22   interaction. This includes good support for caching.
     23 * Should work not just for our paivana-httpd but also for Turnstile-style
     24   paywalls that need to work with purely static paywall pages without
     25   PHP sessions.
     26 
     27 
     28 Proposed Solution
     29 =================
     30 
     31 Architecture
     32 ------------
     33 
     34 * paivana-httpd is a reverse proxy that sits between ingress HTTP(S) traffic
     35   and the protected upstream service.
     36 * paivana-httpd is configured with a particular merchant backend.
     37 * A payment template must be set up in the merchant backend (called ``{template_id}``
     38   from here on).
     39 
     40 Steps:
     41 
     42 * Browser visits ``{website}``
     43   (for example, ``https://git.taler.net``) where
     44   ``{domain}`` is the domain name of ``{website}``.
     45 * paivana-httpd working as a reverse-proxy for
     46   ``{website}``. Whenever called for a non-whitelisted
     47   URL, it checks for a the presence of a Paivana cookie valid for
     48   this client IP address and ``{website}`` at this time.
     49   The *Paivana Cookie* is computed as:
     50 
     51   ``cur_time || '-' || crock32(SHA512(website || client_ip || paivana_server_secret || cur_time))``.
     52 
     53   where ``cur_time`` in the prefix is the expiration time for the
     54   cookie (and thus the access to the article) in seconds
     55   (to keep it short) while in the hash it is usually binary GNUnet
     56   timestamp in network byte order.
     57   ``crock32`` is GNUnet's Crockford-inspired base32 encoding.
     58 
     59   * If such a cookie is set and valid, the request is
     60     reverse-proxied to upstream. *Stop.*
     61   * Otherwise, an HTTP 302 Redirect to
     62     ``/.well-known/paivana/templates/$ID#$WEBSITE``
     63     is returned. Here, ``$ID`` is the template ID and
     64     ``$WEBSITE`` is base64url-encoding of the full URL of
     65     the website currently being visited. This way,
     66     the template page can be fully static and cached, and the
     67     JavaScript logic on that page can learn which website
     68     to pay for (and after payment redirect the browser there).
     69 
     70 * When the browser requests ``/.well-known/paivana/templates/$ID``
     71    a static **cachable** paywall page is returned,
     72    including a machine-readable ``Paivana`` HTTP header with
     73    the ``taler://pay-template/`` URL minus the client-computed
     74    ``{paivana_id}`` and fullfillment URL (see below).
     75 
     76 * The browser (rendering the paywall page) generates a random
     77   *paivana ID* via JS using the current time (``cur_time``) in seconds
     78   since the Epoch and the current URL (``{website}``) plus some
     79   freshly generated entropy (``{nonce}``):
     80 
     81   ``paivana_id := cur_time || '-' || b64url(SHA256(nonce || website || cur_time))``.
     82 
     83   Here ``b64url`` is the RFC 7515 base64 URL encoder, used to keep
     84   the result short (same reason for the use of SHA-256).
     85   The same computation could also easily be done by a non-JS client
     86   that processes the ``Paivana`` HTTP header (or a GNU Taler wallet
     87   running as a Web extension).
     88 
     89 * Based on this paivana ID, a
     90   ``taler://pay-template/{merchant_backend}/{template_id}?session_id={paivana_id}&fulfillment_url={website}``
     91   URI is generated and rendered as a QR code and link, prompting
     92   the user to pay for access to the ``{website}`` using GNU Taler.
     93 
     94 * The JavaScript in the paywall page running in the browser
     95   (or the non-JS client) long-polls
     96   on a new ``https://{merchant_backend}/sessions/{paivana_id}``
     97   endpoint that returns when an order with the given session ID has been paid
     98   for (regardless of the order ID, which is not known to the browser).
     99 * A wallet now needs to instantiate the pay template, passing the
    100   ``session_id`` and the ``fulfillment_url`` as an additional inputs
    101   to the order creation (the session ID here will work just like
    102   existing use of ``session_ids`` in session-bound payments).
    103   Similarly, the ``{website}`` works as the fulfillment URL as usual.
    104 * The wallet then must pay for the resulting order
    105   by talking to the Merchant backend.
    106 * When the long-poller returns and the payment has succeeded, the
    107   browser (still rendering the paywall page) also learns the order ID.
    108 * The JavaScript of the paywall page (or the non-JS client
    109   processing the ``Paivana`` HTTP header) then POSTs the order ID,
    110   ``nonce``, ``cur_time``
    111   and ``website`` to ``{domain}/.well-known/pavivana``.
    112 * paivana-httpd computes the paivana ID and checks if the given
    113   order ID was indeed paid recently for the computed paivana ID.
    114   If so, it generates an HTTP response which the Paivana cookie
    115   and redirects to the fulfillment URL (which is the original {website}).
    116 * The browser reloads the page with the correct
    117   Paivana cookie (see first step).
    118 
    119 
    120 Problems:
    121 ---------
    122 
    123 * A smart attacker might still create a lot of orders via the pay-template.
    124 
    125   * Solution A: Don't care, unlikely to happen in the first place.
    126   * Solution B: Rate-limit template instantiation on a per-IP basis.
    127 
    128 Implementation:
    129 ---------------
    130 
    131 * Merchant backend needs way to lookup order IDs under a ``session_id``
    132   (DONE: e027e729..b476f8ae)
    133 * Merchant backend needs way to instantiate templates with
    134   a given ``session_id`` and ``fulfillment_url``. This also
    135   requires extending the allowed responses for templates in general.
    136 * Paivana component needs to be implemented
    137 * Wallet-core needs support for a ``session_id`` and
    138   ``fulfillment_url`` in pay templates.
    139 
    140 
    141 Test Plan
    142 =========
    143 
    144 * Deploy it for git.taler.net
    145 
    146 Definition of Done
    147 ==================
    148 
    149 N/A
    150 
    151 Alternatives
    152 ============
    153 
    154 * Do not re-use the session ID mechanism but introduce some new concept.
    155   This has the drawback of us needing additional tables and indicies,
    156   and also the existing use of the session ID is very parallel to this one.
    157 * Instead of doing a 302 Redirect, cache control could have been achieved by
    158   specifying a "Vary: Cookie" HTTP header. We may combine these and use
    159   that to additionally enable caching of the 302 Redirect. The 302 solution
    160   has the advantage that there is only one page to cache per template, and
    161   the disadvantage of an additional redirect. Note that this is purely
    162   a frontend design choice, wallets and merchant backends work nicely with
    163   either approach.
    164 
    165 Drawbacks
    166 =========
    167 
    168 * This exposes an order ID to anyone who knows the session ID. This is
    169   clearly not an issue in this context, and for the existing uses of
    170   the session ID it also seems clear that knowledge of the session ID
    171   requires an attacker to have access that would easily also already
    172   give them any order ID, so this seems harmless.
    173 
    174 
    175 Discussion / Q&A
    176 ================