076-paywall-proxy.rst (7089B)
1 DD 76: Paivana - Fighting AI Bots with GNU Taler 2 ################################################ 3 4 Summary 5 ======= 6 7 This design document describes the architecture of an AI Web firewall using GNU 8 Taler, as well as new features that are required for the implementation. 9 10 Motivation 11 ========== 12 13 AI bots are causing enormous amounts of traffic by scraping sites like git 14 forges. They neither respect robots.txt nor 5xx HTTP responses. Solutions like 15 Anubis and IP-based blocking do not work anymore at this point. 16 17 Requirements 18 ============ 19 20 * Must withstand high traffic from bots, requests before a payment happened 21 must be *very* cheap, both in terms of response generation and database 22 interaction. This includes good support for caching. 23 * Should work not just for our paivana-httpd but also for Turnstile-style 24 paywalls that need to work with purely static paywall pages without 25 PHP sessions. 26 27 28 Proposed Solution 29 ================= 30 31 Architecture 32 ------------ 33 34 * paivana-httpd is a reverse proxy that sits between ingress HTTP(S) traffic 35 and the protected upstream service. 36 * paivana-httpd is configured with a particular merchant backend. 37 * A payment template must be set up in the merchant backend (called ``{template_id}`` 38 from here on). 39 40 Steps: 41 42 * Browser visits ``{website}`` 43 (for example, ``https://git.taler.net``) where 44 ``{domain}`` is the domain name of ``{website}``. 45 * paivana-httpd working as a reverse-proxy for 46 ``{website}``. Whenever called for a non-whitelisted 47 URL, it checks for a the presence of a Paivana cookie valid for 48 this client IP address and ``{website}`` at this time. 49 The *Paivana Cookie* is computed as: 50 51 ``cur_time || '-' || crock32(SHA512(website || client_ip || paivana_server_secret || cur_time))``. 52 53 where ``cur_time`` in the prefix is the expiration time for the 54 cookie (and thus the access to the article) in seconds 55 (to keep it short) while in the hash it is usually binary GNUnet 56 timestamp in network byte order. 57 ``crock32`` is GNUnet's Crockford-inspired base32 encoding. 58 59 * If such a cookie is set and valid, the request is 60 reverse-proxied to upstream. *Stop.* 61 * Otherwise, an HTTP 302 Redirect to 62 ``/.well-known/paivana/templates/$ID#$WEBSITE`` 63 is returned. Here, ``$ID`` is the template ID and 64 ``$WEBSITE`` is base64url-encoding of the full URL of 65 the website currently being visited. This way, 66 the template page can be fully static and cached, and the 67 JavaScript logic on that page can learn which website 68 to pay for (and after payment redirect the browser there). 69 70 * When the browser requests ``/.well-known/paivana/templates/$ID`` 71 a static **cachable** paywall page is returned, 72 including a machine-readable ``Paivana`` HTTP header with 73 the ``taler://pay-template/`` URL minus the client-computed 74 ``{paivana_id}`` and fullfillment URL (see below). 75 76 * The browser (rendering the paywall page) generates a random 77 *paivana ID* via JS using the current time (``cur_time``) in seconds 78 since the Epoch and the current URL (``{website}``) plus some 79 freshly generated entropy (``{nonce}``): 80 81 ``paivana_id := cur_time || '-' || b64url(SHA256(nonce || website || cur_time))``. 82 83 Here ``b64url`` is the RFC 7515 base64 URL encoder, used to keep 84 the result short (same reason for the use of SHA-256). 85 The same computation could also easily be done by a non-JS client 86 that processes the ``Paivana`` HTTP header (or a GNU Taler wallet 87 running as a Web extension). 88 89 * Based on this paivana ID, a 90 ``taler://pay-template/{merchant_backend}/{template_id}?session_id={paivana_id}&fulfillment_url={website}`` 91 URI is generated and rendered as a QR code and link, prompting 92 the user to pay for access to the ``{website}`` using GNU Taler. 93 94 * The JavaScript in the paywall page running in the browser 95 (or the non-JS client) long-polls 96 on a new ``https://{merchant_backend}/sessions/{paivana_id}`` 97 endpoint that returns when an order with the given session ID has been paid 98 for (regardless of the order ID, which is not known to the browser). 99 * A wallet now needs to instantiate the pay template, passing the 100 ``session_id`` and the ``fulfillment_url`` as an additional inputs 101 to the order creation (the session ID here will work just like 102 existing use of ``session_ids`` in session-bound payments). 103 Similarly, the ``{website}`` works as the fulfillment URL as usual. 104 * The wallet then must pay for the resulting order 105 by talking to the Merchant backend. 106 * When the long-poller returns and the payment has succeeded, the 107 browser (still rendering the paywall page) also learns the order ID. 108 * The JavaScript of the paywall page (or the non-JS client 109 processing the ``Paivana`` HTTP header) then POSTs the order ID, 110 ``nonce``, ``cur_time`` 111 and ``website`` to ``{domain}/.well-known/pavivana``. 112 * paivana-httpd computes the paivana ID and checks if the given 113 order ID was indeed paid recently for the computed paivana ID. 114 If so, it generates an HTTP response which the Paivana cookie 115 and redirects to the fulfillment URL (which is the original {website}). 116 * The browser reloads the page with the correct 117 Paivana cookie (see first step). 118 119 120 Problems: 121 --------- 122 123 * A smart attacker might still create a lot of orders via the pay-template. 124 125 * Solution A: Don't care, unlikely to happen in the first place. 126 * Solution B: Rate-limit template instantiation on a per-IP basis. 127 128 Implementation: 129 --------------- 130 131 * Merchant backend needs way to lookup order IDs under a ``session_id`` 132 (DONE: e027e729..b476f8ae) 133 * Merchant backend needs way to instantiate templates with 134 a given ``session_id`` and ``fulfillment_url``. This also 135 requires extending the allowed responses for templates in general. 136 * Paivana component needs to be implemented 137 * Wallet-core needs support for a ``session_id`` and 138 ``fulfillment_url`` in pay templates. 139 140 141 Test Plan 142 ========= 143 144 * Deploy it for git.taler.net 145 146 Definition of Done 147 ================== 148 149 N/A 150 151 Alternatives 152 ============ 153 154 * Do not re-use the session ID mechanism but introduce some new concept. 155 This has the drawback of us needing additional tables and indicies, 156 and also the existing use of the session ID is very parallel to this one. 157 * Instead of doing a 302 Redirect, cache control could have been achieved by 158 specifying a "Vary: Cookie" HTTP header. We may combine these and use 159 that to additionally enable caching of the 302 Redirect. The 302 solution 160 has the advantage that there is only one page to cache per template, and 161 the disadvantage of an additional redirect. Note that this is purely 162 a frontend design choice, wallets and merchant backends work nicely with 163 either approach. 164 165 Drawbacks 166 ========= 167 168 * This exposes an order ID to anyone who knows the session ID. This is 169 clearly not an issue in this context, and for the existing uses of 170 the session ID it also seems clear that knowledge of the session ID 171 requires an attacker to have access that would easily also already 172 give them any order ID, so this seems harmless. 173 174 175 Discussion / Q&A 176 ================