{"id":6891,"date":"2026-04-15T21:49:59","date_gmt":"2026-04-16T04:49:59","guid":{"rendered":"https:\/\/blackcap.name\/blog\/new\/?p=6891"},"modified":"2026-04-15T21:49:59","modified_gmt":"2026-04-16T04:49:59","slug":"bypassing-cloudflare-to-scrape-a-web-site","status":"publish","type":"post","link":"https:\/\/blackcap.name\/blog\/new\/?p=6891","title":{"rendered":"Bypassing Cloudflare to Scrape a Web Site"},"content":{"rendered":"\n<p>It&#8217;s supposed to be really hard, and <a href=\"https:\/\/www.cloudflare.com\/\" data-type=\"link\" data-id=\"https:\/\/www.cloudflare.com\/\">Cloudflare<\/a> does indeed to a <em>very<\/em> good of detecting (and banning) web scrapers. But, it turns out that this is one of those things that, while indeed very difficult to do in the general case, is actually quite simple in a lot of specific cases.<\/p>\n\n\n\n<p>The main trick is to use <a href=\"https:\/\/playwright.dev\/\" data-type=\"link\" data-id=\"https:\/\/playwright.dev\/\">Playwright<\/a> (with the <a href=\"https:\/\/pypi.org\/project\/playwright-stealth\/\" data-type=\"link\" data-id=\"https:\/\/pypi.org\/project\/playwright-stealth\/\">playwright_stealth<\/a> addon) to control a browser, and to use that browser to scrape the web. You could <em>theoretically<\/em> use some other browser automating tool like <a href=\"https:\/\/www.selenium.dev\/\" data-type=\"link\" data-id=\"https:\/\/www.selenium.dev\/\">Selenium<\/a> to do this, but the problem with Selenium is that it basically advertises itself with every request (and there is no way to turn this off), and that triggers Cloudflare in short order.<\/p>\n\n\n\n<p>The second trick is to mimic a human user as faithfully as possible. Generally, that means going slow, i.e. no faster than a human would navigate a web page, and inserting randomness into the process so it looks like a human and not an automaton is controlling the browser. This is where things break down for abusive scrapers like AI companies; they <em>can&#8217;t<\/em> take it slow, because if they do, it will literally take millennia to get as much data as they want. But I am not an AI company, and only need to scrape a modest amount of data, so taking it slow is good enough (I just let it run, and the results get collected eventually, not as fast as they might otherwise have been, but it still beats manual cutting and pasting).<\/p>\n\n\n\n<p>The reason I mention this is that, if you try searching on how to do this, the results tend to be pretty useless. They are dominated by either a) techniques that no longer work and which will trigger Cloudflare almost instantly, or b) commercial scraping services interested in taking your money. The latter <em>may<\/em> actually be a useful service if you want to scrape a lot, but I am not in that category.<\/p>\n\n\n\n<p>That all of this is even necessary to write about is yet another data point in the overall <a href=\"https:\/\/en.wikipedia.org\/wiki\/Enshittification\" data-type=\"link\" data-id=\"https:\/\/en.wikipedia.org\/wiki\/Enshittification\">enshittification<\/a> of the Internet, which by my reckoning peaked in its usefulness circa 2010 and has been heading downhill ever since. The services I now have to use devious techniques to scrape used to be easily scrapable; actually, in many cases, they had free API&#8217;s designed to work well with another computer and didn&#8217;t need to be scraped at all.<\/p>\n\n\n\n<p>And yes, the abusive scrapers are as much (or more) to blame here as are the sites removing functionality. I have had to implement anti-scraping measures on a site I host, because AI companies were ignoring <a href=\"https:\/\/en.wikipedia.org\/wiki\/Robots.txt\" data-type=\"link\" data-id=\"https:\/\/en.wikipedia.org\/wiki\/Robots.txt\">robots.txt<\/a> and scraping the daylights out of it. It was so bad that I was getting server crashes from the abuse.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>It&#8217;s supposed to be really hard, and Cloudflare does indeed to a very good of detecting (and banning) web scrapers. But, it turns out that this is one of those things that, while indeed very difficult to do in the general case, is actually quite simple in a lot of specific cases. The main trick [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[7],"tags":[],"class_list":["post-6891","post","type-post","status-publish","format-standard","hentry","category-computers"],"_links":{"self":[{"href":"https:\/\/blackcap.name\/blog\/new\/index.php?rest_route=\/wp\/v2\/posts\/6891","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blackcap.name\/blog\/new\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blackcap.name\/blog\/new\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blackcap.name\/blog\/new\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blackcap.name\/blog\/new\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=6891"}],"version-history":[{"count":0,"href":"https:\/\/blackcap.name\/blog\/new\/index.php?rest_route=\/wp\/v2\/posts\/6891\/revisions"}],"wp:attachment":[{"href":"https:\/\/blackcap.name\/blog\/new\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=6891"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blackcap.name\/blog\/new\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=6891"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blackcap.name\/blog\/new\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=6891"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}