Three ways to not get blocked while doing web scraping

As collecting data from the internet became a standard process in many industries, companies who don't want their data to get scraped try to find ways to stop automated data collection. Tools like Akamais' Boomerang, Datadome, or Impervas' Advanced Bot Protection all try to achieve the same thing: decide whether the visitor is real or not and prevent bots from parsing the website data. Fetching website data and introducing protection against such acts became a cat-and-mouse game over the years. And while there's no silver bullet, there are a few basic steps you should do to stay under the radar and get the precious data. So let's get right into it.

Preface

If you're new to web scraping, maybe it's a good idea to clear up some things about this topic first. Whenever you visit a website, the webserver behind the URL will receive a bunch of data about the device and browser you're using. With this information, the server can create a fingerprint, identifying the source of the request. If you're curious, you can check your browsers' fingerprint here: https://bot.sannysoft.com.

Pro tip: you can also use https://bot.sannysoft.com when you're using a headless browser to see your fingerprint. Visit this site and create a screenshot to see the results.

When you're using simple scraping methods, like request.get using Python, many of these data points will be missing. On the other hand, when you're using a headless browser, some of this information will tell the server that it comes from a headless browser, not a real one. Based on this information, the server can decide to block your access. Some of the results can be a captcha challenge or a blank page, not your desired content. And while there are solutions to automize captcha solving, it's best to avoid the challenge in the first place.

In this article, we're going to check out some of the techniques you can use when using Puppeteer and Chrome or Chromium for your scraping tasks.

1. Rotating user agents and stealth mode

It's widely known that using the wrong or the same browser settings over many times can make your request suspicious. Therefore, the first thing you want to make sure of is to set up your scraping tool to send requests as it were a real browser. Here are two npm packages you want to experiment with: puppeteer-extra-plugin-stealth random-useragent The stealth plugin applies various techniques to make the detection of your headless browser harder. After installing, using this plugin is straightforward. Setting real-world-like browser headers will help you at accessing most sites.

const puppeteer = require('puppeteer-extra')

const StealthPlugin = require('puppeteer-extra-plugin-stealth')
puppeteer.use(StealthPlugin())

puppeteer.launch({ headless: true }).then(async browser => {
  console.log('Running tests..')
  const page = await browser.newPage()
  await page.goto('https://bot.sannysoft.com')
  await page.waitForTimeout(5000)
  await page.screenshot({ path: 'testresult.png', fullPage: true })
  await browser.close()
  console.log(`All done, check the screenshot. ✨`)
})

const page = await browser.newPage()

Using different user-agent settings for every request you make can make it hard to associate a fingerprint with you. And when it comes to rotating user-agent, the random-useragent plugin is here for the rescue.

const userAgent = randomUseragent.getRandom();
await page.setUserAgent(userAgent);

2. Use a proxy

Using proxies is one of the most recommended practices whether you use a headless browser approach or not. Thankfully there are many proxy providers out there like Brightdata, Proxycrawl, or the proxy mode of Scrapingbee.

Depending on the service you choose, you have a few options to set up a proxy using Puppeteer. If you decide to go with Brigthdata, you can use page.authenticate before visiting the URL.

await page.authenticate({username:username, password:password});

Whatever you choose, always do the math before starting your scraping tasks. Depending on your target sites, some proxy networks can be worth it better than others.

3. Don't rush, wait a bit between page loads

Flooding a server with requests won't make you look human. This is why it's recommended to wait a bit after you finished processing a site. You can also randomize the delay between visiting sites instead of using fixed values.

let delay = Math.floor(Math.random() * 100000) + 10000
await page.waitForTimeout(delay)

Summary

The steps above are just the tip of the iceberg. There are plenty of techniques like these to avoid bot detection systems, and there's definitely more to come as these systems will evolve in the future.