I started working on OBSRV.APP on the 25th of December and today I finished rewriting the backend part responsible for performing the scrape jobs. Almost from scratch.
"Why?" - one would ask. At the beginning my idea was to have a lightweight script that can evolve to something bigger over time. Load page content, fetch data, profit. But after my first test runs it turned out that the behavior of the sites can vary a lot when you try to access them over a headless browser or some other scripted way. (Yes, I know that the same can happen when you're using any boring non-headless browser :))
To provide a non-exhaustive list for fetching a site in Python:
- memory hungry headless Chrome + Selenium
- good old request.get
- or a fancy API like https://scrapingbee.com
The fetch approach can differ based on the technologies used on your target site (eg. static content, dynamic content populated by some JS code). Of course loading the site content is not enough, you need to have something to find the data you're looking for on the page you've loaded. And did I mention that the data you're looking for on the site can be found many different ways?
So the reason that I was not able to use the lightweight script any longer is simple: this is a complex problem. And complex problems need complex solutions. Now I have a system where I can easily add new strategies for loading a site and finding the desired data. And it can find the combination of these strategies that will work on a site. This will do the job for now.
I'm still concentrating on delivering the promise of the product. Optimizing, prepare everything for scaling is not my list. Not yet at least.
Ah yes, and hopefully my connectivity issues will go away by tomorrow. It's quite hard to make progress when your task is to collect data from the internet... without internet.