Scraping dynamic web pages with JS rendering
Scraping dynamic web pages with JS rendering
Created Sunday 17 March 2018
You have three options
- Drive a real browser from a script
- Use a fake/phantom browser to get contents
- Use in-browser js console to do the scraping
Differences
- The first one makes it easy to see current page and debug
- The second makes it possible to run the scraper in a cloud or on a headless server.
- The third does not need any third party tools installed, just your browser
Option 1
With Ruby
watir to drive a browser - download chromedriver and have it do what you want from the script or a command line
watir::Browser.new :phantimjs
Nikogiri - an HTML parser and extractor.
browser.screenshot.png
With Python
Use scrapy and selenium to drive a browser
Option 2
With JS
PhantomJS for browser emulation, jsdom is a more lightweight one, Cheerio for scraping (JQuery like syntax)
With Python
Get spynner (PyQT and WebKit) for browser simulation and Beautiful Soup 4 for scraping
or splash renderer with scrapy scraper via a bridge
Option 3
In-browser JS: Artoo