Scraping dynamic web pages with JS rendering

Scraping dynamic web pages with JS rendering

Created Sunday 17 March 2018

You have three options

  1. Drive a real browser from a script
  2. Use a fake/phantom browser to get contents
  3. Use in-browser js console to do the scraping

Differences

  1. The first one makes it easy to see current page and debug
  2. The second makes it possible to run the scraper in a cloud or on a headless server.
  3. The third does not need any third party tools installed, just your browser

Option 1

With Ruby
watir to drive a browser - download chromedriver and have it do what you want from the script or a command line
watir::Browser.new :phantimjs
Nikogiri - an HTML parser and extractor.
browser.screenshot.png

With Python
Use scrapy and selenium to drive a browser

Option 2

With JS
PhantomJS for browser emulation, jsdom is a more lightweight one, Cheerio for scraping (JQuery like syntax)

With Python
Get spynner (PyQT and WebKit) for browser simulation and Beautiful Soup 4 for scraping
or splash renderer with scrapy scraper via a bridge

Option 3

In-browser JS: Artoo