adrian holovaty

Low-tech edition (Skip to navigation)

May 2, 2008, 1:36 AM ET

Request: Headless HTML rendering engine?

Warning: Seriously geeky request ahead!

I'm looking for a way to render arbitrary Web pages -- including CSS and JavaScript -- and access the resulting DOM tree programatically, i.e., in an automated/headless fashion. I want to be able to ask the following questions of the resulting DOM tree:

The rendering must be state-of-the-art, handling advanced CSS that Firefox, Safari and IE handle. It should work on Linux. Bonus points if there's a Python API for this magical DOM tree.

This is all stuff that standard in-page JavaScript could accomplish, but the catch with me is that I need to be able to do it in a completely automated way, on arbitrary pages, on a headless server.

I know Gecko and Webkit provide this, but I'm not sure where to start with them. The docs and articles I've read seem to be focused more on embedding the full browser window in a GUI application than embedding the rendering engine itself and manipulating the resulting pages.

Help! If you have any clues, I'd be grateful if you left a comment or got in touch with me.

Comments (27) / Permalink



Thanks for reading.

A Django site.