I've just released templatemaker, which is something I've been hacking on and off (mostly off) the past couple of months. It's a Python library for extracting data from similarly formatted text strings.
What the heck does that mean?
Well, say you want to get the raw data from a bunch of Web pages that use the same template -- like restaurant reviews on Yelp.com, for instance. You can give templatemaker an arbitrary number of HTML files, and it will create the "template" that was used to create those files. ("Template," in this case, means a string with a number of "holes" in it, where the holes represent the parts of the page that change.) Once you've got the template, you can then give it any HTML file that uses that same template, and it will give you the raw data: "The value for hole 1 is 'July 6, 2007', the value for hole 2 is 'blue'," etc.
If this still doesn't make sense, have a look at the example usage and documentation.
I searched but couldn't find anything else that did this. I heard from a few folks that Google and Yahoo have internal tools that do this sort of thing, probably in a much more robust fashion, but I'm unaware of any open-source code that does this. (Disclaimer: I have not checked CPAN. CPAN probably has five implementations of this functionality.)
The library uses a longest-common substring algorithm, which is implemented in C, via Python's C interface, for performance. (My original implementation was in Python, and it was noticeably slow in that area.) This means you need to compile it in order to use it, but it has no dependencies, so it should be quick and easy.
You can get templatemaker via its Subversion repository on Google Code. I'm releasing it under the New BSD License. There's also a mailing list, for the fun of it.
I'm interested in seeing what uses people have for this, and I'm also planning on adding a ton more features. And I'd love it if a competent C programmer could take a look at the C bits to make sure everything is kosher -- it's been a while since I've written C code. :-)
Let me know if you have any use for this, and may the suggestions and patches flow freely!
Posted by Andy Vaughn on July 6, 2007, at 7:13 a.m.:
Sounds useful. I look forward to following the developments.
Thanks for the contribution!
Posted by Alberto on July 6, 2007, at 9:27 a.m.:
Really useful piece of code. You're my favourite hacker! Just a curiosity: what os you use for Python/Django development? OSX, Ubuntu, Windows? And what is your text-editor of choice?
Posted by Rob on July 6, 2007, at 11:55 a.m.:
I think scRUBYt! might do something similiar; you give it an example, and it generates the generic extraction code: Data Extraction for Web 2.0: Screen Scraping in Ruby/Rails, Episode 1 (scRUBYt! tutorial)
Posted by Matt on July 6, 2007, at 3:01 p.m.:
This looks great, can't wait to play around with it. Sprog might be similar but since I'm not a 'nix user, I never really took the time to use it.
Posted by Adrian on July 6, 2007, at 3:27 p.m.:
Rob: Yeah, I looked at scRUBYt but got confused by its meandering documentation. It appears to require the user to specify XPath expressions, whereas I wanted templatemaker to work on arbitrary text (i.e., not just HTML) and need as little user intervention as possible. scRUBYt seems more akin to BeautifulSoup.
Posted by Arthur Debert on July 6, 2007, at 4:43 p.m.:
This is great.
This has saved a few hours. Now, when I receive from the designer the HTML "sample" pages for a project, I can abstract the common structure to make my django templates very quickly, without having to search through a lot of tags and nested divs. Sweet!
Posted by Doug Napoleone on July 6, 2007, at 4:56 p.m.:
the evil doer in me just had an evil though. This as a Django middleware + the wget middleware and a few tweaks could be used as a powerful generalized phishing framework. *shiver*
Posted by Vince P. on July 6, 2007, at 5:13 p.m.:
What you're talking about is traditionally called 'report mining' and isn't usually applied to HTML. These concepts have been around since the early 90's at least in products like Monarch (http://monarch.datawatch.com/datawatch-report-mining-server.asp). Well formed *ML should make this task even easier to accomplish than it was the bad old days of report ripping.
Posted by Adrian on July 6, 2007, at 5:27 p.m.:
Vince P.: Oooh, thanks for that info -- that's very helpful. I'll look into "report mining."
Posted by Andrew Gwozdziewycz on July 6, 2007, at 5:33 p.m.:
Sounds to me like you implemented http://www.dapper.net/ . Which is great! I've been wanting to implement something like this for ages, but never had the time, or killer use to actually attempt to implement it.
Posted by Adomas on July 6, 2007, at 6:17 p.m.:
Posted by danny on July 6, 2007, at 7:08 p.m.:
There's http://www.annocpan.org/~AUTRIJUS/Template-Generate-0.04/lib/Template/Generate.pm but I'm not sure how far it got.
Posted by Anton on July 6, 2007, at 7:36 p.m.:
Have a look at webstemmer that provides a very similar functionality and is written in Python :)
Posted by eas on July 6, 2007, at 7:37 p.m.:
If I'm understanding you right, what you've done is pretty sweet. BeautifulSoup is cool, but I'm lazy:).
Posted by Paul Smith on July 6, 2007, at 8:24 p.m.:
Hey Adrian, I think you're missing a line in your setup.py that copies templatemaker.py to site-packages on install. I filed a ticket.
Posted by Aaron Bentley on July 6, 2007, at 8:32 p.m.:
Very cool idea. I'm not sure the C module is really necessary, though. Python supplies its own sequence matching facility: difflib.SequenceMatcher. In my experiece, its performance is pretty good.
You might also want to consider using a longest common subsequence matcher. That can often perform produce better results than a longest common substring matcher, because it isn't confused by spurious similarities as easily. Bazaar has a library to do longest common subsequence matching, written in Python.
Posted by John Zeratsky on July 6, 2007, at 8:50 p.m.:
Fantastic, Adrian. Just the other day, I was thinking about how mining data from HTML is so manual and labor-intensive.
Posted by Adrian on July 6, 2007, at 9:35 p.m.:
Paul: Thanks -- it's fixed now.
Posted by Jeff Wheeler on July 7, 2007, at 12:54 a.m.:
This is absolutely wonderful! I've longed for this functionality, thank you very much!
Posted by Nate Murray on July 7, 2007, at 1:18 a.m.:
You mentioned that you'd never seen anything like this. What you're doing is called "wrapper induction" and theres about four or five big algorithms of varying usefulness:
http://citeseer.ist.psu.edu/kushmerick97wrapper.html - Probably the most referenced paper on WI.
A patented WI algorithm is Stalker. ( http://citeseer.ist.psu.edu/muslea98stalker.html )
Probably my favorite is RoadRunner. It's by some Italian guys and it actually has open source: ( http://citeseer.ist.psu.edu/crescenzi01roadrunner.html )
Posted by doug smith on July 7, 2007, at 1:23 a.m.:
a tool for EveryBlock me thinks :) love django -thanks for sharing.
Posted by ndg on July 7, 2007, at 2:26 a.m.:
Any chance of including the Python version of the substring matcher as a fallback for those of us who can't compile the C? :(
Posted by Matt Baker (PDX) on July 7, 2007, at 3:51 a.m.:
How cool! How does it learn? I suppose I should just download the code and answer that for myself. I've been wanting to build a similar tool for data mining for some time now, what better way to break data out of the formatting and into a DB? I'm sure it's usefulness in that department is no coincidence at all ;)
Posted by Adrian on July 7, 2007, at 4:07 a.m.:
Nate Murray: Thanks *very much* for passing those links along -- this is great stuff.
Posted by Clint Ecker on July 7, 2007, at 4:28 a.m.:
Lovely! I'm a huge BeautifulSoup user. While it's pretty powerful, I find the process of getting my code set up to be pretty tedious. This will help me bunches. Thanks!
Posted by Coty Rosenblath on July 7, 2007, at 4:59 a.m.:
Looks cool. There is a Ruby project that I ran across a while back that you might find interesting: Ariel. I think it is a closer fit than scRUBYt et al. It also has a learning phase and aspires to working on text in general rather than only markup. It requires a bit more setup to learn the structure of a document, but I believe it allows for the extraction of richer structures. Its theory page references a paper on wrapper induction, too, which based on Nate's comment suggests that it is in the same space.
Posted by Tucker on July 9, 2007, at 3:18 a.m.:
I just played with it for a bit and I think the answer is "no", but does it handle learning variable length lists? For example, if I have text like:
where there can be a variable number of items in the list, can template maker automatically learn a template and extract the strings?
Posted by Adrian on July 9, 2007, at 4:52 a.m.:
Tucker: No, it doesn't do that. That'd be a cool feature, but it's sort of out of scope of what templatemaker is intended to do.
Posted by Mason on July 9, 2007, at 6:14 a.m.:
Neat tool :)
You might want to check out some of the projects going on at the Simile research group at MIT. I'm working there part-time doing research these days, and there's some pretty exciting stuff that we're developing. Particularly, Solvent and Piggy Bank are really neat implementations of a variant of your idea.
Posted by Jeremy Dunck on July 9, 2007, at 7:44 a.m.:
Yeah, Simile is doing a lot of interesting stuff. I bet it's a fun place to work!
Posted by Murkt on July 9, 2007, at 10:09 a.m.:
Adrian, we're sending the ray of love to you from Ukraine. Thanks, great tool!
Posted by Marcus on July 11, 2007, at 8:01 a.m.:
I found another python version of a longest common subsequence matcher (from the maker of webstemmer mentioned by Anton above).
Posted by Tom Lynn on July 12, 2007, at 1:39 p.m.:
The following CPAN modules seem to be recommended:
* Template toolkit
(from the comments when James Tauber asked about this on his blog:
Posted by Dan Schultz on July 17, 2007, at 8:20 p.m.:
I figure it can't hurt to join the list of people saying that this is a very cool tool:)
out of curiosity - does you know if there is a similar type of thing that exists for PHP? Sorry if I'm not allowed to use that acronym around these parts...
Posted by Peter Szinek on July 26, 2007, at 8:41 a.m.:
I guess I am a bit late to the party, but anyway... Here is a simpe yelp learning extractor in scRUBYt!:
http://pastie.caboo.se/82375 (as you can see, you need no XPaths or any other specific knowledge, just simple examples)
Which returns this result:
http://pastie.caboo.se/82376 (of course the <address> tag could be mined further)
To turn it into a production extractor, you export it, yielding:
Now this production extractor can be used for to extract data from the other pages of the same site (tried with the pizza example you provided).
I hope this makes things a little bit more clear ;-) What you have seen is maybe 20% of scRUBYt!'s functionality, but it gives an idea about learning/production etc.
Posted by Steven James on July 27, 2007, at 1:03 p.m.:
I really could have used this about a month ago (wrote a custom parser instead); too bad I just found it today. I'm sure another similar task will come up, and this looks like great work.
Comments have been turned off for this page.