The need for archives by citation

Written by Adrian Holovaty on January 2, 2003

Mark Pilgrim recently unveiled a "Posts by Citation" archive.

The archive lets you sort Mark's weblog entries by the sources he cites. For example, you can access every entry that cites The Onion, Jakob Nielsen, or The Washington Post. Plus, the archive page lists how many entries cite each particular source. (At the time of this writing, for example, he'd cited Dave Winer 26 times and Paul Ford 7 times.)

Such a system is made simple because Mark's blog entries use the appropriate code -- the cite tag -- to identify source names. That makes it easy for a computer to discern which pieces of a blog entry are citations. (Which, in turn, makes it easy for a computer to group entries that have similar citations, or calculate how often a particular source is cited.)

This new way of archiving has caused a stir in the weblogging community, but the idea isn't just a weblog novelty. It's a concept news Web sites should adopt and run with. Here's why that should happen:

It groups content in helpful ways

It's obvious that someone looking for quotes by U.S. Defense Secretary Donald Rumsfeld would benefit from an archive of all stories that cited him. Such an archiving scheme helps readers, researchers and journalists themselves.

It provides interesting meta-information about the site

I'd be fascinated to see a list of the most-quoted sources at the New York Times. Or, whom the Times rarely quotes. This type of information has an allure similar to that of "most-e-mailed articles" lists (such as Yahoo's).

It keeps the journalists in check

Most importantly, a citation archive would lay bare a news organization's biases by disclosing publicly which sources have been quoted more than others.

It's no secret that some "unbiased" news outlets quote certain groups -- say, members of certain political parties -- more exclusively than other groups. The American media watchdog group FAIR has published several reports revealing in detail some of these "official agendas." (One study concluded a PBS news program had "utterly failed" to be a fair, open forum because the show's guests tended to be of similar political leanings.)

If news Web sites made it easy for readers to see how often certain sources were quoted, journalists would have extra incentive to "get both sides of the story."


Posted by anonymous on January 2, 2003, at 8:11 p.m.:

Greetings from Kiev!

I appreciate your recommending the technology for very practical applications, such as keeping the news unbiased. Keep up the good work.


Posted by Julie on January 3, 2003, at 5:47 a.m.:

In theory I'm with you all the way. It's smart, useful, etc..... BUT all that coding takes someone's time. It's one thing for Mark or you or any individual to insert applicable cite tags into a daily blog... it's quite another for a large news organization producing scores of stories with multiple sources.

If you have people doing the coding by hand (at any stage of the writing/editing/production process) it becomes an incredible time burden and good luck achieving the kind of uniformity necessary to produce meaningful results. Let's say they can manage to always use the same notation for U.S. Defense Secretary Donald Rumsfeld so you don't end up with 251 for that and 83 for just Defense Sec. Donald Rumsfeld... who decides where to draw the line on the importance of the person? Do you use cite tags if quoted is Joe Yahoo, local farmer? What if it's Joe Yahoo, head of local Young Republicans? Do you discard those people? Count them all as "private citizen"? Private citizen obviously does not equal without bias. Good luck with any meaningful uniformity. There are too many levels of viewpoints, biases .... I glanced at the Fair study and did not see a detailed methodolgy provided (tsk tsk) but if it was conducted properly it would have taken them months to sort through and correctly code that kind of data. That is time that tightly-staffed news orgs simply don't have.

I suppose it is POSSIBLE to come up with an algorithm to replace the daily production burden on staffers but that too would come at a cash price and I suspect not fare much better in deciding who "counts."

Posted by Brian Hamilton on January 5, 2003, at 8 a.m.:

I agree, Julie. It's a great idea, but causes a lot of problems on the way to being used and useful.

I responded to this article in my blog.

Posted by mini-d on January 11, 2003, at 5:57 p.m.:

Adrian, it would be a great idea if someone can point us wich things should be in [cite]tags[/cite], because my example is i do every name in my weblog that refers to someone in a comunity should has a [cite]tags[/cite], but names like The Onion? or Microsoft? I guess people should learn how to semantize content too. Thanks for all.

Posted by Adrian on January 12, 2003, at 1:54 a.m.:

mini-d: The official specification is vague, but my personal policy is to wrap a <cite> around any person or publication that is being used as a source. If I quote somebody, I'll surround that person's name with a <cite>. If I reference a New York Times article, I'll surround "New York Times" with <cite>.

Hope that's somewhat helpful.

Posted by Bill Humphries on January 14, 2003, at 10:33 a.m.:

I know that Mark Pilgrim's gone off on a tear about XHTML 2.0's dropping the cite element (but it's a draft, not the final product), however cite needs to stay unless the working group has an idea for a successor.

Also, cite needs an attribute model. At the very least href, rev, and rel.

Comments have been turned off for this page.