A fundamental way newspaper sites need to change

Written by Adrian Holovaty on September 6, 2006

A blog entry titled 9 Ways for Newspapers to Improve Their Websites has been making the rounds lately. I don’t write about the online news industry on this site as much as I used to, but this article inspired me to collect my current thinking on what newspaper sites need to do. Here, I present my opinion of one fundamental change that needs to happen.

For background: I have a journalism degree, for what it’s worth, and I’ve worked for newspaper Web sites since 1998 (including the college paper and internships). The sites: themaneater.com (now a pale shadow of its former self) at the University of Missouri, SuburbanChicagoNews.com, ajc.com in Atlanta, LJWorld.com / Lawrence.com in Lawrence, Kansas, and, for the last year, washingtonpost.com.

Most of the points made in the “9 Ways” entry are OK, if a little overly specific (“make your content work on cell phones and PDAs”) and trendy (tagging!). There’s nothing wrong with that — the online newspaper industry needs all the advice it can get. But more fundamental shifts need to happen for newspaper companies to remain essential sources of information for their communities.

One of those important shifts is: Newspapers need to stop the story-centric worldview.

Conditioned by decades of an established style of journalism, newspaper journalists tend to see their primary role thusly:

Collect information
Write a newspaper story

The problem here is that, for many types of news and information, newspaper stories don’t cut it anymore.

So much of what local journalists collect day-to-day is structured information: the type of information that can be sliced-and-diced, in an automated fashion, by computers. Yet the information gets distilled into a big blob of text — a newspaper story — that has no chance of being repurposed.

“Repurposed”?

Let me clarify. I don’t mean “Display a newspaper story on a cell phone.” I don’t mean “Display a newspaper story in RSS.” I don’t mean “Display a newspaper story on my PDA.” Those are fine goals, but they’re examples of changing the format, not the information itself. Repurposing and aggregating information is a different story, and it requires the information to be stored atomically — and in machine-readable format.

For example, say a newspaper has written a story about a local fire. Being able to read that story on a cell phone is fine and dandy. Hooray, technology! But what I really want to be able to do is explore the raw facts of that story, one by one, with layers of attribution, and an infrastructure for comparing the details of the fire — date, time, place, victims, fire station number, distance from fire department, names and years experience of firemen on the scene, time it took for firemen to arrive — with the details of previous fires. And subsequent fires, whenever they happen.

That’s what I mean by structured data: information with attributes that are consistent across a domain. Every fire has those attributes, just as every reported crime has many attributes, just as every college basketball game has many attributes.

Those three examples are obvious candidates for structure, mostly due to ubiquity. People have been slicing and dicing sports stats for years. People have been analyzing crime for years.

But it doesn’t stop at those obvious examples. If you take some time to examine what sort of information newspaper journalists collect, the amount of structure will jump at you. If I may take the liberty of giving examples from Web sites I’ve worked for:

An obituary is about a person, involves dates and funeral homes.
A wedding announcement is about a couple, with a wedding date, engagement date, bride hometown, groom hometown and various other happy, flowery pieces of information.
A birth has parents, a child (or children) and a date.
A college graduate has a home state, a home town, a degree, a major and graduation year.
An Onion-style “On the Street” feature has respondents, answers and a publication date.
A drink special has a day of the week and is offered at a bar.
The schedule of the U.S. Congress has a day and multiple agenda items.
A political advertisement has a candidate, a state, a political party, multiple issues, characters, cues, music and more.
Every Senate, House and Governor race in the U.S. has location, analysis, demographic information, previous election results, campaign-finance information and more.
Every known detainee at Guantanamo Bay has an approximate age, birthplace, formal charges and more.

See the theme here? A lot of the information that newspaper organizations collect is relentlessly structured. It just takes somebody to realize the structure (the easy part), and it just takes somebody to start storing it in a structured format (the hard part).

Now, I can understand why newspapers are slow to accept this sort of thinking. Journalists aren’t the most tech-savvy bunch, they’re not the most innovative bunch, and they’re (just a tad) resistant to change.

One barrier to this thinking is a sort of journalistic arrogance: “How is this journalism? We’re journalists, and we’ve been trained to explain complex information to people in ways that they can understand. Displaying raw data does not help people; writing a news article does help people, because it’s plain English.” I’ve presented these concepts (“journalism via computer programming,” the importance of machine-readable data, etc.) at a number of journalism-industry events, and inevitably somebody asks these questions.

Well, I have a couple of answers.

First, the question of “How is this journalism?” is academic. Journalists should have less of a concern of what is and isn’t “journalism,” and more of a concern for important, focused information that is useful to people’s lives and helps them understand the world. A newspaper ought to be that: a fair look at current, important information for a readership.

Second, it’s important to note I’m not making an all-or-nothing proposition; I’m not saying newspapers should turn completely to vast collections of data, completely abandoning the format of a news article. News articles are great for telling stories, analyzing complex issues and all sorts of other things. An article — a “big blob of text” — is often the best way to explain concepts. The nuances of the English language do not map neatly to machine-manipulatable data sources. (This very entry, which you’re reading right now, is a prime example of something that could not be replaced with a database.) When I say “newspapers need to stop the story-centric worldview,” I don’t mean “newspapers need to abolish stories.” The two forms of information dissemination can coexist and complement each other.

But beyond the journalistic arrogance, another problem is that newspaper companies’ current software and organizational setup overwhelmingly discourages any sort of “information special-casing.” Just about every newspaper Web site content-management system I’ve ever seen is unabashedly story-centric. Want to post event calendar information into your news-site CMS? Post it as a “news article” object. Want to publish listings of recent crimes in your town? It goes in as a “news article.” There’s not much Joe Reporter, or even Jane Online Editor, can do about this, because Oh We’ve Invested So Much Into This CMS, and/or Our Newspaper Web Site Doesn’t Employ Any Computer Programmers. (The latter of which makes as much sense as a film director refusing to employ cameramen or video editors.)

When I worked at LJWorld.com, we wrote a CMS from the ground up to be able to handle all these distinct types of information. (And we created the Web framework Django to let us handle new types of information rapidly.) But, before that, we had an old CMS, and our night producers, whose job it was to copy-and-paste everything from the print newspaper into the Web system, would routinely publish all the newspaper’s little special features as “news articles.” The “photo of the day” newspaper feature would get posted as a story with no text — just a photo. The Onion-style “On the Street” feature would get posted as a “news article” object containing a question, four photos, and four responses. Each recurring newspaper feature was posted as a “news article,” regardless of whether it actually was a news article — simply because that’s all the content-management system knew how to do.

This is a subtle problem, and therein lies the rub. In my experience, when I’ve tried to explain the error of storing everything as a news article, journalists don’t immediately understand why it is bad. To them, a publishing system is just a means to an end: getting information out to the public. They want it to be as fast and streamlined as possible to take information batch X and put it on Web site Y. The goal isn’t to have clean data — it’s to publish data quickly, with bonus points for a nice user interface.

But the goal for me, a data person focused more on the long term, is to store information in the most valuable format possible. The problem is particularly frustrating to explain because it’s not necessarily obvious; if you store everything on your Web site as a news article, the Web site is not necessarily hard to use. Rather, it’s a problem of lost opportunity. If all of your information is stored in the same “news article” bucket, you can’t easily pull out just the crimes and plot them on a map of the city. You can’t easily grab the events to create an event calendar. You end up settling on the least common denominator: a Web site that knows how to display one type of content, a big blob of text. That Web site cannot do the cool things that readers are beginning to expect.

Then there’s the serendipity advantage. When I worked for LJWorld.com, we worked with the local weathermen to create a weather site that displayed the weathermen’s forecast for the next few days. I made them a Web interface that let them enter the predicted high temperature, low temperature and sky conditions — all in separate database fields. There really wasn’t any reason to use separate fields for these values other than the fact that the site’s design called for presenting the temperatures in a different color than the conditions, and we didn’t want the weathermen to have to remember to insert the HTML coloring codes in the right place. But it wasn’t until several months later that we reaped some real benefits of databasing the information, when we were putting together Game, an exhaustive database of local little-league teams and games. (Yes, you read that right.) We created a page for every little-league team and every little-league game, and when it came time to create the game pages, one of us said, “You know, these games tend to rain out a lot. It’d be really cool if we could somehow display the weather forecast for each game.” And, boom! One of us realized that we already had weather forecast data, in nice, sliceable-and-diceable format, thanks to our database populated by the weathermen. Ten minutes later, our little-league pages displayed weather forecasts. Serendipity.

Finally, I’ll note that some newspaper Web sites are actively looking to improve their understanding of these issues, and a number of Web editors have contacted me to ask if I know of anybody — please, Adrian, anybody! — who has tech skills and would be interested in applying his/her knowledge to a newspaper Web site. If you have these skills, please contact me, and I’ll put you in touch. The industry needs you.

UPDATE, several years later: This essay inspired the creation of the fantastic PolitiFact, which won the Pulitzer Prize in 2009 and is a great example of treating news data with respect.

Comments

Posted by Trevor Turk on September 6, 2006, at 2:58 p.m.:

I totally agree with you, but there's one thing I'm curious about. In the case where all of the newspaper content was being dumped into the same big-blob type of format, isn't there a way to use computers to "scrape" structured information out?

I know that this is nowhere near as easy, efficient, or reasonable as getting the information in there correctly (repurpose-ably) in the first place, but I'd bet there's a large opportunity in figuring out a way to add structure to large masses of existing newspaper information out there - provided that the data could be pulled out in a meaningful way, the majority of the time.

I wonder if you've thought about this problem/opportunity, and if you had any ideas about how to salvage existing data stores like the ones that old CMS you were using.

Posted by Chris Heisel on September 6, 2006, at 5:26 p.m.:

Adrian,

Great post, I've been trying to explain the same ideas to folks at work for weeks as we talk CMS, but you've said it better than I could!

Posted by Ryan on September 6, 2006, at 7:13 p.m.:

Amen! and hear, hear!

Posted by Brian Hamman on September 6, 2006, at 7:24 p.m.:

Awesome post Adrian, it sounds like there's some momentum building on this issue.

We're in the process of database-izing our production here at the University of Missouri/Columbia Missourian (thanks for Django) and we're running smack into the debate you're talking about. As a newspaper with a dual role of covering the community and training future journalists, we're especially concerned with what role journalists have to play in structuring and collecting all this information. We've only started to get our feet wet, but I'm constantly surprised and impressed by how many human decisions need to be made in order to cram the real world into a database -- both by the programmer and by those out there making the phone calls to gather the information.

I'm wondering whether journalists couldn't learn a thing or two from our friends in library sciences. Before coming back to journalism I worked a year or two in digital archives where we tackled very similar problems: How do you take "stuff" and put it into a database so that a) you can find it again and b) the structure of the data suggests meaningful connections you didn't know about beforehand. As much as I hated them at the time, I think that journalists could do well to adopt some of the metadata standards (such as Dublin Core or EAD) used in digital libraries. If we play by those rules then suddenly our news content (be it articles or databases of people, events, etc.) can be connected to incredible resources. (The Library of Congress digital collection, for starters).

When I start to throw around the digital library terms in the newsroom I often run into a question similar to the role of a journalist: what is the difference between news and information? Is it the role of newspapers to collect and organize what's going on in our world -- to build an information resource -- or to write stories that filter and explain that mass of information? I think you've long been saying, Adrian -- and I totally agree -- that the answer is, increasingly, both. But there's a lot of wiggle room in there. Do we stop at collecting data that's already structured or can be automatically processed, or should we invest resources into gathering the informaiton and metadata that would tie our current and local news and information into a global and historical storehouse?

Posted by Mike Orren on September 6, 2006, at 8:22 p.m.:

Hey Adrian:

Very well put! I expounded a bit here, in the context of Pegasus News: http://www.texasgigs.com/blogs/notmusic/2006/sep/06/data/.

Not much else to add that you didn't already say perfectly well here. However, I'll throw in my definition of database that we're hanging in our newsroom:

"Database [dey-tuh-beys] -verb: to transform pieces of information that are useful for a moment into a network of information that is useful forever."

In my mind, that's the primary job of a news organization, and with the growth of user-generated and shared content, the priority gap between databasing and reporting within a news organization will only widen.

Posted by Andy Baio on September 6, 2006, at 9:31 p.m.:

When I was working in journalism, I found that papers were unwilling to pay market rates for programmers... I'm sure the biggest papers have changed in that respect, but I wonder if there isn't still reluctance to pay a lead programmer more than the managing editor. (According to the 2005 Primedia salary survey, the average managing editor is making $81k.)

Posted by Todd Zeigler on September 6, 2006, at 9:31 p.m.:

Great post. The original post I wrote was from the perspective of a user of newspaper sites and a web developer. Things I'd like to see. It's interesting to read your thoughts as someone fighting the fight, so to speak.

Posted by PJ on September 7, 2006, at 1:02 a.m.:

You're a genius. Why didn't I think of that?!

As Johnny #5 says: Need more input!

Posted by anonymous on September 7, 2006, at 1:42 a.m.:

Isn't this what the Semantic Web is supposed to solve?

Posted by Jacob Kaplan-Moss on September 7, 2006, at 3:08 a.m.:

Adrian: well done!

Andy: you've got a good point about relative salary disparity; I could probably double my salary working for the J-W if I went off to one of them thar "Web 2.0" companies. However, I think it's a much more complex than papers unwilling to pay market rate.

First, geeky companies are pretty tightly clustered around Silicon Valley (and, to a lesser extent, places like Chicago, Boston, and New York), but newspapers are everywhere. So part of the difference is simply a cost-of-living thing; my half-San Jose salary puts me very comfortably above average Lawrence wages. So while on paper I make less then the average programmer, the relative power of the dollar in Kansas makes the issue more complex. Sure, Lawrence is exceptionally cool -- enough to bring a Bay Area born-and-raised boy out to the prairie -- but there dozens if not hundreds of similarly awesome small towns with cool local papers looking for programmers.

Put that way, it seems to me that part of the problem is that programmers aren't interested in looking outside the SV bubble. Until I came to the J-W, I never even considered looking for a job at a news company. Didn't even cross my mind.

That dovetails nicely into the next bit that makes this a more complex issue: if you find the right newspaper, working for a newsroom can be far better than working for any dot-com. My job is hands-down the best job I've ever had, in no small part because newspapers need us for their very survival.

Most news organizations, although slow to adapt and late to the party, are finally realizing just how compelling web-based journalism can be, and they're creating positions for us faster than we can fill 'em.

So I think what I'm saying is that money isn't everything. If you can catch a job with a newspaper that's coming around to this new vision of journalism that Adrian's been championing, you'll probably never look back. I don't know about most programmers, but I certainly don't really do this for the money; I'll take a good job at low pay over a crappy, well-paid one any day.

So my own twist to Adrian's call for programmers is this: if you're stuck putting cover sheets on your TPS reports, start looking at jobs with newspapers; there's some outstanding possibilities out there.

Posted by Karl on September 7, 2006, at 3:53 a.m.:

Hello Adrian,

I've been meaning to say hello to you for a number of different reasons over the past few years.

I'm an old Knight Ridder Digital developer. One of the folks that helped develop the Cofax CMS that was later replaced by KRD with... something else.

Cofax was a framework as well as a CMS, and in some very positive ways (well *I* think so :)), Django reminds me of it. Cofax was open sourced, but when KRD replaced it, well, work pretty much kept me from going back, refactoring, and taking it where it could still go. It's still in use in many places. Well enough of that...

I definitively agree with you that newspapers are terrific places to work if you are a software engineer. The pace is quick, the work challenging, and you get the rare opportunity to not only practice your profession, but do so building tools and services that connect, inform and empower people.

It's hard to beat.

anonymous - yes, I think Adrian is talking Semantic Web here. But like Adrian's call for newspaper organizations to take a hard look at how they manage information in their publishing systems, Tim Berners-Lee has made the same call to the web developer community. The hard sell has been that that the Semantic Web likewise solves a series of problems of lost opportunity. It requires an investment in time and effort by the developer community to see its potential archived. Adrian, please correct me if that's an incorrect understanding on my part.

Great piece.

Posted by Doug Karr on September 7, 2006, at 5:21 a.m.:

This is great information regarding how writing patterns need to change for the web. However, I think there's a lot more Newspapers could do to improve their sites. They should have Geographic interfaces to EVERYTHING... stories, classifieds, ads, etc. and they must exploit the fact that they are the local experts with all of the resources. Many newspapers I go to look like the 'corporate' site because they've merged their IT resources from multiple newspapers. These sites don't talk enough about the reporters, their regions and topics, etc. They look like a newspaper online, instead of a regional news portal. The internet doesn't have geographic boundaries, but newspapers do! They need to exploit it! Why would I read a story from the AP when I can get that on any site? I shouldn't!!!!

http://www.douglaskarr.com/2006/09/05/newspapers-need-to-think-small/

Warmest Regards,

Doug

Posted by Grant Barrett on September 7, 2006, at 1:02 p.m.:

This kind of transformation has already taken place in lexicography, which did indeed go from blob-type data to a high level of data tagging, generally using XML. An XML-based system means that any data can be tagged, so a story could be written as it always has been but every name in it could be tagged as [name] and every data in it as [date] and then you can have attributes on those tags, too, so [name] could also be marked as [company] or [person] and [date] as [birth], [death], etc., etc. A well-tagged XML document is amazing rich with metadata that is invisible to the user.

This, is, of course, what the semantic web proposes, but it's important to understand that this kind of data management has added new labor costs and has increased the total time that it takes to make dictionaries. Both lexicography and journalism tend to be consistently short on qualified labor and the balance of skills to pay is usually out of whack (I say this both as a lexicographer and as a former journalist). But there's no reason that taggers need to be on staff. This work can be jobbed out very cheaply to anywhere in the country or the world, and that, in fact, is how dictionary publishers do it so that they don't add costs that force the final sale price of a dictionary to be completely out of whack with the market.

Of course, the tagging is easier if the person doing the collecting (the journalist or the lexicographer) thinks about how the data will be structured from the beginning, but a lot of it can be done after the fact. Which leads me to another point: the greatest failure of social tagging (folksonomy) is that the elements being tagged are not discrete enough. People tend to be tagging entire web sites, photos, or articles, rather than headlines, bylines, words, names, or parts of pictures (though Flickr's system for tagging of parts of photos is so well done and searchable that I'm surprised that there seem to be no newspapers in the country who have adopted this). So imagine a newspaper that allowed mass tagging of every discrete element of its articles, of parts of pictures, of every number that appears in an article. There'd be the usual problems with bad actors, but the end result, if the body of public taggers was high enough, would be a mass of tags that could then be judged by a hired master tagger, rather than entered afresh.

Then, if you balanced that with a DTD for a completely different type of structured tagging, you have two types of metadata that can be searched against, collated, sorted, and used to generate new story ideas.

In any case, anyone can be taught to use XML. Once the DTD (which defines what tags are possible) is written, where the programmer is needed is not in tagging the data but in processing that data, in writing the scripts that will transform the data into something useful.

Posted by Karl on September 7, 2006, at 3:16 p.m.:

If I recall, (it was more then a few years ago for me) the publishing systems at the Philadelphia Inquirer and Daily News enabled writers to tag content as well.

We had those systems export docs that we imported nightly (no copy and paste for us) with information denoting the metadata. We used categorization and association to distinguish datatypes in the database.

Knight Ridder was researching Autonomy and other tools to algorithmically process unstructured information, but me and others felt that ultimately, the best path were tools and procedures that empowered journalists/editors/staffers to do so.

Posted by Paul Ford on September 7, 2006, at 3:57 p.m.:

One of the great benefits of semantic tagging is that it results in a number of useful, focused web pages as output. Because these are narrowly focused by subject they often place high in Google's rankings, which drives traffic. You have more pages and more traffic, which means you can dramatically increase the number of ads you host, or offer advertisers more traffic. If you keep your taxonomy small and useful, hundreds or thousands of well-organized pages can be generated with minimal human intervention and will yield a significant ROI over time.

Posted by Adam Rice on September 7, 2006, at 4:22 p.m.:

This is basically what Paul Ford did for Harpers, right?

Posted by Adam Rice on September 7, 2006, at 4:23 p.m.:

Oh, that's eerie. Paul's post wasn't there when I started writing mine.

Posted by Dan Conover on September 7, 2006, at 6:30 p.m.:

Moving to standards-based, structured-information journalism is an absolutely essential step, not only for the profession, but perhaps for the culture. It's the beginning of a culture that scales to the scope of our global media.

Posted by Phil Wolff on September 7, 2006, at 6:54 p.m.:

Providing structured data doesn't address the revenue issue, at least not directly.

I'll agree that it would be a value added service, and that you would readily synthesize new products to sell. The act of publishing structured data won't solve the industry's problem of a business (news) and an occupation (journalism) with falling barriers to entry and a long line of new entrants.

As to the ability to create new pages for each jot and slice of data and news, and their respective ad inventory, that doesn't mean those pages are read. Readers have limited attention, read n stories, then move on. That strategy just swaps one page of ad inventory for another.

Enormous value added services will jump out at you as you start to organize and collect your raw data. For example, if you're collecting structured data from the fire department blotter, you can map fire incident data against fire insurance rates. Your analysis might produce news, but you can also become the community watchdog over another industry. What is it worth to a homeowner or renter to see if you're getting a fair shake from your insurance rep? Would they subscribe to a service or pay a small one-off fee? And if reporters and fire departments everywhere publish to the same specification, you can compare your local data with regional, national and international data.

Posted by Andrew on September 7, 2006, at 6:57 p.m.:

Chicago Public Radio's plan for the website for their new station calls for something not too far off of what you're talking about here. And it just so happens they're looking for developers to build it. So if working with audio is more interesting to you than working for a newspaper, you might check into it.

Posted by dave mcclure on September 7, 2006, at 8:35 p.m.:

damn adrian... that's a fine post.

i was getting ready to write a part II followup to my post on Publishing 2.0, but i think you just mentioned most of the points i was going to make... with probably a lot more thoughtfulness ;)

kudos,

- dave mcclure

http://500hats.typepad.com

Posted by Richard K Miller on September 7, 2006, at 10:07 p.m.:

Great post. I would love to see news organizations publish structured data in the manner envisioned here.

One summer I worked at a doctor's office. I setup a PHP script to scrape the local newspaper's obituaries and compare the names against the patient database. Each morning the assigned staff member gets an email with possible matches, helping the office avoid the embarrassment/impropriety of sending a bill to the deceased. It would be so much easier (and foolproof) if the newspaper offered an RSS or XML feed of the obits.

Posted by Paul Mcgrath on September 7, 2006, at 10:36 p.m.:

Interesting discussion. I'm surprised the list didn't inlcude the most obvious candidate for database information. The dateline. Databasing the dateline information would easily allow for a mashup of stories onto say Google Earth.

Imagine you could then open Google maps and see every fire in an area, or open Google Earth and look at all the news in your neighbourhood, or a country on the other side of the world.

Posted by MH on September 8, 2006, at 12:28 a.m.:

This is a really nice idea. Could you possible provide an example of a short news article thusly converted and presented? Most of it makes sense, but I'm having a hard time imagining how *verbs* and the *relationships* between concepts get captured, and how they would be represented.

To take a very truncated example:

CAPE CANAVERAL, Fla. - Caught in a scheduling squeeze, NASA decided to try to launch space shuttle Atlantis on Friday without replacing a troublesome electrical component.

--------

story.title = "NASA to try Friday shuttle launch"

story.author.1.first = "Mike"

story.author.1.middle = ""

story.author.1.last = "Schneider"

story.author.1.title = "Writer"

story.author.1.organization = "Associated Press"

story.author.1.organization.subtype = "News Agency"

story.posted.date = "2006.09.06 6:05PM EST"

story.updated.date = "2006.09.06 6:23PM EST"

story.location.lat=45.12

story.location.long=67.09

story.location.site = "Cape Canaveral"

story.location.site.type = "Space installation"

story.location.city = "Cape Canaveral"

story.location.region = "Florida"

story.location.country = "US"

story.actor.1.name = "NASA"

story.actor.1.type = "Organization"

story.actor.1.description.1 = "Space Agency"

story.object.1.name="Atlantis"

story.object.description.1 = "Space Shuttle"

story.object.description.2 = "spacecraft"

story.object.description.3 = "vehicle"

story.action.1.name = "...?"

--------

How would you complete (or change) this?

Posted by Joe Murphy on September 8, 2006, at 2:51 a.m.:

Adding zip code data to articles, photos, obits (and hometown fields) etc. would allow for some slick content aggregating in lots of ways.

Also, it would allow news sites to index content based on zip code, which opens up a world of relevant advertising possibilities.

Posted by Rex on September 8, 2006, at 8:38 p.m.:

I'm surprised not to see more people here saying something along the lines of "our CMS already does this." I'm familiar with most of the major proprietary systems, and all of them allow for some sort of media-specific data store, which was Adrian's original post.

Perhaps this discrepancy is accountable by the suggestion that this post is directed specifically towards "newspapers," rather than the more generic "media industry," the latter of which oftentimes has to deal with more multimedia scenarios and is therefore built around data-specific storage.

The other things mentioned here in the comments -- inline semantic markup and such -- are certainly less prevalent, but they also exist to some degree out there, but again, probably at some of the bigger media entities.

Posted by F.Baube on September 8, 2006, at 9:04 p.m.:

Where's my FML (Fire Markup Languague) ?

Posted by Dan Stout on September 9, 2006, at 3:15 a.m.:

I posted my comments up on Manufactured Environments here.

Posted by Dan Crawford on September 10, 2006, at 7:02 a.m.:

Excellent post Adrian.

Here is my 'service' idea for a newspaper that employs structured information:

The names (and addresses) of all businesses and people mentioned in an article could be used to query a political donations database. The results would be compiled and summarized at the end of the article. This would add another dimension to the article and provide a means for people to 'read between the lines'.

Posted by Rob Nelson on September 11, 2006, at 5:32 a.m.:

Great post Adrian!

I think this goes a long way toward explaining the underlying philosophy of the Django framework - as a conduit and organizer of information. It's like Django is providing the buildings blocks for creating a library, with the Url being a kind of dynamic 'call number'. I think this is the quality that differentiates it from other web frameworks.

And I think that's what makes it different from a CMS. The proposal of Django is "organize your thoughts conceptually, and I will present them" - as opposed to "put your documents in my system and I will display them" (which seems to me what the CMS approach is).

I don't know. I'm not a journalist so I'm just throwing in my opinion as a programmer. I think you are totally right about journalism though. We've all seen how the blogosphere has changed the landscape of journalism. And that feels analagous to the opensource software movement - the more eyes there are, the better the analysis, the better the final product - and ultimately - the better off humanity is.

Uggh. That sounds pretentious. Oh well. I just wanted to make my observation about Django and say great article!

Posted by Bart Anderson on September 11, 2006, at 10:46 a.m.:

I don't see it.

Why? What's the point?

This feels to me like building an infrastructure because it can be done, not because there is any compelling need for it. What you're describing is a data-gathering operation, for which there can be tremendous benefits. But what data? For what purpose? Is there a market for it? Does anybody want it?

It's not clear to me why general newspapers would embrace new projects such as you propose without any promise of revenue.

On the other hand, I can see specialized journals and information services eager for these schemes, for example business or engineering news, in which organized data is very valuable.

From my point of view, the news media are going in a different direction altogether - back to the journalism of the 18th and 19th centuries, when newspapers were financed by political parties or businessmen to promote their points of view. On the one hand, we have the success of Fox News; on the other hand, the vibrant websites of the bloggers and activists.

Also, community journalism may be making a comeback since the costs of entry for a web-based publication are so small.

Posted by Paul Watson on September 11, 2006, at 4:38 p.m.:

How are journalists capturing all their research data that makes up a story? If it is electronic or we can scan it in then lets do that and have it as an addendum to the story. Present the story and then let people at the jumble of facts if they wish to delve deeper.

As for "publishing photos as stories" I'd say that is a software problem. The journalist is telling a story, just this time with a photo. The software should recognise this and change the type tag.

What you wrote about the weather man seems spot on. He isn't creating a database of weather facts, he is just trying to get his predictions out to the masses. That it was recorded as a database is fantastic but he doesn't need to know that.

Posted by Nick Fitzsimons on September 11, 2006, at 6:08 p.m.:

It seems to me that microformats could be of value here in allowing human-readable news stories to be enhanced with additional semantic elements allowing automated tools to extract structured information.

Posted by John Gardner on September 11, 2006, at 11:29 p.m.:

Great stuff!

I'd love to hear your thoughts on how these ideas could be used to generate revenue for a paper, as that is usually (always?) the first question asked when I propose something to my clients.

Posted by Danny Defreitas on September 12, 2006, at 6:10 a.m.:

I don't understand what you would do with this data. News sites need to focus on having visual impact, interactivity, and video. A "wow factor" that will keep people interested in the news. Words and data don't cut it.

Posted by Adrian on September 12, 2006, at 4:25 p.m.:

Danny: Of all the lessons Google has taught us, one of the biggest is that words and data do cut it.

Posted by Mike Linnane on September 12, 2006, at 4:46 p.m.:

Some added bonuses (forgive me if this has already been covered):

- having the story meta readily available (on-page, on-web, etc.) would minimize or eliminate the spin factor

- reduce barriers to access to raw information for the population at large

Posted by Danny Defreitas on September 12, 2006, at 6:14 p.m.:

Face it, everybody has the same words and data. News is a commodity game. What will set one news site apart from others is presentation, innovative interfaces and visual storytelling.

Posted by Simon Seitz on September 12, 2006, at 7:13 p.m.:

I'm an information design student from Austria. When I read your post it struck me, that you were writing a "Wanted"-letter for good information design. Comprehensible representation of large amounts of information - making it quickly understandable, making it fun to learn/play with - that is what information design is all about.

Information graphics, cool new ways for designing maps and charts, interactive environments, easily scannable and typographically beautiful text, clean, clear corporate design - there is so much in the world of information design that the big media haven't yet explored.

Posted by Derek Willis on September 12, 2006, at 10:59 p.m.:

Danny,

With respect, everyone in the newspaper industry does not have the same data. If this were the case, there would be no need for any newspaper except USA Today. Think of the financial data the Wall Street Journal gathers, or the local data that any number of smaller papers collect. The biggest papers don't collect small-market data; the small papers don't do national stuff.

Posted by Kevin Page on September 14, 2006, at 11:59 p.m.:

I think what you're describing isn't so much a problem with journalistic arrogance, but rather, as you point out, a content management or software problem that "overwhelmingly discourages any sort of information special-casing." Without any type of standardized industry-wide attributes for information we're forced to store information with attributes that make sense. Unfortunately, these are probably not the best for long-term content management. I like your concept, but we need to examine the real barriers to carrying it out or why we should.

Posted by Ryan Shaw on September 20, 2006, at 7:04 a.m.:

I'm curious what you think about the emerging IPTC standards like NewsML2.

Posted by Jack on September 20, 2006, at 11:46 a.m.:

I think this is a great idea and probably one that, as the amount of stored information continues to grow exponentially, will not only become the popular structure, but necessary.

Posted by Marko on September 20, 2006, at 7:03 p.m.:

I am wondering if it is possible. Of course, for structured events it is. But so much news is not structured or simply not all the facts are known.

On several occasions I have tried to put data in tables. We have found out that information is not available, or is only available days or weeks later. By that time my collegues and me have moved on to another story.

Sometimes we just don't care about the story, because there are more important stories that day. Miss one story in a category, and your database is useless.

But even when we would do all the stories about, say, fires. And we do manage to get all the information, down to all the names of the firefighters that participated, there are still a whole load of questions that need to be solved. For example, when are firefighters involved? Do we have to count the emergency operator too? And what if a specialist is called for a second opinion on the chemicals to be used to get the fire out. Do we count him?

Lets take it to a higher level. What if a victim dies three months later due to the fire? Do we count him or her? Do we even hear about the dead?

Even higher level: the place where something is happening. For fires it is very obvious. But what about researches. Do you have to geotag the place where the research was taking place? And when 20 universities where involved?

What about stories about legislation? Everything geotagged in Washington, Londen or the Hague? Also when the legislation is actually about a nuclear reacto that isn't located in the capitals? How about the analist stories. Geotag the place where the analist is sitting behind his computer?

Tagging is great and of course there are many types of information that can be tagged. But we journalists have to deal with many stories that are not complete, that are a category on their own or are partially unknown or secret.

In a text blob a story is always unique. To put the info for common stories in a database is practically not possible..

Posted by Jeff Croft on September 21, 2006, at 2:38 a.m.:

Marko-

I think you may be over-thinking the matter a bit, at least for today. I love the ideas and enthusiasm behind something like what you describe with getting that level of explicit detail about a fire -- but you're right, that's probably impractical in today's environment (at most news organizations, anyway).

Let's start with just getting news organizations to do simple things like not posting birth announcements and obits as stories. Or start with creating a simple 'Fire" object in the database that has even five or six basic fields like "location," "location type," "date," "fatalities," etc.

You're trying to collect every bit of minutiae about every fire -- and that's just not feasible in your situation (and that of most other people). I'd say instead of "wondering if it's possible," you should be trying to formulate a plan to start small. Something is better than nothing, right?

Baby steps, my friend. Baby steps.

Posted by Anthony Moor on September 21, 2006, at 5:13 a.m.:

Allow me to add my props as well, Adrian. Well said and necessary to be shouted from the tops of the ivory towers.

Posted by Rogerr on September 21, 2006, at 1:43 p.m.:

Facinating post. My wife is a reporter at a small daily. The pace is insane (8-10 stories per week), and salaries are shocking (editors make <$40k). They have ancient equipment, little tech support and no IT budget.

I totally agree with your premise. The structured information you refer to amounts basically to reporter's notes. One of the things that kills me about my wife's job is the amount of info that she gathers that never makes it into the story -- either the story develops a different way, or editors take it out. That work is just waste right now.

Maybe the way to really sell this to papers like my wife's is to make the case that by structuring their notes in this fashion the actual writing of the story can go faster -- and that all the info they collect is stored and potentially of use in the future.

Posted by Marko on September 21, 2006, at 3:39 p.m.:

Jeff, of course baby steps are required, but then again, just to implement such a system on a low level requires some very large steps. Both in the minds of the reporters, as on the technical level. And even if you just make a field for a location, which is always present during a fire, from then on you MUST cover all fires. Otherwise your database is not usable.

I have done so with several projects (like a table where you can see which countries are sending what to Libanon in terms of military assistence) and it is HARD to keep up and very hard to get everything in the same measurements. Some countries say they send 1500 troops, others mention a batallion, some add non-military personel or send 'observers' - whatever they may be. Subsequently, you cannot calculate the number of military deployed in Libanon. And this is about the easiest as you can do in terms of database/tables.

If you really want this, you might hire statisticians and documentalists and give them alot of time to research the facts. But then you are turning yourself in a census bureau. Which is fine, but is is what you want?

Posted by Christopher Rivard on September 21, 2006, at 4:30 p.m.:

Great read.

There is a concept in the information architecture world of structuring content with strict facets and loose semantics. If facets were established for the major 'news types' and loose semantics used for more story/prose to make the information more accessible/readable, this would certainly allow for flexibility in distribution and analysis.

I would recommend the book Ambient Findability by Peter Morville as a great insight into trends in information architecture and ambient information.

As far as CMS - we are in the midst of applying some of these concepts to Blend, our object base content management system.

Posted by Shawn Medero on September 21, 2006, at 8:01 p.m.:

This topic reminds of me research funded by DARPA (Defense Advanced Research Projects Agency) that is looking into turning existing news stories, weblogs, and other forms of text (broadcast news transcripts for instance) into a data structure that connects entities (people, companies, governments, etc) to places and the events that brought them together. While DARPA is more interested in automating systems to pull out this information - I think the work done on tagging this information might prove somewhat useful for academic types looking to research this. The project uses an XML data store and has tags for entities, events, relations, and time mentions. There are limits to what people were asked to tag of course (because this data is used for training machine learning systems) but perhaps looking at how they went about it, guidelines given to researchers, the results, and anomalies might give some insight to the next generation of CMS system discussed here. At least one of the projects that funded research on this topic was called ACE (Automated Content Extraction). A google search turns up links to the NIST (National Institute. of Standards and Technology) evaluation web site and various other presentation, project materials, etc.

Posted by Derek Willis on September 21, 2006, at 9:40 p.m.:

Marko,

I don't think anybody is suggesting turning newspapers into the Census Bureau. I think implicit in both Adrian's post and Jeff's response to your initial comment is that newspapers can pick their spots carefully. For example, most papers don't write about every crime that occurs, but they do usually write about every murder. They may not write about every fire, but they may write about every fire that results in a certain amount of damage or injuries or deaths. One of the beauties of a newspaper is the role of editing - in other words, selection - in the process. Take what we've done at the Post, for example: the Congressional Votes Database. It is theoretically possible to have a "plain English" description of every vote, but it's not something we're able to do. So our editors and reporters pick "key votes" and write explainers for those. That doesn't diminish the rest of the database - it simply elevates some portions of it.

Posted by Charlie Madigan on September 25, 2006, at 6:10 p.m.:

I think the big problem here is that the people who run newspapers, particularly in the news departments, don't yet understand that the internet functions as a new kind of medium, not merely as a vehicle for repurposing content from the paper. Your points about CMS rigidity are excellent, as is the argument for much broader thinking about the obligations of journalists. But I think what the world is waiting for is the development of story telling that respects the caliber of collected information, but presents it in a way that uses every ounce of muscle new media has to offer. As a geezer, I have to stay it feels to me a lot like the early days of television news, when it was pretty clear no one knew what to do with it, so they just read stories out of the papers. There's nothing wrong with stories that are well told. What about stories that are well told and have interactive components, video and photo galleries, too? That should be entry level stuff everywhere. I also believe that day of new story telling is a long way in the distance. It seems to me that editorial departments across a spectrum of newspapers have basically let go of the handles, which means marketing and advertising departments are defining the terms and creating the technology. Competition with other news sources should be the force driving this development, but it isn't. As long as many websites satisify themselves with aggregated content that makes them look, read and feel just like everyone else, we're going nowhere. It would not hurt to start tech training for reporters and editors, too. There's nothing like a little tech ability to advance the cause. Anyhow, your comments are grand and right on point.

Posted by Martin Hsu on September 29, 2006, at 12:04 p.m.:

Great ideas in this post.

I think Adrian is sharing a vision where journalism and statistics usage are merged in a way to provide more insight into some types stories. He clearly writes that this doesn't replace traditional story telling (or innvovative content delivery or wow factors should that be your religion), but goes hand in hand with it.

Having 'repurposable' data is an asset that is hard to value because you never completely know when and how you'll be able to use it. Innovative use of this data is the way to extract value from this resource.

Changing the 'shape' of the data can add value by providing another way to look at the same story or type of stories by enhancing its context. Anyone familar with the fable of the 3 blind men and the elephant? While the 3 blind men had different "understandings" of what an elephant was, they were all correct. Collectively, they knew quite a bit about an elephant - more than any single "understanding."

Like any asset, it comes at an expense. There is no denying or avoiding that. Someone with the skills needs to create data models. Fortunately, the number of data models grows as the number of types of stories grows. The more data you want to slice-and-dice, the more you need to do - no surprise here.

Posted by Rich on October 2, 2006, at 2:39 a.m.:

I'll keep it short.

Great post, Adrian. Keep it up. :)

Posted by Suzana Barbosa on October 6, 2006, at 12:38 a.m.:

Hi Adrian,

I´m brazilian journalist and i would like to say: it´s very good to read your articles about the use of database to improve journalistic web sites. I was in joy when i read your interview at OJR because actually i´m doing my PhD in digital journalism and i have discuss about new way (new format) for news web sites: digital database journalism.

Congratulations!

Suzana Barbosa

(www.facom.ufba.br/jol)