Sunday, May 24, 2009

Thinking about tagging

Tagging is a way to label articles and other information objects (such as events) for retrieval and clustering. A good community news web site will support searching articles, for example, and will find all articles that contain a word (or phrase). At their simplest, tags allow an article to be found even if it doesn't contain the search term.

Many times, articles will talk around a subject--the elephant in the living room. When we talk about cutting services and increasing public sector revenue, it is understood that we are talking about "budget," "city finances," and "Bud Conway," the city finance manager, even though those words might not be in the article (let's pretend). The extra words become tags so that the article can be retrieved for a searcher who would likely care.

In searching, you have precision and recall. Precision says that all search results will match the query--there will be no irrelevant articles. Recall says that all articles that match the query, even at the fuzzy edges, will be in the search results, along with a percentage of irrelevant ones. For most applications when we are searching a database of community articles, we want high recall, and tags help with that.

Tags also cluster information. The author of an article can tag it with "News" and readers can browse the News category and find the article, along with other news stories. If the author tags the article with two tags--"News" and "Schools," say--then the article appears in both categories (high recall in the browsing scenario).

Then, there is the notion of readers tagging the stories, the way Flickr lets viewers tag photos. It's a little different with photos, obviously, because the photos can't be searched without captions or tags. With articles, tagging not only potentially helps searching, but it's also a way of interacting with the article--it's a measure of popularity, or interest in an article. Both articles were viewed 1000 times, but this one had a lot more tags proposed. But do we really want readers deciding that something is "News?"

Ed Chi and Todd Mytkowicz of Xerox PARC wrote about aspects of tags recently. They explored the basic mystery of why uncontrolled reader tagging generally works. Users can tag articles with any combination of letters and numbers. Anarchy. Chaos. But, by most measures, the tag system achieves its goals. "Social tagging...[is] attempting to solve a mapping problem," they write. Users are collectively creating a map that will enable them and others like them to navigate the territory efficiently in the future.

One reason I think the potential for anarchy isn't realized is that griefers--vandals--like their mischief to be visible. If I tag a gossip story as "News" and it now appears on the Front Page, it's like spray-painting a mustache on a billboard--everyone can see how clever I am. But if I tag an article as "asdf," it is easily ignored.

Col Needham, the creator of the Internet Movie Database (imdb), has written that a few fairly basic ad hoc tweaks to the search interface greatly improved the searchability of the very large database. Finding a movie like "20,000 Leagues Under the Sea" would be difficult without synonyms like "Twenty Thousand," "20 Thousand," "20000," and so on.

Mindful of Needham's experience, I think a community news web site will need both an anarchic reader tag system alongside a more controlled tag vocabulary used by authors and editors. We'll want to suggest tags for use by authors that perhaps come pre-synonymized. One selection by an author from a drop-down list might add 3 tags. For the reader: "Propose a tag. Type it here. We'll let you know."

By the way, Chi and Mytkowicz found that entropy makes tag systems work less efficiently as they grow. Tags become less descriptive over time and "tags are becoming less meaningful in regards to providing salient navigability."

"Even with a tagging system, the navigability of the document set is becoming more challenging over time. One way for users to respond to this evolutionary pressure is to increase the number of tags they use to specify a document."

Chi and Mytkowicz find that, as the number of articles grows, users are increasing the number of tags they apply, and searchers are using more search terms. They report that Yahoo!'s average query length was 1.2 words in 1998, 2.5 words in 2004, and 3.3 words in May 2006.

No comments:

Post a Comment