Everything is data

An interesting perspective on Wikipedia is provided by Dan Cohen who notes that the current discussion about authority does not engage with the full potential significance of Wikipedia. He describes how a large, openly available knowledge base like Wikipedia is a valuable resource for emerging data mining and search technologies.

Let me provide a brief example that I hope will show the value of having such a free resource when you are trying to scan, sort, and mine enormous corpora of text. Let's say you have a billion unstructured, untagged, unsorted documents related to the American presidency in the last twenty years. How would you differentiate between documents that were about George H. W. Bush (Sr.) and George W. Bush (Jr.)? This is a tough information retrieval problem because both presidents are often referred to as just "George Bush" or "Bush." Using data-mining algorithms such as Yahoo's remarkable Term Extraction service, you could pull out of the Wikipedia entries for the two Bushes the most common words and phrases that were likely to show up in documents about each (e.g., "Berlin Wall" and "Barbara" vs. "September 11" and "Laura"). You would still run into some disambiguation problems ("Saddam Hussein," "Iraq," "Dick Cheney" would show up a lot for both), but this method is actually quite a powerful start to document categorization. [Dan Cohen - Digital Humanities Blog - The Wikipedia Story That's Being Missed]
This is an interesting example of how what is processable will be processed where it can add value - for somebody.

Via if:book.

Comments: 2

Jan 03, 2006

A similar idea was floated a few months ago, to use links to Wikipedia as a source of metadata about the pages that contain them. If we all link to the appropriate Wikipedia page when discussing one or the other of the presidents Bush, our discussions will be unambiguous even to a suitably-attentive robot. In this way Wikipedia could provide an interim universal ontology while we wait for the semantic web.

Jan 04, 2006
Mark Carden

This topic (and your comments on A catalog
with service
and Making
data work harder
reminds me of a paper at EDCL in 2004 called Automated
Indexing with Restricted Random Walks on Large Document Sets in
which Markus Franke and Andreas Geyer-Schulz of the University
of Karlsruhe had found a way to start to
automatically index the 15 million documents in the S�dwestdeutsche, where about 30% of the documents
are not indexed and only few of them are available as digital full text.

As I understand it, they used the co-usage histories of the
documents to cluster them (with restricted random walks, whatever they are) and
then indexed them based on the cluster information.

In effect, they were using "people who borrowed this
also borrowed ..." to build the bibliographic index.