Everything is data

An interesting perspective on Wikipedia is provided by Dan Cohen who notes that the current discussion about authority does not engage with the full potential significance of Wikipedia. He describes how a large, openly available knowledge base like Wikipedia is a valuable resource for emerging data mining and search technologies.

Let me provide a brief example that I hope will show the value of having such a free resource when you are trying to scan, sort, and mine enormous corpora of text. Let’s say you have a billion unstructured, untagged, unsorted documents related to the American presidency in the last twenty years. How would you differentiate between documents that were about George H. W. Bush (Sr.) and George W. Bush (Jr.)? This is a tough information retrieval problem because both presidents are often referred to as just “George Bush” or “Bush.” Using data-mining algorithms such as Yahoo’s remarkable Term Extraction service, you could pull out of the Wikipedia entries for the two Bushes the most common words and phrases that were likely to show up in documents about each (e.g., “Berlin Wall” and “Barbara” vs. “September 11” and “Laura”). You would still run into some disambiguation problems (“Saddam Hussein,” “Iraq,” “Dick Cheney” would show up a lot for both), but this method is actually quite a powerful start to document categorization. [Dan Cohen – Digital Humanities Blog – The Wikipedia Story That’s Being Missed]

This is an interesting example of how what is processable will be processed where it can add value – for somebody.
Via if:book.

2 thoughts on “Everything is data”

  1. A similar idea was floated a few months ago, to use links to Wikipedia as a source of metadata about the pages that contain them. If we all link to the appropriate Wikipedia page when discussing one or the other of the presidents Bush, our discussions will be unambiguous even to a suitably-attentive robot. In this way Wikipedia could provide an interim universal ontology while we wait for the semantic web.

  2. This topic (and your comments on A catalog
    with service
    and Making
    data work harder
    reminds me of a paper at EDCL in 2004 called Automated
    Indexing with Restricted Random Walks on Large Document Sets in
    which Markus Franke and Andreas Geyer-Schulz of the University
    of Karlsruhe had found a way to start to
    automatically index the 15 million documents in the S�dwestdeutsche, where about 30% of the documents
    are not indexed and only few of them are available as digital full text.

    As I understand it, they used the co-usage histories of the
    documents to cluster them (with restricted random walks, whatever they are) and
    then indexed them based on the cluster information.

    In effect, they were using “people who borrowed this
    also borrowed …” to build the bibliographic index.

Comments are closed.