Making data work harder

As more activities move into a network space so more areas of our life are shedding data. This data is increasingly being mined for intelligence which drives services. And with data, quantity, as they say, has a quality all of its own.

A major attribute of both Google and Amazon is how they squeeze as much value as they can from the data they have, and the value of that activity increases with the volume of data. Data about uses and users, as well as data about the used. The more people use Amazon, the better their recommendations. The more of the web that Google harvests, the better the associations it can make between words. Which in turn will improve their collocation of stories in Google News, or their matching of ads to results. The more digital copies of books Amazon has the better its forward and backward citation linking. The more articles Google Scholar indexes the better it can do ranking by citation.

IBM has just acquired identity resolution company SRD, the better to relate names and identities across multiple data streams:

With this newly acquired technology, as users add more and more data sources, accuracy goes up, Wozniak said. "Once you have a database of resolved identities, it can find people across multiple layers of separation," he said. [IBM Buys Identity Company to Nail Down Who's Who]
The more bibliographic data OCLC has the better it can associate the multiple manifestations of works, as it mines the relationships created by many catalogers. The better, also, it can provide useful intelligence about the 'flavor' of a collection, and how it compares with others. We have been doing more research work in this area recently, and also preparing for new collection analysis services to appear later this year. We are also trying to make this data work better in the open web environment in Open WorldCat [ppt]: subjects and authors are now clickable, pulling in related results.

Historically, ISI has been notable in the way in which it has generated intelligence from data. And the work at the University of Southampton on eprints data is pioneering (see in particular CiteBase and OpCit).

However, for a community which invests so much intellectual, staff and financial resources in data creation and management, we do not get as much value from data as we should.

See, for example, Dorothea Salo's recent argument that although we have good data we don't use the structure in the data in our user interfaces. See also Roy Tennant's recent use of my phrase 'murky bucket' in his LJ column, where Roy asks how well our current bibliographic apparatus supports ongoing needs. My view, which Roy kindly notes, is that we need increasingly to think about how we want to use data programmatically -- to 'FRBRize', to do collection analysis, to generate interesting displays.

We do have rich data. It could be better. But more importantly, we need to make our data work harder to create value for our users.


Comments: 0