From time to time, we see a discussion about the relative merits of Google, or Amazon, and library catalogs as retrieval or search engines. There is one main difference that doesn't tend to get discussed much, and that has to do with the type of data that gets factored into the experience. Increasingly, we are seeing major web presences incorporate usage data and contributed data to extend the range of suggestions they can make to users.
In these pages I have suggested that we have four sources of metadata about things. Here is a short version:
- Professional: The curatorial professions have made major investments in knowledge organization, through the development and application of cataloging rules, controlled vocabularies, authorities, gazetteers, and so on.
- Contributed: A major phenomenon of recent years has been the emergence of many sites which invite, aggregate and mine data contributed by users, and mobilize that data to rank, recommend and relate resources. This data includes tags, reviews, ratings, further details, and so on.
- Programmatically promoted: We are handling more digital materials, where it is possible to programmatically identify and promote metadata from resources themselves or groups of resources. We will also do more to mine collections, including collections of metadata, to discern pattern and relations. We are increasingly familiar with clustering, entity identification, automatic classification and other approaches.
- Intentional: I have used this term to refer to the data that we are collecting about use and usage. Pagerank is based on aggregate linking choices. Amazon recommendations are based on aggregate purchase choices.
Now, typically library catalogs use traditional information retrieval techniques over professionally produced metadata. This is not a lot of data to play with! We have just begun to see interesting things being done with the other types of data as libraries explore the use of transactional data for recommendations and look to incorporate contributed data.
Google, Amazon and other sites license professionally produced metadata. But in different ways they also use the other types of data also. So, for example, intentional data, in the form of linking patterns, is central to how Google ranks material. Google Books makes extensive use of programmatically promoted data: look at how it extracts place names and places them on a map for example. This is imperfect, but useful. Amazon makes extensive use of contributed metadata, in the form, for example, of ratings, tags or reviews. Amazon also makes interesting use of intentional data, as in its recommendation engine. It has an especially nice feature, where it shows you what people who looked at a page eventually bought. Amazon builds up a 'rich texture of suggestion' based on several of these types of data.
Suggestion, or recommendation, is becoming increasingly a part of our everyday web experience,and improving the quality of suggestion has become an important goal for many services. Clearly, there are commercial interests riding on this.
One interesting signal of this is the Netflix Prize, a million dollars being offered by Netflix to the first entrant who can improve their recommendation engine by 10%.
A recent Wired article about the Prize talked to Jim Bennett, the VP for recommender systems. He talks about the impact of small changes in the RMSE, a measure of the typical amount a prediction misses by:
How much difference could that possibly make? A lot, Bennett says. Netflix offers hundreds of millions of predictions a day, so a tiny reduction in the frequency of insultingly stupid movie suggestions means a lot fewer angry users. [This Psychologist Might Outsmart the Math Brains Competing for the Netflix Prize]
Over the last few years, the RMSE of Cinematch has steadily improved, as has Netflix's success at retaining customers from month to month. Bennett can't prove the two are related, but he's willing to bet on his belief that they are. He refuses to speculate on the dollar value of a 10 percent improvement to Cinematch, but he's certain it's substantially more than $1 million. [This Psychologist Might Outsmart the Math Brains Competing for the Netflix Prize]
The focus of the article is the work of 'just a guy in a garage', Gavin Potter, and his work on taking human factors into account (for example, you might want to weigh newer ratings more heavily than older ones).
Potter echoes a theme of current discussions:
"The 20th century was about sorting out supply," Potter says. "The 21st is going to be about sorting out demand." The Internet makes everything available, but mere availability is meaningless if the products remain unknown to potential buyers. [This Psychologist Might Outsmart the Math Brains Competing for the Netflix Prize]
This is similar to 'consumption management' as discussed by Bjørn Olstad, for example.
He talks about a move from content management to consumption management. And taking the example of Yellow Pages talks about a move from a provider view (a few details in a listing, shallow understanding) to a consumer view (recommendations, maps, directions, comparisons: a broader experience leading to a deeper understanding). This is achieved through data mining and looking across a range of resources produced by divers hands. [Lorcan Dempsey's weblog: Search is more than search]
I was reminded of this discussion by an article by Michael Schrage in the current issue of Technology Review about the "low, seductive whisper of automated suggestion". He mentions the Netflix Prize and discusses the limitations and promise of recommender systems, focusing on Amazon and iTunes.
The focus of digital personalization has shifted from what I am interested in now to what I might be interested in next. All the choices I make in the moment are absorbed into a sphere of suggestion where, after they have been statistically weighted, they are reborn as offers and advice. [Technology Review: Recommendation Nation]And he concludes with the following:
When I get good recommendations, I spend my time and money differently. Even better recommendations will dramatically increase the value of that time and money. That's a digital future I crave and expect. I hope Internet innovators take my recommendations as seriously as I take theirs. [Technology Review: Recommendation Nation]
I like this emphasis on the value of time. It calls to mind Ranganathan's injunction to save the time of the user. And although expressions like 'consumption management' or 'sorting out demand' may not be uppermost in our minds, it seems to me that they are very much in the spirit of the double injunction "every book its reader" and "every reader his or her book". As resources become more abundant and as time becomes scarcer we need better and better ways of making this match. And this includes finding better ways of aggregating and using intentional and contributed metadata alongside our professionally produced metadata.