Metasearch, google and the rest

How quickly things can change! Last year there were discussions about the Google-busting potential of metasearch. How naive. This year there are discussions about the metasearch-busting potential of Google Scholar. Let us wait and see.

Clearly there are various issues with metasearch: the variety of data and interfaces that has to be managed means that it will always be a difficult process. It is also difficult to build out services on top of a federated resource. (I write briefly about 'portals' here, and about library search here.)

But to think about the question in terms of metasearch and Google obscures a potentially more interesting longer term question. This is a question about consolidation: at what level does it make most sense for resources to be aggregated for more effective use.

Think of two poles: the fractured resource available to a library user, and Google.

Libraries struggle because they manage a resource which is fragmented and 'off-web'. It is fragmented by user interface, by title, by subject division, by vocabulary. It is a resource very much organised by publisher interest, rather than by user need, and the user may struggle to know which databases are of potential value. By off-web, I mean that a resource hides its content behind its user interface and is not available to open web approaches. Increasingly, to be on-web is to be available in Google or other open web approaches.

These factors mean that library resources exercise a weak gravitational pull. They impose high transaction costs on a potential user. They also make it difficult to build services out on top of an integrated resource, to make it more interesting to users than a collection of databases.

A couple of recent examples emphasised for me the issues that fragmentation raises. First, see the following statement in the KB article I mention below:

It is recommended to index all metadata in a single index, and use as few different databases as possible for storage. There are hardly any databases or collections for which the use of a specific database package is justified. When there is a choice between indexing distributed databases in a central index or performing federated searching in distributed databases, it is best to choose the central indexing. There are several reasons for this, but it should be sufficient to compare Google as a central index with a theoretical Google that would distribute every user search to all websites all over the world. A combination with federated searching remains needed for databases that do not allow harvesting into a central index or for focussing a search into a specific area. [Renewing the Information Infrastructure of the Koninklijke Bibliotheek]

Second, I recently visited the Research Library in Los Alamos National Laboratories where they have a tradition of locally loading data where possible. [pdf - scroll down to page 6.] This is partly because of some of the particularities of their environment, but also because it is possible to build services out on top of this consolidated resource much more readily than on top of a federated resource. And the LANL Research Library has indeed created a very impressive set of recommender and other personalised services for their users, much richer in fact than most other libraries. They add significant value to the underlying collection of data, in large part because they have the data inhouse in a consolidated form.

The other pole is the centralized index of Google with an array of much discussed advantages, and a stated aim of consolidating all interesting data.

So, metasearch is one response to fragmentation, albeit one with limited effectiveness. Another approach is to consolidate data resources into larger reservoirs. This has the advantage of reducing the burden of integration, and enhancing the ability to create value-added services. But how and at what level could this be done? What are the sensible and possible consolidations in between the universal Google and the current debilitating fragmentation?

We have some existing consolidations: WorldCat for library materials, books especially; CrossRef for journal articles; ArtStor aspires to provide the benefits of consolidation for art images. I expect that over the next while we will see some more.

Comments: 10

Mar 21, 2005
Roy Tennant

Sure, it's virtually a principle that one should centralize searching whenever possible. But that ignores the simultaneous need to segregate or "slice" when appropriate. For example, there are times when searching everything simply doesn't make sense, and will be a disservice to the user. To use WorldCat as an example, as a user I may wish to see only those items I have a chance of getting my hands on. It would do me no good to see things I can't get, and therefore I want to search a slice of WorldCat, based on my particular needs. As far as I can tell (and they aren't talking) Google has no plans to offer any criteria for slicing their huge pot of stuff. To me, that is one of the central arguments against Google as a metasearching tool -- finding good stuff is as much what you don't search as it is what you do.

Mar 21, 2005
Paul Miller

Roy has a good point. I do remember discussion some years ago, though, about the extent to which it was wise to only show users the slice of content (to use Roy's term) to which they are allowed access.

It was argued that you do the user a disservice, as you (might) mislead them as to the breadth of knowledge available. How will the user know to badger (does that term translate?) their library to subscribe to Journal X if they don't know that it exists, or that it has content they might want?

From the perspective of the publisher, how do they ever get libraries to take new journals if the libraries aren't being asked for them by their patrons?

As a user, I can envisage situations in which I only want that which I am entitled to see. I can also think of situations in which I want a feel for the information that I'm not allowed to see. Most of the time, I probably want the stuff I can get immediately, either through my library or from somewhere else. A document I'm entitled to, but which will come through the mail as an inter-library loan is, all too often, too late.

I guess, at the end of the day, I want systems with the flexibility to give me different views!

Mar 22, 2005
Boon Low

Agree with Paul. Different views of repositories along the spectrum of the two poles is the key here. There is a need to sample these views along with their determining factors. For example, nearer to the centralised index pole, users tend to prefer a quick search on a central index, one preferably relates to their own domain and then slice the content further when the results become large. The Edinburgh University Library, has piloted a portal allowing users to search a central index, then dissaggregate the results according to resources type and year. The dissaggregation is neccessary because the results are likely to contain more matching articles than say, books and journal titles which are buried in the sea of results.

While Google has marched on with central indexing, the library is well position to provide the slicing requirements because of the well established cataloguing practice (hierachirical subject tagging) and users profiling. In my opinion, a portal such as one mentioned above may incorporate recommendation features via item filtering similar to one used by Amazon ("If you like this, you will like that") or finding related items along the subject topic tree. And how about providing more general ("traverse up the topic tree") and specialised ("traverse down the subject tree branches") results? The information to facilitate such slicing features already exists in most library catalogues.

In addition to the "spatial" attribute of the index, another factor may stem from temporal dynamicity. Would it be possible to create an index in a quasi-dynamic manner? For example, providing a grid-like infrastructure that would easily harvest/destroy a cache index based on group or individual (!) profiles and host it for a fixed duration (say per courses, projects) in a data warehouse?

Mar 22, 2005
Brad Spry

The costs associated with "local load" would be very high.

Some costs off the top of my head:

1. Publisher -- They'll charge you an arm, leg, and your first born at the suggestion of such a deal.

2. Technology -- LANL has 30 terabytes stored. Servers, storage, applications, networking, etc. Seven-plus figures...

3. People -- This isn't an off the shelf, turnkey system. You'll need innovative programmers, analysts, and database admins to even think about such a project.


I see great advantages of local load, but most individual libraries couldn't even scratch the surface of such a project.

Entire university systems however, would be a way to approach it. Our system has 16 campuses... it's high time we combined our resources to solved shared problems.

Mar 22, 2005
Blake

To quote Roy: "finding good stuff is as much what you don't search as it is what you do."

Good point, and also important to remember we need to give our users that option, not force them into something we think is best. I don't think you're saying that, but it comes to mind as something very "librarian"

Good stuff to be exploring regardless, our users are no longer happy with 100 different database interfaces.

Mar 22, 2005
Larry Campbell

Roy says: "As far as I can tell (and they aren't talking) Google has no plans to offer any criteria for slicing their huge pot of stuff." My bet would be that Google will have such plans the minute it becomes apparent that users want or need such functionality -- it wouldn't be difficult for them to do, and they have an admirable track record of quick and innovative responses.


On the other hand, I wonder how such centralized, monolithic repositories will fare in an increasingly dense and complex information ecology? Wouldn't you expect to find (to mix metaphors) some specialized search boutiques along with the big boxes, as well as all manner of sizes and services in between? So then the trick, as always (for libraries and others), would be to find the right niche.

Mar 26, 2005
Robert Shaw

The need to "slice" the data is not an argument against centralization. Searches do this all the time. This is just a matter of how one can put additional restrictions on a search, which can be explicit or, in a specialized interface, implicit. Centalization allows one to do both wide and narrow searches (efficiently and effectively).


I'm afraid most of the comments seem like feeble excuses as to why we can't or shouldn't offer centralized search to users. Meanwhile, Google off implementing more and more of it.

Mar 29, 2005
Debbie

To comment on Lorcan's last paragraph and the last comment at the same time, "We have some existing consolidations: WorldCat for library materials, books especially; CrossRef for journal articles; ArtStor aspires to provide the benefits of consolidation for art images. I expect that over the next while we will see some more.", I too support consolidations.

Interestingly, we have Australian equivalents for the centralised services Lorcan mentions: Kinetica (LibrariesAustralia) for library materials; PictureAustralia for art images and photographs, and the ARROW Discovery Service for journal articles amongst other research outputs (still being built).

So I would argue for centralisation, but with useful segregation on a national level. I guess Google does that too, with its URLs focusing on individual countries' stuff, but it is served up with content from everywhere.

While there seems to be a trend here to type/format divisions in services as well (partly due to historical library practice and partly due to the scale of each service, at least in Oz), and while we also look at cross-populating these services in various ways (including search engines) to encourage discovery, highlighting the national identity works too.

Apr 02, 2005
Ben Toth

In the UK National Health Service we have developed a federated bibliographic search at http://www.library.nhs.uk . We are also using Google Search Appliance (covering different material) at http://www.nhs.uk . One of the interesting features of GSA is its ability to create 'collections'. A specialist evidence based medicine search sollection is being developed for use later this year - we want to use whnatever works best for our users.

Apr 02, 2005
Ben Toth

In the UK National Health Service we have developed a federated bibliographic search at http://www.library.nhs.uk . We are also using Google Search Appliance (covering different material) at http://www.nhs.uk . One of the interesting features of GSA is its ability to create 'collections'. A specialist evidence based medicine search collection is being developed for use later this year - we want to use whatever works best for our users.