One of the major questions for library systems is the role of metasearch or federation. I have written about this here (Metasearch: a boundary case) and here (Metasearch, Google and the rest).

The issue is that libraries have to manage a range of database resources whose legacy technical and business boundaries do not very well map user preferences or behaviors. The approach has been to try to move away from presenting a fragmentary straggle of databases to bundling them in various ways in a metasearch application, sometimes in one big search, sometimes in smaller course or subject bundles. The issues here are well-known, not least of which is that libraries typically have limited control over the performance of the target databases.

As an alternative, a few libraries have explored consolidating locally loaded data. This can work very well, as it becomes easier to build additional services over a consolidated resource. However, this is a rather too adventurous undertaking for most libraries. Another approach is for a third party to consolidate, and this is what we have seen with Google Scholar, Scopus, Worldcat, and others.

More recently, recognizing the advantages of local consolidation, we have seen the emergence of a new class of library system which pulls together metadata from locally managed stores (e.g. digital repository, ILS, institutional repository, ...) and offers an integrated search. This may still have to work closely with a metasearch engine to integrate access to external databases. ILS vendors are moving in this direction, and through Worldcat Local, OCLC is also addressing this type of integration.

This is a discussion worth returning to, but that is not my purpose here. Rather I wanted to point to an interesting treatment of similar issues from a different domain. Mike Stonebraker, database guru and writer in the group blog, The Database Column, has a post where he contrasts two models of data integration: ETL (extract, transform and load) and federation. The focus is on enterprise systems. The ETL model will typically involve a centralized data warehouse and "for each operational system, they will employ some sort of ETL process to transform data instances into the global schema and then load them into the centralized warehouse".

'Extract, transform and load' is a good characterization of what is involved in consolidation of library data, whether this is attempted locally or through third parties. One of the interesting questions is the sophistication of the 'transform'. Think of author names, for example, or subjects, or other controlled data, and what would be involved to effectively merge data created within different regimes. What is the impact, for search or for faceted display, of limited or no transformation of these elements?

Here are the headings Stonebraker uses for his discussion.

  • Data element "heat": Hot data favors ETL
  • Indexing: Federation is harder to optimize
  • Resource management: Faster BI query responses for ETL shops
  • Complexity of the schema change: ETL approach performs less joins
  • Contention (concurrency control): Federation contention challenges
  • Timeliness: ETL approaches must deal with out-of-date data issues
  • Mapping: Federations can't handle some transformations

BI is short for 'business intelligence'. 'hot' data is data that is accessed often.

Now, while it is clear that our environment is similar to that discussed here in many ways it would be interesting to do a similar analysis with our domain in mind to see where there are differences. Of course, one issue is that most of the data under discussion here seems to be within institutional control.

Here is his conclusion:

In summary, virtually all enterprises use the ETL approach for data integration. The data federation market is, in contrast, quite small. The place where I see federations as most viable is when there are many, many data sources (e.g., more than 5,000 sources) and BI users utilize only a small number of them at any given time. In this extreme case, the average data element is accessed zero times before it is updated or deleted. In this instance, one is better off leaving the data where it originates. On the other -- more common -- hand, when most data elements get used several times, the ETL approach will continue to be preferred. [To ETL or federate ... that is the question - The Database Column]

Related entries:

Comments: 3

Jan 14, 2008
Stan Kosecki

Lorcan: You write "As an alternative, a few libraries have explored consolidating locally loaded data. This can work very well, as it becomes easier to build additional services over a consolidated resource." Care to give a few examples of well-implemented installations and, more specifically, what additional services they deploy? In a way, this strategy reminds me the best aspects of "the big iron" era. --Stan

Jan 18, 2008
Lorcan Dempsey

Stan, I had in mind Los Alamos National Laboratory and CISTI.

If you have data inhouse you can benefit from some of the performance advantages outlined in the ETL post above. You can also more readily do things like collaborative filtering, preprocessing of relationships in the data, and awareness services.

However, the two examples I give are not typical libraries. They have considerable development capacity.

I am aware of experiments at Bielefeld but I have not seen the actual outputs.

Jan 22, 2008
Jonathan Rochkind

Thought I left a comment last week, but apparently not!

Some other library examples of locally loading cross-db data can be found in my article in Library Journal last year. I'm not going to add the url, because maybe that's what got me caught by the spam filter last time, but google for 'rochkind library journal'.

While I agree that locally loading is probably the way to go where possible (as I argue in that article), it's not clear to me how we fit into Stonebraker's analysis for licensed content search applications: "(e.g., more than 5,000 sources)" --- no, we're probably not going to have more than that, but each source may include more items than he is used to? And the items for the most part _never_ change, but they very well may never be hit in between loads.