May 10, 2008
•
Categories:
Search
, Websites: design and role
Libraries have major challenges in developing their websites. Think just of the information resources they provide access to. There are locally managed resources: a catalog, a repository or two, informational pages, and so on. And there are many remote resources: licensed databases, links to web pages, and so on. And there are pages which try to pull these together: resources organized by subject or department, for example.
These resources may be different in scope (reference, discovery, full-text or other content, ...), in type of content, in terms and conditions, in specialization, and so on.
Abstracting up to that single - or small number of - search boxes that are presented as a goal is not straightforward. And indeed it is still common to see various searches/entry points offered: the catalog, metasearch, a list of databases, a search of the library website, ....
In this context I was interested to see Suzanne Chapman's "search box round-up".
She does a nice job of commenting on several approaches, and has a companion Flickr set of search box pics.
Incidentally, over time I reckon that 'single search' alternatives to 'metasearch' for general article access will emerge. By this I mean that services will consolidate article level metadata to facilitate access. This is not to say that there will not be target markets where niche databases continue to exist, rather that alternative solutions for general article searching seem inevitable. And of course, we are also seeing integrated search solutions for local resources emerge, Primo for example. In this way, the multiple resource challenge may get simpler, but will continue to exist in some form.
Related entries;
May 08, 2008
•
Categories:
General - systems and technologies
, Marketing
, Search
I was very interested to read this brief piece about the 'new discipline' of 'computational advertising':
Web advertising is the primary driving force behind many Web activities, including Internet search as well as publishing of online content by third-party providers. A new discipline - Computational Advertising - has recently emerged, which studies the process of advertising on the Internet from a variety of angles. A successful advertising campaign should be relevant to the immediate user's information need as well as more generally to user's background and personalized interest profile, be economically worthwhile to the advertiser and the intermediaries (e.g., the search engine), as well as be aesthetically pleasant and not detrimental to user experience. [ACL-08: HLT - Tutorials]
This is from the notice about a tutorial session at ACL-08: HLT which is taking place in Columbus in June. The conference combines the Annual Meeting of the Association for Computational Linguistics (ACL) with the Human Language Technology Conference (HLT) of the North American Chapter of the ACL.
Given the nature of the conference, the tutorial has a particular focus:
In this tutorial, we focus on one important aspect of online advertising, namely, contextual relevance. It is essential to emphasize that in most cases the context of user actions is defined by a body of text, hence the ad matching problem lends itself to many NLP methods. At first approximation, the process of obtaining relevant ads can be reduced to conventional information retrieval, where one constructs a query that describes the user's context, and then executes this query against a large inverted index of ads. We show how to augment the standard information retrieval approach using query expansion and text classification techniques. We demonstrate how to employ a relevance feedback assumption and use Web search results retrieved by the query. This step allows one to use the Web as a repository of relevant query-specific knowledge. We also go beyond the conventional bag of words indexing, and construct additional features using a large external taxonomy and a lexicon of named entities obtained by analyzing the entire Web as a corpus. Computational advertising poses numerous challenges and open research problems in text summarization, natural language generation, named entity extraction, computer-human interaction, and others. The last part of the tutorial will be devoted to recent research results as well as open problems, such as automatically classifying cases when no ads should be shown, handling geographic names, context modeling for vertical portals, and using natural language generation to automatically create advertising campaigns. [ACL-08: HLT - Tutorials]
Via Michael White.
March 24, 2008
•
Categories:
Books, movies and reading ...
, OCLC
, Search
Well, another very fine issue of the Code4Lib Journal has appeared.
Jody L DeRidder has an interesting piece describing how they used browsable link pages (by subject, name, ..) and sitemaps to improve the visibility of a particular resource to search engines. The discussion gets into some of the issues of trying to get crawled, indexed, and then ranked: decision criteria may be applied by the search engines at each of these steps. Tony Boston, then with the National Library of Australia, published an article a while ago on experiences at the National Library of Australia in exposing materials to search engines and includes some pointers based on lessons learned. I hope that we see more of these types of articles as SSEO (social/search engine optimization) is a topic of growing importance for libraries. Here is sentence from Jody's conclusion:
Within three months of completion of this project, over 4 times as many hits and over 5 times as many users were recorded in a month as had ever been previously measured. [The Code4Lib Journal - Googlizing a Digital Library]
Update: See also the following article in the current issue of Dlib Magazine. Clearly articles on digital library search engine optimization are like buses. None comes for ages, and then several come together.
Site Design Impact on Robots: An Examination of Search Engine Crawler Behavior at Deep and Wide Websites Joan A. Smith and Michael L. Nelson, Old Dominion University doi:10.1045/march2008-smith [D-Lib Magazine (March/April 2008)]
And I should mention that my colleagues have an article on metadata crosswalking in this issue of Code4Lib also:
This paper discusses an approach and set of tools for translating bibliographic metadata from one format to another. A computational model is proposed to formalize the notion of a ‘crosswalk’. The translation process separates semantics from syntax, and specifies a crosswalk as machine executable translation files which are focused on assertions of element equivalence and are closely associated with the underlying intellectual analysis of metadata translation. A data model developed by the authors called Morfrom serves as an internal generic metadata format. Translation logic is written in an XML scripting language designed by the authors called the Semantic Equivalence Expression Language (Seel). These techniques have been built into an OCLC software toolkit to manage large and diverse collections of metadata records, called the Crosswalk Web Service. [The Code4Lib Journal]
March 04, 2008
•
Categories:
Search
, User experience
, ebooks and other e-resources
Somebody I was talking to recently mentioned that they liked they way Microsoft implemented book search. In particular they mentioned the visual presentation of where in a book matched search terms occurred.
I had a look. Here is a screen capture of the first result in a search done this afternoon Ireland and globalization.
It is indeed quite nice. Another example of glanceabiity: a measure of how quickly and easily a visual design communicates useful information.
Related entry:
January 10, 2008
•
Categories:
General - systems and technologies
, Libraries - systems and technologies
, Libraries - distributed environments
, Search
One of the major questions for library systems is the role of metasearch or federation. I have written about this here (Metasearch: a boundary case) and here (Metasearch, Google and the rest).
The issue is that libraries have to manage a range of database resources whose legacy technical and business boundaries do not very well map user preferences or behaviors. The approach has been to try to move away from presenting a fragmentary straggle of databases to bundling them in various ways in a metasearch application, sometimes in one big search, sometimes in smaller course or subject bundles. The issues here are well-known, not least of which is that libraries typically have limited control over the performance of the target databases.
As an alternative, a few libraries have explored consolidating locally loaded data. This can work very well, as it becomes easier to build additional services over a consolidated resource. However, this is a rather too adventurous undertaking for most libraries. Another approach is for a third party to consolidate, and this is what we have seen with Google Scholar, Scopus, Worldcat, and others.
More recently, recognizing the advantages of local consolidation, we have seen the emergence of a new class of library system which pulls together metadata from locally managed stores (e.g. digital repository, ILS, institutional repository, ...) and offers an integrated search. This may still have to work closely with a metasearch engine to integrate access to external databases. ILS vendors are moving in this direction, and through Worldcat Local, OCLC is also addressing this type of integration.
This is a discussion worth returning to, but that is not my purpose here. Rather I wanted to point to an interesting treatment of similar issues from a different domain. Mike Stonebraker, database guru and writer in the group blog, The Database Column, has a post where he contrasts two models of data integration: ETL (extract, transform and load) and federation. The focus is on enterprise systems. The ETL model will typically involve a centralized data warehouse and "for each operational system, they will employ some sort of ETL process to transform data instances into the global schema and then load them into the centralized warehouse".
'Extract, transform and load' is a good characterization of what is involved in consolidation of library data, whether this is attempted locally or through third parties. One of the interesting questions is the sophistication of the 'transform'. Think of author names, for example, or subjects, or other controlled data, and what would be involved to effectively merge data created within different regimes. What is the impact, for search or for faceted display, of limited or no transformation of these elements?
Here are the headings Stonebraker uses for his discussion.
- Data element "heat": Hot data favors ETL
-
Indexing: Federation is harder to optimize
-
Resource management: Faster BI query responses for ETL shops
-
Complexity of the schema change: ETL approach performs less joins
-
Contention (concurrency control): Federation contention challenges
-
Timeliness: ETL approaches must deal with out-of-date data issues
-
Mapping: Federations can't handle some transformations
BI is short for 'business intelligence'. 'hot' data is data that is accessed often.
Now, while it is clear that our environment is similar to that discussed here in many ways it would be interesting to do a similar analysis with our domain in mind to see where there are differences. Of course, one issue is that most of the data under discussion here seems to be within institutional control.
Here is his conclusion:
In summary, virtually all enterprises use the ETL approach for data integration. The data federation market is, in contrast, quite small. The place where I see federations as most viable is when there are many, many data sources (e.g., more than 5,000 sources) and BI users utilize only a small number of them at any given time. In this extreme case, the average data element is accessed zero times before it is updated or deleted. In this instance, one is better off leaving the data where it originates. On the other -- more common -- hand, when most data elements get used several times, the ETL approach will continue to be preferred. [To ETL or federate ... that is the question - The Database Column]
Related entries:
January 03, 2008
•
Categories:
Books, movies and reading ...
, Digital asset management
, General - systems and technologies
, Identity management, IPR and e-commerce
, Search
Google Book Search: Document Understanding on a Massive Scale [PDF] is a brief treatment of issues faced by Google as they grow their corpus of digitized books and work to make it useful in various ways.
Luc Vincent of Google discusses OCR (issues of many languages occurring unpredictably in variously formatted volumes, at scale), and then focuses on issues of document understanding.
In addition to OCR, making these books easily accessible and useful on http://books.google.com has required developing a number of additional state-of-the-art systems. These include systems for automatically deskewing, cropping and cleaning-up scanned book pages, which is critical as pre-processing prior to OCR, but also to generate clean and small images for efficient web serving. While this may be a well understood problem for high-quality documents, doing this well on scanned century-old book pages is no small feat. Most of the advanced systems developed for Google Book Search however involve some form of Document Understanding and as such, come after OCR in the book processing pipeline. Systems that have been developed, are being developed or are being considered as interesting research challenges include: [Google Book Search: document understanding on a massive scale PDF]
These challenges include: page ordering, language identification, chapter identification, content linking (relate table of contents to appropriate boundaries, index entries to pages, ...); summarization; metadata extraction and cross validation; topic identification; book clustering and linking (create relationships between volumes).
He also discusses ranking:
Specifically, how should books that match a particular query be ranked? The web is notorious for its rich graph of hyperlinks, famously exploited by Google’ PageRank algorithm [6]. This structure applies somewhat to technical publications, which typically contain numerous references to other technical publications. However the universe of books is different and most books (eg, novels) do not contain any references. Novel approaches therefore had to be developed, exploiting an array of new signals. Additionally, these techniques were recently extended to allow “blending” of book search results with web search resuts when appropriate. [Google Book Search: document understanding on a massive scale PDF]
The paper outlines presentation options based on copyright status and also discusses how Google supports the document understanding community through the release of software and data sets.
I was interested that there was no discussion of social features.
Via SEO by the Sea.
November 09, 2007
•
Categories:
General - distributed environments
, Libraries - distributed environments
, Search
, Standards
Under the auspices of OASIS appears a discussion document about the 'search web service'.
The Search web service is a means of opening a database to external enquiry in a standardized manner that facilitates discovery of query and response possibilities and makes it possible for heterogeneous databases to be queried simultaneously with the same or similar queries. Client software can be easily configured using a standardized XML explain document that is accessible from the base URL or via the explain operation. In contrast with protocols such as SQL and XQuery, detailed knowledge of a database’s structure is not necessary as the explain document contains parsable information on server defaults, searchable indexes and record schemas that are returned in the response. [OASIS Specification Template] There is a cryptic note about its relationship to SRU: This specification is based on the SRU (Search Retrieve via URL) specification which can be found at http://www.loc.gov/standards/sru/. It is expected that this standard, when published, will deviate from SRU. How much it will deviate cannot be predicted at this time. The fact that the SRU spec is used as a starting point for development should not be cause for concern that this might be an effort to fast track SRU. The committee hopes to preserve the useful features of SRU, but not to preserve those that are not considered useful. [OASIS Specification Template] There is a wiki for the OASIS group working on this.
September 18, 2007
•
Categories:
Books, movies and reading ...
, General - distributed environments
, Search
The decision by the New York Times to open up for general reading the formerly for-fee TimeSelect parts of its website is being widely discussed. The rationale given is interesting. Since we launched TimesSelect in 2005, the online landscape has altered significantly. Readers increasingly find news through search, as well as through social networks, blogs and other online sources. In light of this shift, we believe offering unfettered access to New York Times reporting and analysis best serves the interest of our readers, our brand and the long-term vitality of our journalism. We encourage everyone to read our news and opinion – as well as share it, link to it and comment on it. [A Letter to Readers About TimesSelect - New York Times]
This is another indication that discovery happens elsewhere. The material is currently not available to people who come to the website, but more importantly it is not available for crawling, linking, quoting, commenting. It is not open to the web. The website is not the focus of a user's attention: the web is, and for material to be discoverable it must be open to the ways in which web users discover and share materials .... elsewhere.
Incidentally, I was struck by the comment that the online landscape has changed significantly in two years. That's two years!
Related entry:
September 16, 2007
•
Categories:
General - distributed environments
, Libraries - distributed environments
, Marketing
, Search
, User experience
I have been using the phrase 'discovery happens elsewhere' in recent presentations. I think it captures quite nicely an increasingly important part of how we think about our services.
No single website is the sole focus of a user's attention. Increasingly people discover websites, or encounter content from them, in a variety of places. These may be network level services (Google, ...), or personal services (my RSS aggregator or 'webtop'), or services which allow me to traverse from personal to network (Delicious, LibraryThing, ...).
This means thinking about services in different ways. About how we disclose stuff to other discovery environments; about where our metadata is; about URL structures, RSS feeds, and so on.
I have suggested before that it would be an interesting experiment to think about our services as if they had no user interface. Here maybe it would be interesting to think about services as if they could only be reached from some other place. It makes you think about the variety of other places that discovery happens.
Credits. 'Discovery happens elsewhere' is influenced by Steve Rubel's use of the phrase 'traffic happens elsewhere' in his discussion of what he calls the 'cut and paste' web.
Related entries:
September 02, 2007
•
Categories:
Books, movies and reading ...
, Knowledge organization and representation
, Metadata
, OCLC
, Search
I was interested to read the following in Susan Gibbons' The academic library and the Net Gen student.
As gaming becomes a more mainstream pastime and an important element in popular culture, academic libraries should begin to develop collections of books and journals about gaming. To find some recent monographs, search OCLC's Worldcat using subject headings such as "Internet games-social aspects" and "Computer games-psychological aspects." [p. 38]
Click for some results:
Internet games -- social aspects
Computer games -- psychological aspects
Some things that occur ....
First, we do not see subject headings or class numbers often used in that way in text, despite their pervasiveness in our library catalogs. Second, it will become increasingly clear that they apply only to a part of the library collection as we more and more pull together access across the whole collection. And third, how much more should we do with them? Wouldn't it be nice to have a selection of tags, or authors, or publishers, which are related to these headings in various ways? We are not making them work very hard ....
August 28, 2007
•
Categories:
Research, learning and scholarly communication
, Search
, The cultural and scholarly record
Publish or perish is interesting looking:
Publish or Perish is a software program that retrieves and analyzes academic citations. It uses Google Scholar to obtain the raw citations, then analyzes these and presents the following statistics: .... [Publish or perish]
Among the statistics it generates are: Total number of papers; Total number of citations; Average number of citations per paper; Average number of citations per author; Average number of papers per author; Hirsch's h-index and related indexes.
One interesting feature is that a search on a journal title returns a list of papers ranked by citation.
Now, of course, this is as good as the data in Google Scholar. I expect we will see some analysis comparing results here with comparable results from ISI.
Via Greg Mankiw.
August 16, 2007
•
Categories:
Metadata
, Search
, User experience
We have just spent a while on the San Juan Islands (off the North West coast of Washington State and East of Vancouver Island for unfamiliar readers - Wikipedia).
I bought a couple of books in the congenial Pyramidion Used and Rare Books in Eastsound on Orcas Island. I was surprised and pleased to find Here comes the sun: architecture and public space in Twentieth-Century European culture by Ken Worpole (whom I have mentioned before in these pages). This was something that I thought I might buy when it came out but I never got around to it. And there it was - its golden covers standing out on the shelf. Worpole writes about the role of places in our social lives, and has written interestingly about public libraries.
Serendipity is indeed important.
It seems to me that I hear evocations of the importance of serendipity in the stacks and racks more often as folks are trying to explain what is lost as we take the digital turn. My response is that yes, serendipity is important, and there is an obvious imperative here: we need to make our data work harder to support the much enhanced opportunities for serendipity our network services provide. One of the ironies of the current discussion of the future of cataloging is how un-stretched existing catalog data is in our systems, whatever about additional or different metadata.
Aside: One of the more interesting chapters in Here comes the sun, with some wonderful pictures, is about the lido, the open-air swimming pool. I mentioned the pictures of lidos in my earlier post. I was interested to see that the one LibraryThing collector of this volume had tagged it with 'lidos', maybe sharing my impression of the interestingness of this chapter within the whole (the other tag is 'Britain' although the work self-describes with 'European' and has a Northern European focus), or indicating its importance in the published record of the lido. Who knows? However, this allows it to be related - potentially serendipitously for some users - to Liquid assets: the lidos and open air swimming pools of Britain. Interestingly, this connection is not made by the subject headings on either book (Here comes the sun; Liquid assets). It may or may not be appropriate to have expected the application of subject headings alone to have made the link but I think it does point up the desire to develop better ways of relating resources in multiple ways, as shown, in, for example, the work of the Powerhouse Museum.
Related entries:
July 19, 2007
•
Categories:
Search
, User experience
Several things have meant that blogging has been a bit slow the last week or two. However, I notice from the stats that the daily average number of visits so far this month has been the highest since I started.
Clearly, for maximum traffic I should stop posting altogether ;-) It must be the comment spam ....
A part of it is certainly Google. I did a search for 'web scale' the other day and was interested to find this entry on top of the results (remember that results may vary depending on your location, etc, and I doubt it will stay there for future visitors ;-).
Several longer entries I have done sometimes come around again in the lists of most visited entries. One of these is 'Discover, locate ... horizontal and vertical integration'. It must resonate with a library audience I thought to myself.
However, a while ago I noticed from logs that it was being regularly found by Google searches for some combination of horizontal and vertical integration. Again it ranks highly.
This stuff is quite interesting, but I need to keep away from it ... Enough other things to do ;-)
June 28, 2007
•
Categories:
Search
Thom has a post talking about his ranking in Google searches for 'thom' and 'hickey'. And, given the importance of the web to the way that people search for information he suggests that it is probably not helpful to organizations, and their employees, to be poorly findable on the web. He draws on a post from Jon Udell which, among other things, argues that it would be useful to have better expanatory context for why people rank well.
For example, it would be interesting - to me anyway ;-) - to know what is at play in the relative positions of the librarian Lorcan Dempsey (1, 4, 5, 7), the architect Lorcan O'Herlihy (2) and the actor Lorcan Cranitch (6,8) in a Google search on Lorcan.
Of course, Lorcan is not the most common name in the world. There are more Dempseys than Lorcans, and the librarian (5, 10) follows the boxer Jack, the actor Patrick and an Ohio law firm. This prompted me to think about a post by Nicholas Carr some time ago:
Here's a sign of the times. Expectant parents are beginning to google prospective baby names to ensure that their kids won't face too much competition in securing a high search rank. The Wall Street Journal reports on one example of a couple using search engine optimization in picking a name: ... [Rough Type: Nicholas Carr's Blog: Womb-based SEO]
And to make a more general point, the web is making us think more about 'handles'. Whether it is parents thinking about naming their children, folks thinking about being being more consistent in their names, and lots of people thinking about identifiers.
June 24, 2007
•
Categories:
Search
, Social networking
Facebook opened itself up to non-college students a while ago. And weeks ago it opened itself up to other applications through the Facebook platform. It describes itself as a 'social utility'. Its CEO, Mark Zuckerberg, talks about the 'social graph', a vast social interconnectedness which propagates news and views. Indeed, folks are talking about it as if we have turned an Internet corner, as revealing the next phase of our network lives.
More of this later, I am sure, but for now I string together some quotes from some of the more suggestive pieces that have appeared in the last couple of weeks.
Here is David Berlind:
In unpacking those two statements, I’ll start with the social graph, a term that boldy surfaced when Facebook CEO Mark Zuckerberg introduced his company’s F8 Platform last month. He attributed the power of Facebook to the social graph, which he defined as network of connections and relationships between people on the service. At the launch of the Facebook platform, Zuckerberg said, “[The social graph] is changing the way the world works. As Facebook adds more and more people with more and more connections it continues growing and becomes more useful at a faster rate. We are going to use it spread information through the social graph.” [» Yahoo’s search for a social graph | Between the Lines | ZDNet.com]
This is in the general context of a discussion of how Yahoo needs a social network to hold together its various people-centred services. Writing on Techcrunch a while earlier, David Sacks argued similarly. He discussed the evolution from browse to search to share. These have emerged successively (think Yahoo, Google, Facebook) and will continue to exist together. However, he is very optimistic about the power of the social graph, and suggests that contextual sharing within your group of 'friends' will become much more important.
While the process of structuring new kinds of information for the social graph to distribute is still sorting itself out, it is easy to object to the frivolity of information on Facebook. For example, Facebook is great at telling me what my friends just had for lunch, but how about hard news? Well, for starters, I’m waiting for the Digg application to not only display articles I’ve digged on my profile, but also to aggregate all the articles dugg by my friends. This could lead to the kind of social news site that MySpace promised but failed to deliver. Not only Digg, but virtually all Web 2.0 applications which are based on the wisdom of crowds can be reconceived as Facebook apps based on the wisdom (or trust) of friends. To the extent that these services cater to publishers who seek a mass audience, such as YouTube or Flickr, the social graph will not threaten their business. But to the extent they publish content intended for friends, or if the value of their service increases with the participation of friends, these applications face only two choices: get each user to recreate his or her friendship network on their own site or migrate their service to the Facebook platform lest someone else does it first. [The New Portals: It’s the Bread, Not the Peanut Butter]
Marc Andreessen also talks about the power of the social graph (without using the term) as a phenomenal 'viral distribution engine'. He identifies this as one of the benefits of the Facebook platform in a reflective and laudatory piece about the significance of Facebook's platform strategy.
Metaphorically, Facebook is providing the ease and user attraction of MySpace-style embedding, coupled with the kind of integration you see with Firefox extensions, plus the added rocket fuel of automated viral distribution to a huge number of potential users, and the prospect of keeping 100% of any revenue your application can generate. [blog.pmarca.com: Analyzing the Facebook Platform, three weeks in]
However, he does draw attention to the fact that while the Facebook platform allows other functionality to be embedded in Facebook, it does not allow Facebook functionality to be embedded in other venues.
These factors are, however, very reflective of the fact that while the Facebook Platform gives developers a lot of capabilities that they never had before, and access to a huge base of enthusiastic users, as a Facebook developer you're very much living in Facebook's world -- you're not creating your own world. And you have to be serious enough about living in that world that you are willing to hit the fairly high barrier of being willing to run your own servers and infrastructure for any applications you build. [blog.pmarca.com: Analyzing the Facebook Platform, three weeks in]
A point also made here by Andy Powell.
Facebook has captured the imagination. And it will be interesting to see how it aligns itself with the other major gravitational hubs of the web in coming months. However, even Facebook is not comprehensive enough to be the sole focus of a user's attention, so it would seem inevitable that it will have to open itself up the other way also. So that for example, I can post from Facebook to my blog, as well as being able to post from my blog to Facebook.
Update: I think that Facebook is showing the potential of the social graph. Will it be the social graph? See Mark Evans and Richard Skrenta for comment.
June 12, 2007
•
Categories:
Libraries - distributed environments
, Libraries - organization and services
, Metadata
, Search
, User experience
Judith Pearce from the National Library of Australia left an interesting comment about the integration, or not, of full-text book indexes and library catalogs. Here is an excerpt: Here at the National Library of Australia, just as we are starting to address the challenge of getting nice fully FRBRised, relevance-ranked and clustered search results from a centralised data corpus, we need to start thinking about searching the whole boook. We already have full-text indexes to our own locally hosted content so it makes sense to extend this to externally hosted content. Our Library Labs prototype at http://ll01.nla.gov.au/ does search Google Books at the moment but the results are not at all well-integrated into the rest of the page. And we would need to target multiple external sources to get full coverage. [Judith Pearce comment on Lorcan Dempsey's weblog: On demand book search again ...]
The Library Labs prototype she points to is worth a look, acknowledging that it is a place for trying out things.
I was interested to follow the link from that page to a presentation by her colleagues Alison Dellit and Tony Boston which provides discusses this work in the context of the further development of Libraries Australia. Our challenge – as a library community – is to make these resources as easy to find and get as the best “long tail” businesses resources are. Finding and getting a library item should be no more complicated than searching and ordering on Amazon, or Ebay. To do this, we need to make searching Libraries Australia as easy and intuitive as possible – including providing new ways for users to browse material; and we need to make getting resources as easy as possible. This paper reports on efforts to improve the searchability of Libraries Australia. Discussions on improving the getting of Libraries Australia material are outside the scope of this paper, however, we would like to note the recent establishment of the Rethinking Resource Sharing Reference Group, which is looking at this problem. [Relevance ranking of results from MARC-based catalogues: from guidelines to implementation exploiting structured metadata]
The paper discusses potential approaches to a range of issues around ranking, tagging, clustering, recommending, and also considers the benefits of consolidation. Worth a read.
Aside: I was reminded reading it of my suggestion that we want to 'rank, relate and recommend' better in our systems. I have changed the order from the original Rank, recommend and relate.
Aside: I need to update my list of entries about the catalog: Talking about the catalog.
Related entries:
June 06, 2007
•
Categories:
Books, movies and reading ...
, Digital asset management
, Featured
, Knowledge organization and representation
, Libraries - organization and services
, Metadata
, OCLC
, Search
, ebooks and other e-resources
Today Google and CIC announce an agreement to digitize ten million volumes across the CIC libraries. Google has been adding new partners since the first announcement was made about the Google 5. Some folks have wondered what rationale has governed selection of partner opportunities. We do not know, but they sure are moving fast! Here are some early thoughts.
The CIC announcement is interesting for several reasons: - It is a shared effort across a major group of libraries with significant collections. There appears to be strong CIC institutional commitment. Of course, CIC has a history of collaboratively sourced activities and this 'pooling' model makes increasing sense given the necessary policy and service challenges that need to be addressed. In this case, but also across a range of other issues that libraries face as they support changing research and learning behaviors in a reconfigured network environment. For some things, scale matters.
- The libraries have a shared approach to managing the digital copies based on shared infrastructure at the University of Michigan, and serving them up to their user communities. An example of collaborative sourcing.
- Google recently advertized for somebody to work on collection development and we seem to be seeing a stronger focus in this area. Collecting areas of importance within each library [pdf] have been identified for attention. Presumably, these decisions have been influenced by the 'collective collection' of the full Google parnership also.
This initiative in turn prompts some more general thoughts about access: - One of the most valuable features of the Google initiative is that it digitizes book content, allowing fine-grained discovery over topics, people, places and so on. Of course this presents interesting questions about indexing, retrieval, ranking, and presentation but the advantage of having this access seems clear. It drives use and sales, and it supports enquiry. Without it, the book literature is less accessible than the web literature.
- However, as we are beginning to see on Google Book Search, we are really going beyond 'retrieval as we have known it' in significant ways. Google is mining its assembled resources - in Scholar, in web pages, in books - to create relationships between items and to identify people and places. So we are seeing related editions pulled together, items associated with reviews, items associated with items to which they refer, and so on. As the mass of material grows and as approaches are refined this service will get better. And it will get better in ways that are very difficult for other parties to emulate.
- Currently this material is made available within the Google destination site. Google is an advertizing engine and its approach depends on aggregating attention for adverts. This apporach may be difficult to deploy within a more 'data services' approach where others - especially the partners - have remixable access to content and services. However, the 'utility' value of this resource will be diminished if it is not made available in this way so that others can mobilize these resource within their own environments. How and if this gets done remains to be seen. (See the related discussion about the search API.)
- This type of access seems especially important for the partner libraries. In the early days of this activity there was some discussion of the types of services which would be built on top of the digitized books by the libraries. However, it is difficult, and maybe not very sensible, for the libraries to individually invest in some types of service development. An important factor here is that they cannot benefit from the network effects tha
May 24, 2007
•
Categories:
Libraries - organization and services
, Marketing
, Search
, User experience
In recent presentations I have been referring to the University of Washington's initiative to systematically put links to its digital collections in relevant Wikipedia entries. I use it as an example of putting library resources 'in the flow' of their users's behavior. If Wikipedia is where many folks end up when they are looking for things, then it makes sense to have links there. Ann Lally and Carolyn Dunford describe the initiative in the current issue of D-Lib Magazine and discuss its impact. Web 2.0 technologies offer librarians a great opportunity to enhance the authority of resources that students use on a daily basis, and to push their knowledge and expertise beyond the traditional boundaries of the library. We now consider Wikipedia an essential tool for getting our digital collections out to our users at the point of their information need. We view this as a very low cost way to enhance access to our collections, as well as an effective way to participate in the creation of resources that are used by millions around the world. We will continue to explore how we can take advantage of the opportunities that Web 2.0 technologies offer us when marketing our digital and physical collections. [Using Wikipedia to Extend Digital Collections]
They also point to an interesting report [pdf] from MIT Libraries which presents the findings of an investigation of information seeking behaviors of their students. Based on findings they note the following priorities for the Library: - Make discovery easier and more effective
- Incorporate trusted networks in finding tools
- Continue to put links to the Libraries' services and resources where the users are>
Related entries:
May 18, 2007
•
Categories:
General - systems and technologies
, Metadata
, Search
, User experience

There has been some discussion - less than I expected - about Google's steps to develop a unified search across its services (blogsearch, booksearch, YouTube, etc) so that blogs, video, books, maps, and so on are returned in results on the main Google site. This latest refinement sounds simple, but it isn't. According to the Californian technology powerhouse, it is a result of two years' work by more than 100 engineers and involved a major revamp of the company's software platform. [Google takes search to next level | | Guardian Unlimited Business] This is a major step given the central importance of ranking to Google and the different ranking models that it employs across these individual services.The first signs of the integration are showing up and more stuff will be progressively introduced. Google's vision for universal search is to ultimately search across all its content sources, compare and rank all the information in real time, and deliver a single, integrated set of search results that offers users precisely what they are looking for. Beginning today, the company will incorporate information from a variety of previously separate sources – including videos, images, news, maps, books, and websites – into a single set of results. At first, universal search results may be subtle. Over time users will recognize additional types of content integrated into their search results as the company advances toward delivering a truly comprehensive search experience. [Google Press Center: Press Release]
At the same time as these changes are being introduced they have released some experimental features in the Google Labs area. These include displays of results around a timeline and on a map. They also interestingly feature a couple of services which provide additional navigation options based on returned results. These include Left-hand search navigation and Right-hand contextual search navigation.
This is all intriguing. It will be fascinating to see how they handle a major transition which has a significant impact on the core of the Google user experience and how well they deliver on the promise of integration.
We sometimes talk about Google as if it is something fixed: we know what it does and how it works. This is especially so in library discussions where the library experience is compared to the Google experience, as if it were something that was going to continue in its current form. However, look at what they have been doing with Google Booksearch and now look at this big change. We do not know what Google will be like in three years time - it will certainly not be the Google of today.
What I find most interesting about these directions is gradual introduction of additional navigation options. We are used to hearing people talk about the 'simple search box' as a goal. But, a simple search box has only been one part of the Google formula. Pagerank has been very important in providing a good user experience, and effective ad placement is important for their revenue model. However, as we move to merged results over a mixed resource base a single ranked list becomes less useful, but also other browse/navigation options become more important. We are seeing Google experiment and narrow by resource type, navigate by related terms, offer related searches, and so on. Basically, they are mining their data to offer a richer 'texture of suggestion' than they have in the past. Search may start with the simple search box, but then a variety of directions are opened up based on the results.
We can see this emerge as a pattern. A simple entry point into a richer navigation space. This is emerging in our library catalogs which are moving to think about how to better exploit the structure of the data to create navigable relations (faceted browsing, FRBR, ...). In this way, the user follows data paths presented after an initial search rather than having to make complicated choices up front before seeing any results. And this highlights again the need to make our bibliographic data work harder in systems and services.
Related entry:
May 03, 2007
•
Categories:
Search
While checking up on referrals from the logs - as discussed a few moments ago - I had a look at Yahoo! also which I had not done for a while.

I was interested to see results from Yahoo! answers included in search results. Here is the first result from Answers in response to a query just now on Libraries Australia.
April 30, 2007
•
Categories:
Libraries - organization and services
, OCLC
, Search
I visited the University of Virginia last week where I spoke about the future of the catalog. This was more topical than I had realized when I agreed the subject with my hosts! When I arrived, the first thing that people wanted to talk about was Roy Tennant! The second was Worldcat Local.
I mentioned that Worldcat Local would go live in a beta version this week at the University of Washington, and indeed it just has. Check out the library home page, and the about page at UW Libraries.
This is a result of much work by colleagues in OCLC and at UW.
Of course, UVa is a focus of Solr developments and have mounted the Blacklight prototype with UVa data. Here is Bess Sadler: If you were at code4libcon you’ve already seen Erik Hatcher’s initial foray into the world of indexing library catalog data with solr. I am pleased to announce that the unofficial UVa team we’ve assembled has come even further with that effort, and for the next couple of weeks we are demonstrating Blacklight, UVa’s solr based faceted OPAC. (Solr? UVA? Blacklight. Get it? Erik came up with it. Pretty clever, no?) [Solvitur ambulando » UVa Blacklight - faceted browsing and prospect] I like those pie charts. Coming back from the trip I was also interested to see Ryan Eby's list of Solr implementations.
Things are moving along ...
April 28, 2007
•
Categories:
Books, movies and reading ...
, Miscellaneous
, Search
I just came across Google Authors, a series of videos of guest speakers at Google. An interesting variety and different formats (lecture, interview, ...). Many are interviewed by Eric Schmidt. Some household names (if you live in a BoBo household, to use the term coined by David Brooks, one of the interviewees [on youtube]) and some that were unfamiliar to me.
Indeed, such is the variety that I am not sure who to link to as examples ;-)
In addition, we've just added our most important location yet: an online home at google.com/talks/authors with a video archive of our events on YouTube. Just this year, we've hosted a great variety of authors, including Martin Amis, Strobe Talbott, Bob & Lee Woodruff, Jonathan Lethem, Don Tapscott, Senator Hillary Clinton, and Carly Fiorina. The subjects of their talks range from literary fiction to science fiction, sociology to technology, politics to business. [Official Google Blog: Authors@Google]
|