Google Book Search: Document Understanding on a Massive Scale [PDF] is a brief treatment of issues faced by Google as they grow their corpus of digitized books and work to make it useful in various ways.

Luc Vincent of Google discusses OCR (issues of many languages occurring unpredictably in variously formatted volumes, at scale), and then focuses on issues of document understanding.

In addition to OCR, making these books easily accessible and useful on http://books.google.com has required developing a number of additional state-of-the-art systems. These include systems for automatically deskewing, cropping and cleaning-up scanned book pages, which is critical as pre-processing prior to OCR, but also to generate clean and small images for efficient web serving. While this may be a well understood problem for high-quality documents, doing this well on scanned century-old book pages is no small feat. Most of the advanced systems developed for Google Book Search however involve some form of Document Understanding and as such, come after OCR in the book processing pipeline. Systems that have been developed, are being developed or are being considered as interesting research challenges include: [Google Book Search: document understanding on a massive scale PDF]

These challenges include: page ordering, language identification, chapter identification, content linking (relate table of contents to appropriate boundaries, index entries to pages, ...); summarization; metadata extraction and cross validation; topic identification; book clustering and linking (create relationships between volumes).

He also discusses ranking:

Specifically, how should books that match a particular query be ranked? The web is notorious for its rich graph of hyperlinks, famously exploited by Google’ PageRank algorithm [6]. This structure applies somewhat to technical publications, which typically contain numerous references to other technical publications. However the universe of books is different and most books (eg, novels) do not contain any references. Novel approaches therefore had to be developed, exploiting an array of new signals. Additionally, these techniques were recently extended to allow “blending” of book search results with web search resuts when appropriate. [Google Book Search: document understanding on a massive scale PDF]

The paper outlines presentation options based on copyright status and also discusses how Google supports the document understanding community through the release of software and data sets.

I was interested that there was no discussion of social features.

Via SEO by the Sea.

Comments: 3

Jan 07, 2008
bowerbird

i've waited a long time for this dialog to start.

the paper itself is "too little, too late,"
but let's see if the offer to collaborate
bears fruit...

google has a lot of lemon scan-sets, and i have
been itching to make some digital-text lemonade.

i'm off to send an e-mail to luc...

-bowerbird

Jan 17, 2008
bowerbird

no response from google. it figures...

-bowerbird

Jan 29, 2008
Henk

I am sure OCR and Google will make books more accessible to a lot of people. However the most importatant item is that those people will read these books