Lorcan Dempsey's weblog On libraries, services and networks.   

Google Book Search and document understanding

 •  Categories: Books, movies and reading ... , Digital asset management , General - systems and technologies , Identity management, IPR and e-commerce , Search

Google Book Search: Document Understanding on a Massive Scale [PDF] is a brief treatment of issues faced by Google as they grow their corpus of digitized books and work to make it useful in various ways.

Luc Vincent of Google discusses OCR (issues of many languages occurring unpredictably in variously formatted volumes, at scale), and then focuses on issues of document understanding.

In addition to OCR, making these books easily accessible and useful on http://books.google.com has required developing a number of additional state-of-the-art systems. These include systems for automatically deskewing, cropping and cleaning-up scanned book pages, which is critical as pre-processing prior to OCR, but also to generate clean and small images for efficient web serving. While this may be a well understood problem for high-quality documents, doing this well on scanned century-old book pages is no small feat. Most of the advanced systems developed for Google Book Search however involve some form of Document Understanding and as such, come after OCR in the book processing pipeline. Systems that have been developed, are being developed or are being considered as interesting research challenges include: [Google Book Search: document understanding on a massive scale PDF]

These challenges include: page ordering, language identification, chapter identification, content linking (relate table of contents to appropriate boundaries, index entries to pages, ...); summarization; metadata extraction and cross validation; topic identification; book clustering and linking (create relationships between volumes).

He also discusses ranking:

Specifically, how should books that match a particular query be ranked? The web is notorious for its rich graph of hyperlinks, famously exploited by Google’ PageRank algorithm [6]. This structure applies somewhat to technical publications, which typically contain numerous references to other technical publications. However the universe of books is different and most books (eg, novels) do not contain any references. Novel approaches therefore had to be developed, exploiting an array of new signals. Additionally, these techniques were recently extended to allow “blending” of book search results with web search resuts when appropriate. [Google Book Search: document understanding on a massive scale PDF]

The paper outlines presentation options based on copyright status and also discusses how Google supports the document understanding community through the release of software and data sets.

I was interested that there was no discussion of social features.

Via SEO by the Sea.

Bookmark:  del.icio.us   Digg This   Google Boomarks   reddit   Furl  

3 comments so far

Posted by...Posted by bowerbird on January 7, 2008

i've waited a long time for this dialog to start.

the paper itself is "too little, too late,"
but let's see if the offer to collaborate
bears fruit...

google has a lot of lemon scan-sets, and i have
been itching to make some digital-text lemonade.

i'm off to send an e-mail to luc...

-bowerbird

Posted by...Posted by bowerbird on January 17, 2008

no response from google. it figures...

-bowerbird

Posted by...Posted by Henk on January 29, 2008

I am sure OCR and Google will make books more accessible to a lot of people. However the most importatant item is that those people will read these books

Post a commentPost a comment

Remember me? 
(You may use HTML tags for style)
  
Syndication
AddThis Feed Button Atom Feed RSS Feed 1.0 RSS Feed 2.0
email.gif Weekly digest Enter email for digest
Find In A Library
Search for an item in libraries near you: