I have been careful in recent posts about numbers to note that the term 'book' has no precise referent. One person's book is another person's pamphlet, or ...
Brian Lavoie and I have an article in the current issue of DLib Magazine which uses Worldcat to estimate the number of print books published in the US which are potentially in copyright (i.e. with a publication date of 1923 or later). Of course, this question has been given currency in recent years by the mass digitization of library book collections, notably by Google.
This article characterizes the aggregate collection of US-published print books in WorldCat, with a special emphasis on materials published during or after 1923, and therefore either potentially or definitely in copyright. Findings from the analysis indicate that the collection of US-published print books in WorldCat is quite large, encompassing about 15.5 million print books. Nearly two-thirds of these - those published after 1963 - have a high likelihood of being in copyright; less than 15 percent - those published prior to 1923 - are almost certainly in the public domain, with the rest - those published between 1923 and 1963 - potentially in copyright if copyright was renewed. The post-1923 materials collectively account for more than 80 percent, or about 12.6 million, of the US-published print books in WorldCat. It is difficult to predict how many of these print books might be orphan works, but even a small fraction would, in terms of absolute numbers, be considerable, and require a substantial effort to investigate and clear copyright. [Beyond 1923]
In order to generate various counts decisions have to be made about what to count. And those decisions have to be based on what can be supported in the data that we have. I was talking to Brian about some of these decisions a while ao, and he wrote up some of his comments which I include here.
"As a non-librarian who works with library data on a regular basis, I was surprised to learn that the commonplace object 'book' is not well-defined in traditional cataloging practice. This is all the more surprising when one considers that historically, libraries were built around aggregations of books. The difficulty is that there are no explicit bibliographic criteria for identifying something most people would recognize as a 'book'. So for example, consider a simple question like 'How many books are in WorldCat?' In the bibliographic universe, there is nothing explicitly defined as a 'book': there are monographs, or more narrowly, language-based monographs, but the items falling into these categories are not necessarily books as we might commonly perceive them. Is a government document a book? A dissertation? A technical report? A pamphlet of only a dozen pages? These kinds of materials, and more, get included when we use a construct like 'language-based monographs' as a proxy for 'books'.
"Why is this important? The concept of "books" is appearing in a variety of current discussions, most notably in the context of digitization issues like the Google book settlement. So we are often asked questions like, 'how many print books in WorldCat have been published after 1923?' We can provide answers to these questions, but only with a degree of approximation built in: i.e., we can calculate a number that reflects something along the lines of 'all language-based monographs in WorldCat, excluding dissertations and government documents'; we can even throw in a minimum page requirement (at least 49 pages, according to the UNESCO definition of a book). But we can't say exactly how many books are in WorldCat, because from a cataloging standpoint, we don't know what a book is. Libraries are grappling with difficult new questions these days, as collections and services transition from print to digital, from local to the network. But an old question still remains: what is a book?" [Personal communication from Brian Lavoie]
[Note: Updated to include later version of Brian's quote. This is repeated from an earlier entry On books again.]