Tuesday, August 24, 2010

Google's count of 130 million books is probably bunk; ArsTechnica.com, 8/9/10

Jon Stokes, ArsTechnica.com; Google's count of 130 million books is probably bunk:

""After we exclude serials, we can finally count all the books in the world," wrote Google's Leonid Taycher in a GBS blog post. "There are 129,864,880 of them. At least until Sunday."

It's a large, official-sounding number, and the explanation for how Google arrived at it involves a number of acronyms and terms that will be unfamiliar to most of those who read the post. It's also quite likely to be complete bunk...

But the problem with Google's count, as is clear from the GBS count post itself, is that GBS's metadata collection is a riddled with errors of every sort. Or, as linguist and GBS critic Goeff Nunberg put it last year in a blog post, Google's metadata is "train wreck: a mish-mash wrapped in a muddle wrapped in a mess."

Indeed, a simple Google search for "google books metadata" (sans quotes) will turn up mostly criticisms and caterwauling by dismayed linguists, librarians, and other scholars at the terrible state of Google's metadata. Erroneous dates are pervasive, to the point that you can find many GBS references to historical figures and technologies in books that Google dates to well before the people or technologies existed. The classifications are a mess, and Nunberg's presentation points out that the first 10 classifications for Walt Whitman's "Leaves of Grass" classify it as Juvenile Nonfiction, Poetry, Fiction, Literary Criticism, Biography & Autobiography, Counterfeits and Counterfeiting. Then there are authors that are missing or misattributed, and titles that bear no relation to the linked work."

http://arstechnica.com/science/news/2010/08/googles-count-of-130-million-books-is-probably-bunk.ars

No comments: