by John Stevens
September 13, 2010 3:51 PM
Have rumors of the demise of the hardbound book been greatly exaggerated? Some scholars, especially linguists, say so. Google Books is being criticized by a growing number of sources for providing inaccurate metadata -- the informational tags that tell searchers what they can hope to find in a given source. If you are making decisions about what to do with books saved in a self storage unit based on the possibility of finding texts online using Google Books, you may want to think again -- and hold onto stored books a little longer.
Last year at this time, linguist Geoffrey Nunberg wrote in The Chronicle of Higher Education that Google Books is full of metadata errors. Nunberg begins with errors in publication date, noting that Tom Wolfe’s Bonfire of the Vanities is dated 1888 in Google Books, for example. “To take Google's word for it," Nunberg continues, "1899 was a literary annus mirabilis, which saw the publication of Raymond Chandler's Killer in the Rain, The Portable Dorothy Parker, André Malraux's La Condition Humaine, Stephen King's Christine, The Complete Shorter Fiction of Virginia Woolf, Raymond Williams's Culture and Society 1780-1950, and Robert Shelton's biography of Bob Dylan, to name just a few.”
Nunberg tested Google Books by searching on modern terms such as Internet (which got 527 results published before 1950) and Woody Allen (325 hits before 1812).
Another rich source of metadata errors in Google Books are the classification errors, in which books are labeled as belonging to the wrong category of books. Melville’s Moby Dick was labeled a computer book by Google Books, Nunberg says, while The Cat Lover’s Book of Fascinating Facts was classified as technology and engineering. Susan Bordo’s 2003 book of feminist theory, Unbearable Weight: Feminism, Western Culture, and the Body, was labeled by Google Books as a health and fitness book.
Some of the errors in Google Books arise from errors in scanned copies that were reproduced when the works were copied into Google. But some come from Google’s attempts to use software to automatically retrieve a publication date from a scanned text. In some cases, Google’s programs draw a book’s publication date from a bookplate -- Nunberg comments on an Elizabethan bookplate that shows a coat of arms and the number 1574, which happens to appear in a book with a later publication date that is located in Harvard’s library. Google used 1574 as the book’s own publication date.
Many of the errors that Nunberg mentions has been corrected by Google Books, since his initial work on the subject attracted a lot of media attention. And he has an idea for how to make Google Books more accurate by incorporating its metadata with data from the catalogs at the Library of Congress. So Nunberg ended his article on an optimistic note. But has Google Books become more accurate since then?
According to a Salon.com article published last week, the answer is no, not much. Salon writer Laura Miller did her own research and found The Golden Bough by James Frazer classified by Google Books as life sciences, and she notes that the work’s 12 volumes are not grouped together and are not searchable as a whole.
While metadata errors in classification and publication date may not mean much to the casual user of Google Books, it may be a big problem for scholars who had hoped to use Google Books as a technological shortcut for their research. When Miller phoned Nunberg to find out why metadata was so important, Nunberg explained that some information can’t be found in a Google search. A scholar might want to know what edition of Marx was referred to by a particular thinker, for example. Or, Nunberg said in today’s Salon article, “I might want to search for the first sentence of a Henry Fielding novel across different editions....And I might want to search across collections: How often was a word used in a particular historical period?”
Furthermore, Nunberg explained, Google’s metadata is based on the subject categories used by bookstores to classify books being sold in retail stores. That system “has 20 subcategories for children's books about various animals -- books about bears, or about monkeys, for example -- but only one category for European poetry. In a retail bookstore, you're not going to have a section for 18th century Italian poetry or 17th century German poetry; all the European poetry is going to be shelved together. But that's a ridiculous way to classify the collection of the Harvard Library.”
Nunberg and other scholars are working to get Google talking to librarians at university libraries, and that may result in an overhaul of the system used to provide metadata for Google Books. But until that happens -- and maybe even after it does -- hard copies of books will continue to be an important reference for linguists. Right now it is primarily linguists and scholars interested in the history of ideas who like to trace the development of words and ideas through several editions of the same book, or by comparing books published in a given subject area year by year to see how ideas on a subject develop. But even scientific scholars may someday be interested to find out at what point a word like “metadata” became a working part of the average scholar’s -- or even the average person’s -- vocabulary.
Sources used:
“Books Known Issues.” Google Books.
Hellmen, Eric. “Ebook summit preview: should kids get ebooks in school?” Library Journal. Aug. 24, 2010.
Miller, Laura. “The trouble with Google Books.” Salon. Sept. 9, 2010.
Nunberg, Geoffrey. “Google’s book search: a disaster for scholars.” The Chronicle of Higher Education. Aug. 31, 2009.
Tags: books, google books, scholars, linguists, metadata, inaccurate metadata, ebooks, self storage, stored books, geoffrey nunberg, chronicle of higher education, tom wolfe, bonfire of the vanities, raymond chandler, killer in the rain, the portable dorothy parker, andré malraux, la condition humaine, stephen king, christine, the complete shorter fiction of virginia woolf, virginia woolf, raymond williams, culture and society 1780-1950, robert shelton, biography, bob dylan, internet, woody allen, herman melville, moby dick, the cat lover's book of fascinating facts, susan bordo, unbearable weight, scanning errors, classification errors, publication date errors, library of congress, software, laura miller, salon.com,
Storage