Showing posts with label data mining. Show all posts
Showing posts with label data mining. Show all posts

Monday, February 1, 2010

Colloquium: How to read 15 million books in one sitting; Bill Schilit , Google Research, 2/3/10 4 PM, Carnegie-Mellon University

Colloquium: How to read 15 million books in one sitting; Bill Schilit , Google Research; 2/3/10 4 PM, Carnegie-Mellon University, Newell Simon Hall 1305 (Michael Mauldin Auditorium):

"Abstract

Scanning books, magazines, and newspapers is widespread because people believe a great deal of the world's information still resides off-line. In general after works are scanned they are OCR'ed, indexed for search and processed to add links. In this talk I will describe a new approach to automatically add links by mining repeated passages. This technique connects elements that are semantically rich, so strong relations are made. Moreover, link targets point within rather than to the entire work, facilitating navigation. Our system has been run on a digital library of many millions of books (Google Book Search), has been used by thousands of people, and has generated the world's largest collection of quotations. I will also present a follow-on project based on the theory that authors copy passages from book to book because these quotations capture an idea particularly well: Jefferson on liberty; Stanton on women's rights; and Gibson on cyberpunk. These projects suggest that mining quotations for links and ideas is an important mechanism for understanding the knowledge contained in books.

(This work is in collaboration with Okan Kolak, Google Research and Google Book Search.)*"

http://www.hcii.cmu.edu/news/seminar/how-read-15-million-books-one-sitting-or-mining-hypertext-quotations-and-ideas-very-lar

Friday, August 28, 2009

Librarians apply scrutiny to Google Books at Berkeley con; ZDNet Government, 8/27/09

Richard Koman via ZDNet Government; Librarians apply scrutiny to Google Books at Berkeley con:

"If you’re in the Bay Area and you want a full day of wonky debate, check out UC Berkeley’s Google Books Conference. It features panels on how the Google Books settlement affect data mining, privacy, information quality and public access.

The conference comes hard on the heels of the formation of the Open Book Alliance, an organization driven by the Internet Archive and including Amazon, Yahoo and Microsoft, as well as library and small publishing groups among its members. Most of the speakers are opposed to the deal but Google’s Tom [sic] Clancy will be there to make the company’s argument....

But if Google is the last library, as Berkeley linguist Geoff Nunberg says, it’s a pretty bad one. That means serious library science must be applied to the online collection before we should outsource the history of human (or at least Western) knowledge to Google:

Google Book Search is almost laughably unusable for serious research, UC Berkeley’s Nunberg said. For example, he pointed out that the Charles Dickens classic “A Tale of Two Cities” is listed in Google Book Search as having been published in 1800; Dickens was born in 1812."

http://government.zdnet.com/?p=5309