- In the Library with the Lead Pipe - http://www.inthelibrarywiththeleadpipe.org -
A Useful Amplification of Records That Are Unavoidably Needed Anyway
Posted By Brett Bonfield On November 19, 2008 @ 6:00 am In Uncategorized | Comments Disabled
Depending on books can feel like relying on snail mail. “Now that I’ve showed you how to find some articles,” I say to people at the reference desk, “I’ll show you how to use our website to find some books you might want to check out. And after that, wouldn’t it make your grandmother’s day if you wrote her a letter?”
For anyone accustomed to the Internet, books can lack the immediacy of articles or websites. Books generally have slower developing narratives, and often have longer paragraphs, sentences, and words, which means they don’t lend themselves to skimming. Compared to digital material, relevant passages can be hard to find, and even finding the right book can be challenging.
Although library websites are improving, keyword searching doesn’t work well at most libraries and faceted browsing—the links down the left side of the page on Amazon—is still a rarity. More importantly, with one notable exception, there is a good chance that nothing on the shelf that is “printed on paper and constructed on the model of the codex” includes the exact information you have in mind.
This is where universal catalogs come into play. If there’s nothing on the shelf that meets your needs, the next step is to figure out if such a book exists. There are five websites that provide relatively complete and easily accessible lists of books: Amazon, Google, LibraryThing, WorldCat, and Open Library. In order to make the best use of these websites, it can be useful to learn how each of them started, what keeps them going, and how their business models and practices affect the data they collect and and how they go about sharing it.
It’s tempting to think of Amazon as a technology company. That’s how Werner Vogels sees it, which is understandable: he’s their Chief Technology Officer, and he seems to have done a very good job of it, because Amazon’s technological initiatives have taken a leap forward since Amazon hired him away from Cornell in 2004. Over the last couple of years, Amazon has made its mark as a service supplier, rewriting the rules for online hosting with its Amazon Web Services; it has developed a successful consumer electronics product (the demand for its Kindle e-book reader consistently exceeds supply, and it seems to be extraordinarily popular with publishers as well: they have made almost 200,000 titles available); and it has also made use of its infrastructure with offerings as diverse as its Mechanical Turk and Fulfillment services.
But if you look at its revenue stream, it’s pretty clear that Amazon has very little in common with a traditional technology company, such as Microsoft, its Seattle-area neighbor. Instead, Amazon is probably most like a different neighbor: Costco.
Amazon’s founder, Jeffrey Bezos, seems to have a firm grasp of three important aspects of retailing:
Similarly, Costco’s founders, James Sinegal and Jeffrey Brotman, stock their retail outlets to the rafters, refuse to mark up items more than 15%, and, in their most recent report to shareholders, they note, “This past year we also enjoyed the highest membership renewal rate in our history at 87%, attesting, we believe, to the high level of satisfaction our members have in our products and services.” Think about the things you typically shop for at Amazon: are they more like what you buy from Microsoft or are they more like what you buy from Costco?
Because of Amazon’s size, breadth, and ubiquity, it can be easy to forget that its original business model was pretty basic: it resold books it bought from Ingram and Baker & Taylor. As Tim O’Reilly points out in an apologia on Web 2.0, Amazon purchased a database of book information from R.R. Bowker, put it on the still new World Wide Web, and encouraged its customers to share reviews, bibliographies, and even correct any mistakes or omissions in its data. Two years later, when Amazon went public, it carried more than 2.5 million titles, “including most of the estimated 1.5 million English-language books believed to be in print, more than one million out-of-print titles believed likely to be in circulation and a smaller number of CDs, videotapes and audiotapes.” Out-of-print titles were generally available within two to six months.
Amazon’s original formula hasn’t changed all that drastically. In 2007, books and other media accounted for 62% of its net sales, down from 66% in 2006 and 70% in 2005. The trend may be downward, but media sales are actually improving—it’s just that other sales are improving even faster.
Despite investments in other areas, Amazon knows that it is still primarily a retailer of books and other media, and it continues to invest in complementary initiatives and businesses that fortify its ability to sell these items. Its recent acquisitions, including Audible, Shelfari, and AbeBooks (which brings with it a 40% stake in LibraryThing), join other Amazon businesses, including the Internet Movie Database (IMDB), Alexa, and BookSurge. It also developed its own search subsidiary, A9, it was an important participant in creating ONIX, “the international standard for representing and communicating book industry product information in electronic form,” and it published a hugely successful API (now a part of its Associates program) through which it makes book jackets and summaries available to affiliates (including libraries), and also shares a percentage of sales, inspiring creative programmers to develop websites like BigBookSearch and Zoomii.
Amazon does all this so it can sell more goods and, in general, it seems to be working. Consumers are getting deeper discounts on a broader range of books and other media than ever before, and they have an easy time finding the items they want thanks to Amazon’s faceted browsing interface, its active user community, and its search engine which, in many cases, makes it easy to search within the text of published items.
While Amazon does everything it can to provide you with as much information as possible about the items it has in stock, there’s no motivation for it to share information about items it can’t sell in volume, such as out-of-print material. If the information you’re seeking is likely to be included in new, commercially available books, then Amazon is an excellent resource. If not, you’re best served looking elsewhere.
Amazon is one of two major corporate alternatives to libraries; Google is the other.
Amazon followed one of the two traditional paths for forming a giant corporation: it was founded by an entrepreneur who had a good idea for a company and then hired talented people to build its technological infrastructure. Google followed the other path: its founders, Larry Page and Sergey Brin, created something the world wanted and then hired people to turn their idea into a profitable corporation.
While still graduate students at Stanford, Page and Brin took Eugene Garfield’s work on citation indexing and adapted it for the World Wide Web. Garfield, who marketed information products through his company, the Institute for Scientific Information (now a Thomson Reuters subsidiary), records how often scholarly papers are cited by subsequent scholarly papers, which is useful because citation frequency is a reasonable proxy for importance. Similarly, Google’s PageRank algorithm is primarily a scheme for measuring and weighting links between Web pages: the more links to a page or website, the more likely it is to be important, especially if those links come from other important sites. PageRank is intended to determine which Web pages are likely to be perceived by Google’s users as relevant.
It was soon apparent that Google worked—users found what they were looking for—but no one saw any money in it. Page and Brin tried to sell their technology for $1 million to the big players in the Web market. After everyone turned them down, they decided to start their own company, focusing their attention on attracting as many users as possible.
Where Amazon is a retailer that can be thought of as a virtual Costco, Google is an entertainment company like News Corp or Viacom—it generates 99% of its revenue from advertisements. Just as Amazon is primarily a reseller of products others make, Google is primarily a portal into content others create. Its mission is to “organize the world’s information and make it universally accessible and useful.” Note the absence of the word “Web” in that mission statement: Google’s goal is to organize every bit of information. For instance, Google created its free telephone directory assistance project, GOOG-411 in order to develop speech recognition software. In turning spoken words into text, Google opens up the possibility of searching audio and video files through the same Google search box that is currently used to search websites.
Though the Web has become many people’s primary information source, a great deal of the world’s information is still found in books. In order to harvest that data, in December 2004, Google announced that five libraries—the University of Michigan, Harvard. Stanford, Oxford, and the New York Public Library—had agreed to let Google begin scanning their collections (and several more have since joined the project). Multiple elements of this arrangement remained secret, including the terms of these agreements and the rate at which books were being scanned. It was also unclear how Google would deal with potential copyright issues, especially after the Association of American Publishers and the Authors Guild almost immediately filed a joint lawsuit.
This copyright lawsuit mirrors another: Viacom’s suit against Google acquisition YouTube for copyright infringement. There was some speculation that Google bought YouTube specifically to make sure YouTube didn’t lose its lawsuit, establishing a precedent that Google would have to overcome if it were ever sued for hosting video files. When Google reached a settlement in its book scanning lawsuit this past October, Viacom saw a potential concession in its own suit.
The book-scanning settlement has raised concerns about preservation and access for Google-scanned materials. Harvard has expressed its reservations publicly, and Peter Brantley has been doing an extraordinarily good job of identifying and summarizing the issues involved. How all this will affect people who want to read books online has yet to be determined.
What does seem settled, at least for now, is that Google has archived an unparalleled number of books (and also scholarly articles) whose entire text could be as easy to search as the Web. With the success of Google-411, it seems likely that Google will soon be able to offer text-based searching within audio and video files as well.
What’s not clear is whether advertising will make these ventures profitable or if Google can successfully transition to alternative business models for subsets of its data. Right now, it resells access to scholarly articles and newspaper stories for several publishers, and it appears that it will soon be selling access to the books it has digitally archived. It’s also not clear if Google sees any point in developing an active user community around books. While Google allows users to add reviews at its book website, user-contributed content is not a focus in the same way it is at Amazon or at LibraryThing.
Founder Tim Spalding’s LibraryThing is a new kind of Internet-enabled organization, the small company that operates on a large scale. This method for doing business has been best documented by programmer, essayist, and venture capitalist Paul Graham, one of Spalding’s inspirations, though LibraryThing probably resembles Craigslist more than it resembles any of the YCombinator companies Graham has helped to shepherd into existence.
Like Craigslist, LibraryThing has an evangelical faith in its users, maintains a simple and easy to understand interface, is satisfied with steady and modest profitability, and competes for attention in a field with significantly larger entities (Craigslist is often cited as a cause of the newspaper industry’s financial difficulties, even though it employs fewer than 30 people).
LibraryThing gets its data from Amazon, from libraries that make their catalogs available through the Z39.50 protocol, and from its users, who supplement the data by providing reviews, cataloging information, adding tags, and disambiguating records. These last two seem to be particularly successful even though they vary from standard library practice.
The tagging concept, popularized by Joshua Shachter’s group bookmarking website, del.icio.us, allows users to catalog items using whatever keyword they wish. This enables works like Bridget Jones’s Diary to be tagged “chicklit” or Neuromancer to be tagged “cyberpunk,” subject terms that differ greatly from Library of Congress designations for these works by Fielding and Gibson.
Disambiguation allows users to clarify records by taking actions such as combining entries for works that are identical but released under different titles, or aggregating work under a single author heading even though that person has released work under multiple names. These can be difficult tasks when a small group of staff members attempt to take this on manually, and it has proved tricky to teach computers to disambiguate records programmatically. For instance, author Cyril Northcote Parkinson’s name is subject to multiple permutations (C.N., Cyril N., C. Northcote, etc.), and his most famous work, Parkinson’s Law (which expands on his belief that “work expands so as to fill the time available for its completion”), has been released with multiple title variations and in numerous editions. Amazon struggles to make it clear which edition of Parkinson’s Law a potential customer might wish to purchase and Google offers a few different options that are not readily distinguishable from one another. LibraryThing, while representing more options than either of the other two, also makes it clear which title its users believe should be considered definitive.
It’s worth noting that Amazon, Google, and LibraryThing are not operating on a different scale when it comes to the number of books they’re cataloging. LibraryThing, which launched on August 29, 2005, has catalog entries for over 32 million books. While open cataloging has its limitations, LibraryThing’s website regularly demonstrates the power of crowdsourcing big tasks to a large, devoted community.
That community is the key to LibraryThing’s success. Just as del.icio.us users socialize around shared bookmarks and tags, LibraryThing users socialize around the books in their collections. Users can add 200 books for free, but to add more they have to pay either $10 per year or spend $25 for a lifetime membership.
That’s one way LibraryThing makes money. The other is LibraryThing for Libraries, a service that allows libraries to integrate LibraryThing’s tag database and, as of September 2008, its user reviews, into participating libraries’ websites. This service is offered on a sliding scale, with the smallest libraries paying $1,000 per year.
While Amazon’s business model does not target libraries in any discernible way (either as customers or competitors), and Google appears to be interested only in the largest libraries as partners, LibraryThing seems to be actively interested in selling its services to pretty much every kind of library—dozens have already signed up for LibraryThing for Libraries—and in digesting Z39.50 feeds (or getting records in other formats) from any library willing to share. In a pinch, it appears that LibraryThing will even take care of your cataloging.
OCLC is a nonprofit consortium that includes almost 70,000 libraries as members. It was founded in 1967 as the Ohio College Library Consortium. In 1977, it began allowing libraries outside Ohio to become members, and in 1981 it changed its name to the Online Computer Library Center. It has made multiple acquisitions as it has grown, including the Dewey Decimal Classification System and its only competitor, the Research Libraries Group, which operated from 1974 until 2006. This sort of activity, and OCLC’s business model, led to its nonprofit status being investigated, but ultimately recognized. Understandably, OCLC uses its tax status to its advantage, just as some nonprofit hospitals take advantage of their status and IKEA makes use of its unusual structure.
OCLC’s most widely visible product is an amazingly good website, WorldCat.org, which provides free access to over 110 million library catalog records, most of which are for books: member libraries provide access to their entire collection, which includes articles, audio, and video. Right now, WorldCat.org is the best free website that lets visitors use keywords to conduct serious research across all media types, a feature which all on its own would make it valuable. On top of that, OCLC has integrated its work on FRBR and xISBN—projects that make it easier to find what you’re looking for—helping to turn WorldCat.org into an invaluable resource.
One of the two major problems with WorldCat.org is what it doesn’t include: the long tail of library records. With 70,000 libraries contributing records, it’s tempting to assume that just about every book is included in the WorldCat.org database, but that’s probably far from true. OCLC’s Karen Calhoun has written about its efforts to position its pricing and services so smaller libraries can participate, and OCLC is making inroads, but it still serves far fewer than half of the smaller libraries in the United States. This won’t affect most of the popular material—big libraries have just about every major work held by a smaller library, so the small libraries’ records are redundant in these instances—but it does mean that more obscure works collected by smaller libraries, representing local authors and regional historical resources, may not be included.
This sort of limitation affects everyone from amateur genealogists to academic researchers. For instance, I have a friend who is writing her doctoral thesis on the history of illness in the counties surrounding Philadelphia. Almost none of the libraries, archives, and historical societies she is relying on have shared their catalogs with OCLC. This means she must make use of each of these collections individually, usually in person, and spend time learning how each collection is organized. This is the research equivalent of using a manual typewriter instead of a MacBook Pro to type her dissertation, and represents a failure to make the best possible use of available technology. These collections’ records should be included in WorldCat.org.
This kind of wasted opportunity to assist researchers is one major disadvantage of WorldCat.org’s omission of smaller libraries’ holdings. The other major problem arises when researchers try to make use of one WorldCat.org’s signature features. When users search for an item in WorldCat.org, they can select a tab labeled “Libraries,” which takes them to a list of local libraries that have that item in their collection. However, only libraries that share their records with OCLC are listed. For example, search for Daemon: a novel by Leinad Zeraus and select the Libraries tab. WorldCat.org displays ten libraries where you can find this book, in descending order of proximity. It would be natural for WorldCat.org visitors to infer that these are the ten closest libraries that have this book. Unfortunately, that’s probably not the case. Instead, WorldCat.org is displaying the ten closest libraries that share their records with WorldCat. Users who believe that WorldCat.org is helping them search their nearby libraries may be led to believe that their local libraries don’t have any books at all—or, at least, none of the books they’re hoping to find.
Of course, it’s possible that some libraries may not want their records included in WorldCat.org. I’m not sure why they would feel that way, aside from the recent hullabaloo over licensing which appears to be getting increasingly heated. However, the library where I work very much wants its records in WorldCat.org so that our neighbors in town can use it as an alternative way of looking for the books that are available in their local library.
OCLC markets WorldCat and other services through a network of regional service providers. The provider for our area is PALINET, so if we want to get our records into WorldCat, we have to go through PALINET. Unfortunately, between OCLC and PALINET, a sort of “if you have to ask, you can’t afford it” pricing structure seems to have emerged for getting records included in WorldCat.org.
I don’t think this is anyone’s fault. Everyone I’ve met at OCLC and PALINET is smart, dedicated, and helpful. My guess is that it’s more like Kate Sheehan’s post office story in which her attempt to pick up a package left her feeling “broken or inept.” That’s certainly how I felt after spending a month exchanging emails with PALINET. At the end I was so confused that it just didn’t seem worth bothering to get an accurate price to take to my board, because the one thing about which I was relatively certain was that we didn’t have enough money to share our records on the WorldCat.org website.
The folks at OCLC seem to be working hard to remedy this situation. I have faith that they’ll get there. But until they do, there will probably be a lot of libraries that would like to share their records in WorldCat.org and either can’t afford it or can’t figure out if they can. That means researchers are going to have to keep working harder than necessary, WorldCat.org users will keep being misled by its Libraries tab, and frustrated libraries may find themselves looking for more accommodating partners.
Along with OCLC’s WorldCat.org, Open Library is one of two major nonprofit initiatives centered on creating a universal book catalog: its goal is a page for every book ever published, and to enable those pages to be updated by users, just as LibraryThing or Wikipedia pages are edited by site visitors. Since its founding in July, 2007, it has added over 30 million records to its book database.
For now, Open Library may be best known for its founder, Brewster Kahle, and its technical lead, Aaron Swartz. Both are Internet celebrities and serial entrepreneurs, though both specialize in nonprofit startups. Kahle has sold companies to AOL and Amazon, but he is best known for his work on the Internet Archive, home of the Wayback Machine, which attempts to archive the entire Web. Swartz was a founder of Reddit, which was sold to Condé Nast, and a developer of RSS, which enables websites, most notably blogs, to deliver content directly to readers. Open Library is currently funded by the Internet Archive and the California State Library and is committed to remaining entirely free, right down to the code that runs the site, which it makes available through an open source license.
Unlike our experience with OCLC, sharing our records in Open Library was dead simple: I emailed Aaron Swartz and he replied that receiving our records “was cause for much rejoicing.” (I also emailed Tim Spalding at LibraryThing to see if he might be interested in our records, and I found out he was as well.) Open Library is actively soliciting these contributions from libraries. However, it could, potentially, get these records directly from library websites. The technology involved is pretty simple and fairly well understood.
For example, the library where I work recently introduced a new website that’s powered by Casey Bisson’s fantastic Scriblio project. To import the Collingswood Library’s old records into our new website, we had Scriblio visit the web page for each record in the old catalog and import its data into the Scriblio database, turning blah into beautiful. We also use scrib_availability to show website visitors if the book is on the shelf.
Open Library clearly has the technical knowledge to do something like this and, because just about every library has a web-based catalog, it could easily include every book from pretty much every library in its database, enabling site visitors to learn if their local library has the book they want. For now, Open Library’s book pages, LibraryThing’s book records, and Google’s About this book pages link to WorldCat.org. (Edit: I originally wrote that Google’s About this book pages did not link to WorldCat.org. In the future, I’ll try to remember to disable my Firefox extensions before making such claims.)
The issue isn’t technical; it’s legal and ethical. On behalf of the library where I work, I uploaded our records to archive.org, making it possible for Open Library to use them, and on behalf of my library I uploaded them into our Scriblio-based website. It seems unlikely that libraries will have their records aggregated without their permission, at least in the near future. However, it wouldn’t be surprising if Kahle or Swartz, instead of asking for our records, began asking for our permission: what if they came to us and asked if they could automatically index our catalogs, creating for free a service that costs libraries thousands of dollars through OCLC? Even non-OCLC libraries are used to sharing their records. Why wouldn’t they accept Open Library’s offer to create a universal catalog? For most libraries, there’s no downside, but there’s an enormous upside: a single website where the world could see their records, and a free hub they could use for sharing records with each other.
In his 1992 Redesigning Library Services: A Manifesto, Michael Buckland writes that, “(f)rom an operational perspective the library catalog can be seen as a useful amplification of records that are unavoidably needed anyway. The information in a catalog can be useful in a variety of ways to library staff and library users. The difference between modern library catalogs and those before the late nineteenth century is essentially that the modern catalogs have a much larger bibliographical superstructure added to the locational information than had previously been the case.” In a nutshell, Buckland is saying, libraries decided that, since they had to keep a list of what they owned, they might as well describe each item and make sure they knew exactly where copies of it could be found. “With materials on paper, having copies stored locally is a necessary (though not a sufficient) condition for convenient access. With electronic materials, local storage may be desirable but is no longer necessary…. The answer is to shift from catalogs to union catalogs or linked catalogs…. Arguably the present day catalog… is more a product of the limitations of nineteenth century library technology than of present day opportunities.”
Between Amazon, Google, LibraryThing, WorldCat, and Open Library, we’re getting ever closer to setting aside nineteenth century models and to more fully taking advantage of present day opportunities. There is no technological reason preventing us from building a universal catalog that contains information on every book in existence and locates that book in every library that has a copy available for use.
We’re also closing in on having a digital scan of every book, making full-text searching possible, as well as concurrent, remote use of scarce resources (by which I mean, I can look at the text of a book on my screen while you’re looking at it on yours, a feature not available in a paper-based book, which is limited to being used in a single location and, generally, by a single user). It’s an exciting time to be a booklover, and it gives one hope that, with better resources available, books will begin to seem as accessible and vital as born-digital resources.
I like the alternatives that Amazon, Google, LibraryThing, WorldCat, and Open Library make available. I think each has made the other better, and I like having alternatives in researching books just as I like having FedEx, UPS, DHL, and the United States Postal Service available when I’m trying to send a package. I don’t think researchers are generally lazy, and I don’t think they want fewer options. What they want are a few really good choices, and they have them. It’s exciting for all of us that these good choices seem intent on becoming great ones.
Thanks to Tim Spalding and Aaron Swartz for reading an early draft of this article, and to my ItLwtLP colleague, Hilary Davis, for helping me with its final version.
Article printed from In the Library with the Lead Pipe: http://www.inthelibrarywiththeleadpipe.org
URL to article: http://www.inthelibrarywiththeleadpipe.org/2008/a-useful-amplification-of-records-that-are-unavoidably-needed-anyway/
This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.