Libraries data mining

By | 04.01.2018

Libraries generate a great deal of information about their own processes, including circulation records. Making that information available to others could be the. Promoting Public Library Sustainability through Data Mining: library data by analyzing the economic and employment data mining, library & information. A Libraries & Big Data Brown Bag brought to you by the Information Science program at the MIT Libraries. Bring your lunch and join us for this fascinating.
A Literature Survey. Proof in the Pattern, By Scott Nicholson — January 15, 2006 in Library Journal; Bibliomining: Data Mining for Libraries by Dr. Scott Nicholson. Introduction. SPMF is an open-source data mining mining library written in Java, specialized in pattern mining. It is distributed under the GPL v3 license. Libraries generate a great deal of information about their own processes, including circulation records. Making that information available to others could be the. Promoting Public Library Sustainability through Data Mining: library data by analyzing the economic and employment data mining, library & information. A Libraries & Big Data Brown Bag brought to you by the Information Science program at the MIT Libraries. Bring your lunch and join us for this fascinating.

Explore practical data mining and parsing with PHP

Dig into XML and HTML data to find useful information with PHP

Eli White
Published on July 26, 2011

Data mining and its importance

Frequently used acronyms
  • API: Application programming interface
  • CDATA: Character data
  • DOM: Document Object Mode
  • FTP: File Transfer Protocol
  • HTML: HyperText Markup Language
  • HTTP: Hypertext Transfer Protocol
  • REST: Representational State Transfer
  • URL: Uniform Resource Locator
  • W3C: World Wide Web Consortium
  • XML: Extensible Markup Language

Wikipedia defines data mining as "the process of extracting patterns from large data sets by combining methods from libraries data mining and artificial intelligence with database management." This is a very deep definition and probably goes beyond the typical use case for most people. Few people work with artificial intelligence; most commonly, data mining simply entails the ingesting of large data sets and searching through them to find information that is useful.

Given how the Internet has grown, with so much information available, it is important to be able to aggregate large amounts of data and make some sense of it. To take datasets much larger than a single person can read and boil them down to useful data is a primary goal. This type of data mining is the focus of this article, specifically how to collect and parse this data.

Practical uses of data mining

Data mining has many practical uses. You might want to scour a website for information that it libraries data mining (such as mining posters records for movies or concerts). You might have more serious information, such as voter records, to retrieve and make sense of the data, libraries data mining. Or, more mining hat, you might look at social network data and attempt to parse trends from it, such as how often your company is mentioned and whether it's mentioned in a positive or negative light.

Precautions before mining a website

Before you continue, I should mention that I assume you will pull this data from another website. If you already have the data at your disposal, libraries data mining, that's a libraries data mining different situation. When you pull data from a website, you need to make sure that you are following the terms of service regardless of whether you are web scraping (more on this later) or using an API. If you are scraping, you also need to be wary of following the site's robots.txt file, libraries data mining, which describes what parts of the website scripts you can access, libraries data mining. Finally, make sure that you are respectful of the site's bandwidth. You should not write scripts that access the site's data as fast as your script can run. Not only might you cause hosting problems, but you run the risk of being banned or blocked from the site for being too aggressive.

Understanding XML data structure

Regardless of the way that you pull data in, libraries data mining, chances are that you will receive data in XML (or HTML) format. XML has become the standard language of the Internet when it comes to sharing data. It's mining fit eve to briefly consider XML structure and how to handle it in PHP before you look at methods to retrieve it.

The basic structure of an XML document is very straightforward, especially if you have previously worked with HTML. All data in an XML document is stored in one of two ways. The primary way to store the data is inside nested tags. For an example of the simplest form, libraries data mining, suppose that you have an address, which can be stored in a document such as this:

<address>1234 Main Street, Baltimore, MD</address>

You can nest these XML data points to create a list of multiple addresses. You can put all of these addresses inside another tag, in this case called (see Listing 1).

Listing 1. Multiple addresses in XML
<locations> <address>1234 Main Street, Baltimore, MD</address> <address>567 1st Street, libraries data mining, San Jose, libraries data mining, CA</address> <address>901 Washington Ave, Chicago, IL</address> </locations>

To expand this approach further, you might want to break the addresses into their constituent parts of street, city,and state, which makes processing of the data easier. In that case, you have a more typical XML file, libraries data mining, as in Listing 2.

Listing 2. Fully broken-down addresses in XML
<locations> <address> <street>1234 Main Street</street> <city>Baltimore</city> <state>MD</state> </address> <address> <street>567 1st Street</street> <city>San Jose</city> <state>CA</state> </address> <address> <street>901 Washington Ave</street> <city>Chicago</city> <state>IL</state> </address> </locations>

As mentioned, libraries data mining, you can store XML data in two main ways. You've now seen one of them. The other method is through attributes. Each tag can have a number of attributes assigned to it. While less common, this approach can be a very useful tool. Sometimes it gives additional information, such as a unique ID or an event date. Quite often, libraries data mining, it adds meta data; in your address example, a attribute indicates whether the address is a home or work address, as in Listing 3.

Listing 3. Tags added to XML
<locations> <address type="home"> <street>1234 Main Street</street> <city>Baltimore</city> <state>MD</state> </address> <address type="work"> <street>567 1st Street</street> <city>San Jose</city> <state>CA</state> </address> <address type="work"> <street>901 Washington Ave</street> <city>Chicago</city> <state>IL</state> </address> </locations>

Note that XML documents do always have a parent root tag/node that all other tags/nodes are children of. XML also can include other declarations and definitions at the beginning of the document and a few other complications (such as CDATA blocks). I highly recommend that you read more about XML in Related topics.

Parsing XML data in PHP

Now that you understand what XML looks libraries data mining and how it's structured, you need to know how to parse and programmatically access that data inside PHP. A number of libraries created for PHP allow XML parsing, and each library has its own benefits and drawbacks. There are DOM, XMLReader/Writer, XML Parser, SimpleXML, and others. For the purposes of this article, I focus on SimpleXML as it is one of the most commonly used libraries and one of my favorites.

SimpleXML, as its name suggests, was created to provide a very simple interface to accessing XML. It takes an XML document and transforms it into an internal PHP object format. Accessing data points becomes as easy as accessing object variables. Parsing an XML document with SimpleXML is as easy as using the function (see Listing 4).

Listing 4, libraries data mining. Parsing a document with SimpleXML
<?php $xml = simplexml_load_file('listing4.xml'); ?>

That's really all that is required. Do note that thanks to PHP's filestream integration, you can insert a filename or a URL here and the filestream integration automatically fetches it. You can also use if you already have the XML loaded into memory. If you run this code on the XML in Listing 3 and use to see the rough structure of the data, you get the output in Listing 5.

Listing 5. Output www mining enc ru parsed XML
SimpleXMLElement Object ( [address] => Array ( libraries data mining [0] => SimpleXMLElement Object ( [@attributes] => Array ( libraries data mining [type] => home ) libraries data mining [street] => 1234 Main Street [city] => Baltimore [state] => MD ) [1] => SimpleXMLElement Object ( [@attributes] => Array ( [type] => work ) [street] => 567 1st Street [city] => San Jose [state] => CA ) [2] => SimpleXMLElement Object ( libraries data mining [@attributes] => Array ( [type] => work ) libraries data mining libraries data mining [street] => 901 Washington Ave [city] => Chicago [state] => IL libraries data mining ) ) )

You can then access the data using standard PHP object access and methods. For example, to echo out every state that someone lived in, you can iterate over the addresses to do just that (see Listing 6).

Listing 6. Iterating over addresses
<?php $xml = simplexml_load_file('listing4.xml'); foreach ($xml->address as $address) { echo $address->state, "<br \>\n"; } ?>

Accessing the attributes is a little different. Rather than reference them as you do an object property, you access them like array values. You can change that last code washoe mining corporation sample to show the attribute by using the code in Listing 7.

Listing 7. Adding attributes
<?php libraries data mining = simplexml_load_file('listing4.xml'); foreach ($xml->address as $address) { echo $address->state, ': ', $address['type'], "<br \>\n"; } ?>

While all the current examples involved iteration, you can reach directly into the data and use a specific piece of information that you want, such as grabbing the street address of the second address with the code .

You should now have the basic tools to start playing with XML data. I do recommend that you read the SimpleXML documentation and other links listed in Related topics to learn more.

Data mining in PHP: Possible ways

As mentioned, you can access data in multiple ways. The two primary methods are web scraping and API use.

Web scraping

Web scraping is the act of literally downloading entire web pages programmatically and extracting data from the page, libraries data mining. There are entire books written on this subject (see Related topics). Libraries data mining briefly list the tools needed to do this. First of all, libraries data mining, PHP makes it very easy to read a web page in as a string. There are many ways to do this, including using with libraries data mining URL, but in this case you want to be able to parse the HTML libraries data mining a meaningful manner.

Given that HTML is at its heart a language based on XML, it is useful to convert HTML into a SimpleXML structure. You can't just load an HTML page usinghowever, as even valid HTML isn't XML, libraries data mining. A good workaround is to use the DOM extension to load the HTML page as a DOM document and then convert it to SimpleXML, as in Listing 8.

Listing 8. Using DOM methods to get a SimpleXML version of a web page
<?php $dom = new DOMDocument(); $dom->loadHTMLFile(''); $xml = simplexml_import_dom($dom); ?>

After you've done this, libraries data mining, you can now traverse the HTML page just as you might have any other XML document. Therefore you can access the title of the page now using or go deep into the page with references such as .

As you might expect from that last example, though, libraries data mining, it can get very unwieldy at times to try to find data in the midst of an HTML page, libraries data mining, which often isn't nearly as organized as an XML file is. The above line looks for the first h4 that exists inside of three nested divs; in each case, it looks for the first div inside each parent.

Luckily, if you want to find only the first h4 on the os for mining, or other such "direct data," XPath is a much easier way to do so. XPath is essentially a way to search through XML documents using a query language, and SimpleXML exposes this. XPath is a very powerful tool and can be the subject of an entire series of articles, including some listed in Related topics. In basic terms, you use to describe hierarchical relationships; therefore, you can rewrite the preceding references as the following XPath search (see Listing 9).

Listing 9. Using XPath directly
<?php $h4 = $xml->xpath('/html/body/div/div/div/h4'); ?>

Or you could just use the option with XPath, which causes it to search all of the document libraries data mining the tags you are looking for, libraries data mining. Therefore, you could find all the h4's as an array, libraries data mining, then access the first libraries data mining, using XPath:

Walking an HTML hierarchy

The main reason to talk about these conversions and XPath is that one of the common gpu mining systems required tasks when you do web scraping is to automatically find other links on the web page and follow them, allowing you to "walk" the website, finding out as much information as possible.

This task is made fairly trivial using XPath. Listing 10 gives you an array of all the <a> links with "href" attributes, allowing you to handle them.

Listing 10. Combining techniques to find all links on a page
<?php $dom = new DOMDocument(); $dom->loadHTMLFile(''); $xml = simplexml_import_dom($dom); $links = $xml->xpath('//a[@href]'); foreach ($links as $l) { echo $l['href'], "<br />\n"; } ?>

Now that code finds all the links, but you might quickly start crawling the entire web if you followed every possible link you found. Therefore, it's best to enhance your code to ensure that you access only links that are valid HTML links (not FTP or JavaScript) and that go back only to the same website (either through full domain links or through relative ones).

An easier way is to iterate on the links using PHP's built-in function, which handles a lot of the sanity checks for you. Listing 11 looks something like this.

Listing 11. A more robust site walker
<?php $dom libraries data mining new DOMDocument(); $host = ''; $dom->loadHTMLFile("http://{$host}/"); $xml = simplexml_import_dom($dom); $links = $xml->xpath('//a[@href]'); foreach ($links as $l) { libraries data mining $p = bitcoin mining threads if (empty($p['scheme']) || in_array($p['scheme'], array('http', 'https'))) { libraries data mining if (empty($p['host']) || ($host == $p['host'])) { echo $l['href'], libraries data mining, "<br />\n"; // Handle URL iteration here } } } ?>

As a final note on HTML parsing, you reviewed how to use the DOM extension solely for the purpose of converting back into SimpleXML, for a unified interface to all XML-like languages. Note that the DOM library itself is a very robust one and can be used directly, libraries data mining. If you are well versed in JavaScript and traversing a DOM document tree using tools such asthen you might be comfortable staying within the DOM library and not using SimpleXML.

You should now have the tools that you need to start scraping data from web pages. Once you are familiar with the techniques detailed previously in this article, you can read any information from the web page, not just the links that you can follow. We hope that you don't need to do this task because an API or other data source exists instead.

Using XML APIs and data

At this point, you have the basic skills to access and use a majority of the XML data APIs on the Internet, libraries data mining. They are often REST-based and therefore require only a simple HTTP access to retrieve the data and parse it using the preceding techniques.

Every API is different in the end. You certainly can't cover how to access every single one so let's walk through some basic examples of XML APIs, libraries data mining. One of the most common sources of data, and already in XML format, is the RSS feed. RSS stands for Really Simple Syndication and is a mostly standardized format for sharing frequently updated data, such as blog posts, news headlines, or podcasts. To learn more about the RSS format, see Related topics. Note that RSS is an XML file, with a parent tag called <channel> that can have any number of <item> tags in it, each providing a bevy of data points.

As an example, use SimpleXML to read in the RSS feed of the headlines of The New York Times (see Related topics for a link to the RSS feed) and format a list of headlines with links to the stories (see Listing 12).

Listing 12, libraries data mining. Reading The New York Times RSS feed
<ul> <?php $xml = simplexml_load_file(''); foreach ($xml->channel->item as $item) { echo "<li><a href=\"{$item->link}\">{$item->title} </a></li>"; } ?> </ul>

Figure 1 shows the output from The New York Times feed.

Output from The New York Times feed

View image at full size

Now, let's explore an example of a more fully featured REST-based API. A good one to start with is the Flickr API because it offers lots of data without the need to authenticate with it. Many APIs require you to authenticate with them, using Oauth or other mechanisms, libraries data mining, to act on behalf of a web user. This step might apply to the entire API, or just part of it. Check the documentation of each API for how to do this.

To demonstrate using the Flickr API for a non-authenticated request, libraries data mining, you can use its search API. For an example, search Flickr for all public photos of crossbows. While you don't need to authenticate, as you might with many APIs, libraries data mining, you do need to generate an API key to use when accessing the data. Learn to do that task directly from Flickr's API documentation itself. After you' have an API key, you can explore using their search feature as in Listing 13.

Listing 13. Searching for "crossbow" using the Flickr API
<?php // Store some basic information that you need to reference $apiurl = ''; $key = '9f275087e222ee395c92662437bf84a2'; // Replace with your own key // Build an array of parameters that you want to request: $params = array( 'method' => '', 'api_key' => $key, 'text' => 'crossbow', libraries data mining, // Our search term 'media' => 'photos', 'per_page' => 20 // We only want 20 results ); // Now make the request to Flickr: $xml = simplexml_load_file($apiurllibraries data mining. http_build_query($params)); // From this, iterate over the list of photos & request more info: foreach ($xml->photos->photo as $photo) { // Build a new request with libraries data mining photo's ID $params = decred mining 1060 'method' => '', 'api_key' => $key, 'photo_id' => (string)$photo['id'] ); $info = simplexml_load_file($apiurl. http_build_query($params)); // Now $info holds a vast amount of data about the image including // owner, GPS, dates, description, tags, etc . all to be used. // Let's also request "sizes" to get all of the image URLs: $params = array( 'method' => '', 'api_key' => $key, 'photo_id' => (string)$photo['id'] ); $sizes = simplexml_load_file($apiurl. http_build_query($params)); $small = $sizes->xpath("//size[@label='Small']"); // For now, just going to create a simple display of the image, // linked back to Flickr, with title, GPS info, and more shown: echo <<<EOHTML <div> <a href="{$info->photo->urls->url[0]}"> <img src="{$small[0]['source']}" width="{$small[0]['width']}" height="{$small[0]['height']}" /> </a> <ul> <li>Title: {$info->photo->title}</li> <li>User: {$info->photo->owner['realname']}</li> <li>Date Taken: {$info->photo->dates['taken']}</li> <li>Location: {$info->photo->location->locality}, {$info->photo->location->county}, {$info->photo->location->region}, {$info->photo->location->country} </li> </ul> </div> EOHTML; } ?>

Figure 2 shows the output of the Flickr program. The results of your search for crossbows includes photos plus information about each photo (title, user, location, libraries data mining, date the photo was taken).

Figure 2. Example output of the Flickr program from Listing 13

View image at full size

You can see how powerful APIs like this are and how you can combine various calls in the same API to get the data that you need. With these basic techniques, you can mine the data of any website or information source.

Simply discover how you can get programmatic access to the data through an API or web scraping. Then use the methods shown to access and iterate over all the target data.

Storing and reporting on extracted data

The final point, storing and reporting on the data, is in many ways the easiest part—and perhaps the most fun. The sky is the limit here as you decide how to handle this aspect for your own situation.

Typically, take all of the information that you libraries data mining and store it in a database. Then structure the data in a way that matches how you plan to access it later. When doing this, don't be shy about storing libraries data mining information than you think you might need. While you can always delete data, retrieving additional information can be a painful process once you have lots of it. It's better to overestimate in the beginning. After all, libraries data mining, you never know what piece of data might turn out to be interesting.

Then at that point, after the data is stored in a database or similar data store, you can create reports, libraries data mining. Reporting might be as simple as running some basic SQL queries against a linux amd mining libraries data mining see the number of times that a piece of data exists, or it might be very complicated web user interfaces designed to let someone dive in and find their own correlations.

After you do the hard work of cataloging all the data, you can imagine creative ways to display it.


Through the course of this article, you looked at the basic structure of XML documents and an easy method to parse those in PHP using SimpleXML. You also added the ability to handle HTML in a similar manner and touched on the basics of walking a website to libraries data mining scrape data not available in an XML format. Using these tools, and following some of the examples libraries data mining have been given, you now have a good base level of knowledge so you can begin to work on data mining a website. There is much more to learn than a single article can convey. For additional ways to increase your knowledge about data mining, plan to check the Related topics.

Downloadable resources

Related topics

  • XML as described on Wikipedia: Read a description of the XML specification.
  • Extensible Markup Language (XML) 1.0 (Fifth Edition) (W3C Recommendation, November 2008): Visit this source for specific details about XML features.
  • Introduction to XML (Doug Tidwell, developerWorks, August 2002): Look at what XML is, why it was developed, and how it shapes electronic commerce. Review a variety of important XML programming interfaces and standards, and two case studies of how companies solve business problems with XML.
  • XML Tutorial (W3Schools): Read a lesson about XML and how it can transport and store data.
  • SimpleXML documentation: Browse and learn about a tool set to convert XML to an object that you can process with normal PHP property selectors and array iterators.
  • php|architect's Guide to Web Scraping libraries data mining PHP (Matthew Turland): Get more information on web scraping with a variety of technologies and frameworks.
  • XML Path Language (XPath) Version 1.0 (W3C Recommendation, libraries data mining, November 1999): Familiarize yourself with the specification for a common syntax and semantics for functionality shared between XSLT and XPointer.
  • RSS specification: Explore the details of the RSS web content syndication format.
  • Flickr Services: Look into the Flickr API, an online photo management and sharing application.
  • Visit and explore the central resource for PHP developers.
  • Recommended PHP reading list (Daniel Krook and Carlos Hoyos, developerWorks, March 2006): Learn about PHP (Hypertext Preprocessor) with this reading list compiled for programmers and administrators by IBM web application developers.
  • PHP and more: Browse all the PHP content on developerWorks.
  • Zend Core for IBM: Using a database with PHP? Check out a seamless, out-of-the-box, easy-to-install PHP development and production environment that supports IBM DB2 V9.
  • RSS feed of the headlines of The New York Times: Experiment with an RSS feed of headlines from The New York Times.
  • PHP: Get this general-purpose scripting language for web development.
  • XML area on developerWorks: Find the resources you need to advance your skills in the XML arena. See the XML technical library for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks
  • Expand your PHP skills by checking out the IBM developerWorks PHP project resources.
  • IBM certification: Find out how you can become an IBM-Certified Developer.
  • IBM product evaluation versions: Get your hands on application development tools and middleware products.

Explore practical data mining and parsing with PHP

A Libraries & Big Data Brown Bag brought to you by the Information Science program at the MIT Libraries. Bring your lunch and join us for this fascinating. Mar 19, 2016 · What I’ve learned over the last three years of data mining the Internet Archive and what it means for the future of libraries in the big data era. Oct 19, 2010 · Hi all! I am developing a web application that will work with Data Mining-models deployed to Analysis Services. I downloaded the library Data Mining . Tutorial¶ This is a gentle introduction on scripting in Orange, a Python 3 data mining library. We here assume you have already downloaded and installed Orange from. Data mining requires data preparation which can uncover information or An open source deep learning library for the Lua programming language and scientific. The art of data mining is a wide you need to know how to parse and programmatically access that data inside PHP. A number of libraries created for PHP.

Data mining requires data preparation which can uncover information or An open source deep learning library for the Lua programming language and scientific. Tutorial¶ This is a gentle introduction on scripting in Orange, a Python 3 data mining library. We here assume you have already downloaded and installed Orange from. Mar 19, 2016 · What I’ve learned over the last three years of data mining the Internet Archive and what it means for the future of libraries in the big data era.

Information in a library is of two kinds — there is the content, the collection, all that stuff that resides in books and journals and special collections; and there is the information about that content, the metadata: information about where things are located, how they relate to other things, how often they circulate (but, rarely, for privacy reasons, about who actually accesses and reads the content). It’s that latter kind of information, the metadata, I am interested in, as it may provide value to certain organizations, value that libraries may seek to tap.

I have been thinking about library circulation for some time, but my interest grew when I began to study PDA last year, as eliminating books that don’t circulate is one of the reasons librarians are interested in PDA in the first place. PDA programs implicitly ask two questions: Why would libraries want to acquire books that don’t circulate? and, Why would publishers want to publish books that nobody reads? There is a risk with questions like this in that they can make scholarly publishing into a popularity contest. After all, if the measure of a successful scholarly book is how many copies are sold and how often that title circulates in a library, then the big trade publishers would become the model for all publishing, drowning out the specialized, intellectually serious work that is the business of a research university. But surely there is a middle ground between bestsellerdom and the totally obscure. Information about how books circulate in libraries would help publishers evaluate their lists and provide guidance for future editorial acquisitions.

Publishers do, of course, have some limited information coming back to them from the marketplace. No publisher fails to study the Amazon rankings of a title, for example (some authors do nothing else), and there are services that provide a modicum of information about the sales of books in retail outlets.  Scholarly publishers have a special problem, however, in that their titles are sold disproportionately to libraries, so the absence of circulation data affects them more seriously than it does trade houses.

You would expect to be able to go online and look all this stuff up, even if some of it resides behind a paywall.  But you can’t:  there is no place to go to get aggregate data on library circulation. So, for example, Stanford University Press published a book called “Monopolizing the Master: Henry James and the Politics of Modern Literary Scholarship,” by Michael Anesko, a book I am choosing at random (though it sounds interesting). Is it not reasonable to ask what libraries purchased a copy and how often it has circulated? You can get an answer to the first question by looking at WorldCat, but the second question is unanswerable at this time.

Individual libraries study their circulation records carefully. I have previously cited on the Kitchen a rigorous study done at Cornell, and I imagine many libraries have something of this kind in hand; librarians being librarians, one assumes that such studies get passed around informally. But there is a place for a full service, one that aggregates the circulation data, properly anonymized, of all library collections, and that can generate management reports for interested parties.

So let’s imagine a new library service called BloodFlow, which sets out to aggregate the circulation records of all the world’s libraries. The libraries themselves would have to be tagged by type (e.g., their Carnegie classifcations or by using a different taxonomy) so that one could distinguish between the major ARLs, liberal arts college libraries, the libraries of community colleges — and, of course, school, public, and corporate libraries.  Circulation data from all these libraries would be uploaded to BloodFlow, which would aggregate the data in a form that allowed it to be packaged according to the needs of any particular user. For example, a librarian at the University of Michigan may contemplate whether to purchase a revised edition of a book first published by Rutgers University Press 10 years ago. What is the demand among research universities for this title?  If the circulation in the aggregate is strong, Michigan may decide to purchase the book. Or a librarian at a public library may look at the circulation records for a book that is already in print from Palgrave Macmillan. But if the records show that virtually all of the book’s circulations were at the top ARLs, the librarian may pass on that title as not a good fit with a public library’s collection.

Publishers would make different uses of this data. Should I bring a book back into print?  Let’s check the circulation records. Or, we have a submission here on Byzantine studies; how can we assess the market opportunity? Publishers would also be interested in trends: Are books in Women’s Studies circulating more or less strongly over the past decade, and how do these circulation records compare to that of collections as a whole? Or how about economics, or physics? Once you begin to study data like this, the number of new questions that arise can be mind-boggling. Mix a curious mind with a large data set and the tools to manipulate it and suddenly you find that you have given birth to a new Edison or Tesla.

One way to get this service to work would be to set up a membership organization — the BloodFlow Partnership.  Any library could join, with the following conditions: there is a membership fee, scaled by size and type of library, and the library must make all its circulation records available to the partnership.  A member would then have unrestricted access to the data, including the report-generation feature. (An interesting question is whether information about the reports requested — the meta-metadata — would be part of the service as well.) Non-members would have to pay a fee, which would once again be scaled by type and size. For whatever reason, Colby College decides not to participate, but it subscribes to the service; the price for Colby, however, is far less than that paid by Oxford University Press and Simon & Schuster.  Thus the business model is a combination of membership and toll-access publishing.  Ideally, the circulation records would be available in real-time (How many copies of “Administrative Law:  The Informal Process,” by Peter Woll and published by the University of California Press are circulations right now?), but this may be hard to achieve technically. The more granular the data, the better, but even annual circulation figures from libraries without the technical means to publish an API to their circulation records would have some value.

There is a corollary to this argument, and that is that with more and more libraries getting into the publishing business in some way, usually with various kinds of open access services, there is an unanswered, even unasked editorial question: What is the right kind of content for a library to publish?  In my view, the best new publishing enterprises focus on new and growing content areas. A library that seeks to publish material in European history must contend with the program at OUP; a library interested in American history will have strong competition from Harvard University Press; and, most obviously, a library interested in STM journals will find such organizations as Elsevier, Springer, and Wiley Blackwell fiercely defending their turf.  But aggregate library metadata is another matter. This information is proprietary to libraries; only they have access to it, only they can publish it.  It’s a great competition position to be in. The beautiful irony is that the paying customers for such services will in part be traditional publishers.

Joseph Esposito


Joe Esposito is a management consultant for the publishing and digital services industries. Joe focuses on organizational strategy and new business development. He is active in both the for-profit and not-for-profit areas.

View All Posts by Joseph Esposito

Forum text mining 498
Libraries data mining 407
Libraries data mining Leading us mining companies

1 thoughts on “Libraries data mining

Add comment

E-mail *