[I spoke at the Lita Top Technology Trends at Dallas. I had a trend in reserve - big data - but did not use it. Here is something along the lines of what I might have said ...]

Big Data is a big trend, but as with expressions for other newly forming areas, it may evoke different things for different people.

A few years ago, academic libraries might have thought of scientific or biomedical data when they heard the expression 'big data'. In particular, the publication of The Fourth Paradigm: data-intensive scientific discovery helped crystallise awareness of developments in scientific practice.

More recently, however, big data has become a much more general term, across various domains. Indeed, it is now common to read about big data in the general business press. One comes across it in government and medicine, and in education. For example, a recent article in Inside Higher Ed talks about 'big data' and 'predictive analytics' in relation to course data and student retention. There are two interesting aspects of this, one, the data, and, two, the management environment ...

The rise of webscale services which handle large amounts of users, transactions and data has made the management of big data a more visible issue. At the same time, as more material is digital, as more business processes are automated, and as more activities shed usage data, organizations are having to cope with greater volume and variety of relatively unstructured data. Analytics, the extraction of intelligence from usage data has become a major activity. Here is a helpful characterization by Edd Dumbill on O'Reilly Radar.

As a catch-all term, "big data" can be pretty nebulous, in the same way that the term "cloud" covers diverse technologies. Input data to big data systems could be chatter from social networks, web server logs, traffic flow sensors, satellite imagery, broadcast audio streams, banking transactions, MP3s of rock music, the content of web pages, scans of government documents, GPS trails, telemetry from automobiles, financial market data, the list goes on. Are these all really the same thing? [What is big data?]

In a brief discussion of big data as a possible trend on FaceBook, Leslie Johnston provided an interesting perspective on issues from the Library of Congress.

Our collections are not just discovered by people and looked at, they are discovered by processes and analyzed using increasingly sophisticated tools in the hands of individual researchers, using just laptops. And we not only have TB/PB of digital collections, we will have billions of items, so fully manual processing/cataloging is rapidly becoming a thing of the past.

Leslie expanded on some of the actual data ...

  • 5 million newspaper pages, images with OCR, available via API, used in NSF digging into data project for data mining, combined with other collections used in new visualizations, and in an image analysis project.
  • 5 billion files of all types in a single institutional web archive - researchers do not search for and view individual archived sites, they analyze sites over time, and characterize entire corpuses, such as campaign web sites over 10 years.
  • Extreme example: over 50 billion tweets: many research requests received to do linguistic analysis, graph analysis, track geographic spread of news stories, etc.
  • Collection of 100s of thousands of electronic journal articles, which require article-level discovery: they don't all come with metadata and no one can afford to create it manually.

The remark about manual creation of metadata is one example where current processing methods do not scale. Leslie also notes:

And we cannot do manual individual file QA for mass digitization or catalog web archives or tweets without automated extraction. And when we start talking about video and audio, it all requires automated extraction or processing. I know of one request that we process a video to produce an audio-only track so that a transcript could then be automatically generated. LC has 20 PB of video and audio. Can you imagine what it would take to provide that level of service? Researchers started asking a few years ago to get files so they could do it themselves.

The Library of Congress may be a special case, but other organizations are facing similar issues. We are familiar with discussions about research data curation in university settings. Referring to the university challenge, Leslie then points to another interesting example.

I hear this from research libraries, but also from archives, especially state archives that are mandated to take in all state records, physical and electronic. Email archives are already Big Data for a lot of state archives.

Indeed, national or state institutions with responsibility for public records are reconfiguring organizations and systems to manage large volumes of e-records. My colleague Jackie Dooley pointed me at the recent Presidential Mandate on Managing Government Records which has implications for agencies and NARA.

In this context, it is not surprising that we are seeing a growing interest in data mining across domains (Leslie mentions the 'digging into data' challenge). The term 'data scientist' is cropping up in job ads and position titles. A couple of years ago, Hal Varian's comments on the importance of data and the skills required to analyse it were widely noticed.

The ability to take data - to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it's going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. Because now we really do have essentially free and ubiquitous data. So the complementary scarce factor is the ability to understand that data and extract value from it. [Hal Varian on how the web challenges managers - reg required]

It is clear from this discussion that existing systems are not well suited to manage and analyse these types of data, and this introduces the second topic, the management environment. Indeed, for Dumbill, this is the defining characteristic of big data:

Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn't fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it.

And alternative ways have been emerging, assisted by the webscale companies who had to face these challenges early on. Google provided MapReduce, described by Edd Dumbill as follows:

The important innovation of MapReduce is the ability to take a query over a dataset, divide it, and run it in parallel over multiple nodes. Distributing the computation solves the issue of data too large to fit onto a single machine. Combine this technique with commodity Linux servers and you have a cost-effective alternative to massive computing arrays. [What is Apache Hadoop]

MapReduce is a central part of Hadoop, whose development was supported by Yahoo, and whose further development is now supported within the Apache Software Foundation.

Hadoop brings the ability to cheaply process large amounts of data, regardless of its structure. By large, we mean from 10-100 gigabytes and above. How is this different from what went before?
Existing enterprise data warehouses and relational databases excel at processing structured data and can store massive amounts of data, though at a cost: This requirement for structure restricts the kinds of data that can be processed, and it imposes an inertia that makes data warehouses unsuited for agile exploration of massive heterogenous data. The amount of effort required to warehouse data often means that valuable data sources in organizations are never mined. This is where Hadoop can make a big difference. [What is Apache Hadoop]

The availability of the Hadoop family of technologies (again, nicely described by Dumbill) and cheap commodity hardware has made processing of large amounts of data more accessible. Cloud options are also emerging, from Amazon, Microsoft and others. Uptake has been rapid.

So, while Hadoop and related technologies have emerged in the context of the Big Data requirements of webscale companies, they are becoming more widely deployed. Their scalability, coupled with lower cost, have made them an attractive option across a range of data processing tasks. They may be used with 'big data' and not so big data.

In this way, my big data trend may more realistically be two trends. We are indeed having to process greater volume and variety of data. The description of data management at the Library of Congress provides some nice examples. Several technologies, notably the Hadoop framework, have emerged as a result of such challenges. However, these are now also finding more broad adoption as they reduce costs and provide greater flexibility.

Coda: In OCLC research we have been using MapReduce for several years and more recently have been using Hadoop. We have been also working with colleagues elsewhere in OCLC as we look at where and how Hadoop might provide benefits.

  • Linking not typing ... knowledge organization at the network level January 01, 2012 – 'Knowledge organization' seems a slightly quaint term now, but we don't have a better in general use. Take the catalogue. This has been a knowledge organization tool. When an item is added, the goal is that it is related to the network of knowledge that is represented in the catalogue. In theory, this is achieved through 'adjacency' and cross reference, notably with reference to authors, subjects and works. In practice this has worked variably well. In parallel with bibliographic data, the library community, notably national libraries, have developed 'authorities' for authors and subjects to facilitate this structure. From our current... more
  • End of the digest .... October 13, 2011 – For almost as long as this blog has been going we have had an associated digest. This has gone out to over 800 people. The frequency of the digest has changed as the frequency of posting has gone down. We have decided that it is now time to turn off the digest. While the blog continues, it has become more a venue for occasional comment than a steady stream. Thank you to all those who have subscribed to the digest, and I hope you continue to read entries in the future. And, as a reminder, I am on Twitter at... more
  • Collections are library assets August 31, 2011 – I quite like using the word 'assets' with reference to library collections. We tend to think of assets in positive terms, as things that are valuable. More of that later. I was interested to see Rick Anderson remark on the vocabulary used by my colleague Constance Malpas a while ago. This was in the context of a generous note about Constance's "Cloud-sourcing Research Collections: Managing Print in the Mass-digitized Library Environment." [pdf]. I confess that I giggle and shudder simultaneously at the thought of referring publicly to books in our collection as "inventory that is increasingly devalued as an institutional... more
  • The ILS, the digital library and the research library August 17, 2011 – Job adverts are interesting for a variety of reasons. They give a sense of skills and attributes in demand. They say something about how the hiring institution wants to present itself. And they can indicate trends. I have been interested to see three research libraries look for senior digital library posts in recent months. Associate Director for Digital Library Programmes and Information Technologies, Bodleian Libraries, University of Oxford. Associate Vice President for Digital Programs and Technology Services, Columbia University Libraries/Information Services. Head of Digital Library, Information Services, The University of Edinburgh. Note: given the nature of these resources, the links... more
  • Preserving musical heritage ... August 14, 2011 – One of the casualties of the London riots last week was a Sony distribution warehouse. The building, owned by Sony DADC, was also the main HQ for the UK's biggest distributor of independent music, Pias. [More than 1.5m CDs destroyed in Sony warehouse fire] Interestingly, Sony looked after the stock of more than 150 record labels at the warehouse. According to the BBC story quoted above "As well as CDs, the 20,000 sq m (215,000 sq ft) centre was used to store DVDs, Blu-ray discs and discs used for PlayStation Portable games." It was depressing reading about the impact on... more
  • Nostalgia, the Dublin Irish Festival, and variant forms of names August 05, 2011 – The Dublin Irish Festival is on this weekend - Dublin, Ohio, that is. I notice that Moya Brennan is performing. As some folks will know, Moya Brennan is an Irish singer who was a member of the well-known family group, Clannad. They emerged in the 70s, playing very much in a traditional irish music idiom. As they evolved, they developed a style that was influential in the emergence of the sort of new age, 'celtic' music that became popular and has some well known practitioners. One of the most vivid memories I have of secondary school is when they put... more
  • Worldcat Identities Network: a 'mashup' July 30, 2011 – There has been some nice reaction to the Worldcat Identities Network. The initial motivation for this was to put a graphic display of related Identities into an Identities page. This did not work out and we decided to make it available as a standalone app. The aim is to show how something could be built on top of the Worldcat API and the Worldcat Identities Web Services. It is a 'mashup'. Which prompted me to think that the peak of the mashup has passed? Or at least we do not hear as much about mashups as before? It is interesting... more
  • Gamification: services and libraries July 24, 2011 – I have been interested to see more notes in my tweetstream about peoples' exercise or diet regimes. They are typically generated by network services as a by-product of some activity, running or cycling, for example, and are part of a motivating framework. The Withings bathroom scale is connected to the network, allows you to set goals and to record and access weight and other data, as well as optionally communicating progress by tweeting your weight. In a similar vein, I recently came across stickK.com - "the smartest way to set and achieve your goals". If you're ready to turn that... more
  • Spotify and Klout: fungible influence July 19, 2011 – Popular music streaming service Spotify has just launched in the US. For some background see the Ars Technica story. One of the interesting aspects of the launch was the tie-in with Klout. Klout is one of several services which provide analytics around social media activity. It aims to be the 'standard for influence', tracking social media impact. I have spoken about Klout before (Analysing influence .. the personal reputational hamsterwheel). The business model includes the matching of 'influencers' in a particular area with providers of products and services relevant to that area. The providers may provide the influencers with 'perks'... more
  • Hamster wheeling .... June 14, 2011 – I have found the expression 'hamster wheeling' useful over the last few years. I tend to use it in the context of any frantic effort where the participants have to keep several things going at the same time, and where it seems that slowing down might cause something to fall off. More specifically, it is an appropriate description for those operations, common in a digital library environment, where a set of grants are used to keep a set of people working on a set of projects. Many of us are familiar with such environments, either as participants or observers. In... more