Tuesday, December 31, 2013

Big Data Content Organization, Discovery, and Management

Margie Hlava, President, Access Innovations
Military Libraries Workshop, Von Braun Center, Huntsville, Alabama
December 11, 2013

Big Data
  • Data is the new oil – we have to learn how to mine it! Qatar – European Commission Report
  • $ 7 trillion economic value in 7 US sectors alone
  • $90 B annually in sensitive devices
  • Land, Labor, Capital, + Data

Data Deluge – the End of Science, Wired, 16.07

            Too much data to analyze and process!

Google, eBay, LinkedIn, and Facebook are all Big Data harvesters, they were expecting Big Data from the beginning.

They don’t need to reconcile or integrate Big Data with their IT infrastructure because they were built to deal with it.

Traditional sources of data and the analytics performed upon them aren’t going away.  Big Data is the new member of the family that must be integrated.  Data scientists have to learn to work with the data and be able to analyze it.

Big Data is too much stuff to deal with in a reasonable amount of time!
Big Data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big Data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in a single data set. – Wikipedia, May 2011 

There is a new paradigm – one of data-intensive scientific discovery

There are new special collections – more about methods than data.

  • Location aware data
  • Life streaming
  • Insurance claims
  • Hubble telescope
  • CERN Collections
  • Flight data
Unstructured data
  • Means untagged or unformatted
  • PDF
  • Word files
  • File shares
  • News feeds
  • News Data feeds
  • Images

This isn’t entirely accurate.  We make use of the properties of PDF and Word files, we can add a lot of metadata and give the files structure.  Only most people don’t do this.

Structured data is like xml – the tagging describes the data.

What are the problems?
  • Data infrastructure challenges
  • “taking diverse and heterogeneous data sets and making them more homogeneous and usable”
  • Is this a problem or an opportunity?
  • All that data – what can it tell us?
  • Privacy
  • Copyright
  • Neurological impact
  • Data collection methods

Government Initiative

Big Data Senior Steering Group (BDSSG) was formed to identify current Big Data research and development activities across the Federal government, offer opportunities for coordination, and identify what the goal of a national initiative in this area would look like.

There is a fast-growing volume of digital data.  Do we need new technology?

Techniques for dealing with Big Data

Content organization – doesn’t matter where the data lives (machine, cloud, etc.)

Undifferentiated, unstructured – needs organization.

Type of database structure:  where are we going to put it?   Do we use a relational database or an object-oriented system?

An object-oriented system using java or xml pulls all the descriptors into one place – the object.  Example of a bottle of water – the descriptors would all live with the object – (water, bottle, plastic, origin, etc.)

What are Librarians doing?
  • We are using meta-search tools to integrate all these data sets.
  • We give structure to the unstructured data
  • We create the meta-data

Where do store the meta-data?
  • With the records - in the html header
  • Store the meta-data in a separate file and link to it – database or Sharepoint



  1. Librarians are doing great things like meta-search tools to integrate , give structure to the unstructured data and create the meta-data.

    product feed management

  2. The process of education is as old as the man himself. The process has undergone various changes and modern education is linked with write my college essay me. The assignment writing is the most important tool of learning.