With the advent of big data, organizations are beginning to recognize the impact that big data and analytics can have on their ability to compete in their respective industries. In a recent study by MIT and the SAS Institute, 67% of leading organizations firmly believe that analytics give them a competitive advantage. This recognition has revealed that it is not only about the volume, velocity and variety of the data at hand, but having the right culture, skillsets, and technologies in place, while respecting the privacy of consumers. This post will be the first of a four part series aimed at demystifying the term ‘big data’, and touching on opportunities, implications and challenges related to big data.
Big data is a term used to describe data sets that push the limits and boundaries of traditional technologies used to collect, manage, and process the data within an acceptable amount of time. Big data does not only imply very large data sets, but also the speed at which the data is generated and processed, as well as the various sources and structures that make big data complex.
One of the characteristics of big data is its sheer size and growth rate. A study by IDC estimates that by 2020, the digital universe will consist of 35 zetabytes (ZB), a 27 fold increase from 2010. The size and exponential growth can be attributed to organizations generating and collecting increasing amounts of transactional and log data, the rise of the ‘Internet of Things’ that includes location aware devices (e.g. smartphones), and sensors (e.g. RFIDs and Smart Meters), the increasing amount of data generated in the field of science (e.g CERN labs generate 40 terabytes per second), as well as the popularity of social media sites such as Facebook and Twitter, that generate on a daily basis 25 TB of log data and 12 TB of data respectively. Add to this social media like Instagram and YouTube that generate high bandwidth content (i.e. video and photos) and the over 2 billion people connected to the web, one can describe this as being a perfect storm of sorts.
Another characteristic is the rate, or velocity, at which the data is generated and the need to analyze that data in real-time. As users navigate through a website, as the stock exchange regularly streams stock activity, or as a news event triggers millions of tweets in Twitter, data is generated quickly and requires immediate analysis. This data could be stored, however the value of analyzing that data drops as time passes. An example of this is in an IBM commercial that states that you wouldn’t cross a road if all you had was a five minute old snapshot of traffic location. This is the essential aspect of big data velocity; making sense of something you learn as fast as you learn it.
Big data is generated by an enormous array of transaction logs, devices, and social media sites, all of which generate structured, semi-structured and unstructured data. Structured data is the traditional form of data and is typically in a relational database management system (RDBMS) that consists of pre-defined, well understood data structures. However, the real complexity of big data lies in the fact that the majority of it resides in distributed, external sources ranging from web logs, emails, social media exchanges and posts, videos, and audio, most of which is semi-structured (e.g. a web site transaction logs) or unstructured (e.g. blog posts containing text data or video), that cannot easily be inserted into an RDBMS nor easily queried using structured query language (SQL).
Big data is all around us. The real magic lies in the ability to harvest the valuable insight that exists in big data using big data analytics. In the next post, I will focus on the vast opportunities offered by big data that can contribute significantly to an organization’s competitiveness.