본문 바로가기

카테고리 없음

Introduction to Big Data

Data Science, “data mining,” and “big data,” much like the word sabermetrics, are recent buzzwords that are becoming more widely used. This trend of increased usage of these words primarily reflects the information age we currently live in – the amount of data generated by people these days is staggering. Just a few tidbits concerning big data today:

  • In August 2012, Facebook engineers reported the site processes 500+ Terabytes (TB, ~1000 gigabytes) of new data per day and uploads 300 million photos per day (TechCrunch article, accessed April 2014)
  • In October 2012, the Internet archive reached 10 petabytes (PB, ~1000 TB) of data stored in a massive collection of music, newspapers, books, internet sites, etc. (accessed April 2014)
  • In February 2014, IBM said the daily data created (climate sensor data, social media, digital photos/sites, cell phones, etc.) is 2.3 exabytes (EB, ~1000 PB), and that 90% of the data in the world today has been created in the last two years (Webopedia entry, accessed April 2014)
  • In June 2011, the International Data Corporation published a white paper (PDF) stating “in 2011, the amount of information created and replicated will surpass 1.8 zettabytes (1.8 trillion gigabytes) – growing by a factor of 9 in just five years.” (accessed April 2014)

Lots and lots of data, and it is only a small sample of examples we could discuss. Big data is here, it is here to stay – and Data Science is an emerging field of study that is attracting more and more students.  And as before, with the word sabermetrics, we should define our terms:

Data analytics is using data from traditional and digital sources for discovery and analysis (hat tip to Lisa Arthur of Forbes.com in her article “What is Big Data?”). This broad definition encompasses all of analytics, so what is defining difference with “big data?” The McKinsey Global Institute, the research arm of global consulting firm McKinsey & Company, starts with the following: 

““Big data” refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.” --("Big Data: The Next Frontier", accessed April 2014)

And the online information technology dictionary, Webopedia.com, defines “Big Data” this way:

Big data is a buzzword, or catch-phrase, used to describe a massive volume of both structured and unstructured data that is so large that it's difficult to process using traditional database and software techniques. In most enterprise scenarios the data is too big or it moves too fast or it exceeds current processing capacity. While the term may seem to reference the volume of data, that isn't always the case. (Accessed April 2014)

In good news for students who wish to focus on the problems of big data, IBM makes the claim in this informative infographic that 1.9 million jobs will be created in the United States to support big data (4.4 million globally).