ENCS Data Science Research Centre

What is Big Data?

Big Data has been around forever, though the definition of "Big" has changed as we have become more advanced at collecting, storing, analyzing, and visualizing data. If we cannot analyze the data easily, then the data is Big Data.

A key notion is actionable data: data that is useful in supporting decisions, determining actions, and adding value to an endeavour.

What makes data difficult to analyse?

Big Data articles often refer to the 3 V's (Volume, Variety and Velocity), the 4 V's (Volume, Variety, Velocity and Veracity), and the 5 V's (Volume, Variety, Velocity, Veracity and Value).

Volume: the amount of data to be analysed is increasing in size. Today petabytes of data need to be stored, searched, and analysed.

Variety: the different types of data to be integrated and analysed now includes structured relational tables, unstructured text, linked data on the web, images, videos, voice recordings, sensor data, social media conversations, and more.

Velocity: the rate at which data is generated is extremely high in areas such as share trading, credit card transactions, social media commentary, and sensor networks for the Internet of Things.

Veracity: the lack of trustworthiness, or level of noise, in data makes it difficult to determine quality or accuracy of data, and hence, quality and accuracy of analysis results.

Value: how useful is the data to a business in carrying out its business? Can you leverage the data to add value so there is a net benefit from the cost and effort of the big data initiative.

The Future

As these challenges and opportunities impact the day-to-day lives of people, companies are seeking ways to take Big Data and deliver services of value to the consumer. The ease of access to open linked data makes so many things possible, for companies and for the average person in the street.

Some History of Big Data

Hollerith Cards 1890

One of the first uses of "computers" was in handling the US population census data in 1890 when a card-processing machine invented by Herman Hollerith took only a single year to tabulate the census data, whereas the 1880 census had required 8 years of manual tabulation. The census of 1890 had 30 questions and the US population at the time was 63 million. In 1896 Hollerith founded the Tabulating Machine Company which became IBM in 1924.

Economic Data 1952

Following the Great Depression, governments began to monitor economic data, such as the Gross Domestic Product (GDP), and the United Nations established a standard for National Accounts in 1952. Statistics Canada, which is responsible for collecting and reporting the national accounts in Canada, tracks 725 commodities from 300 industries and the final demand for commodities by the four sectors of the economy: Households and Non-profit Institutions serving Households, Corporations, Governments and Non-residents.

Computers 1959 - The First Digital Data Tsunami

The digital Big Data tsunami began with the advent of the modern computer in the 1950's. Governments adopted computers for processing census data and economic data, as well as applications for navigation, artillery, and numerical scientific computing. In 1959 IBM released the IBM 1401, an all-transistor computer with magnetic tapes for business data processing. The modern relational database system for business data processing became widespread in the 1980's. The summarization of business operational data into a data warehouse for applications to business decision-making, so called business intelligence, was well established by the late 1990's. Wal-Mart's success in supply chain management is often attributed to their pioneering use of a data warehouse, which, in 1992, was the first data warehouse to reach 1 TB (1 Terabyte = 1000 Gigabytes) in size. Remember, that in 1992 the largest hard drive capacity was 2 GB. By 2008, Wal-Mart had 2.5 PB of data in its data warehouse (1 Petabyte (PB) = 1000 TB).

World Wide Web 1990's - The Second Digital Data Tsunami

The second wave of the digital Big Data tsunami began with the introduction of the world wide web in the early 1990's which provided convenient access to a broad range of data in html. In the early 2000's the introduction of the semantic web made the web machine-friendly through the use of RDF triples, the Web Ontology Language, and the Sparql query language. The world of web services had arrived together with its terminology of SOAP, RESTFul, WSDL, and JSON. Google became synonymous with Big Data, search, and the application of these to the creation of large corporate profits.

Social Media 1985 - The Third Digital Data Tsunami

Social networking is producing the third wave of the Big Data tsunami. Modern wireless networks and mobile devices have accelerated the growth of users, messages, and data since the early days of AOL (1985), Geocities (1994), MySpace (2003), Facebook (2004), and Twitter (2006). Today Twitter has over 270 million users and can process 140,000 tweets per second. To cope, data analysis turned to virtual machines in the Cloud such as Amazon's EC2 (Elastic Compute Cloud) and to parallel processing using Hadoop and Giraph from Apache.

Internet of Things 2000 - The Fourth Digital Data Tsunami

Now the web is evolving to include devices that can monitor, sense, decide, and act as well as communicate through the web. This fourth wave of the Big Data tsunami is called the Internet of Things. GPS in mobile devices leading to location-aware apps and services is the most obvious example to date. But we also see the impact of the Internet of Things in high-frequency financial securities trading, cars that can park themselves and warn of impending collisions, and in household appliances that we can monitor and control from our cellphone.

Big Science - 1960's onwards

Data from devices can be generated in very high volumes and at great rates. Science has faced these problems since the early days of space exploration, particle physics, seismology, and weather prediction. Science, defence, and security drove the development of supercomputers from 1960's until today. CERN acquired a CDC 6600 in 1965 to put its Big Data to good use. Today the Large Hadron Collider at CERN produces 25 PB of data per year and uses over 170 computing facilities worldwide to analyze the data.

Deep Knowledge - 2011 onwards

Biotechnology's advanced sequencers, microscopes, and MRIs bring Big Data to the life sciences with potential benefits to healthcare, and wearable monitoring devices will also contribute healthcare data. When Big Data meets Deep Knowledge, as it does in IBM's Watson computer, the computer can not only become the champion of Jeopardy, but also assist in the discovery of new drugs. The Watson computer consists of 90 servers with a total of 2880 processor cores and 16 TB of memory. It can process 500 GB of data per second; that is the equivalent of one million books per second.

What has become clear to business, governments, and science is the need to turn data into actionable data so that our decisions and actions are driven by the data at hand.

The entrepreneurs in Big Data seek to create value from the actionable data for their companies.

Those in the non-profit sector seek to apply actionable data for the social good, such as advancement of knowledge, improved quality of life, and more efficient governance.