Big Data

Datamapping tools

r studio


In the 80s, a bunch of smart guys at IBM worked on transactional mainframe systems that needed to process millions of transactions by the day. Think telephone, banking and airline companies. They wondered what to do with the ‘stale data’ (eg: flight GGGG, from destination X to arrival Y, carrying passenger Z, in class M, on date DD/DD/DD, at hour HH:HH, etc.) that was not used anymore. Most people discard old electricity bills once paid. It is transactional information, and can be thrown away. The same principle applies to large organizations, where the transactions run into millions per day.

So the smart guys stored this stale data in a separate system, and once stored, they analyzed the data, looking for trends, patterns, evolutions over time. This brought insights. For instance, which were the most profitable flights or how and when passengers bought tickets (thus increasing price when closer to departure date). The term ‘data warehousing’ was born. It came from storing data in a warehouse and then analyzing it to derive business value. A fun example of this occurred in retail where, based on statistical analysis, they decided to place diapers next to beer. New born fathers came to buy their first diapers and bought beer to celebrate.

Fast forward to the 90s. Remember the Millennium bug? Even if the world did not stop turning, most of the home grown transactional systems were unable to handle “00” as a date for 2000. These systems needed upgrading, or replacing by a new Millennium bug-proof system, called an ERP (enterprise resource planning) system. Those were the golden days of SAP, Oracle, Siebel and Peoplesoft, all vendors of ERP systems, and for Oracle, selling the underlying database. But the ERP growth is explained only partially by the Millennium bug-panic. Companies realized their home grown systems for tracking sales, HR, manufacturing or finance transactions were black boxes, and the people who designed them were disappearing due to attrition or pension. Companies thus mitigated the risk by massively replacing black box systems by standardized ERP systems, where knowledge was abundantly available.

But ERP systems only offer primitive reporting capabilities, and when a business analyst queried the data with complex variables (product, geography, sales rep, revenue, cost of sales, date), the ERP system took many hours or even days to complete the query. Database administrators also grew nervous the more complex these analytical queries became. So the engineers created a separate database structure, called a ‘star schema’ to handle complex analytical queries. These star schema took data out of ERP systems by ETL (extract, transform, load) and provided a fast, reliable system that could handle increasingly complex queries which derive intelligence from data. Hence the name ‘business intelligence’ was born, driven by the need to analyze ‘structured data’.

About +10 years later, the internet matured, social media took the world by storm and ‘unstructured data’ like videos, pictures, games, CAD designs (buildings, cities), CGI data (Avatar), geospatial data, MRI scans and tweets exploded. Here it was companies like Yahoo, Google and Facebook who faced challenges in storing and retrieving this unstructured data. Existing tools and relational database structures did not suffice, so Yahoo, Google and Facebook created new tools like HDFS, Hadoop and MapReduce to manage this flood of unstructured data. And soon another bright business analyst started querying this unstructured data, hoping to derive insights and value from trends and evolutions. To do this, more tools, like Pig, Hive and Zookeeper were created and big data analytics was a reality. Today there are many solutions on the market that can combine structured and unstructured data into analytical engines that, when used by the right people, can bring business value to organizations.

In summary, data warehousing was the first wave of deriving value from massive volumes of stored transactional data. The next wave, business intelligence piggybacked on the ERP trend, analyzing structured data. And finally big data, the third wave, gains insights from mainly unstructured data (or a combination of both) exploded into existence through the wildly expanding universe of data. I am curious to see what the fourth wave will be, because like concentric circles, data and the need to derive intelligence from it just increases exponentional by the decade.