Last month I attended Gartner’s Data & Analytics Summit in Sydney. I would like to share some of the learnings and my own thoughts from the 2 days of wide ranging presentations of all things data related.
1. The “Zone of Confusion”: Lakes, Warehouses and Hubs
A common theme that came out from the discussions was the amount of confusion surrounding Data Warehouses, Data Lakes and Integration Hubs. What is the difference? When should each technology be applied? Does one replace the other?
There was some satisfaction in hearing that experiences in Australia and the broader Asia Pacific are similar to those in North America.
From collecting the experience of their research and consulting customers, the Gartner analysts gave some clear guidance:
a. A Data Lake does not replace a Data Warehouse – at least not yet
I took a note from one of the presentations that around 85% of all databases in use are still relational in nature. Of course, not all of these are data warehouses, but from a knowledge, skills and experience base, the relational database is still immensely powerful, and is not only being supported but also being enhanced by vendors (the Oracle Autonomous Data Warehouse was one example cited).
The enforced structuring of data in a relational Data Warehouse better supports certain use cases that the typically less-structured Data Lake does not. The key message here is that it is the use cases that should drive the architecture, not the availability of the technology (how many times has that been said before…..??)
One of the presenting analysts stated that Data Lakes (or technologies such as Hadoop that support unstructured data) would not replace relational data warehouses in his lifetime (and he looked fit and reasonably middle-aged, so you would have to assume that covers the next 40 or 50 years!). However, some of the discussion I had post one of the presentation seem to indicate that views differ, and that we will see a convergence of the technologies. Where we can all agree is that we are not at that point at the current time.
b. A Data Warehouse is useful where the data and the questions are both “known”
Gartner have a 2×2 ‘non-quadrant’ (since the word Quadrant has special meaning at Gartner). It can be used to pictorially demonstrate where a structured data warehouse is most useful. Typically, where the data is “known”, meaning it is well structured and perhaps there has been some work done to conform it into a common “shape” so that different products or service offerings may be commonly compared, and the questions being asked of the data are “known”, meaning that the questions are not opened ended, such as “what should I do next?” but are more descriptive such as “how many widgets did I sell last month?”, then the structured data warehouse is best suited to the task.
c. A Data Lake is useful where the data and the questions are both “unknown”
Where data is “unknown”, meaning it has not been structured or conformed, and the questions being asked are not descriptive but are more exploratory in nature, such as “I wonder if there is a connection between the news articles on the broader economy and the volume/type of website traffic we experience”, then data lakes are most suited for the task. I think we are just beginning to understand the type of workloads that social media companies are putting on their Data Lake environments, and they are not the typical workloads found in corporations today.
d. The Data Hub reflects a shift from Data Collection to Data Connection…. but is not about Analytics.
We have been discussing Data Virtualisation for some time. It is the concept of doing away with taking a copy of data generated in systems of record or sensory systems and storing or shaping it for later use, but rather connecting those systems of record at the time a report or analysis is required. The approach was more focussed on data sharing than it was on data analysis, removing the need for cumbersome overnight batch processing. However I think it would be fair to say the approach has had limited success, perhaps partly driven by a long standing principle in IT that says “thou shalt never run a query on a live production system”.
Similarly, the concept of a Data Hub is about connectivity to the data, or making the process of sharing data seamless. Based upon experience across “thousands” of clients, one analyst gave this advice: a Data Hub is not an analytics vehicle. It does not have to be part of an analytics architecture. It is useful in removing point-to-point interfaces, and allows a centralised place to govern data or assess quality.
Another piece of insight: there does not have to be only one Hub, they can be use case specific. There may be a Hub for Master Data Management, or a Hub for Reference Data Management for example.
It some clients the Hub is seen as a replacement for a component of the data warehouse, essentially playing the role that the staging and operational data store (ODS) typically plays in receiving data from operational systems. The advantage of the Hub approach is that other system can access this ODS rather than being purely ‘locked-away’ in the broader data warehouse.
2. Speaking a different language: The Need to Improve Data Literacy
The concept of ‘data literacy’ and the need to improve it within organisations came up in many different discussions.
There appears to be a common challenge that people with deep experience of working with data ‘get it’ but the ability to communicate benefits to decision makers is lacking, often resulting in the lack of investment or investment in the wrong areas. This may be partly driven by some of the over promises that were made about classical Data Warehouses or more recently Data Lakes, resulting in scepticism or mistrust from the people making investment decisions.
Furthermore, the language that is often used in data-circles may not come across as immediately relevant to business functions. One suggestion was to stop ‘Data Governance Programs’ and rename them ‘Business Insights Initiatives’ – the former implies that the program is going to constrain innovation or creativity, while the latter suggests a positive outcome.
Of course, this is simply toying with semantics and there is a real need for business to understand how building a foundation of robust data management and governance will result in longer term commercial benefits or mitigation of risk. One common effective driver for improved data and analytics is regulatory compliance. There were multiple reference GDPR, and how compliance has driven complete re-architecting of data solutions in some organisations.
The strong theme that came through is that it is not enough for the CDO or data governance functions to focus on policies and frameworks. There is also a need for education, often of people that don’t want to invest the time and effort, but do want the insights that the data can provide. There is a role for a ‘data envoy’ to improve data literacy so that we can spend more time focussed on outcomes, and less time on trying to explain the words being used.
3. We need more roles: Scientists vs Engineers vs Citizens
Another big theme that came out across multiple presentations was the need for Data Engineers to translate the workings of ‘intellectual’ or academically focussed data scientists into metrics that can be operationalised or insights that make sense in a commercial context.
Furthermore, there is acknowledgement that not only are there professional Data Scientists in an organisation, who typically reside in an analytics function, but also a scattered team of lay-person Citizen Scientists that if co-ordinated effectively can contribute to a data-driven culture embedded directly within business functions. These citizens are to be collaborated with, not feared, and similar to the broader world experience, the more educated the citizens become, the more useful their contributions are.
4. AI & ML: Yes. Big Data: No (or perhaps the wrong conference)
It was telling that I did not hear the term ‘Big Data’ mentioned once. It would appear that the industry has moved on from classifying data as ‘big’ and instead is now more focussed on the different architectures (lakes, hubs, warehouses etc) that support different styles of data. It also seems that where once much of the discussion was about how to leverage social media data, now the shift seems to be back towards better leverage of the data that already exists within an organisation to drive clear commercial outcomes.
There was plenty of discussion about Machine Learning and Artificial Intelligence, with heavy use of the concept of Augmented Analytics, where humans are still involved in the decision making process, but their analysis is supplemented, rather than replaced, by the results of the algorithms. Whether you believe all data analysis will be replaced by machines or not, it is clear that the level of sophistication of the analysis has increased significantly in recent years, and the ability to automate operational processes (such as the measurement and assessment of data quality issues relating to interfaces) is already being embedded into leading organisations.
In summary, the Summit was thought provoking and in some ways reassuring that similar organisations across the globe are facing the same challenges we face in this region.
Please contact us if you would like further discussion on any of these topics.