What is a Smart Data Catalog?
And why it isn't only about machine learning
Guillaume Bodet - CEO - Zeenea
"The idea of a Smart Data Catalog has been around for a few years in metadata management-related literature, although it has no official definition. The general consensus is that a modern data catalog must-have machine learning and AI to unlock its potential.
In this piece, we will attempt to define how Zeenea handles the idea of the Smart Data Catalog which, for us, cannot be limited to machine learning capabilities."
Data Quality usually refers to a company’s ability to ensure the longevity of its data. At Zeenea (a data catalog provider), we believe Data Quality is ensured through the 9 following dimensions - all essential to extract value to your company:
We will detail these dimensions with the help of a simple example in part one. We will then elaborate on how Data Quality management is an important challenge for organizations seeking to extract maximum value from their data.
We will also draw parallels between these different Data Quality dimensions and the different risk management phases to overcome - identification, analysis, evaluation, and processing. This will enable you to hone your risk management reflexes by tying in Data Quality improvement processing to a company objective (and evaluating the ROI on each quality dimension).
Once we have established the main features of an enterprise Data Quality management tool, we will detail how a Data Catalog - though not a Data Quality tool - can contribute towards Data Quality improvement (through the clarity, availability, and traceability dimensions mentioned above).
Regardless of its size, an information system contains several dozen systems and applications that store data through a wide variety of sources (relational and non-relational databases, distributed file systems, APIs, cloud solutions, etc.), according to specific protocols, formats, and rules. Each system manages hundreds or thousands of datasets - usually, tables or files - themselves made of dozens of fields (or columns). And each dataset and each field feeds into a metamodel (in other words, an ensemble of structured metadata) which makes data exploration possible.
Ultimately, a data catalog will have to harness enormous amounts of very diverse information - and its volume will grow exponentially, just as the volume of usable data will. This volume of information will raise 2 major problems:
💡 How to feed and maintain the volume of information without tripling (or more) the cost of metadata management?
💡 How to find the most relevant datasets for any specific use case?
For us, a Smart Data Catalog should have a much wider scope than the integration of AI algorithms and should include a range of smart technological and conceptual features that provide answers to the 2 questions above.
We have identified 5 areas in which a data catalog can be "Smart" - most of which do not involve machine learning:
🔸 The data inventory
🔸 Metadata management
🔸 The search engine
🔸 User experience