Google Cloud Data Catalog
Google Cloud Data Catalog – What is it, and how does it work?
Google Cloud Data Catalog is a highly scalable indexing service dedicated to data discovery and metadata management on the Google Cloud Platform. It is an structured inventory that provides a unified view of an organization’s data assets. It comprises a user-friendly search interface where users can quickly discover their data and an API to program access to this data and build custom applications. It is well integrated into GCP’s ecosystem, utilizing GCP’s most popular offerings such as BigQuery, Cloud Storage, and Compute Engine. Data Catalog is fully managed by Google Cloud, basing its infrastructure on Cloud Spanner which allows easy setup. How does this service work?
Data catalog allows users to create catalogs for their organizations’ data assets. The batch synchronization within it can bring data from different sources. Once a catalog for a given data asset is created, say BigQuery, Data Catalog automatically syncs technical metadata from this asset. Users do not have to add data from the same data assets manually, provided a catalog has already been created. It has open-source connectors that permit data ingestion from non-GCP sources such as Hive, Oracle, etc. Its API enables the users to add these data sources to it and build applications that manage access to this data.
The use of structured tags (Schematized) set GCP’s Data Catalog apart from traditional catalogs. While it is true that conventional catalogs permit the use of tags, they are non-structured text strings which make capturing complex metadata difficult. Data Catalog’s structured tags are of 5 types: Double, Boolean, String, Enumerated, and DateTime. These structured tags make data searchable by status (zero errors or job completed), classification (either public or private data), or the life cycle (whether in the production or development stages). These tags enable users to compute data quality metrics such as median, minimum or maximum. Data Catalog allows users to customize the available structured tags templates or create their own.
The use of structured tags coupled with the technology that powers the search engines of Gmail and Google Drive make metadata searchable. This technology ensures the scalability of the Data Catalog since it also powers Gmail, which has billions of users. Users can perform simple keyword searches across all the data assets available or limit their searches to facets.
Security is built into GCP’s Data Catalog. It has GCP’s Identity & Access Management (IAM), enabling users to have acces control to data or data asset within the organization. IAM enables access control where the admin grants granular access, using Data Catalog’s API, to the data in the organization to personnel at different levels of the organization. For example, managers will have access to more projects than other employees ranking below them. Data Catalog would enable the admin to create different access roles to be fulfilled before authorization.
Data catalog ensures personal identifying information (PII), such as social security numbers, is protected. It integrates with Clouds Data Loss Protection (DLP), enables it to scan for PII and locate using structured PII tags. Once found, the user can choose whether to mask, redact, tokenize, replace or bucket this information. PII can be protected using reversible data protection techniques and access roles created for departments needing this data to function, such as billing.
Get in touch with us
Ready to start your next project with us? Give us a call or send us an email and we will get back to you as soon as possible!
+43 (720) 34 91 83
Am Heumarkt 4/17, 1030 Wien, Austria