Data Lineage


What is Data lineage, and why is it important for Data Engineering and Analytics on GCP?

Data lineage encompasses tracing data from its source and the transformations it may undergo to reach its target. Data lineage is a process that is initialized when data is put into a data system, the manipulations (such as extractions or queries) that precede, and the final output. It generates a link between single and multiple sources and target entities in a data warehouse, consequently enabling traceability. For this reason, it is vital to data governance.

Data lineage can be developed into a system that is either active or passive. Active data lineage systems are meant for non-language data operations, such as running an Apache Beam pipeline or an Apache Spark job on Hadoop. Passive systems are more suitable for language-based data operations since they support SQL. 

Passive systems are ideal for data-warehouses such as BigQuery. Data lineage systems serve as repositories, hosting records of data lineage and providing an insight into those records via a visual interface or API. Such systems also enable the automatic extraction of data lineage from the warehouse. GCP, a platform that handles vast volumes of data, requires data engineering and analytics to manage this data. Data lineage is an integral component of data engineering and analytics on GCP. So, why is it essential for these two processes?

Data lineage enables easy detection of the root cause of data inconsistencies within a system or between multiple systems during analytics on GCP. Suppose BigQuery data shows that a given customer is eligible for a committed use discount, whereas Cloud Storage data shows that the same customer is still on a free trial on a Compute Engine service. Data lineage will provide a clear insight into the data pipeline from this output to the source. For this reason, you can perform a root-cause analysis and the origin of the inconsistency determined. In this way, data lineage enables traceability.   

By enabling traceability, data lineage also makes it possible to enforce data access policies in the warehouse. For example, since a passive data lineage system supports SQL, ZetaSQL can be used to layout the narratives that determine the parties that can access a given piece of data in the data warehouse (BigQuery). This will prevent unauthorized data access. 

Data lineage dictates the blueprint for developing serverless data pipelines on GCP. Data engineers build data pipelines that define the data path from ingestion via Cloud Pub/Sub to storage, output, and subsequently analytics in BigQuery. There are data governance regulations, such as General Data Protection Regulation (GDPR), to which data platforms must adhere. These regulations are applied at different stages of the data pipeline for which data lineage lays out. Consequently, data lineage ensures the transparency and safety of customer data ingested into GCP to comply with these regulations. 

In summary, data lineage ensures data engineering and analytics on GCP are well within the rules regulating data management and governance. These are highly regulated processes, especially for platforms that handle highly sensitive customer data (such as credit card-, or phone numbers). Data lineage also simplifies correcting errors that may accrue from data engineering and analytics, despite the large volumes of data handled on GCP.

Get in touch with us

Ready to start your next project with us? Give us a call or send us an email and we will get back to you as soon as possible!

Call Us

+43 (720) 34 91 83

Offices

Am Heumarkt 4/17, 1030 Wien, Austria