What is data engineering?

Aniruddh Yadav
4 min readAug 19, 2023

--

Data engineering is the process of discovering, designing and building the data infrastructure to help data owners and data users use and analyze raw data from multiple sources and formats. This allows businesses to use the data to make critical business decisions. Without data engineering, it would be impossible to make sense of the huge amounts of data that are available.

One of the key aspects of data engineering is the Extract, Transform, Load (ETL) process. ETL involves extracting data from various sources, transforming it into a format that can be analyzed, and loading it into a data warehouse or other storage system. This process is essential for ensuring that data is clean, consistent, and ready for analysis.

There are also other variations of ETL such as ELT (Extract, Load, Transform) and EL (Extract, Load). In ELT, the data is first loaded into the target system and then transformed within the system. This approach can be more efficient for large datasets or when using cloud-based storage systems. In EL, the data is simply extracted from the source and loaded into the target system without any transformation. This approach can be useful when dealing with raw or unstructured data.

ETL system design on Google Cloud Platform

Note: I have tried to explain ETL system design by using GCP, we can also achieve this by using any cloud or on-prem platform.
For AWS, Data Lake can be S3, Data transformation using Glue/EMR and Data warehouse can be Redshift.
We use Apache Spark for Data transformation on AWS, GCP and Azure.
Refer
Let’s understand Apache Spark | by Aniruddh Yadav |

Data sources are the systems or repositories from which data is extracted during the ETL process. These can include databases, files, APIs, web services, and other sources of structured or unstructured data.

Data ingestion is the process of importing data from various sources into a system for storage or analysis. This can involve moving large amounts of data in batch processes or streaming data in real-time. Data ingestion is a crucial component of data engineering, as it enables organizations to collect and integrate data from multiple sources.

A data lake is a centralized repository for storing large amounts of raw data in its native format. Data lakes are often used in conjunction with data warehouses to provide a flexible and scalable storage solution for big data analytics.

Data curation/ Data Transformation involves organizing, cleaning, enriching data to improve its quality and usability. This can include tasks such as removing duplicate records, filling in missing values, standardizing data formats and structuring data into a usable format that can be analyzed to support decision making processes, and to propel the growth of an organization.

Analytics involves using statistical methods and tools to analyze data and extract insights. Analytics is a key component of data-driven decision-making and can help organizations identify trends, patterns, and relationships in their data.

Machine learning is a subset of artificial intelligence that involves using algorithms to automatically learn from data and make predictions or decisions. Machine learning can be used in conjunction with analytics to build predictive models and uncover hidden patterns in large datasets.

How Is Data Engineering Different from Data Science?
The data landscape is always changing. And due to the amount of data being produced, data gathering, and data management are complex. And organizations want fast insights from this data. While the required skillset for a data engineer and a data scientist may sound alike, the roles are distinct:

Data engineer vs. data scientist: What’s the difference?

Data engineers develop, test and maintain data pipelines and architectures. It includes Data Ingestion, Data processing, and preparing ready to use data for analytics, data science, and Artificial intelligence implementation.

Data scientists use that data to predict trends and answer questions that are important to the organization.

Let’s have a look on how we can create end to end Data Pipelines using GCP for Batch Data and Stream Data:

Image created by Aniruddh Yadav

I would love to hear from you if you want to discuss anything: https://www.linkedin.com/in/aniruddhyadav

--

--

Aniruddh Yadav
Aniruddh Yadav

Written by Aniruddh Yadav

Data Engineer with experience in solutioning, building data-intensive applications, tackling challenging architectural and scalability problems.