Even in the context of the tech industry, data engineering is a relatively new set of responsibilities. Companies recognize the value of promptly collected and well-managed data, so data engineering skills are in high demand. The number of available vacancies snowballing, and the offered compensation is higher than ever. This demand, among other reasons, makes the role attractive to software engineers looking for a change of focus or just for new challenges.
I was once in this position myself; now I am leading a team of data engineers. In the past 12 months, I have been hiring engineers for my team and other teams within the Data Engineering department. I have reviewed hundreds of CVs and LinkedIn profiles, had dozens of interviews with applicants and was lucky enough to find several excellent developers who are now my colleagues.
In this post I will outline a list of skills that will get you a data engineering position and help you succeed in this role. We will also discuss what defines data engineering and makes it different from broader software development and related fields of data analytics and data science.
The Elder Scripts blog covers the opportunities and challenges that software engineers and data engineers encounter. I am sharing insights gained from years of work in the industry as an engineer and as a manager of engineering teams. Subscribe to get notified of weekly posts if you haven't already.
Differences between Data Engineering and other similar and related roles
Data Engineering vs Software Development
Data engineering is a specialized subset of software development. Data Engineers utilize the same skills: designing systems, writing code, testing, and deploying software. The distinction is that most DE projects focus on data management: retrieving it, transforming it into formats required by the business, storing it, and exposing it to interested parties. A lot of data engineers used to be software developers in the past. I am one example!
Data Engineering vs Data Analytics
The difference between these two roles has been more pronounced in the past. Data analysts used to query the data prepared by engineers and build dashboards using UI-based tools. Recently the situation has changed. Modern data analysts know how to write complicated SQL queries, but they also use Python and R to set up advanced reports and dashboards. One thing is still valid, though. Data analysts are your best friends if you work as a data engineer. Out of all teams in the company, you will probably talk to them the most.
Data Engineering vs Data Science and Machine Learning Engineering
These domains received much attention in recent years, both from businesses and from people interested in building relevant skillsets. Data scientists apply statistical methods of varying complexity to create prediction and classification models, and really a considerable number of things that might seem magical for people who don't know that much about machine learning (like me). Data scientists need access to datasets specially prepared for the task, and usually, a team of data engineers helps them by setting up these datasets. In some companies, this role is called Machine Learning engineer.
Data Engineering skillset
This is the original data skill, the one where it all began. A data warehouse is a specialized database configured to store large amounts of data and run analytical queries on subsets of this data. Data engineers are responsible for loading the data into the warehouse, setting up the correct schema, and, usually, helping users write efficient queries.
Typical tools in this domain are:
Concepts to learn:
- OLAP databases and OLTP databases
- Types of schemas used for data warehousing (e.g. star schema or snowflake schema)
Object Storage and Data Lake
Object storage allows engineers to store large volumes of data in various file formats. This way of storing data is usually not optimized for frequent random queries; its primary goal is to store the data reliably. Object storage is frequently used for backups and as a staging area where data resides before being loaded into a data warehouse. When object storage performs the former role, it is called a data lake. A data lake is a repository where data arrives from multiple sources and where it is stored in its unprocessed format. Recently several companies on the market introduced specialized data lake solutions. Delta Lake is one example.
Common tools are:
There are several file formats used for storing and sharing data. They range from relatively common and easy to read, like JSON or CSV, to specialized ones like Parquet. It is essential to know when to pick each format. The most common formats data engineers use are:
Distributed Data Processing
Once the size of your datasets starts to grow beyond the capabilities of a single computer, it is no longer feasible to process this data on a single server. This is why the tools that can distribute this process onto a cluster of several (or several thousand) servers are the bread and butter of data engineering. An engineer can convert data between formats, filter, transform, join, and slice the datasets using these tools.
Standard tools to learn:
- Apache Spark. Spark is by far the most popular tool for batch processing of data.
- Apache Beam
- Google Dataflow
Streaming Data Processing
In modern tech companies engineers have to deal with data that arrives as a continuous flow. Web visitors are browsing search pages, customers are performing transactions, vehicles are sending telemetry, and all of these data points must be collected, processed and stored. Businesses benefit from the reduced reaction time that streaming processing allows. Decision-makers don’t have to wait hours or days for the new batch of data to arrive. They can follow and monitor the data using real-time reports or dashboards.
Common tools to learn:
- Apache Kafka
- Apache Flink
- Apache Spark
- Apache Beam (that’s right, Spark and Beam support both batch and stream processing)
Most of the tools and systems mentioned above can run either on the company’s own hardware or in the cloud. However, most companies these days choose to run these workloads in the cloud, in order to cut operational costs and improve the reliability of their systems.
Common tools to learn:
- AWS (S3, Redshift, Athena, EMR, Kinesis and other services)
- GCP (Cloud Storage, BigQuery, Dataflow etc.)
- Azure (Storage, SQL database, Analytic tools etc.)
The most commonly used programming languages in data engineering are:
A data engineer should know about and apply the data management concepts in your daily work. They don’t necessarily map well to existing tools but rather inform architectural and operational decisions.
- Privacy and security of business and personal data
- Data governance
- Data lineage
- Data quality and observability
This concludes the list of skills that form the foundation of data engineering. In smaller companies engineers are expected to cover most, if not all, items on this list. In larger companies engineers usually specialize, and it is not uncommon for a whole team or department to manage just one category of responsibilities.
You don't have to know all of the tools I've listed above to land a job in data engineering, especially an entry-level job. But it helps to be familiar with common concepts out of all these areas and to try out one or two tools out of each.
If you have read this far, thank you, and I hope this will prove helpful to you. If you have questions or would like to ask for advice concerning a career in data engineering, please hit me up on Twitter @TheElderScripts or email me at email@example.com.