5 Things I Wish I Knew Before Becoming a Data Engineer

Every developer, engineer, analyst, scientist, and doctor starts somewhere. It is not uncommon that no one has the full idea of the field they are stepping in on. Similar is the case with me. I did my graduation in Software Engineering from COMSATS University Islamabad. After graduation, I joined my first company as a Backend Engineer. I was working on ASP.NET MVC along with some JavaScript & JQuery.

However, I always felt that I am good with data. Even back in my university days, I was quite good at manipulating data using languages like Java, Python, and mainly SQL. At that time, I didn’t even know that what I am doing is usually known to be Data Wrangling.

So after having this epiphany, I decided to give the field of data a shot. However, having this epiphany wasn’t the only factor, buzzwords like Data Science, Big Data, and Data Engineering started to lurk around on the internet and this field started to gain popularity (as deserved).

A lot of people started to come into this field. Especially, computer science/software engineering grads. People were interested to learn data engineering and data science.

When we start out in this field, our general expectations are to be familiar with popular tools like Apache Hadoop, Apache Spark, Apache Kafka, some basic cloud concepts, how a data pipeline is built, and programming. We generally expect to know what we are taught in universities or online tutorials. However when we enter the industry, it turns out, these tools aren’t the only things that matter. This is what I faced when I transitioned into the data field.

Now, usually, being beginners, we tend to skip some important concepts which are used in the industry but are not given enough emphasis on in software engineering schools.

5 Things Data Engineers Should Know

We learn to code in our favorite language, learn our favorite APIs, learn generic concepts, and then enter the industry. However, after having years of experience working in this field, I have seen some important concepts and tools which every data engineer (and software engineer) should know in order to perform well and work as expected.

Let’s look at the top 5 concepts/tools which I believe every data engineer/data scientist/software engineer should be familiar with. These concepts are extremely common in the industry and companies expect you to have their understanding.

1 Linux Shell or Bash Commands/Scripting

Let me tell you a short story. When I started out in this field. I was expected to put together a Kubernetes cluster for a Kafka Streams application to be hosted on. For this, I needed to learn how to install a Kubernetes cluster. And here I was, sitting with some general knowledge of SQL, some Data Warehousing concepts, and some Hadoop and Spark. Although I knew some basic HDFS commands, I wasn’t aware of the general ecosystem of Linux servers. Like what the folder structure in Linux is (where the term “Folder” turned out to be called “Directory” in the Linux lingo), how to run the processes, how to automate using bash scripts, and how common apache tools are executed.

This mini project turned out to be a blessing for me as not only did I learn Linux, but I also got to learn generally how applications are installed and executed in Linux, how several Linux nodes are put together to form a cluster, and generally how to use the Linux terminal.

Don’t worry, you won’t be asked to put together a whole cluster at the start. My case was different, as I was expected to learn stuff (or do R&D, as they say in the corporate lingo), however, learning shell commands and some general bash scripting truly gives you an edge.

Knowing shell commands is extremely important for data engineers. Wherever you go and whichever tool you use, you’ll always find those tools hosted in some Linux servers, be it directly or indirectly using any serverless cloud services.

At the basic level, you should know how to traverse through a Linux terminal. You should know basic commands like cp, mv, cd, ls, mkdir, cat, vim, etc.

Let me list down some of the benefits of knowing the Linux shell:

You can automate data pipelines.
You can transfer data between different environments. For example, DEV to QA.
You get creative. You can embed shell commands in your programming language to do things more creatively.
You can interact with various cloud services by using their shell-based tools. As it turns out, this is pretty common.
In most modern tools like Databricks or Snowflake, you have to use some level of shell scripting to do stuff. For example, I can quickly list down the folders in the Databricks filesystem by using %sh magic command and executing ls in one of the cells of the Databricks Notebook.

2 Basics of Networking

I have talked about working in the Linux shells/terminals above. These Linux machines are basically hosted on servers. Now the questions arise: how do you connect with those servers, how a TCP connection works, how to ssh into a server, what an SFTP protocol is, and how to connect with an sFTP server using an sFTP client? etc.

That’s where the knowledge of networking basics comes into play. Following are some of the reasons why networking is helpful:

It makes it easier to understand the DataOps side of things. You can easily debug deployment issues and monitor log messages for errors.
You can use tools like PuTTy and MobaXTerm to ssh into remote servers. This is very common. In most places, you’ll find yourself doing this.
Sometimes, as data engineers, you have to deal with sFTP-based storage systems. Having the knowledge of how to connect with an sFTP storage is extremely important.
If you are working in DataOps, you might have to monitor different servers to identify potential bottlenecks which can make data pipelines slower. Knowledge of networking really comes into play in this regard.
When provisioning distributed clusters of different data engineering tools like Apache Spark or Hadoop, you are required to have knowledge of networking. Familiarity with network topologies & configuration options will help you in provisioning servers for the deployment of your tools.
When you are familiar with common networking protocols like TCP/IP, it can be useful for you in understanding how different components of a distributed system interact. For example, in your data pipelines, you may need to achieve idempotence. And having some knowledge of TCP/IP will help you understand what it is.

In the contemporary scenario of the modern tech world, cloud technology is gaining popularity quite fast. It has become paramount for data engineers to know how to connect with the cloud as most companies are now hosting their tools on the cloud. So, by having basic knowledge of networking, it becomes quite easy for you to learn cloud technologies.

3 Version Control – Git

Okay, I can’t stress this enough. I have observed that when fresh engineers enter the industry, they really lack knowledge of Git and version control in general (including services like GitHub and BitBucket).

I have seen junior engineers struggling with doing basic git tasks like creating a branch, downloading a branch, and pushing their code to the specific branch made for it instead of accidentally pushing it to the master branch *wink wink*. I have also seen that when people need to push a single change, they either download the whole repo again or create a new branch and merge it for every single change. I can’t stress enough how Learning Git & GitHub properly is extremely valuable and paramount.

Contrary to popular belief, learning git is quite easy. There are only a handful of concepts that you need to use in git most of the time. Some of them include:

Pulling code from the remote branch (from GitHub, GitLab, BitBucket, etc.)
Creating branches
Pushing code to branches
Merging branches

For production workload, usually, the above steps can be automated using CI/CD practices. More on that later.

Refer to the following practical and easy-to-follow tutorials on YouTube:

4 Understanding of CI/CD Pipelines

When we initially learn different tools, we usually only learn to build projects locally. However, we do not consider the factors related to deploying our projects for production use. That’s where the understanding of CI/CD is crucial.

CI/CD stands for Continuous Integration/Continuous Development. This term is usually used in the DevOps/DataOps world. DevOps Engineers are usually responsible for building the infrastructure responsible for automating the process of deployment of code, machine learning models, and data pipelines.

When you develop something, I recommend that you know at least the following tools:

Git and GitHub – Also emphasized in point number 4 above.
Jenkins – You don’t need to learn Jenkins to the full extent. However, you should have an understanding of building basic Jenkins pipelines and how they generally work.
Integration of Jenkins and Git – Mostly only understanding is required as you’ll get the integrated environment already as done by DevOps or DataOps folks. When you push your code to a git-based repository, the Jenkins pipeline can detect the changes and can automatically deploy the code for production use. I won’t go into the detail of how Jenkins work, but you can refer to this link.
Docker – You should be able to work in Dockerized environments. Often you need to create your own docker images to run on the server. And learning docker is a great investment.
Bonus: Kubernetes – Though, it is used at the DataOps end, understanding how it works is valuable. Many companies use Kubernetes to achieve orchestration and load balancing since it can be proven cost-effective.

The aforementioned tools are just some of the few tools that I deem important. However, they are not limited to these. Remember, tech stacks vary from company to company, spanning several industries. That’s why you need to be flexible and tool-agnostic.

5 Coding Standards & Best Practices

I have seen this trend, particularly in Data Engineers who mostly focus on creating data pipelines and solving data-related problems. In doing so, they often write pretty spaghetti code and don’t worry about standardization and code management.

This leads to problems in the future when you onboard new engineers on the team or the codebase grows in volume. Specifically, when the codebase grows, the complexity grows. And as the complexity grows, ultimately tech debt is born which becomes very costly in the future.

It is important to have the knowledge and passion of implementing best software engineering practices while coding.

Following are some of the benefits of adopting good coding practices:

More maintainable and scalable code
Time and resources saved in the long run
Improved accuracy and reliability of data pipelines and systems
Code that is easy to understand, test, and modify
Robust and less prone to errors
Avoiding potential technical debt.

Conclusion

Although the tools and concepts I have mentioned are extremely important, I believe that you should not limit yourself to these tools only. Remember, learning never stops! With the passage of time, tools change, techniques change, approaches change, and strategies change, but things that remain constant are the basics. Always have your basic concepts clear and you are good to go.

5 Things I Wish I Knew Before Becoming a Data Engineer

Here are 5 things Data Engineers should know.

Top 5 Industries Where Big Data is Used

Common Pitfalls to Avoid When Building Big Data Pipelines

Hamza Nasir

Related Posts

Common Pitfalls to Avoid When Building Big Data Pipelines

How To Become a Data Engineer – The Ultimate Guide

Common Pitfalls to Avoid When Building Big Data Pipelines

Discussion about this post

Top 5 Industries Where Big Data is Used