• About
Thursday, June 5, 2025
Big Data Lad
No Result
View All Result
  • Home
  • Who is The Big Data Lad
  • Categories
    • Artificial Intelligence
    • Big Data
    • Corporate
    • Data Analytics
    • Data Science
    • Engineering & Tech
    • Opinion
  • Get a 1:1 With Me
  • Home
  • Who is The Big Data Lad
  • Categories
    • Artificial Intelligence
    • Big Data
    • Corporate
    • Data Analytics
    • Data Science
    • Engineering & Tech
    • Opinion
  • Get a 1:1 With Me
No Result
View All Result
Big Data Lad
Home Engineering & Tech

5 Things I Wish I Knew Before Becoming a Data Engineer

Here are 5 things Data Engineers should know.

Hamza Nasir by Hamza Nasir
January 25, 2023
in Engineering & Tech
Reading Time: 8 mins read
Things I wish I knew for data engineering
1.2k
VIEWS
ShareTweetPostSend

Table of Contents

  • 1. Linux Shell or Bash Commands/Scripting
  • 2. Basics of Networking
  • 3. Version Control – Git
  • 4. Understanding of CI/CD Pipelines
  • 5. Coding Standards & Best Practices

Every developer, engineer, analyst, scientist, and doctor starts somewhere. It is not uncommon that no one has the full idea of the field they are stepping in on. Similar is the case with me. I did my graduation in Software Engineering from COMSATS University Islamabad. After graduation, I joined my first company as a Backend Engineer. I was working on ASP.NET MVC along with some JavaScript & JQuery.

However, I always felt that I am good with data. Even back in my university days, I was quite good at manipulating data using languages like Java, Python, and mainly SQL. At that time, I didn’t even know that what I am doing is usually known to be Data Wrangling.

So after having this epiphany, I decided to give the field of data a shot. However, having this epiphany wasn’t the only factor, buzzwords like Data Science, Big Data, and Data Engineering started to lurk around on the internet and this field started to gain popularity (as deserved).

A lot of people started to come into this field. Especially, computer science/software engineering grads. People were interested to learn data engineering and data science.

When we start out in this field, our general expectations are to be familiar with popular tools like Apache Hadoop, Apache Spark, Apache Kafka, some basic cloud concepts, how a data pipeline is built, and programming. We generally expect to know what we are taught in universities or online tutorials. However when we enter the industry, it turns out, these tools aren’t the only things that matter. This is what I faced when I transitioned into the data field.

Now, usually, being beginners, we tend to skip some important concepts which are used in the industry but are not given enough emphasis on in software engineering schools.

5 Things Data Engineers Should Know

We learn to code in our favorite language, learn our favorite APIs, learn generic concepts, and then enter the industry. However, after having years of experience working in this field, I have seen some important concepts and tools which every data engineer (and software engineer) should know in order to perform well and work as expected.

Let’s look at the top 5 concepts/tools which I believe every data engineer/data scientist/software engineer should be familiar with. These concepts are extremely common in the industry and companies expect you to have their understanding.

1 Linux Shell or Bash Commands/Scripting

Shell Commands
Shell/Bash Scripting

Let me tell you a short story. When I started out in this field. I was expected to put together a Kubernetes cluster for a Kafka Streams application to be hosted on. For this, I needed to learn how to install a Kubernetes cluster. And here I was, sitting with some general knowledge of SQL, some Data Warehousing concepts, and some Hadoop and Spark. Although I knew some basic HDFS commands, I wasn’t aware of the general ecosystem of Linux servers. Like what the folder structure in Linux is (where the term “Folder” turned out to be called “Directory” in the Linux lingo), how to run the processes, how to automate using bash scripts, and how common apache tools are executed.

This mini project turned out to be a blessing for me as not only did I learn Linux, but I also got to learn generally how applications are installed and executed in Linux, how several Linux nodes are put together to form a cluster, and generally how to use the Linux terminal.

Don’t worry, you won’t be asked to put together a whole cluster at the start. My case was different, as I was expected to learn stuff (or do R&D, as they say in the corporate lingo), however, learning shell commands and some general bash scripting truly gives you an edge.

Knowing shell commands is extremely important for data engineers. Wherever you go and whichever tool you use, you’ll always find those tools hosted in some Linux servers, be it directly or indirectly using any serverless cloud services.

At the basic level, you should know how to traverse through a Linux terminal. You should know basic commands like cp, mv, cd, ls, mkdir, cat, vim, etc.

Let me list down some of the benefits of knowing the Linux shell:

  • You can automate data pipelines.
  • You can transfer data between different environments. For example, DEV to QA.
  • You get creative. You can embed shell commands in your programming language to do things more creatively.
  • You can interact with various cloud services by using their shell-based tools. As it turns out, this is pretty common.
  • In most modern tools like Databricks or Snowflake, you have to use some level of shell scripting to do stuff. For example, I can quickly list down the folders in the Databricks filesystem by using %sh magic command and executing ls in one of the cells of the Databricks Notebook.

2 Basics of Networking

Networking in Data Engineering
Role of Networking in Data Engineering

I have talked about working in the Linux shells/terminals above. These Linux machines are basically hosted on servers. Now the questions arise: how do you connect with those servers, how a TCP connection works, how to ssh into a server, what an SFTP protocol is, and how to connect with an sFTP server using an sFTP client? etc.

That’s where the knowledge of networking basics comes into play. Following are some of the reasons why networking is helpful:

  • It makes it easier to understand the DataOps side of things. You can easily debug deployment issues and monitor log messages for errors.
  • You can use tools like PuTTy and MobaXTerm to ssh into remote servers. This is very common. In most places, you’ll find yourself doing this.
  • Sometimes, as data engineers, you have to deal with sFTP-based storage systems. Having the knowledge of how to connect with an sFTP storage is extremely important.
  • If you are working in DataOps, you might have to monitor different servers to identify potential bottlenecks which can make data pipelines slower. Knowledge of networking really comes into play in this regard.
  • When provisioning distributed clusters of different data engineering tools like Apache Spark or Hadoop, you are required to have knowledge of networking. Familiarity with network topologies & configuration options will help you in provisioning servers for the deployment of your tools.
  • When you are familiar with common networking protocols like TCP/IP, it can be useful for you in understanding how different components of a distributed system interact. For example, in your data pipelines, you may need to achieve idempotence. And having some knowledge of TCP/IP will help you understand what it is.

In the contemporary scenario of the modern tech world, cloud technology is gaining popularity quite fast. It has become paramount for data engineers to know how to connect with the cloud as most companies are now hosting their tools on the cloud. So, by having basic knowledge of networking, it becomes quite easy for you to learn cloud technologies.

3 Version Control – Git

git github
Git and GitHub

Okay, I can’t stress this enough. I have observed that when fresh engineers enter the industry, they really lack knowledge of Git and version control in general (including services like GitHub and BitBucket).

I have seen junior engineers struggling with doing basic git tasks like creating a branch, downloading a branch, and pushing their code to the specific branch made for it instead of accidentally pushing it to the master branch *wink wink*. I have also seen that when people need to push a single change, they either download the whole repo again or create a new branch and merge it for every single change. I can’t stress enough how Learning Git & GitHub properly is extremely valuable and paramount.

Contrary to popular belief, learning git is quite easy. There are only a handful of concepts that you need to use in git most of the time. Some of them include:

  • Pulling code from the remote branch (from GitHub, GitLab, BitBucket, etc.)
  • Creating branches
  • Pushing code to branches
  • Merging branches

For production workload, usually, the above steps can be automated using CI/CD practices. More on that later.

Refer to the following practical and easy-to-follow tutorials on YouTube:

  • Git and GitHub for Beginners Tutorial | YouTube
  • Git Tutorial for Beginners – Git & GitHub Fundamentals In Depth | YouTube

4 Understanding of CI/CD Pipelines

DevOps – CI/CD (Image by Freepik)

When we initially learn different tools, we usually only learn to build projects locally. However, we do not consider the factors related to deploying our projects for production use. That’s where the understanding of CI/CD is crucial.

CI/CD stands for Continuous Integration/Continuous Development. This term is usually used in the DevOps/DataOps world. DevOps Engineers are usually responsible for building the infrastructure responsible for automating the process of deployment of code, machine learning models, and data pipelines.

When you develop something, I recommend that you know at least the following tools:

  • Git and GitHub – Also emphasized in point number 4 above.
  • Jenkins – You don’t need to learn Jenkins to the full extent. However, you should have an understanding of building basic Jenkins pipelines and how they generally work.
  • Integration of Jenkins and Git – Mostly only understanding is required as you’ll get the integrated environment already as done by DevOps or DataOps folks. When you push your code to a git-based repository, the Jenkins pipeline can detect the changes and can automatically deploy the code for production use. I won’t go into the detail of how Jenkins work, but you can refer to this link.
  • Docker – You should be able to work in Dockerized environments. Often you need to create your own docker images to run on the server. And learning docker is a great investment.
  • Bonus: Kubernetes – Though, it is used at the DataOps end, understanding how it works is valuable. Many companies use Kubernetes to achieve orchestration and load balancing since it can be proven cost-effective.

The aforementioned tools are just some of the few tools that I deem important. However, they are not limited to these. Remember, tech stacks vary from company to company, spanning several industries. That’s why you need to be flexible and tool-agnostic.

5 Coding Standards & Best Practices

Clean Code
Clean Code (Image by Freepik)

I have seen this trend, particularly in Data Engineers who mostly focus on creating data pipelines and solving data-related problems. In doing so, they often write pretty spaghetti code and don’t worry about standardization and code management.

This leads to problems in the future when you onboard new engineers on the team or the codebase grows in volume. Specifically, when the codebase grows, the complexity grows. And as the complexity grows, ultimately tech debt is born which becomes very costly in the future.

It is important to have the knowledge and passion of implementing best software engineering practices while coding.

Following are some of the benefits of adopting good coding practices:

  • More maintainable and scalable code
  • Time and resources saved in the long run
  • Improved accuracy and reliability of data pipelines and systems
  • Code that is easy to understand, test, and modify
  • Robust and less prone to errors
  • Avoiding potential technical debt.

Conclusion

Although the tools and concepts I have mentioned are extremely important, I believe that you should not limit yourself to these tools only. Remember, learning never stops! With the passage of time, tools change, techniques change, approaches change, and strategies change, but things that remain constant are the basics. Always have your basic concepts clear and you are good to go.

 

Tags: 5 Things I Wish I Knew Before Becoming a Data Engineerbig databig data toolsdata conceptsdata engineering conceptsdata engineering toolsdata toolsThings I Wish I Knew Before Becoming a Data Engineer
ShareTweetShareSend
Previous Post

Top 5 Industries Where Big Data is Used

Next Post

Common Pitfalls to Avoid When Building Big Data Pipelines

Hamza Nasir

Hamza Nasir

Hamza A.K.A The Big Data Lad is a business-savvy Data Engineer and consultant, currently working as a Big Data and Cloud Consultant for Systems Limited. He is highly passionate about tech and data analytics.

Related Posts

Common Pitfalls To Avoid When Building Big Data Pipelines
Engineering & Tech

Common Pitfalls to Avoid When Building Big Data Pipelines

by Hamza Nasir
March 18, 2023
0

Nowadays, almost every company is striving to become a data company. Executives make important decisions to utilize their data assets...

Read moreDetails
how-to-become-a-data-engineer

How To Become a Data Engineer – The Ultimate Guide

December 9, 2022
Next Post
Common Pitfalls To Avoid When Building Big Data Pipelines

Common Pitfalls to Avoid When Building Big Data Pipelines

Discussion about this post

  • Trending
  • Comments
  • Latest
Industries where Big Data is used

Top 5 Industries Where Big Data is Used

January 10, 2023
how-to-become-a-data-engineer

How To Become a Data Engineer – The Ultimate Guide

December 9, 2022
Things I wish I knew for data engineering

5 Things I Wish I Knew Before Becoming a Data Engineer

January 25, 2023
Common Pitfalls To Avoid When Building Big Data Pipelines

Common Pitfalls to Avoid When Building Big Data Pipelines

March 18, 2023
how-to-become-a-data-engineer

How To Become a Data Engineer – The Ultimate Guide

0
Industries where Big Data is used

Top 5 Industries Where Big Data is Used

0
Common Pitfalls To Avoid When Building Big Data Pipelines

Common Pitfalls to Avoid When Building Big Data Pipelines

0
Things I wish I knew for data engineering

5 Things I Wish I Knew Before Becoming a Data Engineer

0
Common Pitfalls To Avoid When Building Big Data Pipelines

Common Pitfalls to Avoid When Building Big Data Pipelines

March 18, 2023
Things I wish I knew for data engineering

5 Things I Wish I Knew Before Becoming a Data Engineer

January 25, 2023
Industries where Big Data is used

Top 5 Industries Where Big Data is Used

January 10, 2023
how-to-become-a-data-engineer

How To Become a Data Engineer – The Ultimate Guide

December 9, 2022

Read More

Common Pitfalls To Avoid When Building Big Data Pipelines

Common Pitfalls to Avoid When Building Big Data Pipelines

March 18, 2023
Things I wish I knew for data engineering

5 Things I Wish I Knew Before Becoming a Data Engineer

January 25, 2023
Industries where Big Data is used

Top 5 Industries Where Big Data is Used

January 10, 2023
how-to-become-a-data-engineer

How To Become a Data Engineer – The Ultimate Guide

December 9, 2022

About

Big Data Lad

Big Data Lad

Big Data Lad is a blog about tech, Big Data, AI, Machine Learning, and much more!

Follow us

Categories

  • Big Data
  • Corporate
  • Data Analytics
  • Engineering & Tech
  • Opinion

Recent Posts

  • Common Pitfalls to Avoid When Building Big Data Pipelines
  • 5 Things I Wish I Knew Before Becoming a Data Engineer
  • Top 5 Industries Where Big Data is Used
  • How To Become a Data Engineer – The Ultimate Guide
  • About

© 2022 Big Data Lad - All Rights Reserved!

No Result
View All Result
  • Engineering & Tech
    • Big Data
    • Data Analytics
    • Data Science
    • Artificial Intelligence
  • Opinion
  • About

© 2022 Big Data Lad - All Rights Reserved!