Table of Contents
Nowadays, almost every company is striving to become a data company. Executives make important decisions to utilize their data assets to analyze for better business outcomes and to move ahead of the competition.
Companies hire data engineers and architects to design and build complex data warehouses and lakes to do this. Engineers then build complex data pipelines to ingest the data into these data warehouses and data lakes. To do this, they follow a rigorous ETL process to ensure that the data is transformed and reaches its destination without compromising its integrity.
When it comes to building big data pipelines, many organizations run into pitfalls that can cause costly delays and errors. Imagine this scenario: an organization puts together a big data pipeline to collect data from multiple sources and quickly discovers that the data quality is poor and the performance is lacking. After months of trial and error, they realize they have fallen into a common trap — not avoiding common pitfalls when building big data pipelines.
This article will discuss common pitfalls to avoid when building big data pipelines. We will discuss the importance of clear data governance, the need to focus on data quality, and the importance of selecting the right tools.
Common Pitfalls to Avoid When Building Big Data Pipelines
Let’s say you are a data engineering consultant for an XYZ organization. You build several data pipelines catering to several business problems. There are multiple sources from where data is extracted.
Now, one night, you get a call from your team lead that an important data pipeline is failing in production. And this pipeline is highly business-critical. Now, what do you do? Of course, you rush towards your computer and log in.
You immediately begin debugging. You review the code to identify where the issue may be coming from, as well as check related log files to identify any potential errors. You also take a look at the data sources to ensure they are properly connected and configured. Finally, You review any changes recently made to the pipeline to see if any of them may have caused the failure and fix them.
But as you fixed them, another pipeline dependent on this pipeline started to fail. You start to wonder what went wrong and where. Turns out, data quality is affected among other issues which are not identified yes.
If you have faced similar issues to this, you are not alone. Let’s discuss some of the common pitfalls to avoid while building big data pipelines.
1 Poor Data Quality from Data Producers
It is not uncommon that when you have successfully built a data pipeline and it is working fine. All the tests passed and the downstream systems are performing as per the expectations. But suddenly, things start to break. Turns out, the source started to produce poor-quality data. And since no data quality checks were in place and no notification alert system was developed, the pipeline started to fail. The debugging time and lack of value provided to the business resulted in ever-rising costs.
I am sure most companies have faced similar scenarios at one point. Now, there are two cases – Either you are the data producer, or you are consuming the data from other data producers.
In both cases, data producers need to ascertain that there is some agreement between both producers and consumers.
The following are some of the things which ought to be considered when strategizing a new pipeline.
- For data, an agreed-upon schema is in place. An agreement should be signed between both parties.
- File types and naming conventions should be agreed-upon.
- File cadence and frequency of file arrival should be decided.
- Data Quality or DQ checks are integrated into the pipeline.
- A standardized source to target mapping document should be made for each pipeline
Data quality is the degree to which data meets the requirements for its intended use. Poor quality data can cause issues with the accuracy and reliability of the data, which can lead to incorrect insights and decisions. To ensure good data quality, organizations should ultimately have a comprehensive data quality assessment program in place, although this also depends on how big the organization is but having this program is paramount. This program should include data quality checks, data cleansing, and data monitoring.
Data Contracts
I talked about having a certain agreement between both parties before building data pipelines. Now, there is a new buzzword in town – Data Contracts.
Having data contracts in place can be a game changer. Data Contracts, on a high level, are agreements between data consumers and producers regarding what kind of semantics of data should be expected by the consumer to receive and work upon.
Data contracts typically include information such as the data types, field names, constraints, and validation rules that apply to the data being exchanged. They may also define the protocols and technologies used for transmitting the data, such as XML, JSON, or SOAP.
You can start to learn more about data contracts from here and here.
2 Date and Timestamp Issues
Consider this: You have written a business logic to parse a certain date column and separate out information like year, month, and day. Now, while writing the pipeline, this was assumed that the date format will remain the same. However, there was no contract signed regarding the schema of the data.
Now, someone from the source side was working on the data and for some reason, decided that the date format should be changed. He didn’t consider how this will affect the downstream systems.
As this person changed the date format, your pipeline started to break just because of that small parsing logic you built into the pipeline. You see how even a minuscule thing like this can have dire consequences if not taken into consideration. This can be solved by having *ideally* stringent constraints at the source level based on the agreement signed between the producers and consumers.
Another thing is, sometimes, consumers ask you to give them the date column in DateType format instead of StringType. Now, the thing is, the DateType formats do not have a certain format, they are DateType objects and not StringType objects which can have a format. A common misconception among several people is that the date value they see while querying the data is what the DateType object looks like. It’s the querying tool or IDE you are using that is converting the DateType object into StringType in a pre-configured format.
Now, let’s get back to the consumer’s demand. The stakeholders required the date value in a specific format but the type of that column is DateType. Now, when they query that column, they get the format that is configured in the IDE or Editor they are using. They then ask the DEs to fix the format for them. Now, the only way to fix this format is that either the DE changes the type to StringType in the required format or the stakeholder configure the required type in the tool.
Some of the other issues which may occur are:
- Timezone Conversion: Different data sources may have different timezones configured into them and creating a standard timezone in the sink systems can become a challenge.
- Missing or incomplete Data: Let’s suppose you need to calculate the time interval and create a timestamp column. But the columns which are needed to compute this interval are missing the data or the data is null, this may cause issues.
- Time Drift: This is now a rare issue provided cloud providers take care of this, but what used to happen was data synchronization issues because clocks may drift in actual physical servers. But then again, this issue is rare and you won’t see it happening commonly.
- Timestamp Precision: Some systems or data sources may record timestamps with different levels of precision, which can make it difficult to compare or merge data.
3 Updates in Source APIs – Unintentional Consequences
Sometimes, we use open-source APIs to ingest data into our data lakes or data warehouses. The people behind open-source APIs may not version their changes and as a result, modify the RestAPI endpoints which may break the downstream pipelines.
Not only that, if the API undergoes big changes, the actual data which is fetched is also affected. There can be schema changes in the data or new options introduced in the API to include/exclude certain columns which were previously always available.
Source APIs can also positively or negatively impact the performance of the data pipeline. This can happen if the API limits the number of requests that can be made, makes changes in the response time, changes its underlying infrastructure, or introduces new rate limits.
Now, to mitigate these issues, consider the following:
- Instead of using open-source untrustworthy APIs, consider subscribing to or purchasing premium ones.
- Make sure you monitor any potential changes and think ahead.
- Have a configurable and dynamic ETL pipeline that can cater to changes without having to change it ideally.
- Add DQ checks to ensure the data coming from the API is on par with the requirements.
- Have alert notification systems in place in case the DQ checks fail.
4 Infrastructure Failures
One of the most common issues in data pipelines is related to infrastructure failures. Infrastructure can have a pretty significant effect on data pipelines both in positive and negative ways. Let’s discuss some of these issues.
Before selecting your infrastructure, you need to ask questions. Questions like:
- What are your storage, computing, and memory requirements?
- What cloud provider is best suited for your needs?
- How to handle data replication to maintain 100% availability of data?
- How to handle network latency? etc.
Some of the causes of infrastructure failures are as follows:
- Increased Latency & Reduced Throughput: Due to network issues, infrastructure can cause extreme delays in data ingestion processes because throughput is decreased.
- Constantly increasing jobs: If new jobs are constantly being developed, you need to think about having a scalable infrastructure. We have faced this issue in our EMR cluster where multiple developers were submitting multiple jobs for different business units. The cluster would often get stuck.
- Cost Increase: When it comes to infrastructure in the cloud, you need to be very careful. It happens with many people that while building a POC for a new pipeline, engineers will create a new EC2 instance for development, but then forget to shut it down. The cost keeps on increasing. Apart from this, when the data volume increase, requirements for the compute increase as well. And this ultimately causes a significant increase in cost and multiple emails from the top management.
- Security Issues & Vulnerabilities: This issue is by far the most important issue in this list. If you are not following the best security practices, you are in for a lot of trouble tomorrow if not today. Make sure that in whatever environment you are building your infrastructure, you are following its best security practices. And that, you have hired the right expertise for it.
5 As volume increases, complexity increases
In your data pipeline, if you have not handled the increase in volume and velocity, this can mean that your data pipelines can ultimately become slower, potentially less reliable, and more costly than usual.
An increasing volume of data can negatively impact performance, especially when your data pipeline is highly business-critical. And slower data pipeline means having a slower response time to queries in your BI tool, which can cause businesses potential losses.
Oh! And did I say increased costs? How dare I not? An increase in volume means an increase in compute and storage costs. This means they HAVE to be managed smartly.
So what do you need to do in this case? You need to be smarter about building data pipelines, especially when the data is arriving in an incremental fashion. You need to plan ahead and make your data pipeline scalable. Not only that, but you need to hire the right team which takes into account the increasing cost.
6 Conclusion
Well, to conclude, the list above is based on my experiences and is not only limited to the aforementioned pitfalls. But it is paramount to plan ahead and mitigate as many of these pitfalls as possible to avoid unnecessarily increased costs and negatively impacted performance.
One way to mitigate these issues is by having a successful data governance strategy. One of the most important aspects of building a successful big data pipeline is having clear data governance in place. Data governance refers to the processes, policies, and people that are responsible for managing the accessibility, usability, integrity, and security of the data. Without clear data governance, an organization is likely to experience problems with data quality, accuracy, and consistency. However, this also depends on how large or small an organization is.
Without data governance, big data pipelines can quickly become unruly, resulting in data quality issues, inaccuracy, and inconsistency. For example, a lack of data governance can lead to large-scale data duplication, mislabeled data, and data being shared across departments without proper authorization. To ensure the success of a big data pipeline, organizations are recommended to establish clear data governance policies and procedures, including roles and responsibilities for data stewardship, data access control, and data auditing. Furthermore, organizations should invest in data governance tools to help them manage and monitor their data pipelines in a secure and compliant manner.
Discussion about this post