ETL Archives - Indium https://www.indiumsoftware.com/blog/tag/etl/ Make Technology Work Thu, 02 May 2024 04:44:00 +0000 en-US hourly 1 https://wordpress.org/?v=6.5.3 https://www.indiumsoftware.com/wp-content/uploads/2023/10/cropped-logo_fixed-32x32.png ETL Archives - Indium https://www.indiumsoftware.com/blog/tag/etl/ 32 32 Certainty in streaming real-time ETL https://www.indiumsoftware.com/blog/certainty-in-streaming-real-time-etl/ Wed, 15 Feb 2023 14:21:57 +0000 https://www.indiumsoftware.com/?p=14684 Introduction The timely loading of real-time data from your on-site or cloud-based mission-critical operational systems to your cloud-based analytical systems is assured by a continuous streaming ETL solution. The data loaded for making crucial operational decisions should be reliable due to continuous data flow. By supplying an efficient, end-to-end data integration between the source and

The post Certainty in streaming real-time ETL appeared first on Indium.

]]>
Introduction

The timely loading of real-time data from your on-site or cloud-based mission-critical operational systems to your cloud-based analytical systems is assured by a continuous streaming ETL solution. The data loaded for making crucial operational decisions should be reliable due to continuous data flow. By supplying an efficient, end-to-end data integration between the source and the target systems, Striim can guarantee the dependability of the stream ETL solutions. To ensure data reliability and send it to the target systems, the data from the source can be transformed in the real-time data pipeline. Striim applications can be created for a variety of use cases.

About the customer

A power company called Glitre Energi manages the power grid, retails electricity, and offers broadband services. About 90,000 people receive electricity from Glitre Energi. The organization oversees the power lines that pass through heavily populated areas.

Problems with the current design

  • Metering data should be loaded to the SQL databases from event-based sources.
  • Regardless of any additional parameters in the source events, metering events with the same filename should have the same ID assigned to them.
  • Relational database systems have trouble normalizing real-time metering events. Unless all of the previous events have been sent to the target, comparing data in real-time and assigning values becomes difficult.

Solution Provided

  • Meter values for the power supply are sent as JSON files from the source applications, which are referred to as meter events, to Azure Event Hubs.
  • Due to reporting lags, each file contains n number of events with various timestamps.
  • Each event must maintain the link to the file in which it was received in order to maintain traceability back to the source.
  • These events are sent to two SQL server tables, one of which contains information about metering and the other of which contains information about metering files.

Also Read: Use Cases for a Unified Data Integration and Streaming Platform like Striim

Usage Of Constituents

Cache

Getting the most unique identifier from the target table is made possible by a memory-based cache of non-real-time historical or reference data that was obtained from an external source.

External Cache The need for data prompts Striim to query an external database. In order to determine whether the incoming data is present in the target table already or not, Striim queries the same data when it joins with real-time data.

Windows

By limiting the data set by a specific number of events, period, or both, Windows will aggregate, join, or perform calculations on the data. This aids in bringing the target database data and real-time data together in one location where the downstream pipeline can carry out the transformations.

Continuous Query

A continuous query that can be used to filter, aggregate, join, enrich, and transform the events specifies the logic of an application. A query facilitates the logic in the data pipeline and the combining of data from various sources.

Read About Our Success: How we were able to assist one of the biggest manufacturing companies involved setting up an ETL process using PySpark to move sales data from a MySQL on-premises database, which was then obtained from several different ERP systems to Redshift on the AWS cloud.

Get in touch

The use case’s high-level representation is shown in the image below:

Flow Insights

  • Striim application must identify the event with new files, get the filename, assign a unique integer Id, and store these values in a separate table in the SQL server database.
  • For each event that is processed, the application queries an external cache to see if the filename already exists in the target table.
    • If it exists, the CQ retrieves the Id for that filename, replaces the id value with the incoming event data, and sends it to the target table.
    • If it doesn’t exist, the CQ will increment the id and assign the id to the new filename and send the data to both the target tables.
  • Striim cache can be used to load the last received filenames and IDs so that the ID can be incremented.
  • Striim cache should be updated regularly depending on what frequency each event has been sent to the target tables, but it effectively needs to be mutable.
  • Striim windows help bound the real-time event data and the file data, so the continuous queries can use these data and make decisions accordingly.

Conclusion

Using continuous query components has made it simple to compare event data in real-time load to reach decisions. With the aid of windows and cache components, efficient data that must be retrieved from external sources has been planned out very well. The beauty of Striim allows data to be joined wherever it is needed and the desired output to be obtained, assisting Glitre Energi in achieving the normalization of their metering events in their relational systems.

Please get in touch with us if you still need help or if your needs are still unclear. Our team of professionals will always be available to assist you. Click to do so.

The post Certainty in streaming real-time ETL appeared first on Indium.

]]>
Building Reliable Data Pipelines Using DataBricks’ Delta Live Tables https://www.indiumsoftware.com/blog/building-reliable-data-pipelines-using-databricks-delta-live-tables/ Fri, 16 Dec 2022 07:33:10 +0000 https://www.indiumsoftware.com/?p=13726 The enterprise data landscape has become more data-driven. It has continued to evolve as businesses adopt digital transformation technologies like IoT and mobile data. In such a scenario, the traditional extract, transform, and load (ETL) process used for preparing data, generating reports, and running analytics can be challenging to maintain because they rely on manual

The post Building Reliable Data Pipelines Using DataBricks’ Delta Live Tables appeared first on Indium.

]]>
The enterprise data landscape has become more data-driven. It has continued to evolve as businesses adopt digital transformation technologies like IoT and mobile data. In such a scenario, the traditional extract, transform, and load (ETL) process used for preparing data, generating reports, and running analytics can be challenging to maintain because they rely on manual processes for testing, error handling, recovery, and reprocessing. Data pipeline development and management can also become complex in the traditional ETL approach. Data quality can be an issue, impacting the quality of insights. The high velocity of data generation can make implementing batch or continuous streaming data pipelines difficult. Should the need arise, data engineers should be able to change the latency flexibly without re-writing the data pipeline. Scaling up as the data volume grows can also become difficult due to manual coding. It  can lead to more time and cost spent on developing, addressing errors, cleaning up data, and resuming processing.

To know more about Indium and our Databricks and DLT capabilities

Contact us now

Automating Intelligent ETL with Data Live Tables

Given the fast-paced changes in the market environment and the need to retain competitive advantage, businesses must address the challenges, improve efficiencies, and deliver high-quality data reliably and on time. This is possible only by automating ETL processes.

The Databricks Lakehouse Platform offers Delta Live Tables (DLT), a new cloud-native managed service that facilitates the development, testing, and operationalization of data pipelines at scale, using a reliable ETL framework. DLT simplifies the development and management of ETL with:

  • Declarative pipeline development
  • Automatic data testing
  • Monitoring and recovery with deep visibility

With Delta Live Tables, end-to-end data pipelines can be defined easily by specifying the source of the data, the logic used for transformation, and the target state of the data. It can eliminate the manual integration of siloed data processing tasks. Data engineers can also ensure data dependencies are maintained across the pipeline automatically and apply data management for reusing ETL pipelines. Incremental or complete computation for each table during batch or streaming run can be specified based on need.

Benefits of DLT

The DLT framework can help build data processing pipelines that are reliable, testable, and maintainable. Once the data engineers provide the transformation logic, DLT can orchestrate the task, manage clusters, monitor the process and data quality, and handle errors. The benefits of DLT include;

Assured Data Quality

Delta Live Tables can prevent bad data from reaching the tables by validating and checking the integrity of the data. Using predefined policies on errors such as fail, alert, drop, or quarantining data, Delta Live Tables can ensure the quality of the data to improve the outcomes of BI, machine learning, and data science. It can also provide visibility into data quality trends to understand how the data is evolving and what changes are necessary.

Improved Pipeline Visibility

DLT can monitor pipeline operations by providing tools that enable visual tracking of operational stats and data lineage. Automatic error handling and easy replay can reduce downtime and accelerate maintenance with deployment and upgrades at the click of a button.

Improve Regulatory Compliance

The event log can automatically capture information related to the table for analysis and auditing. DLT can provide visibility into the flow of data in the organization and improve regulatory compliance.

Simplify Deployment and Testing of Data Pipeline

DLT can enable data to be updated and lineage information to be captured for different copies of data using a single code base. It can also enable the same set of query definitions to be run through the development, staging, and production stages.

Simplify Operations with Unified Batch and Streaming

Build and run of batch and streaming pipelines can be centralized, and the operational complexity can be effectively minimized with controllable and automated refresh settings.

Concepts Associated with Delta Live Tables

The concepts used in DLT include:

Pipeline: A Directed Acyclic Graph that can link data sources with destination datasets

Pipeline Setting: Pipeline settings can define configurations such as;

  • Notebook
  • Target DB
  • Running mode
  • Cluster config
  • Configurations (Key-Value Pairs).

Dataset: The two types of datasets DLT supports include Views and Table, which, in turn, are of two types: Live and Streaming.

Pipeline Modes: Delta Live provides two modes for development:

Development Mode: The cluster is reused to prevent restarts and disable pipeline retries for detecting and fixing errors.

Production Mode: Cluster restart for recoverable errors such as stale credentials or memory leak and execution is retried for specific errors.

Editions: DLT comes in various editions to suit the different needs of the customers such as:

  • Core for streaming ingest workload
  • Pro for core features + CDC, streaming ingest, and table updation based on changes to the source data
  • Advanced where in addition to core and pro features, data quality constraints are also available

Delta Live Event Monitoring: Delta Live Table Pipeline event log is stored under the storage location in /system/events.

Indium for Building Reliable Data Pipelines Using DLT

Indium is a recognized data engineering company with an established practice in Databricks. We offer ibriX, an Indium Databricks AI Platform, that helps businesses become agile, improve performance, and obtain business insights efficiently and effectively.

Our team of Databricks experts works closely with customers across domains to understand their business objectives and deploy the best practices to accelerate growth and achieve the goals. With DLT, Indium can help businesses leverage data at scale to gain deeper and meaningful insights to improve decision-making.

FAQs

How does Delta Live Tables make the maintenance of tables easier?

Maintenance tasks are performed on tables every 24 hours by Delta Live Tables, which improves query outcomes. It also removes older versions of tables and improves cost-effectiveness.

Can multiple queries be written in a pipeline for the same target table?

No, this is not possible. Each table should be defined once. UNION can be used to combine various inputs to create a table.

The post Building Reliable Data Pipelines Using DataBricks’ Delta Live Tables appeared first on Indium.

]]>