data-eng-page Archives - Indium

Data Monetization: How Snowflake Data Share and CDC can help monetise your data?

Lakshminarasimman S Akhil Tummapudi — Tue, 04 Jul 2023 06:03:01 +0000

Data monetization

It is the practice of generating revenue or extracting value from data assets by utilizing owned or accessed data to gain insights, make informed decisions, and establish fresh revenue streams. It has become increasingly crucial in the digital era, where businesses, organizations, and individuals accumulate and generate vast quantities of data.

How can you monetise and why is it prominent in the data world?

In the realm of data, one prominent method of monetization is through targeted advertising. In this process, companies gather extensive data on user behavior, preferences, and demographics, enabling them to gain insights into individual interests and requirements. Subsequently, this valuable data is harnessed to deliver highly personalized advertisements to specific individuals or segmented groups within the population.

Targeted advertising occupies a prominent position in the data world due to multiple compelling reasons:

1. Enhanced effectiveness: By leveraging data insights, advertisers can customize their messaging to specific audiences who are more likely to show interest in their products or services. This results in improved conversion rates and a more optimized utilization of advertising budgets.

2. Elevated user experience: Relevant and personalized advertisements deliver value to users by showcasing offerings that align with their interests. This elevates the overall user experience and minimizes the perception of intrusive or irrelevant advertising.

3. Increased revenue potential: Targeted advertising has the potential to generate higher revenues for both advertisers and publishers. Advertisers are willing to invest premium amounts to reach their ideal audience, while publishers can command higher rates for ad space when they can demonstrate the effectiveness of their targeted advertising capabilities.

4. Data-driven decision making: Monetizing data through targeted advertising necessitates sophisticated data analytics and insights. This drives the advancement of cutting-edge data analytics tools, machine learning algorithms, and data science techniques. Consequently, the data world continues to progress and innovate, enabling improved decision making and business strategies rooted in data-driven insights.

Snowflake

Data Warehousing is the process of bringing data from various sources into one place to gather different business insights. This is largely helping to understand the business users and make various decisions on demand. Snowflake is playing a crucial role as a unified and fully managed warehouse cloud platform to store and compute huge amounts of data. The decoupled model of Snowflake, storage, and compute greatly facilitates organizations attainment of a cost-effective warehouse system based on demand. One of the cool features called Snowflake Share enables data sharing among the organizational accounts, which enables the segregation of data production and consumption with their computation adapted. Let us see how this Snow Share works and enables us to do the change tracking from the consumer account.

How can data be monetized in snowflakes?

Snowflake provides a platform for data monetization, enabling businesses to leverage data effectively.
It facilitates data collection from diverse source systems.
The platform enables the transformation of large datasets into valuable business insights through analytics.
Snowflake ensures the secure sharing of raw or processed data with third parties.

Snowflake sharing

Snowflake Sharing is one of the features that allows users to share data securely and efficiently with customers, partners, and suppliers. It enables users to share data without compromising security or control. Users should define access policies and rules, including user rules and permissions, through which authorized users have access to see the data. In general, a snowflake database object can be shared with the direct target, a list, listed and a group of read-only accounts (consumers) within and across regions (through replication) from a producer account. Let us see how this sharing can be done as a producer account and consumed as a consumer account.

Different forms of sharing

➔ Snowflake Secure Data Sharing between the Same Regions

Snowflake provides a secure and efficient way to share data between Snowflake accounts within the same region.

To share data between Snowflake accounts in the same region, you need to set up the required roles and privileges on the data. ACCOUNTADMIN: The ACCOUNTADMIN role is required to set up secure data sharing. This role can create and manage the required database objects and grant privileges to other roles. In addition to the roles, you will need to grant the appropriate privileges to each role. The specific privileges required will depend on the requirements of your data sharing use case.

The following commands need to be executed by the producer.

CREATE OR REPLACE SHARE SHARE1;
GRANT USAGE ON DATABASE PRIMARY_DB TO SHARE SHARE1;
GRANT USAGE ON SCHEMA PRIMARY_DB.SCHEMA1 TO SHARE SHARE1;
GRANT SELECT ON TABLE PRIMARY_DB.SCHEMA1.EMPLOYEE TO SHARE SHARE1;

Include the account in the SHARE1 share

ALTER SHARE SHARE1 ADD ACCOUNTS=org_name.consumer_name;

Consumers are required to execute the following commands.

CREATE DATABASE SECONDARY_DB FROM SHARE org_name.producer_name.SHARE1;

➔ Snowflake Secure Data Sharing between regions

Snowflake provides a solution for securely sharing data between regions. By leveraging Snowflake’s cloud-based architecture and advanced security features, users can share sensitive data with other region accounts without compromising on security or performance.

To execute the following commands,run them on ACCOUNT PRODUCER 1.

USE ROLE ORGADMIN;
SELECT SYSTEM$GLOBAL_ACCOUNT_SET_PARAMETER(‘org_name.AP_SOUTH_EAST_ACCOUNT’,’ENABLE_ACCOUNT_DATABASE_REPLICATION’, ‘true’);

Creating primary database

CREATE DATABASE PRODUCE_DB_1;
USE PRODUCE_DB_1;
CREATE SCHEMA PROCDUCER_SCHEMA;
CREATE TABLE PRODUCER_TABLE (ID INT,NAME VARCHAR(255),BRANCH_CODE INT,LOCATION VARCHAR(255));
ALTER TABLE PRODUCER_TABLE SET CHANGE_TRACKING = TRUE;

Creating AWS_AP_SOUTH_1 account

USE ROLE ORGADMIN;
CREATE ACCOUNT AP_SOUTH_PRODUCER_ACCOUNT
admin_name=ADMIN_NAME
admin_password=’PASSWORD’
first_name=AKHIL
last_name=TUMMAPUDI
email=’****@gmail.com’
edition=ENTERPRISE
region=AWS_AP_SOUTH_1;
Select system$global_account_set_parameter(org_name.AP_SOUTH_PRODUCER_ACCOUNT’,’ENABLE_ACCOUNT_DATABASE_REPLICATION’,’TRUE’);

You can replicate to the AWS_AP_SOUTH_1 account and promote an existing database in your local account as the primary one.

use role accountadmin;
alter database PRODUCER_DB_1 enable replication to accounts org_name.AP_SOUTH_PRODUCER_ACCOUNT;

Following commands need to be run on ACCOUNT PRODUCER 2

CREATE WAREHOUSE MY_WH;

Replicate the existing database to a secondary database in the other region

create database PRODUCER_DB_12 as replica of org_name.AP_SOUTH_EAST_ACCOUNT.PRODUCER_DB_1;

Create a database for stored procedures

create database PRODUCER_DB_SP_12;
use database PRODUCER_DB_SP_12;

Schedule refresh of the secondary database

create or replace task refresh_PRODUCER_DB_12_task
warehouse = MY_WH
schedule = ‘1 MINUTE’
as
alter database PRODUCER_DB_12 refresh;

alter task refresh_PRODUCER_DB_12_task resume;

Refresh the secondary database now

alter database PRODUCER_DB_12 refresh;

Create a share.

create OR REPLACE share share1;

Add objects to the share.

grant usage on database PRODUCER_DB_12 to share share1;
grant usage on schema PRODUCER_DB_12.PROCDUCER_SCHEMA to share share1;
grant select on TABLE PRODUCER_DB_12.PROCDUCER_SCHEMA.PRODUCER_TABLE to share share1;

Add consumer accounts to the share

alter share share1 add accounts=org_name.AP_SOUTH_ACCOUNT;

Following commands need to be run on ACCOUNT CONSUMER.

use role ORGADMIN;
select system$global_account_set_parameter(‘org_name.AP_SOUTH_ACCOUNT’,’ENABLE_ACCOUNT_DATABASE_REPLICATION’,’TRUE’);
use role accountadmin;
CREATE DATABASE CONSUMER_DB_12 FROM SHARE org_name.AP_SOUTH_PRODUCER_ACCOUNT.SHARE1;

Start monetizing your data and unlocking its value today! Book a Call Now for more details.

Click here

Types of data sharing in Snowflake:

Direct Data Share in Snowflake

Why?

Direct data sharing in Snowflake enables the secure sharing of real-time data sets among different Snowflake accounts, eliminating the need for data duplication or movement. This feature facilitates seamless real-time collaboration and analysis across various entities, including partners, subsidiaries, and customers.

Pros:

1. Seamless collaboration: By enabling immediate data sharing, it fosters seamless collaboration and swift decision-making among multiple entities in real time.

2. Cost-effective: It eliminates the necessity for data replication or ETL processes, thereby minimizing storage and processing expenses related to data movement.

3. Robust security and governance: Snowflake incorporates robust security features that guarantee data privacy and control, empowering organizations to share data with the utmost confidence.

4. Streamlined data sharing: Data providers can effortlessly share targeted data sets with chosen recipients, granting precise control over data access in a simplified manner.

Cons:

1. Reliance on data providers: The accessibility and accuracy of data for data recipients depend on the data providers. Any challenges or delays faced by the providers can have an impact on the recipient’s ability to access the shared data.

2. Restricted data transformation capabilities: Direct data sharing primarily revolves around the sharing of raw or minimally transformed data, which imposes limitations on the recipient’s capacity to execute intricate data transformations within Snowflake.

Change Data Capture (CDC) Data Share in Snowflake

Why?

CDC data sharing in Snowflake enables organisations to share real-time data changes extracted from source databases with other Snowflake accounts. It facilitates nearly instantaneous data replication and synchronisation between systems.

Pros:

1. Instantaneous data synchronisation: CDC data sharing ensures swift replication of changes made in the source databases, making the data promptly available to the receiving Snowflake accounts. This enables real-time analytics and reporting.

2. Minimised latency: CDC captures and delivers only the modified data, significantly reducing data replication time and minimising latency compared to traditional batch-based data sharing methods.

3. Optimised resource utilisation: With CDC data sharing, only the changed data is captured and replicated, leading to efficient resource utilisation. This helps reduce network bandwidth usage and storage requirements.

4. Uninterrupted data availability: The near-real-time nature of CDC data sharing guarantees that the receiving Snowflake accounts have access to the most up-to-date data continuously.

Cons:

1. Reliance on source database compatibility: CDC data sharing relies on the support of change data capture capabilities in the source databases. Incompatibility with certain databases may restrict its usability and functionality.

2. Heightened complexity: The implementation and management of CDC data sharing entail configuring and monitoring data capture processes, introducing additional complexity compared to traditional data sharing methods.

How at Indium have we helped customers monetise their customer data?

One of the customers use cases is to replicate the data from Snowflake tables and views into other target systems in real-time.
Where the customer has the primary Snowflake account, data is collected from various sources, and they want to replicate the changes immediately to other targets through Striim.
Striim is a platform providing a real-time change data capture solution from various data sources like databases, file systems, Snowflake, and others.
Here, we used Snowflake’s share feature to share the data from the primary account to the secondary accounts.
As explained above, Striim picked up the changes from the shared data from the secondary accounts in real-time.

Learn how Snowflake Data Share and CDC can transform your business. Get started now and unleash the full potential of your data.

Click here

Conclusion

In the digital era, the significance of data monetization has grown, enabling organisations to derive value from their data assets. A prominent approach to monetizing data is through targeted advertising, leveraging comprehensive data insights. While data sharing in Snowflake brings advantages like real-time collaboration and reduced latency, it also entails challenges such as dependency on source database compatibility and increased complexity in implementation and management. Overall, Snowflake empowers organisations to effectively monetize their data while offering robust data warehousing capabilities. Striim, as a real-time replication platform, plays a major role in consuming changes from Snowflake tables and views from the secondary accounts.

The post Data Monetization: How Snowflake Data Share and CDC can help monetise your data? appeared first on Indium.

Data Masking: Need, Techniques and Best Practices

Ashrulochan Sahoo — Wed, 17 May 2023 06:55:20 +0000

Introduction

More than ever, the human race is discovering, revolving, and revolving. The revolution in Artificial Intelligence Domain has brought the whole human species to a new Dawn of personalized services. With more people adapting to the Internet, demands of various services in different phases of life are increasing. Let’s consider the case of Covid Pandemic, the demons are still at war with. In the times of lockdown, to stay motivated we have used Audio Book applications, video broadcasting applications, attended online exercise, Yoga, even Consulted with Doctors through an Application. While the physical streets were closed, there was more traffic online.

All these applications, websites, which we have used, have a simple goal and that is a better service to the user. To do so, they collect personal information directly or indirectly, intentionally or for the sake of betterment. The machines, despite their size starting from laptops to smart watches, even voice assistants are listening to us, watching us every move we made, every word we uttered. Albeit their purpose of doing so is noble, but there’s no guarantee of leakage-proof, intruder-proof and spammers-proof data handling. According to a study by Forbes, on average there are 2.5 quintillion bytes of data generated per day, and this data is increasing year by year exponentially. Data Mining, Data Ingestion and Migration phases are the most vulnerable phases for potential data leakage. The surprising news is the cyber-attacks also happen at a rate of 18 attacks per minute. More than 16 lakh cybercrimes happened in last 3 years in India only.

Need of Data Masking

Besides the online scams and frauds Cyber Attacks, data breaches are major risks to every organization that mines personal data. A data breach is where the attacker gains access to millions to billions of people’s personal information like bank details, mobile numbers, social service numbers, etc. According to the Identity Theft Resource Center (ITRC), 83% of the 1,862 data breaches in 2021 involved sensitive data. These incidents are now considered as equipment of modern warfare.

Data Security Standards

Based on the countries and regulatory authorities there are different rules that need to be imposed to protect sensitive information. European Union States promotes General Data Protection Regulation (GDPR) to protect personal and racial information along with digital information, Health records, biometric and genetic data of individuals. United States Department of Health and Human Service (HHS) passed Health Insurance Portability and Accountability Act that protects and promotes security standards for Privacy of Individually Identifiable Health Information. International Organization for Standardization and the International Electrotechnical Commission’s (IOS/IEC) 27001 and 27018 security standards promote confidentiality, integrity and availability norms for Big Data organizations. In Extract Transform and Load (ETL) services, Data Pipeline services or Data Analytics services sticking to these security norms are crucial and liberating.

Different Security Standards

Read this insightful blog post on– Maximizing AI and ML Performance: A Guide to Effective Data Collection, Storage, and Analysis

Techniques to Protect Sensitive Data

All the security protocols and standards can be summarized into three different techniques: Data De-Identification, Data Encoding and Data Masking. Data De-identification is used to protect sensitive data by removing or obscuring identifiable information. In De-identification technique the original sensitive information will be anonymized i.e., to completely remove those records from database, pseudonymization i.e., to replace the sensitive information with aliases, and lastly the aggregation where data will be grouped and summarized and then will be presented or shared rather than sharing the original elements.

In de-identification the original data format or structure may not be retained. Data Encoding refers to the technique of encoding the data in cyphers which can later be decoded by authorized users. Various encoding techniques are Encryption – key based encryption of data, Hashing – Original data will be converted to hash values using Message Digest (md5), Secure Hash Algorithm (sha1) or BLAKE hashing, etc. In other hand Data masking is the technique of replacing the original data with factious or obfuscated data where the masked data retains the format and structure of original data. All these techniques do not fall into a particular class or follow a hierarchal trend. They are used alone with one another based on the use cases and the cruciality of the data.

Comparative abstraction of major techniques

Data Masking is of two types i.e., Static Data Masking (SDM) and Dynamic Data Masking (DDM). Static Data masking involves replacing sensitive data with realistic but fictitious data with the structure and format of original data. Static Data Masking involves substitution – replacing the sensitive data with fake data, Shuffling – Shuffle the data in a column to manipulate original value and its references, Nulling – Sensitive data will be replaced with Null values. Encryption – encryption of sensitive information, Redaction – partially masking the sensitive data where only one part of the data is visible. Whereas Dynamic Data Masking involves Full masking, partial masking – Mask portion, random masking – mask at random, conditional masking – mask when a specific condition is met, Encoding and Tokenization- convert data to non-sensitive token value that preserves the format and length of original data.

SDM masks data at rest by creating a copy of an existing data set. The copied and masked data can only be used to share in analysis and production teams. Updates to the original data do not reflect in masked data until a new copy is made whereas DDM masks data at query time. The updated data also comes in masked format because of the query. The liveness of data remains intact without worrying about data silos. SDM is the primary choice of data practitioners as it is reliable and completely isolated original data. In other hand, DDM depends on query time masking which poses a chance of failure at some adverse instances.

SDM vs DDM

Data Masking Best Practices

Masking of sensitive data depends on the use case of the resultant masked data. It is always recommended to mask the data in the non-production environment. However, there are some practices that need to be considered for secure and fault-tolerant data masking.

1. Governance: The organization must follow common security practices based on the country it’s operating in and the international data security standards as well.

2. Referential Integrity: Tables with masked data should follow references accordingly for the purpose of join while analyzing the data without revealing sensitive information.

3. Performance and Cost: Tokenization and Hashing often convert the data to a standard size which may be more than actual size. Masked data shouldn’t impact the general query processing time.

4. Scalability: In case of big data the masking technique should be able to mask large dataset and stream data as well.

5. Fault-tolerance: The technique should be tolerant to minimal data ugliness like extra space, comma, special characters etc. By scrutinizing the masking process and resultant data often helps to avoid common pitfalls.

Protect your sensitive data with proper data masking techniques. Contact us today to get in Touch.

Click here

Conclusion

In conclusion, the advancements in technology, particularly in the domain of Artificial Intelligence, have brought about a significant change in the way humans interact with services and each other. The COVID-19 pandemic has further accelerated the adoption of digital technologies as people were forced to stay indoors and seek personalized services online. The increased demand for online services during the pandemic has shown that technology can be leveraged to improve our lives and bring us closer to one another even in times of crisis. As we continue to navigate the post-pandemic world, the revolution in technology will play a significant role in shaping our future and enabling us to live a better life.

The post Data Masking: Need, Techniques and Best Practices appeared first on Indium.

Building Reliable Data Pipelines Using DataBricks’ Delta Live Tables

Indium — Fri, 16 Dec 2022 07:33:10 +0000

The enterprise data landscape has become more data-driven. It has continued to evolve as businesses adopt digital transformation technologies like IoT and mobile data. In such a scenario, the traditional extract, transform, and load (ETL) process used for preparing data, generating reports, and running analytics can be challenging to maintain because they rely on manual processes for testing, error handling, recovery, and reprocessing. Data pipeline development and management can also become complex in the traditional ETL approach. Data quality can be an issue, impacting the quality of insights. The high velocity of data generation can make implementing batch or continuous streaming data pipelines difficult. Should the need arise, data engineers should be able to change the latency flexibly without re-writing the data pipeline. Scaling up as the data volume grows can also become difficult due to manual coding. It can lead to more time and cost spent on developing, addressing errors, cleaning up data, and resuming processing.

To know more about Indium and our Databricks and DLT capabilities

Automating Intelligent ETL with Data Live Tables

Given the fast-paced changes in the market environment and the need to retain competitive advantage, businesses must address the challenges, improve efficiencies, and deliver high-quality data reliably and on time. This is possible only by automating ETL processes.

The Databricks Lakehouse Platform offers Delta Live Tables (DLT), a new cloud-native managed service that facilitates the development, testing, and operationalization of data pipelines at scale, using a reliable ETL framework. DLT simplifies the development and management of ETL with:

Declarative pipeline development
Automatic data testing
Monitoring and recovery with deep visibility

With Delta Live Tables, end-to-end data pipelines can be defined easily by specifying the source of the data, the logic used for transformation, and the target state of the data. It can eliminate the manual integration of siloed data processing tasks. Data engineers can also ensure data dependencies are maintained across the pipeline automatically and apply data management for reusing ETL pipelines. Incremental or complete computation for each table during batch or streaming run can be specified based on need.

Benefits of DLT

The DLT framework can help build data processing pipelines that are reliable, testable, and maintainable. Once the data engineers provide the transformation logic, DLT can orchestrate the task, manage clusters, monitor the process and data quality, and handle errors. The benefits of DLT include;

Assured Data Quality

Delta Live Tables can prevent bad data from reaching the tables by validating and checking the integrity of the data. Using predefined policies on errors such as fail, alert, drop, or quarantining data, Delta Live Tables can ensure the quality of the data to improve the outcomes of BI, machine learning, and data science. It can also provide visibility into data quality trends to understand how the data is evolving and what changes are necessary.

Improved Pipeline Visibility

DLT can monitor pipeline operations by providing tools that enable visual tracking of operational stats and data lineage. Automatic error handling and easy replay can reduce downtime and accelerate maintenance with deployment and upgrades at the click of a button.

Improve Regulatory Compliance

The event log can automatically capture information related to the table for analysis and auditing. DLT can provide visibility into the flow of data in the organization and improve regulatory compliance.

Simplify Deployment and Testing of Data Pipeline

DLT can enable data to be updated and lineage information to be captured for different copies of data using a single code base. It can also enable the same set of query definitions to be run through the development, staging, and production stages.

Simplify Operations with Unified Batch and Streaming

Build and run of batch and streaming pipelines can be centralized, and the operational complexity can be effectively minimized with controllable and automated refresh settings.

Concepts Associated with Delta Live Tables

The concepts used in DLT include:

Pipeline: A Directed Acyclic Graph that can link data sources with destination datasets

Pipeline Setting: Pipeline settings can define configurations such as;

Notebook
Target DB
Running mode
Cluster config
Configurations (Key-Value Pairs).

Dataset: The two types of datasets DLT supports include Views and Table, which, in turn, are of two types: Live and Streaming.

Pipeline Modes: Delta Live provides two modes for development:

Development Mode: The cluster is reused to prevent restarts and disable pipeline retries for detecting and fixing errors.

Production Mode: Cluster restart for recoverable errors such as stale credentials or memory leak and execution is retried for specific errors.

Editions: DLT comes in various editions to suit the different needs of the customers such as:

Core for streaming ingest workload
Pro for core features + CDC, streaming ingest, and table updation based on changes to the source data
Advanced where in addition to core and pro features, data quality constraints are also available

Delta Live Event Monitoring: Delta Live Table Pipeline event log is stored under the storage location in /system/events.

Indium for Building Reliable Data Pipelines Using DLT

Indium is a recognized data engineering company with an established practice in Databricks. We offer ibriX, an Indium Databricks AI Platform, that helps businesses become agile, improve performance, and obtain business insights efficiently and effectively.

Our team of Databricks experts works closely with customers across domains to understand their business objectives and deploy the best practices to accelerate growth and achieve the goals. With DLT, Indium can help businesses leverage data at scale to gain deeper and meaningful insights to improve decision-making.

FAQs

How does Delta Live Tables make the maintenance of tables easier?

Maintenance tasks are performed on tables every 24 hours by Delta Live Tables, which improves query outcomes. It also removes older versions of tables and improves cost-effectiveness.

Can multiple queries be written in a pipeline for the same target table?

No, this is not possible. Each table should be defined once. UNION can be used to combine various inputs to create a table.

The post Building Reliable Data Pipelines Using DataBricks’ Delta Live Tables appeared first on Indium.