ibrix-page Archives - Indium

Distributed Data Processing Using Databricks

Indium — Mon, 21 Nov 2022 05:21:38 +0000

Distributed systems are used in organizations for collecting, accessing, and manipulating large volumes of data. Recently, distributed systems have become an integral component of various organizations as an exponential increase in data is witnessed across industries.

With the advent of big data technologies, many challenges in dealing with large datasets have been addressed. But in a typical data processing scenario, when a data set is too large to be processed by a single machine or when a single machine may not contain the data to respond to user queries, it requires the processing power of multiple machines. These scenarios are becoming increasingly complex as many applications, devices, and social platforms need data in an organization, and this is where distributed data processing methods are best implemented.

Know more about Indium’s capabilities on Databricks and how it can help transform your business

Click Here

Understanding Distributed Data Processing

Distributed data processing consists of a large volume of data that flows through variable sources into the system. There are various layers in that system that manage this data ingestion process.

At first, the data collection and preparation layer collects the data from different sources, which is further processed by the system. However, we know that any data gathered from external sources are mainly raw data such as text, images, audio, and forms. Therefore, the preparation layer is responsible for converting the data into a usable and standard format for analytical purposes.

Meanwhile, the data storage layer primarily handles data streaming in real-time for performing analytics with the help of in-memory distributed caches for storing and managing data. Similarly, if the data is required to be processed in the conventional approach, then batch processing is performed across distributed databases, effectively handling big data.

Next is the data processing layer, which can be considered the logical layer that processes the data. This layer allows various machine learning solutions and models for performing predictive, descriptive analytics to derive meaningful business insights. Finally, there is the data visualization layer consisting of dashboards that allows visualization of the data and reports after performing different analytics using graphs and charts for better interpretation of the results.

In the quest to find new approaches to distribute processing power, application programs, and data, distributed data engineering solutions is adopted to enable the distribution of applications and data among various interconnected sites to complement the increasing need for information in the organizations. However, an organization may opt for a centralized or a decentralized data processing system, depending on their requirements.

Benefits of Distributed Data Processing

The critical benefit of processing data within a distributed environment is the ease at which tasks can be completed with significantly lesser time as data is accessible from multiple machines that execute the tasks parallelly instead of a single machine running requests in a queue.

As the data is processed faster, it is a cost-effective approach for businesses, and running workloads in a distributed environment meets crucial aspects of scalability and availability in today’s fast-paced environment. In addition, since data is replicated across the clusters, there is less likelihood of data loss.

Challeges of Distributed Data Processing

The entire process of setting up and working with a distributed system is complex.

With large enterprises compromised data security, coordination problems, occasional performance bottlenecks due to non-performing terminals in the system and even high costs of maintenances are seen as major issues.

How is Databricks Platform Used for Distributed Data Processing?

The cloud data platforms Databricks Lakehouse helps to perform analytical queries, and there is a provision of Databricks SQL for working with business intelligence and analytical tasks atop the data lakes. Analysts can query data sets using standard SQL and have great features for integrating business intelligence tools like Tableau. At the same time, the Databricks platform allows working with different workloads encompassing machine learning, data storage, data processing, and streaming analytics in real time.

The immediate benefit of a Databricks architecture is enabling seamless connections to applications and effective cluster management. Additionally, using databricks provides a simplified setup and maintenance of the clusters, which makes it easy for developers to create the ETL pipelines. These ETL pipelines ensure data availability in real-time across the organization leading to better collaborative efforts among cross-functional teams.

With the Databricks Lakehouse platform, it is now easy to ingest and transform batch and streaming data leading to reliable production workflows. Moreover, Databricks ensure clusters scale and terminate automatically as per the usage. Since the data ingestion process is simplified, all analytical solutions, AI, and other streaming applications can be operated from a single place.

Likewise, automated ETL processing is provided to ensure raw data is immediately transformed to be readily available for analytics and AI applications. Not only the data transformation but automating ETL processing allows for efficient task orchestration, error handling, recovery, and performance optimization. Orchestration enables developers to work with diverse workloads, and the data bricks workflow can be accessed with a host of features using the dashboard, improving tracking and monitoring of performance and jobs in the pipeline. This approach continuously monitors performance, data quality, and reliability metrics from various perspectives.

In addition, Databricks offers a data processing engine compatible with Apache Spark APIs that speeds up the work by automatically scaling multiple nodes. Another critical aspect of this Databricks platform is enabling governance of all the data and AI-based applications with a single model for discovering, accessing, and securing data sharing across cloud platforms.

Similarly, there is support for Datbricks SQL within the Databricks Lakehouse, a serveless data warehouse capable of running any SQL and business intelligence applications at scale.

Databricks Services From Indium:

With deep expertise in Databricks Lakehouse, Advanced Analytics & Data Products, Indium Software provides wide range of services to help our clients’ business needs. Indium’s propreitory solution accelerator iBriX is a packaged combination of AI/ML use cases, custom scripts, reusable libraries, processes, policies, optimization techniques, performance management with various levels of automation including standard operational procedures and best practices.

To know more about iBriX and the services we offer, write to info@www.indiumsoftware.com.

The post Distributed Data Processing Using Databricks appeared first on Indium.

Deploying Databricks on AWS: Key Best Practices to Follow

Indium — Fri, 11 Nov 2022 08:11:41 +0000

Databricks is a unified, open platform for all organizational data and is built along the architecture of a data lake. It ensures speed, scalability, and reliability by combining the best of data warehouses and data lakes. At the core is the Databricks workspace that stores all objects, assets, and computational resources, including clusters and jobs.

Over the years, the need to simplify Databricks deployment on AWS had become a persistent demand due to the complexity involved. When deploying Databricks on AWS, customers had to constantly between consoles as given in a very detailed documentation. To deploy the workspace, customers had to:

Configure a virtual private cloud (VPC)
Set up security groups
Create a cross-account AWS Identity and Access Management (IAM) role
Add all AWS services used in the workspace

This could take more than an hour and needed a Databricks solutions architect familiar with AWS to guide the process.

To make matters simple and easy and enable self-service, the company offers Quick Start in collaboration with Amazon Web Services (AWS). This is an automated reference deployment tool integrating AWS best practices to leverage AWS Cloud Formation templates and deploy key technologies on AWS.

Incorporating AWS Best Practices

Best Practice #1 – Ready, Steady, Go

Make it easy even for non-technical customers to get Databricks up and running in minutes. Quick Starts allows customers to sign in to the AWS Management Console and deploy Databricks within minutes after selecting the CloudFormation template and Region by filling in the parameter values required for the purpose and deploy. Quick Starts is applicable to several environments and the architecture is designed such that customers using any environment can leverage it.

Best Practice #2 – Automating Installation

Deployment of Databricks involved installing and configuring several components manually earlier. This is a very slow process, prone to errors and reworks. The customers had to refer to a document to get it right and this was proving to be difficult. By automating the process, AWS cloud deployments can be speeded up effectively and efficiently.

Best Practice #3 – Security from the Word Go

One of the AWS best practices is the focus on security and availability. When deploying Databricks, this focus should be integrated right from the beginning. For effective security and availability, aligning it with the AWS user management to allow one-time IAM will provide access to the environment with appropriate controls. This should be supplemented with AWS Security Token Service (AWS STS) to authenticate user requests for temporary, limited-privilege credentials.

Best Practice #4 High Availability

As the environment spans two Availability Zones, it ensures a highly available architecture. Add a Databricks- or customer-managed virtual private cloud (VPC) to the customer’s AWS account and configure it with private subnets and a public subnet. This will provide customers with access to their own virtual network on AWS. In the private subnets, Databricks clusters of Amazon Elastic Compute Cloud (Amazon EC2) instances can be added along with additional security groups to ensure secure cluster connectivity. In the public subnet, outbound internet access can be provided with a network address translation (NAT) gateway. Use Amazon Simple Storage Service (Amazon S3) bucket for storing objects such as notebook revisions, cluster logs, and job results.

The benefits of using these best practices is that creating and configuring the AWS resources required to deploy and configure the Databricks workspace can be automated easily. It doesn’t need solutions architects to undergo extensive training to the configurations and can be an intuitive process. This will help them remain updated with the latest product enhancements, security upgrades, and user experience improvements without difficulty.

Since the launch of Quick Starts in September 2020, Databricks deployment on AWS has become much simpler, resulting in:

Deployment time takes only 5 minutes as against the earlier 1 hour
95% lower deployment errors

As it incorporates the best practices of AWS and is co-developed by AWS and Databricks, the solution answers the need of its customers to quickly and effectively deploy Databricks on AWS.

Indium – Combining Technology with Experience

Indium Software is an AWS and Databricks solution provider with a battalion of data experts who can help you with deploying Databricks on AWS to set you off on your cloud journey. We work with our customers closely to understand their business goals and smooth digital transformation by designing solutions that cater to their goals and objectives.

While Quick Starts is a handy tool that accelerates the deployment of Databricks on AWS, we help design the data lake architecture to optimize cost and resources and maximize benefits. Our expertise in DevSecOps ensures a secure and scalable solution that is highly available with permission-based access to enable self-service with compliance.

Some of the key benefits of working with Indium on Databricks deployments include:

More than 120 person-years of Spark expertise
Dedicated Lab and COE for Databricks
ibriX – Homegrown Databricks Accelerator for faster Time-to-market
Cost Optimization Framework – Greenfield and Brownfield engagements
E2E Data Expertise – Lakehouse, Data Products, Advanced Analytics, and ML Ops
Wide Industry Experience – Healthcare, Financial Services, Manlog, Retail and Realty

FAQs

How to create a Databrick in AWS?

In the free trial, you can sign up by clicking the Try Databricks button at the top of the page or on AWS Marketplace.

How can one store and access data on Databricks and AWS?

All data can be stored and managed on a simple, open lakehouse platform. Databricks on AWS allows the unification of all analytics and AI workloads by combining the best of data warehouses and data lakes.

How can Databricks connect to AWS?

AWS Glue allows Databricks to be integrated and Databricks table metadata to be shared from a centralized catalog across various Databricks workspaces, AWS services, AWS accounts, and applications for easy access.

The post Deploying Databricks on AWS: Key Best Practices to Follow appeared first on Indium.

EDA , XGBoost and Hyperparameter Tuning using pySpark on Databricks (Part III)

Hrushikesh Raghavendra — Tue, 31 May 2022 12:26:34 +0000

This post is the continuation of post which covers the model building using spark on databricks. In this post we are going to cover EDA and Hyperoptimization using pyspark.

In case you missed part-1, here you go: https://www.indiumsoftware.com/blog/end-to-end-ml-pipeline-using-pyspark-and-databricks-part-1/

Load the data using pyspark

spark = SparkSession \
    .builder \
    .appName(“Life Expectancy using Spark”) \
    .getOrCreate()

sc = spark.sparkContext
sqlCtx = SQLContext(sc)

data = sqlCtx.read.format(“com.databricks.spark.csv”)\
    .option(“header”, “true”)\
    .option(“inferschema”, “true”)\
    .load(“/FileStore/tables/Life_Expectancy_Data.csv”)

Replacing spaces in column names with ‘_’

data = data.select([F.col(col).alias(col.replace(‘ ‘, ‘_’)) for col in data.columns])

With Spark SQL, you can register any DataFrame as a table or view (a temporary table) and query it using pure SQL.
There is no performance difference between writing SQL queries or writing DataFrame code, they both “compile” to the same underlying plan that we specify in DataFrame code.

data.createOrReplaceTempView(‘lifeExp’)
spark.sql(“SELECT Status, Alcohol FROM lifeExp where Status in (‘Developing’, ‘Developed’) LIMIT 10”).show()

For more details on Indium’s Databricks consultation services

Click Here

Performance Comparison Spark DataFrame vs Spark SQL

dataframeWay = data.groupBy(‘Status’).count()
dataframeWay.explain()

sqlWay = spark.sql(“SELECT Status, count(*) FROM lifeExp group by Status”)
sqlWay.explain()

Usage of Filter function.

data.filter(col(‘Year'<2014).groupby(‘Year’).count().show(truncate=False)

data.filter(data.Status.isin([‘Developing’,’Developed’])).groupby(‘Status’).count().show(truncate=False)

Descriptive Analysis.

display(data.select(data.columns).describe())

We will look at outliers in the data which cause the bias in the data.

Convert data into pandas dataframe

data1 = data.toPandas()

#interpolate null values in data
data1 = data1.interpolate(method = ‘linear’, limit_direction = ‘forward’)

Boxlplot using matlplotlib

plt.figure(figsize = (20,30))
for var, i in columns.items():
    plt.subplot(5,4,i)
    plt.boxplot(data1[var])
    plt.title(var)
plt.show()

Boxplots are a standardized way of displaying the distribution of data based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”).

The distribution of Data is as below,

We can see most outliers in HIV/AIDS, GDP, Population, etc.

We need to treat the outliers, for this we will apply cube root function

#Cube root transformation
plt.hist(data1[‘Life_expectancy_’])
plt.title(‘before transformation’)
plt.show()
data1[‘Life_expectancy_’] = (data1[‘Life_expectancy_’]**(1/3))
plt.hist(data1[‘Life_expectancy_’])
plt.title(‘after transformation’)
plt.show()

# for Adult_Mortality
plt.hist(data1[‘Adult_Mortality’])
plt.title(‘before transf’)
plt.show()
data1[‘Adult_Mortality’] = (data1.Adult_Mortality**(1/3))
plt.hist(data1[‘Adult_Mortality’])
plt.title(‘after transf’)
plt.show()

Similarly applying the cube root function for all other features, plotting the box plot again to see the outliers treatment.

Outliers are significantly reduced from the above observations.

Converting Status values to binary values,

data1.Status = data1.Status.map({‘Developing’:0, ‘Developed’: 1})

Feature Importance.

corrs = []
columns = []
def feature_importance(col, data):
    for i in data.columns:
        if not( isinstance(data.select(i).take(1)[0][0], six.string_types)):
            print( “Correlation to Life_expectancy_ for “, i, data.stat.corr(col,i))
            corrs.append(data.stat.corr(col,i))
            columns.append(i)
sparkDF=spark.createDataFrame(data1)
# sparkDF.printSchema()

feature_importance(‘Life_expectancy_’, sparkDF)

corr_map = pd.DataFrame()
corr_map[‘column’] = columns
corr_map[‘corrs’] = corrs
corr_map.sort_values(‘corrs’,ascending = False)

Learn how Indium helped implement Databricks services for a global supply chain enterprise: https://www.indiumsoftware.com/success_stories/enterprise-data-mesh-for-a-supply-chain-giant.pdf

We considering features with positive correlation for model building.

VectorAssembler and VectorIndexer:

vectorAssembler combines all feature columns into a single feature vector column, “rawFeatures”. vectorIndexer identifies categorical features and indexes them, and creates a new column “features”. # Remove the target column from the input feature set.
featuresCols=[‘Schooling’,’Income_composition_of_resources’,’_BMI_’,’GDP’,’Status’,’percentage_expenditure’,’Diphtheria_’,
‘Alcohol’,’Polio’, ‘Hepatitis_B’, ‘Year’, ‘Total_expenditure’]

vectorAssembler = VectorAssembler(inputCols=featuresCols, outputCol=”rawFeatures”)

vectorIndexer = VectorIndexer(inputCol=”rawFeatures”, outputCol=”features”, maxCategories=4)

The next step is to define the model training stage of the pipeline.
The following command defines a XgboostRegressor model that takes an input column “features” by default and learns to predict the labels in the “Life_Expectancy_” column.
If you are running Databricks Runtime for Machine Learning 9.0 ML or above, you can set the `num_workers` parameter to leverage the cluster for distributed training.

from sparkdl.xgboost import XgboostRegressor
xgb_regressor = XgboostRegressor(num_workers=3, labelCol=”Life_expectancy_”, missing=0.0)

Define a grid of hyperparameters to test:
— maxDepth: maximum depth of each decision tree
— maxIter: iterations, or the total number of trees

paramGrid = ParamGridBuilder()\
.addGrid(xgb_regressor.max_depth, [2, 5])\
.addGrid(xgb_regressor.n_estimators, [10, 100])\
.build()

Define an evaluation metric. The CrossValidator compares the true labels with predicted values for each combination of parameters, and calculates this value to determine the best model.

evaluator = RegressionEvaluator(metricName=”rmse”, labelCol=xgb_regressor.getLabelCol(), predictionCol=xgb_regressor.getPredictionCol())

Declare the CrossValidator, which performs the model tuning.

cv = CrossValidator(estimator=xgb_regressor, evaluator=evaluator, estimatorParamMaps=paramGrid)

Defining Pipeline.

from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[vectorAssembler, vectorIndexer, cv])
pipelineModel = pipeline.fit(train)

Predictions.

predictions = pipelineModel.transform(test)
display(predictions.select(‘Life_expectancy_’,’prediction’,*featuresCols))

rmse = evaluator.evaluate(predictions)
print(“RMSE on our test set: %g” % rmse)

Output: RMSE on our test set: 0.100884

evaluatorr2 = RegressionEvaluator(metricName=”r2″,
labelCol=xgb_regressor.getLabelCol(),
predictionCol=xgb_regressor.getPredictionCol())

r2 = evaluatorr2.evaluate(predictions)
print(“R2 on our test set: %g” % r2)

Output: R2 on our test set: 0.736901

For the observations of RMSE and R-Squared we can see there is 73% of the variance of Life_Expectancy_ is explained by the independent features. We can further improve the R-squared value by including all the features except ‘Country’.

featuresCols = [‘Year’, ‘Status’, ‘Adult_Mortality’, ‘infant_deaths’, ‘Alcohol’, ‘percentage_expenditure’, ‘Hepatitis_B’, ‘Measles_’, ‘_BMI_’, ‘under-five_deaths_’, ‘Polio’, ‘Total_expenditure’, ‘Diphtheria_’, ‘_HIV/AIDS’, ‘GDP’, ‘Population’, ‘_thinness__1-19_years’, ‘_thinness_5-9_years’, ‘Income_composition_of_resources’, ‘Schooling’]

vectorAssembler = VectorAssembler(inputCols=featuresCols, outputCol=”rawFeatures”)

vectorIndexer = VectorIndexer(inputCol=”rawFeatures”, outputCol=”features”, maxCategories=4)

xgb_regressor = XgboostRegressor(num_workers=3, labelCol=”Life_expectancy_”, missing=0.0)

paramGrid = ParamGridBuilder()\
.addGrid(xgb_regressor.max_depth, [2, 5])\
.addGrid(xgb_regressor.n_estimators, [10, 100])\
.build()

evaluator = RegressionEvaluator(metricName=”rmse”,
labelCol=xgb_regressor.getLabelCol(), predictionCol=xgb_regressor.getPredictionCol())

cv = CrossValidator(estimator=xgb_regressor, evaluator=evaluator, estimatorParamMaps=paramGrid)

pipeline = Pipeline(stages=[vectorAssembler, vectorIndexer, cv])
pipelineModel = pipeline.fit(train)
predictions = pipelineModel.transform(test)

New values of R2 and RMSE.

rmse = evaluator.evaluate(predictions)
print(“RMSE on our test set: %g” % rmse)

evaluatorr2 = RegressionEvaluator(metricName=”r2″,
labelCol=xgb_regressor.getLabelCol(), predictionCol=xgb_regressor.getPredictionCol())

r2 = evaluatorr2.evaluate(predictions)
print(“R2 on our test set: %g” % r2)

Output: RMSE on our test set: 0.0523261, R2 on our test set: 0.92922

We see a significant improvement in RMSE and R2.

We can monitor the hyperparameters max_depth, n_estimators from Artifacts stored in JSON formats estimator_info.json, metric_info.json.

Conclusion

This post has covered Exploratory Data Analysis, XGBoost Hyperparameter Tuning. Further posts would be covering deployment of model using Databricks.

Please see the part 1 : The End-To-End ML Pipeline using Pyspark and Databricks (Part 1)

The post EDA , XGBoost and Hyperparameter Tuning using pySpark on Databricks (Part III) appeared first on Indium.