databricks Archives - Indium https://www.indiumsoftware.com/blog/tag/databricks/ Make Technology Work Wed, 17 Apr 2024 09:24:45 +0000 en-US hourly 1 https://wordpress.org/?v=6.5.3 https://www.indiumsoftware.com/wp-content/uploads/2023/10/cropped-logo_fixed-32x32.png databricks Archives - Indium https://www.indiumsoftware.com/blog/tag/databricks/ 32 32 Collaboration of Synthetics: ML’s Evolutionary Edge https://www.indiumsoftware.com/blog/collaboration-of-synthetics-mls-evolutionary-edge/ Thu, 15 Feb 2024 07:54:00 +0000 https://www.indiumsoftware.com/?p=26194 The desire for data is like an endless hole in the world of data and analytics today. The big data analytics business is predicted to reach $103 billion this year, and 181 zettabytes of data will be produced by 2025. Despite massive data being generated, access to and availability remains a problem. Although public databases

The post Collaboration of Synthetics: ML’s Evolutionary Edge appeared first on Indium.

]]>
The desire for data is like an endless hole in the world of data and analytics today. The big data analytics business is predicted to reach $103 billion this year, and 181 zettabytes of data will be produced by 2025.

Despite massive data being generated, access to and availability remains a problem. Although public databases partially address this issue, certain dangers are still involved. One of them is bias caused by improper usage of data sets. The second difficulty is requiring different data to train the algorithms and satisfy real-world requirements properly. The quality of the algorithm will also be impacted by data accuracy. It is regulated to preserve privacy and might be expensive to obtain.

These problems can be resolved by using synthetic data, which enables businesses to quickly produce the data sets required to satisfy the demands of their clients. Gartner predicts that by 2030, synthetic data will likely surpass actual data in AI models, even though accurate data is still regarded as superior.

Decoding the Exceptional Synthetic Data

So, what do you get when we say synthetic data? At the forefront of modern data-driven research, institutions like the Massachusetts Institute of Technology (MIT) are pioneering the utilization of synthetic data. Synthetic data refers to artificially generated datasets that mimic real-world data distributions, maintaining statistical properties while safeguarding privacy. This innovative approach ensures that sensitive information remains confidential, as exemplified by MIT’s creation of synthetic healthcare records that retain essential patterns for analysis without compromising patient privacy. This technique’s relevance extends to various domains, from machine learning advancements to societal insights, offering a powerful tool to unlock valuable knowledge while upholding data security and ethical considerations.

Using synthetic data, new systems can be tested without live data or if the data is biased. Small datasets not being used can be supplemented, and the accuracy of learning models can be improved. Synthetic data can also be used when real data cannot be used, shared, or moved. It can create prototypes, conduct product demos, capture market trends, and prevent fraud. It can even be used to generate novel, futuristic conditions.

Most importantly, it can help businesses comply with privacy laws, mainly health-related and personal data. It can reduce the bias in data sets by providing diverse data that reflects the real world better.

Use Cases of Synthetic Data

Synthetic data can be used in different industries for different use cases. For instance, computer graphics and image processing algorithms can generate synthetic images, audio, and video that can be used for training purposes.

Synthetic text data can be used for sentiment analysis or for building chatbots and machine translation algorithms. Synthetically generated tabular data sets are used in data analytics and training models. Unstructured data, including images, audio, and video, are being leveraged for speech recognition, computer vision, and autonomous vehicle technology. Financial institutions can use synthetic data to detect fraud, manage risks, and assess credit risk. In the manufacturing industry, it can be used for quality control testing and predictive maintenance.

Also read: The Transformative Impact Of Generative AI On The Future Of Work.

Generating Synthetic Data
How synthetic data is generated will depend on the tools and algorithms used and the use case for which it is created. Three of the popular techniques used include:

Technique #1 Random Selection of Numbers: One standard method is randomly selecting numbers from a distribution. Though this may not provide insights like real-world data, the data distribution matches it closely.

Technique #2 Generating Agent-based Models: Unique agents are created using simulation techniques to enable them to communicate with each other. This is especially useful in complex systems where multiple agents, such as mobile phones, apps, and people, are required to interact with each other. Pre-built core components and Python packages such as Mesa are used to develop the models quickly, and a browser-based interface is used to view them.

Technique #3 Generative Models: Synthetic data replicating real-world data’s statistical properties or features is generated using algorithms. Training data learns the statistical patterns and relationships in the data and generates new synthetic data similar to the original. Generative adversarial networks and variational autoencoders are examples of generative models.

The model quality should be reliable to ensure the quality of synthetic data. Additional verification is required and involves comparing the model results with the real-world data that has been annotated manually. Users must be sure that the synthetic data is not misleading, reliable, and 100% fail-safe for privacy.

Synthetic Data with Databricks

Databricks offers dbldatagen, a Python library, to generate synthetic data for testing, creating POCs, and other uses such as Delta Live Tables pipelines in Databricks environments. It helps to:

● Create unique values for a column.
● Allow templated text generation based on specifications.
● Generate data from a specific set of values.
● Generate weighted data in case of repeating values.
● The data generated in a data frame can be written to storage in any format.
● Billions of rows of data can be generated quickly.
● A random seed can be used to generate data based on the value of other fields.


To learn more about Indium Software, please visit

Click Here

The post Collaboration of Synthetics: ML’s Evolutionary Edge appeared first on Indium.

]]>
What Cloud Engineers Need to Know about Databricks Architecture and Workflows https://www.indiumsoftware.com/blog/what-cloud-engineers-need-to-know-about-databricks-architecture-and-workflows/ Wed, 15 Feb 2023 13:50:19 +0000 https://www.indiumsoftware.com/?p=14679 Databricks Lakehouse Platform creates a unified approach to the modern data stack by combining the best of data lakes and data warehouses with greater reliability, governance, and improved performance of data warehouses. It is also open and flexible. Often, the data team needs different solutions to process unstructured data, enable business intelligence, and build machine

The post What Cloud Engineers Need to Know about Databricks Architecture and Workflows appeared first on Indium.

]]>
Databricks Lakehouse Platform creates a unified approach to the modern data stack by combining the best of data lakes and data warehouses with greater reliability, governance, and improved performance of data warehouses. It is also open and flexible.

Often, the data team needs different solutions to process unstructured data, enable business intelligence, and build machine learning models. But with the unified Databricks Lakehouse Platform, all these are unified. It also simplifies data processing, analysis, storage, governance, and serving, enabling data engineers, analysts, and data scientists to collaborate effectively.

For the cloud engineer, this is good news. Managing permissions, networking, and security becomes easier as they only have one platform to manage and monitor the security groups and identity and access management (IAM) permissions.

Challenges Faced by Cloud Engineers

Access to data, reliability, and quality, are key for businesses to be able to leverage the data and make instant and informed decisions. Often, though, businesses face the challenge of:

  • No ACID transactions: As a result, updates, appends, and reads cannot be mixed
  • No Schema Enforcement: Leads to data inconsistency and low quality.
  • Integration with Data Catalog Not Possible: Absence of single source of truth and dark data.

Since object storage is used by data lakes, data is stored in immutable files that can lead to:

  • Poor Partitioning: Ineffective partitioning leads to long development hours for improving read/write performance and the possibility of human errors.
  • Challenges to Appending Data: As transactions are not supported, new data can be appended only by adding small files, which can lead to poor quality of query performance.

To know more about Cloud Monitoring

Get in touch

Databricks Advantages

Databricks helps overcome these problems with Delta Lake and Photon.

Delta Lake: A file-based, open-source storage format that runs on top of existing data lakes, it is compatible with Apache Spark and other processing engines and facilitates ACID transactions and handling of scalable metadata, unifying streaming and batch processing.

Delta Tables, based on Apache Parquet, is used by many organizations and is therefore interchangeable with other Parquet tables. Semi-structured and unstructured data can also be processed by Delta Tables, which makes data management easy by allowing versioning, reliability, time travel, and metadata management.

It ensures:

  • ACID
  • Handling of scalable data and metadata
  • Audit history and time travel
  • Enforcement and evolution of schema
  • Supporting deletes, updates, and merges
  • Unification of streaming and batch

Photon: The lakehouse paradigm is becoming de facto but creating the challenge of the underlying query execution engine unable to access and process structured and unstructured data. What is needed is an execution engine that has the performance of a data warehouse and is scalable like the data lakes.

Photon, the next-generation query engine on the Databricks Lakehouse Platform, fills this need. As it is compatible with Spark APIs, it provides a generic execution framework enabling efficient data processing. It lowers infrastructure costs while accelerating all use cases, including data ingestion, ETL, streaming, data science, and interactive queries. As it does not need code change or lock-in, just turn it on to get started.

Read more on how Indium can help you: Building Reliable Data Pipelines Using DataBricks’ Delta Live Tables

Databricks Architecture

The Databricks architecture facilitates cross-functional teams to collaborate securely by offering two main components: the control plane and the data plane. As a result, the data teams can run their processes on the data plane without worrying about the backend services, which are managed by the control plane component.

The control plane consists of backend services such as notebook commands and workspace-related configurations. These are encrypted at rest. The compute resources for notebooks, jobs, and classic SQL data warehouses reside on the data plane and are activated within the cloud environment.

For the cloud engineer, this architecture provides the following benefits:

Eliminate Data Silos

A unified approach eliminates the data silos and simplifies the modern data stack for a variety of uses. Being built on open source and open standards, it is flexible. Enabling a unified approach to data management, security, and governance improves efficiency and faster innovation.

Easy Adoption for A Variety of Use Cases

The only limit to using the Databricks architecture for different requirements of the team is whether the cluster in the private subnet has permission to access the destination. One way to enable it is using VPC peering between the VPCs or potentially using a transit gateway between the accounts.

Flexible Deployment

Databricks workspace deployment typically comes with two parts:

– The mandatory AWS resources

– The API that enables registering those resources in the control plane of Databricks

This empowers the cloud engineering team to deploy the AWS resources in a manner best suited to the business goals of the organization. The APIs facilitate access to the resources as needed.

Cloud Monitoring

The Databricks architecture also enables the extensive monitoring of the cloud resources. This helps cloud engineers track spending and network traffic from EC2 instances, register wrong API calls, monitor cloud performance, and maintain the integrity of the cloud environment. It also allows the use of popular tools such as Datadog and Amazon Cloudwatch for data monitoring.

Best Practices for Improved Databricks Management

Cloud engineers must plan the workspace layout well to optimize the use of the Lakehouse and enable scalability and manageability. Some of the best practices to improve performance include:

  • Minimizing the number of top-level accounts and creating a workspace as needed to be compliant, enable isolation, or due to geographical constraints.
  • The isolation strategy should ensure flexibility without being complex.
  • Automate the cloud processes.
  • Improve governance by creating a COE team.

Indium Software, a leading software solutions provider, can facilitate the implementation and management of Databricks Architecture in your organization based on your unique business needs. Our team has experience and expertise in Databricks technology as well as industry experience to customize solutions based on industry best practices.

To know more Databricks Consulting Services

Visit

FAQ

Which cloud hosting platform is Databricks available on?

Amazon AWS, Microsoft Azure, and Google Cloud are the three platforms Databricks is available on.

Will my data have to be transferred into Databricks’ AWS account?

Not needed. Databricks can access data from your current data sources.

The post What Cloud Engineers Need to Know about Databricks Architecture and Workflows appeared first on Indium.

]]>
Building Reliable Data Pipelines Using DataBricks’ Delta Live Tables https://www.indiumsoftware.com/blog/building-reliable-data-pipelines-using-databricks-delta-live-tables/ Fri, 16 Dec 2022 07:33:10 +0000 https://www.indiumsoftware.com/?p=13726 The enterprise data landscape has become more data-driven. It has continued to evolve as businesses adopt digital transformation technologies like IoT and mobile data. In such a scenario, the traditional extract, transform, and load (ETL) process used for preparing data, generating reports, and running analytics can be challenging to maintain because they rely on manual

The post Building Reliable Data Pipelines Using DataBricks’ Delta Live Tables appeared first on Indium.

]]>
The enterprise data landscape has become more data-driven. It has continued to evolve as businesses adopt digital transformation technologies like IoT and mobile data. In such a scenario, the traditional extract, transform, and load (ETL) process used for preparing data, generating reports, and running analytics can be challenging to maintain because they rely on manual processes for testing, error handling, recovery, and reprocessing. Data pipeline development and management can also become complex in the traditional ETL approach. Data quality can be an issue, impacting the quality of insights. The high velocity of data generation can make implementing batch or continuous streaming data pipelines difficult. Should the need arise, data engineers should be able to change the latency flexibly without re-writing the data pipeline. Scaling up as the data volume grows can also become difficult due to manual coding. It  can lead to more time and cost spent on developing, addressing errors, cleaning up data, and resuming processing.

To know more about Indium and our Databricks and DLT capabilities

Contact us now

Automating Intelligent ETL with Data Live Tables

Given the fast-paced changes in the market environment and the need to retain competitive advantage, businesses must address the challenges, improve efficiencies, and deliver high-quality data reliably and on time. This is possible only by automating ETL processes.

The Databricks Lakehouse Platform offers Delta Live Tables (DLT), a new cloud-native managed service that facilitates the development, testing, and operationalization of data pipelines at scale, using a reliable ETL framework. DLT simplifies the development and management of ETL with:

  • Declarative pipeline development
  • Automatic data testing
  • Monitoring and recovery with deep visibility

With Delta Live Tables, end-to-end data pipelines can be defined easily by specifying the source of the data, the logic used for transformation, and the target state of the data. It can eliminate the manual integration of siloed data processing tasks. Data engineers can also ensure data dependencies are maintained across the pipeline automatically and apply data management for reusing ETL pipelines. Incremental or complete computation for each table during batch or streaming run can be specified based on need.

Benefits of DLT

The DLT framework can help build data processing pipelines that are reliable, testable, and maintainable. Once the data engineers provide the transformation logic, DLT can orchestrate the task, manage clusters, monitor the process and data quality, and handle errors. The benefits of DLT include;

Assured Data Quality

Delta Live Tables can prevent bad data from reaching the tables by validating and checking the integrity of the data. Using predefined policies on errors such as fail, alert, drop, or quarantining data, Delta Live Tables can ensure the quality of the data to improve the outcomes of BI, machine learning, and data science. It can also provide visibility into data quality trends to understand how the data is evolving and what changes are necessary.

Improved Pipeline Visibility

DLT can monitor pipeline operations by providing tools that enable visual tracking of operational stats and data lineage. Automatic error handling and easy replay can reduce downtime and accelerate maintenance with deployment and upgrades at the click of a button.

Improve Regulatory Compliance

The event log can automatically capture information related to the table for analysis and auditing. DLT can provide visibility into the flow of data in the organization and improve regulatory compliance.

Simplify Deployment and Testing of Data Pipeline

DLT can enable data to be updated and lineage information to be captured for different copies of data using a single code base. It can also enable the same set of query definitions to be run through the development, staging, and production stages.

Simplify Operations with Unified Batch and Streaming

Build and run of batch and streaming pipelines can be centralized, and the operational complexity can be effectively minimized with controllable and automated refresh settings.

Concepts Associated with Delta Live Tables

The concepts used in DLT include:

Pipeline: A Directed Acyclic Graph that can link data sources with destination datasets

Pipeline Setting: Pipeline settings can define configurations such as;

  • Notebook
  • Target DB
  • Running mode
  • Cluster config
  • Configurations (Key-Value Pairs).

Dataset: The two types of datasets DLT supports include Views and Table, which, in turn, are of two types: Live and Streaming.

Pipeline Modes: Delta Live provides two modes for development:

Development Mode: The cluster is reused to prevent restarts and disable pipeline retries for detecting and fixing errors.

Production Mode: Cluster restart for recoverable errors such as stale credentials or memory leak and execution is retried for specific errors.

Editions: DLT comes in various editions to suit the different needs of the customers such as:

  • Core for streaming ingest workload
  • Pro for core features + CDC, streaming ingest, and table updation based on changes to the source data
  • Advanced where in addition to core and pro features, data quality constraints are also available

Delta Live Event Monitoring: Delta Live Table Pipeline event log is stored under the storage location in /system/events.

Indium for Building Reliable Data Pipelines Using DLT

Indium is a recognized data engineering company with an established practice in Databricks. We offer ibriX, an Indium Databricks AI Platform, that helps businesses become agile, improve performance, and obtain business insights efficiently and effectively.

Our team of Databricks experts works closely with customers across domains to understand their business objectives and deploy the best practices to accelerate growth and achieve the goals. With DLT, Indium can help businesses leverage data at scale to gain deeper and meaningful insights to improve decision-making.

FAQs

How does Delta Live Tables make the maintenance of tables easier?

Maintenance tasks are performed on tables every 24 hours by Delta Live Tables, which improves query outcomes. It also removes older versions of tables and improves cost-effectiveness.

Can multiple queries be written in a pipeline for the same target table?

No, this is not possible. Each table should be defined once. UNION can be used to combine various inputs to create a table.

The post Building Reliable Data Pipelines Using DataBricks’ Delta Live Tables appeared first on Indium.

]]>
Distributed Data Processing Using Databricks https://www.indiumsoftware.com/blog/distributed-data-processing-using-databricks/ Mon, 21 Nov 2022 05:21:38 +0000 https://www.indiumsoftware.com/?p=13346 Distributed systems are used in organizations for collecting, accessing, and manipulating large volumes of data. Recently, distributed systems have become an integral component of various organizations as an exponential increase in data is witnessed across industries.   With the advent of big data technologies, many challenges in dealing with large datasets have been addressed. But in

The post Distributed Data Processing Using Databricks appeared first on Indium.

]]>
Distributed systems are used in organizations for collecting, accessing, and manipulating large volumes of data. Recently, distributed systems have become an integral component of various organizations as an exponential increase in data is witnessed across industries.  

With the advent of big data technologies, many challenges in dealing with large datasets have been addressed. But in a typical data processing scenario, when a data set is too large to be processed by a single machine or when a single machine may not contain the data to respond to user queries, it requires the processing power of multiple machines. These scenarios are becoming increasingly complex as many applications, devices, and social platforms need data in an organization, and this is where distributed data processing methods are best implemented.  

Know more about Indium’s capabilities on Databricks and how it can help transform your business

Click Here

Understanding Distributed Data Processing 

Distributed data processing consists of a large volume of data that flows through variable sources into the system. There are various layers in that system that manage this data ingestion process.  

At first, the data collection and preparation layer collects the data from different sources, which is further processed by the system. However, we know that any data gathered from external sources are mainly raw data such as text, images, audio, and forms. Therefore, the preparation layer is responsible for converting the data into a usable and standard format for analytical purposes. 

Meanwhile, the data storage layer primarily handles data streaming in real-time for performing analytics with the help of in-memory distributed caches for storing and managing data. Similarly, if the data is required to be processed in the conventional approach, then batch processing is performed across distributed databases, effectively handling big data.  

Next is the data processing layer, which can be considered the logical layer that processes the data. This layer allows various machine learning solutions and models for performing predictive, descriptive analytics to derive meaningful business insights. Finally, there is the data visualization layer consisting of dashboards that allows visualization of the data and reports after performing different analytics using graphs and charts for better interpretation of the results. 

In the quest to find new approaches to distribute processing power, application programs, and data, distributed data engineering solutions  is adopted to enable the distribution of applications and data among various interconnected sites to complement the increasing need for information in the organizations. However, an organization may opt for a centralized or a decentralized data processing system, depending on their requirements.  

Benefits of Distributed Data Processing 

The critical benefit of processing data within a distributed environment is the ease at which tasks can be completed with significantly lesser time as data is accessible from multiple machines that execute the tasks parallelly instead of a single machine running requests in a queue. 

As the data is processed faster, it is a cost-effective approach for businesses, and running workloads in a distributed environment meets crucial aspects of scalability and availability in today’s fast-paced environment. In addition, since data is replicated across the clusters, there is less likelihood of data loss.

Challeges of Distributed Data Processing 

The entire process of setting up and working with a distributed system is complex.  

With large enterprises compromised data security, coordination problems, occasional performance bottlenecks due to non-performing terminals in the system and even high costs of maintenances are seen as major issues. 

How is Databricks Platform Used for Distributed Data Processing? 

The cloud data platforms Databricks Lakehouse  helps to perform analytical queries, and there is  a provision of Databricks SQL for working with business intelligence and analytical tasks atop the data lakes. Analysts can query data sets using standard SQL and have great features for integrating business intelligence tools like Tableau. At the same time, the Databricks platform allows working with different workloads encompassing machine learning, data storage, data processing, and streaming analytics in real time. 

The immediate benefit of a Databricks architecture is enabling seamless connections to applications and effective cluster management. Additionally, using databricks provides a simplified setup and maintenance of the clusters, which makes it easy for developers to create the ETL pipelines. These ETL pipelines ensure data availability in real-time across the organization leading to better collaborative efforts among cross-functional teams.  

With the Databricks Lakehouse platform, it is now easy to ingest and transform batch and streaming data leading to reliable production workflows. Moreover, Databricks ensure clusters scale and terminate automatically as per the usage. Since the data ingestion process is simplified, all analytical solutions, AI, and other streaming applications can be operated from a single place.  

Likewise, automated ETL processing is provided to ensure raw data is immediately transformed to be readily available for analytics and AI applications. Not only the data transformation but automating ETL processing allows for efficient task orchestration, error handling, recovery, and performance optimization. Orchestration enables developers to work with diverse workloads, and the data bricks workflow can be accessed with a host of features using the dashboard, improving tracking and monitoring of performance and jobs in the pipeline. This approach continuously monitors performance, data quality, and reliability metrics from various perspectives.  

In addition, Databricks offers  a data processing engine compatible with Apache Spark APIs that speeds up the work by automatically scaling multiple nodes. Another critical aspect of this Databricks platform is enabling governance of all the data and AI-based applications with a single model for discovering, accessing, and securing data sharing across cloud platforms. 

Similarly, there is support for Datbricks SQL within the Databricks Lakehouse, a serveless data warehouse capable of running any SQL and business intelligence applications at scale. 

Databricks Services From Indium: 

With deep expertise in Databricks Lakehouse, Advanced Analytics & Data Products, Indium Software provides wide range of services to help our clients’ business needs. Indium’s propreitory solution accelerator iBriX is a packaged combination of AI/ML use cases, custom scripts, reusable libraries, processes, policies, optimization techniques, performance management with various levels of automation including standard operational procedures and best practices. 

To know more about iBriX and the services we offer, write to info@www.indiumsoftware.com.  

The post Distributed Data Processing Using Databricks appeared first on Indium.

]]>
Deploying Databricks on AWS: Key Best Practices to Follow https://www.indiumsoftware.com/blog/deploying-databricks-on-aws-key-best-practices-to-follow/ Fri, 11 Nov 2022 08:11:41 +0000 https://www.indiumsoftware.com/?p=13257 Databricks is a unified, open platform for all organizational data and is built along the architecture of a data lake. It ensures speed, scalability, and reliability by combining the best of data warehouses and data lakes. At the core is the Databricks workspace that stores all objects, assets, and computational resources, including clusters and jobs.

The post Deploying Databricks on AWS: Key Best Practices to Follow appeared first on Indium.

]]>
Databricks is a unified, open platform for all organizational data and is built along the architecture of a data lake. It ensures speed, scalability, and reliability by combining the best of data warehouses and data lakes. At the core is the Databricks workspace that stores all objects, assets, and computational resources, including clusters and jobs.

Over the years, the need to simplify Databricks deployment on AWS had become a persistent demand due to the complexity involved. When deploying Databricks on AWS, customers had to constantly between consoles as given in a very detailed documentation. To deploy the workspace, customers had to:

  • Configure a virtual private cloud (VPC)
  • Set up security groups
  • Create a cross-account AWS Identity and Access Management (IAM) role
  • Add all AWS services used in the workspace

This could take more than an hour and needed a Databricks solutions architect familiar with AWS to guide the process.

To make matters simple and easy and enable self-service, the company offers Quick Start in collaboration with Amazon Web Services (AWS). This is an automated reference deployment tool integrating AWS best practices to leverage AWS Cloud Formation templates and deploy key technologies on AWS.

Incorporating AWS Best Practices

Best Practice #1 – Ready, Steady, Go

Make it easy even for non-technical customers to get Databricks up and running in minutes. Quick Starts allows customers to sign in to the AWS Management Console and deploy Databricks within minutes after selecting the CloudFormation template and Region by filling in the parameter values required for the purpose and deploy. Quick Starts is applicable to several environments and the architecture is designed such that customers using any environment can leverage it.

Best Practice #2 – Automating Installation

Deployment of Databricks involved installing and configuring several components manually earlier. This is a very slow process, prone to errors and reworks. The customers had to refer to a document to get it right and this was proving to be difficult. By automating the process, AWS cloud deployments can be speeded up effectively and efficiently.

Best Practice #3 – Security from the Word Go

One of the AWS best practices is the focus on security and availability. When deploying Databricks, this focus should be integrated right from the beginning. For effective security and availability, aligning it with the AWS user management to allow one-time IAM will provide access to the environment with appropriate controls. This should be supplemented with AWS Security Token Service (AWS STS) to authenticate user requests for temporary, limited-privilege credentials.

Best Practice #4 High Availability

As the environment spans two Availability Zones, it ensures a highly available architecture. Add a Databricks- or customer-managed virtual private cloud (VPC) to the customer’s AWS account and configure it with private subnets and a public subnet. This will provide customers with access to their own virtual network on AWS. In the private subnets, Databricks clusters of Amazon Elastic Compute Cloud (Amazon EC2) instances can be added along with additional security groups to ensure secure cluster connectivity. In the public subnet, outbound internet access can be provided with a network address translation (NAT) gateway. Use Amazon Simple Storage Service (Amazon S3) bucket for storing objects such as notebook revisions, cluster logs, and job results.

The benefits of using these best practices is that creating and configuring the AWS resources required to deploy and configure the Databricks workspace can be automated easily. It doesn’t need solutions architects to undergo extensive training to the configurations and can be an intuitive process. This will help them remain updated with the latest product enhancements, security upgrades, and user experience improvements without difficulty.

Since the launch of Quick Starts in September 2020, Databricks deployment on AWS has become much simpler, resulting in:

  • Deployment time takes only 5 minutes as against the earlier 1 hour
  • 95% lower deployment errors

As it incorporates the best practices of AWS and is co-developed by AWS and Databricks, the solution answers the need of its customers to quickly and effectively deploy Databricks on AWS.

Indium – Combining Technology with Experience

Indium Software is an AWS and Databricks solution provider with a battalion of data experts who can help you with deploying Databricks on AWS to set you off on your cloud journey. We work with our customers closely to understand their business goals and smooth digital transformation by designing solutions that cater to their goals and objectives.

While Quick Starts is a handy tool that accelerates the deployment of Databricks on AWS, we help design the data lake architecture to optimize cost and resources and maximize benefits. Our expertise in DevSecOps ensures a secure and scalable solution that is highly available with permission-based access to enable self-service with compliance.

Some of the key benefits of working with Indium on Databricks deployments include:

  • More than 120 person-years of Spark expertise
  • Dedicated Lab and COE for Databricks
  • ibriX – Homegrown Databricks Accelerator for faster Time-to-market
  • Cost Optimization Framework – Greenfield and Brownfield engagements
  • E2E Data Expertise – Lakehouse, Data Products, Advanced Analytics, and ML Ops
  • Wide Industry Experience – Healthcare, Financial Services, Manlog, Retail and Realty

FAQs

How to create a Databrick in AWS?

In the free trial, you can sign up by clicking the Try Databricks button at the top of the page or on AWS Marketplace.

How can one store and access data on Databricks and AWS?

All data can be stored and managed on a simple, open lakehouse platform. Databricks on AWS allows the unification of all analytics and AI workloads by combining the best of data warehouses and data lakes.

How can Databricks connect to AWS?

AWS Glue allows Databricks to be integrated and Databricks table metadata to be shared from a centralized catalog across various Databricks workspaces, AWS services, AWS accounts, and applications for easy access.

The post Deploying Databricks on AWS: Key Best Practices to Follow appeared first on Indium.

]]>
Building a Databricks Lakehouse on AWS to Manage AI and Analytics Workloads Better https://www.indiumsoftware.com/blog/building-a-databricks-lakehouse-on-aws-to-manage-ai-and-analytics-workloads-better/ Tue, 18 Oct 2022 07:12:12 +0000 https://www.indiumsoftware.com/?p=12727 Businesses need cost-efficiency, flexibility, and scalability with an open data management architecture to meet their growing AI and analytics needs. Data lakehouse provides businesses with capabilities for data management and ACID transactions using an open system design that allows the implementation of data structures and management features similar to those of a data warehouse. It

The post Building a Databricks Lakehouse on AWS to Manage AI and Analytics Workloads Better appeared first on Indium.

]]>
Businesses need cost-efficiency, flexibility, and scalability with an open data management architecture to meet their growing AI and analytics needs. Data lakehouse provides businesses with capabilities for data management and ACID transactions using an open system design that allows the implementation of data structures and management features similar to those of a data warehouse. It accelerates the access to complete and current data from multiple sources by merging them into a single system for projects related to data science, business analytics, and machine learning.

Some of the key technologies that enable the data lakehouse to provide these benefits include:

  • Layers of metadata
  • Improved SQL execution enabled by new query engine designs
  • optimized access for data science and machine learning tools.

To know more about our Databricks on AWS capabilities, contact us now

Get in touch

Data Lakes for Improved Performance

Metadata layers track the files that can be a part of different table versions to enable ACID-compliant transactions. They support streaming I/O without the need for message buses such as Kafka), facilitating accessing older versions of the table, enforcement and evolution of schema, and validating data.

But among these features, what makes the data lake popular is its performance with the introduction of new query engine designs for SQL analysis. In addition, some optimizations include:

  • Hot data caching in RAM/SSDs
  • Cluster co-accessed data layout optimization
  • Statistics, indexes, and other such auxiliary data structures
  • Vectorized execution on modern CPUs

This makes data lakehouse performance on large datasets comparable to other popular TPC-DS benchmark-based data warehouses. Being built on open data formats such as Parquet makes access to data easy for data scientists and machine learning engineers in the lakehouse.

Indium’s capabilities with Databricks services: UX Enhancement & Cross Platform Optimization of Healthcare Application

Easy Steps to Building Databricks Data Lakehouse on AWS

As businesses increase their adoption of AI and analytics and scale up, businesses can leverage Databricks consulting services to experience the benefits of their data by keeping it simple and accessible. Databricks provides a cost-effective solution through its pay-as-you-go solution on Databricks AWS to allow the use of existing AWS accounts and infrastructure.

Databricks on AWS is a collaborative workspace for machine learning, data science, and analytics, using the Lakehouse architecture to process large volumes of data and accelerate innovation. The Databricks Lakehouse Platform, forming the core of the AWS ecosystem, integrates easily and seamlessly with popular Data and AI services such as S3 buckets, Kinesis streams, Athena, Redshift, Glue, and QuickSight, among others.

Building a Databricks Lakehouse on AWS is very easy and involves:

Quick Setup: For customers with AWS partner privileges, setting up Databricks is as simple as subscribing to the service directly from their AWS account without creating a new account. The Databricks Marketplace listing is available in the AWS Marketplace and can be accessed through a simple search. A self-service Quickstart video is available to help businesses create their first workspace.

Smooth Onboarding: The Databricks pay-as-you-go service can be set up using AWS credentials. Databricks allows the account settings and roles in AWS to be preserved, accelerating the setting up and the kick-off of the Lakehouse building.

Pay Per Use: The Databricks on AWS is a cost-effective solution as the customers have to pay based on the use of resources. The billing is linked to their existing Enterprise Discount Program, enabling them to build a flexible and scalable lakehouse on AWS based on their needs.

Try Before Signing Up: AWS customers can opt for a free 14-day trial of Databricks before signing up for the subscription. The billing and payment can be consolidated under their already-present AWS management account.

Benefits of Databricks Lakehouse on AWS

Apart from a cost-effective, flexible and scalable solution for improved management of AI and analytics workloads, some of the other benefits include:

  • Supporting AWS Graviton2-based Amazon Elastic Compute Cloud (Amazon EC2) instances for 3x improvement in performance
  • Exceptional price-performance ensured by Graviton processors for workloads running in EC2
  • Improved performance by using Photon, the new query engine from Databricks Our Engineering team ran benchmark tests and discovered that Graviton2-based

It might be interesting to read on End-To-End ML Pipeline using Pyspark and Databricks (Part I)

Indium–A Databricks Expert for Your AI/Analytics Needs

Indium Software is a leading provider of data engineering, machine learning, and data analytics solutions. An AWS partner, we have an experienced team of Databricks experts who can build Databricks Lakehouse on AWS quickly to help you manage your AI and analytics workloads better.

Our range of services includes: Data Engineering Solutions: Our quality engineering practices optimize data fluidity from origin to destination.

BI & Data Modernization Solutions: Improve decision making through deeper insights and customized, dynamic visualization

Data Analytics Solutions: Leverage powerful algorithms and techniques to augment decision-making with machines for exploratory scenarios

AI/ML Solutions: Draw deep insights using intelligent machine learning services

We use our cross-domain experience to design innovative solutions for your business, meeting your objectives and the need for accelerating growth, improving efficiency, and moving from strength to strength. Our team of capable data scientists and solution architects leverage modern technologies cost-effectively to optimize resources and meet strategic imperatives.

Inquire Now to know more about our Databricks on AWS capabilities.

The post Building a Databricks Lakehouse on AWS to Manage AI and Analytics Workloads Better appeared first on Indium.

]]>
EDA , XGBoost and Hyperparameter Tuning using pySpark on Databricks (Part III) https://www.indiumsoftware.com/blog/end-to-end-ml-pipeline-using-pyspark-and-databricks-part-3 Tue, 31 May 2022 12:26:34 +0000 https://www.indiumsoftware.com/?p=9944 This post is the continuation of post which covers the model building using spark on databricks. In this post we are going to cover EDA and Hyperoptimization using pyspark. In case you missed part-1, here you go: https://www.indiumsoftware.com/blog/end-to-end-ml-pipeline-using-pyspark-and-databricks-part-1/ Load the data using pyspark spark = SparkSession \    .builder \    .appName(“Life Expectancy using Spark”) \    .getOrCreate() sc

The post EDA , XGBoost and Hyperparameter Tuning using pySpark on Databricks (Part III) appeared first on Indium.

]]>
This post is the continuation of post which covers the model building using spark on databricks. In this post we are going to cover EDA and Hyperoptimization using pyspark.

In case you missed part-1, here you go: https://www.indiumsoftware.com/blog/end-to-end-ml-pipeline-using-pyspark-and-databricks-part-1/

Load the data using pyspark

spark = SparkSession \
    .builder \
    .appName(“Life Expectancy using Spark”) \
    .getOrCreate()

sc = spark.sparkContext
sqlCtx = SQLContext(s
c)

data = sqlCtx.read.format(“com.databricks.spark.csv”)\
    .option(“header”, “true”)\
    .option(“inferschema”, “true”)\
    .load(“/FileStore/tables/Life_Expectancy_Data.csv”)

Replacing spaces in column names with ‘_’

data = data.select([F.col(col).alias(col.replace(‘ ‘, ‘_’)) for col in data.columns])

With Spark SQL, you can register any DataFrame as a table or view (a temporary table) and query it using pure SQL.
There is no performance difference between writing SQL queries or writing DataFrame code, they both “compile” to the same underlying plan that we specify in DataFrame code.

data.createOrReplaceTempView(‘lifeExp’)
spark.sql(“SELECT Status, Alcohol FROM lifeExp where Status in (‘Developing’, ‘Developed’) LIMIT 10”).show()

For more details on Indium’s Databricks consultation services

Click Here

Performance Comparison Spark DataFrame vs Spark SQL

dataframeWay = data.groupBy(‘Status’).count()
dataframeWay.explain()

sqlWay = spark.sql(“SELECT Status, count(*) FROM lifeExp group by Status”)
sqlWay.explain()

Usage of Filter function.

data.filter(col(‘Year'<2014).groupby(‘Year’).count().show(truncate=False)

data.filter(data.Status.isin([‘Developing’,’Developed’])).groupby(‘Status’).count().show(truncate=False)

Descriptive Analysis.

display(data.select(data.columns).describe())

We will look at outliers in the data which cause the bias in the data.

Convert data into pandas dataframe

data1 = data.toPandas()

#interpolate null values in data
data1 = data1.interpolate(method = ‘linear’, limit_direction = ‘forward’)

Boxlplot using matlplotlib

plt.figure(figsize = (20,30))
for var, i in columns.items():
    plt.subplot(5,4,i)
    plt.boxplot(data1[var])
    plt.title(var)
plt.show()

Boxplots are a standardized way of displaying the distribution of data based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”).

The distribution of Data is as below,

We can see most outliers in HIV/AIDS, GDP, Population, etc.

We need to treat the outliers, for this we will apply cube root function

#Cube root transformation
plt.hist(data1[‘Life_expectancy_’])
plt.title(‘before transformation’)
plt.show()
data1[‘Life_expectancy_’] = (data1[‘Life_expectancy_’]**(1/3))
plt.hist(data1[‘Life_expectancy_’])
plt.title(‘after transformation’)
plt.show()

# for Adult_Mortality
plt.hist(data1[‘Adult_Mortality’])
plt.title(‘before transf’)
plt.show()
data1[‘Adult_Mortality’] = (data1.Adult_Mortality**(1/3))
plt.hist(data1[‘Adult_Mortality’])
plt.title(‘after transf’)
plt.show()

Similarly applying the cube root function for all other features, plotting the box plot again to see the outliers treatment.

Outliers are significantly reduced from the above observations.

Converting Status values to binary values,

data1.Status = data1.Status.map({‘Developing’:0, ‘Developed’: 1})

Feature Importance.

corrs = []
columns = []
def feature_importance(col, data):
    for i in data.columns:
        if not( isinstance(data.select(i).take(1)[0][0], six.string_types)):
            print( “Correlation to Life_expectancy_ for “, i, data.stat.corr(col,i))
            corrs.append(data.stat.corr(col,i))
            columns.append(i)
sparkDF=spark.createDataFrame(data1)
# sparkDF.printSchema()

feature_importance(‘Life_expectancy_’, sparkDF)

corr_map = pd.DataFrame()
corr_map[‘column’] = columns
corr_map[‘corrs’] = corrs
corr_map.sort_values(‘corrs’,ascending = False)

Learn how Indium helped implement Databricks services for a global supply chain enterprise: https://www.indiumsoftware.com/success_stories/enterprise-data-mesh-for-a-supply-chain-giant.pdf

We considering features with positive correlation for model building.

VectorAssembler and VectorIndexer: 

vectorAssembler combines all feature columns into a single feature vector column, “rawFeatures”. vectorIndexer identifies categorical features and indexes them, and creates a new column “features”. # Remove the target column from the input feature set.
featuresCols=[‘Schooling’,’Income_composition_of_resources’,’_BMI_’,’GDP’,’Status’,’percentage_expenditure’,’Diphtheria_’,
               ‘Alcohol’,’Polio’, ‘Hepatitis_B’, ‘Year’, ‘Total_expenditure’]

vectorAssembler = VectorAssembler(inputCols=featuresCols, outputCol=”rawFeatures”)

vectorIndexer = VectorIndexer(inputCol=”rawFeatures”, outputCol=”features”, maxCategories=4)

The next step is to define the model training stage of the pipeline. 
The following command defines a XgboostRegressor model that takes an input column “features” by default and learns to predict the labels in the “Life_Expectancy_” column.
If you are running Databricks Runtime for Machine Learning 9.0 ML or above, you can set the `num_workers` parameter to leverage the cluster for distributed training.

from sparkdl.xgboost import XgboostRegressor
xgb_regressor = XgboostRegressor(num_workers=3, labelCol=”Life_expectancy_”, missing=0.0)

Define a grid of hyperparameters to test:
 — maxDepth: maximum depth of each decision tree 
 — maxIter: iterations, or the total number of trees

paramGrid = ParamGridBuilder()\
  .addGrid(xgb_regressor.max_depth, [2, 5])\
  .addGrid(xgb_regressor.n_estimators, [10, 100])\
  .build()

Define an evaluation metric. The CrossValidator compares the true labels with predicted values for each combination of parameters, and calculates this value to determine the best model.

evaluator = RegressionEvaluator(metricName=”rmse”,                                labelCol=xgb_regressor.getLabelCol(),  predictionCol=xgb_regressor.getPredictionCol())

Declare the CrossValidator, which performs the model tuning.

cv = CrossValidator(estimator=xgb_regressor, evaluator=evaluator, estimatorParamMaps=paramGrid)

Defining Pipeline.

from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[vectorAssembler, vectorIndexer, cv])
pipelineModel = pipeline.fit(train)

Predictions.

predictions = pipelineModel.transform(test)
display(predictions.select(‘Life_expectancy_’,’prediction’,*featuresCols))

rmse = evaluator.evaluate(predictions)
print(“RMSE on our test set: %g” % rmse)

Output: RMSE on our test set: 0.100884

evaluatorr2 = RegressionEvaluator(metricName=”r2″,
                                labelCol=xgb_regressor.getLabelCol(),
                        predictionCol=xgb_regressor.getPredictionCol())

r2 = evaluatorr2.evaluate(predictions)
print(“R2 on our test set: %g” % r2)

Output: R2 on our test set: 0.736901

For the observations of RMSE and R-Squared we can see there is 73% of the variance of Life_Expectancy_ is explained by the independent features. We can further improve the R-squared value by including all the features except ‘Country’.

featuresCols = [‘Year’, ‘Status’, ‘Adult_Mortality’, ‘infant_deaths’, ‘Alcohol’, ‘percentage_expenditure’, ‘Hepatitis_B’, ‘Measles_’, ‘_BMI_’, ‘under-five_deaths_’, ‘Polio’, ‘Total_expenditure’, ‘Diphtheria_’, ‘_HIV/AIDS’, ‘GDP’, ‘Population’, ‘_thinness__1-19_years’, ‘_thinness_5-9_years’, ‘Income_composition_of_resources’, ‘Schooling’]
 
vectorAssembler = VectorAssembler(inputCols=featuresCols, outputCol=”rawFeatures”)
  
vectorIndexer = VectorIndexer(inputCol=”rawFeatures”, outputCol=”features”, maxCategories=4)

xgb_regressor = XgboostRegressor(num_workers=3, labelCol=”Life_expectancy_”, missing=0.0)

paramGrid = ParamGridBuilder()\
  .addGrid(xgb_regressor.max_depth, [2, 5])\
  .addGrid(xgb_regressor.n_estimators, [10, 100])\
  .build()
 
evaluator = RegressionEvaluator(metricName=”rmse”,
                              labelCol=xgb_regressor.getLabelCol(),                             predictionCol=xgb_regressor.getPredictionCol())

cv = CrossValidator(estimator=xgb_regressor, evaluator=evaluator, estimatorParamMaps=paramGrid)

pipeline = Pipeline(stages=[vectorAssembler, vectorIndexer, cv])
pipelineModel = pipeline.fit(train)
predictions = pipelineModel.transform(test)

 New values of R2 and RMSE.

rmse = evaluator.evaluate(predictions)
print(“RMSE on our test set: %g” % rmse)

evaluatorr2 = RegressionEvaluator(metricName=”r2″,
                              labelCol=xgb_regressor.getLabelCol(),                             predictionCol=xgb_regressor.getPredictionCol())

r2 = evaluatorr2.evaluate(predictions)
print(“R2 on our test set: %g” % r2)

Output: RMSE on our test set: 0.0523261, R2 on our test set: 0.92922

We see a significant improvement in RMSE and R2.

We can monitor the hyperparameters max_depth, n_estimators from Artifacts stored in JSON formats estimator_info.json, metric_info.json.

 

Conclusion

This post has covered Exploratory Data Analysis, XGBoost Hyperparameter Tuning. Further posts would be covering deployment of model using Databricks.

Please see the part 1 : The End-To-End ML Pipeline using Pyspark and Databricks (Part 1)

The post EDA , XGBoost and Hyperparameter Tuning using pySpark on Databricks (Part III) appeared first on Indium.

]]>
End-To-End ML Pipeline using Pyspark and Databricks (Part II) https://www.indiumsoftware.com/blog/end-to-end-ml-pipeline-using-pyspark-and-databricks-part-2 Tue, 31 May 2022 12:17:58 +0000 https://www.indiumsoftware.com/?p=9945 In the previous post we have covered the brief introduction to Databricks. In this post, we will build predict life expectancy using various attributes. Our data is from Kaggle https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who.  The life expectancy data aims to find the relation of various factors; what factors are negatively related or positively related to life expectancy.  Let’s import

The post End-To-End ML Pipeline using Pyspark and Databricks (Part II) appeared first on Indium.

]]>

In the previous post we have covered the brief introduction to Databricks. In this post, we will build predict life expectancy using various attributes. Our data is from Kaggle https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who

The life expectancy data aims to find the relation of various factors; what factors are negatively related or positively related to life expectancy. 

Let’s import the libraries needed, 

import pandas as pd 
from os.path import expanduser, join, abspath 
import pyspark 
from pyspark.sql import SparkSession 
from pyspark.sql import Row 
from pyspark.sql import SQLContext 
from pyspark.sql.types import * 
from pyspark.sql import Row 
from pyspark.sql.functions import * 
from pyspark.sql import functions as F 
from pyspark.ml.feature import StringIndexer, OneHotEncoder 
import mlflow 
import mlflow.pyfunc 
import mlflow.sklearn 
import numpy as np 
from sklearn.ensemble import RandomForestClassifier 
from pyspark.ml.regression import LinearRegression 
import matplotlib.pyplot as plt 
import six 
from pyspark.ml.feature import VectorAssembler, VectorIndexer 
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder 
from pyspark.ml.evaluation import RegressionEvaluator 

As mentioned in the earlier tutorial the data uploaded will be found in filepath : /FileStore/tables/<file_name>. Using dbutils we can view the files uploaded into the workspace. 

display(dbutils.fs.ls(“/FileStore/tables/Life_Expectancy_Data.csv”)) 

Loading the data using spark 

spark = SparkSession \ 
    .builder \ 
    .appName(“Life Expectancy using Spark”) \ 
    .getOrCreate()sc = spark.sparkContext 
sqlCtx = SQLContext(sc)data = sqlCtx.read.format(“com.databricks.spark.csv”)\ 
    .option(“header”, “true”)\ 
    .option(“inferschema”, “true”)\ 
    .load(“/FileStore/tables/Life_Expectancy_Data.csv”) 

Replacing column spaces with ‘_’ to reduce the ambiguity while saving the results as a table in databricks. 

data = data.select([F.col(col).alias(col.replace(‘ ‘, ‘_’)) for col in data.columns]) 

For more details on Indium’s Databricks consultation services

Click Here

Preprocessing Data: 

Replacing na’s using pandas interpolate. For using pandas interpolate we need to convert spark dataframe to pandas dataframe. 

data1 = data.toPandas() 
data1 = data1.interpolate(method = ‘linear’, limit_direction = ‘forward’)data1.isna().sum()##Output not fully displayed 
Out[17]: Country 0  
Year 0 Status 0  
Life_expectancy_ 0  
Adult_Mortality 0  
infant_deaths 0  
Alcohol 0  
percentage_expenditure 0  
Hepatitis_B 0  
Measles_ 0  
_BMI_ 0 

Status attribute has two categories ‘Developed’ and ‘Developing’, which needs to be converted into binary values. 

data1.Status = data1.Status.map({‘Developing’:0, ‘Developed’: 1}) 

Converting pandas dataframe back to spark dataframe 

sparkDF=spark.createDataFrame(data1)  
sparkDF.printSchema()Out: 
|– Country: string (nullable = true)   
|– Year: integer (nullable = true)   
|– Status: string (nullable = true)   
|– Life_expectancy_: double (nullable = true)   
|– Adult_Mortality: double (nullable = true)   
|– infant_deaths: integer (nullable = true)   
|– Alcohol: double (nullable = true)   
|– percentage_expenditure: double (nullable = true)   
|– Hepatitis_B: double (nullable = true)   
|– Measles_: integer (nullable = true)   
|– _BMI_: double (nullable = true)   
|– under-five_deaths_: integer (nullable = true)   
|– Polio: double (nullable = true)   
|– Total_expenditure: double (nullable = true)   
|– Diphtheria_: double (nullable = true)   
|– _HIV/AIDS: double (nullable = true)   
|– GDP: double (nullable = true)   
|– Population: double (nullable = true)   
|– _thinness__1-19_years: double (nullable = true)   
|– _thinness_5-9_years: double (nullable = true)   
|– Income_composition_of_resources: double (nullable = true)   
|– Schooling: double (nullable = true) 

Checking for null values in each column. 

# null values in each column 
data_agg = sparkDF.agg(*[F.count(F.when(F.isnull(c), c)).alias(c) for c in sparkDF.columns]) 

Descriptive Analysis of data 

sparkDF = sparkDF.drop(*[‘Country’]) 
sparkDF.describe().toPandas().transpose()

Learn how Indium helped implement Databricks services for a global supply chain enterprise: Enterprise Data Mesh For a Supply-Chain Giant

Feature importance 

We have columns which are positively correlated and negatively correlated, but we are considering all features for our prediction initially. 

We would be doing further analysis and EDA on the data in upcoming posts. 

corrs = [] 
columns = [] 
def feature_importance(col, data): 
    for i in data.columns: 
        if not( isinstance(data.select(i).take(1)[0][0], six.string_types)): 
            print( “Correlation to Life_expectancy_ for “, i, data.stat.corr(col,i)) 
            corrs.append(data.stat.corr(col,i)) 
            columns.append(i)feature_importance(‘Life_expectancy_’, sparkDF)corr_map = pd.DataFrame() 
corr_map[‘column’] = columns 
corr_map[‘corrs’] = corrs 
corr_map.sort_values(‘corrs’,ascending = False) 

Modelling 

We are using Databricks feature mlflow which is used to monitor spark jobs, model training runs and securing the results. 

with mlflow.start_run(run_name=’StringIndexing and OneHotEcoding’):# create object of StringIndexer class and specify input and output column 
    SI_status = StringIndexer(inputCol=’Status’,outputCol=’Status_Index’)sparkDF = SI_status.fit(sparkDF).transform(sparkDF)# view the transformed data 
    sparkDF.select(‘Status’, ‘Status_Index’).show(10)# create object and specify input and output column 
    OHE = OneHotEncoder(inputCol=’Status_Index’,outputCol=’Status_OHE’)# transform the data 
    sparkDF = OHE.fit(sparkDF).transform(sparkDF)# view and transform the data 
    sparkDF.select(‘Status’, ‘Status_OHE’).show(10) 

Linear Regression 

from pyspark.ml.feature import VectorAssembler# specify the input and output columns of the vector assembler 
assembler = VectorAssembler(inputCols=[‘Status_OHE’, 
‘Adult_Mortality’, 
‘infant_deaths’, 
‘Alcohol’, 
‘percentage_expenditure’, 
‘Hepatitis_B’, 
‘Measles_’, 
‘_BMI_’, 
‘under-five_deaths_’, 
‘Polio’, 
‘Total_expenditure’, 
‘Diphtheria_’, 
‘_HIV/AIDS’, 
‘GDP’, 
‘Population’, 
‘_thinness__1-19_years’, 
‘_thinness_5-9_years’, 
‘Income_composition_of_resources’, 
‘Schooling’],outputCol=’Independent Features’)# transform the data 
final_data = assembler.transform(sparkDF)# view the transformed vector 
final_data.select(‘Independent Features’).show()

Split the data into train and test 

finalized_data=final_data.select(“Independent Features”,”Life_expectancy_”) 
train_data,test_data=finalized_data.randomSplit([0.75,0.25]) 

In the training step we are implementing ML flow to track the training runs and it stores the metrics, so we can compare multiple runs for different models. 

with mlflow.start_run(run_name=’Linear Reg’): 
    lr_regressor=LinearRegression(featuresCol=’Independent Features’, labelCol=’Life_expectancy_’) 
    lr_model=lr_regressor.fit(train_data) 
     
    print(“Coefficients: ” + str(lr_model.coefficients)) 
    print(“Intercept: ” + str(lr_model.intercept)) 
     
    trainingSummary = lr_model.summary 
    print(“RMSE: %f” % trainingSummary.rootMeanSquaredError) 
    print(“r2: %f” % trainingSummary.r2) 

Output: Coefficients: [-1.6300402968838337,-0.020228619021424112,0.09238300657380137,0.039041432483016204,0.0002192298797413578,-0.006499261193027583,-3.108789317615403e-05,0.04391511799217852,-0.0695270904865562,0.027046130391594696,0.019637176340948595,0.032575320937794784,-0.47919373797042153,1.687696006431926e-05,2.1667807398855064e-09,-0.03641714616291695,-0.024863209379369915,6.056648065702587,0.6757536166078358] Intercept: 56.585308082800374 

RMSE: 4.034376 

r2: 0.821113 

RMSE gives us the differences between predicted values by the model and the actual values. However, RMSE alone is meaningless until we compare with the actual “Life_expentancy_” value, such as mean, min and max. After such comparison, our RMSE looks pretty good. 

train_data.describe().show()

R squared at 0.82 indicates that in our model, approximate 82% of the variability in “Life_expectancy_” can be explained using the model. This is in align with the result. It is not bad. However, performance on the training set may not a good approximation of the performance on the test data. 

#Predictions #RMSE #R2 
with mlflow.start_run(run_name=’Linear Reg’): 
    lr_predictions = lr_model.transform(test_data) 
    lr_predictions.select(“prediction”,”Life_expectancy_”,”Independent Features”).show(20) 
     
    lr_evaluator = RegressionEvaluator( 
        labelCol=”Life_expectancy_”, predictionCol=”prediction”, metricName=”rmse”) 
    rmse = lr_evaluator.evaluate(lr_predictions) 
    print(“Root Mean Squared Error (RMSE) on test data = %g” % rmse) 
    lr_evaluator_r2 = RegressionEvaluator( 
        labelCol=”Life_expectancy_”, predictionCol=”prediction”, metricName=”r2″) 
    print(“R Squared (R2) on test data = %g” % lr_evaluator_r2.evaluate(lr_predictions)) 
    #Logging Metrics using mlflow 
    mlflow.log_metric(“rmse”, rmse_lr) 
    mlflow.log_metric(“r2”, r2_lr) 

RMSE and R2 for the test data are approximately same in comparison to train data. 

mlflow logging for rmse and r2 for Linear Reg model, 

Decision Tree 

from pyspark.ml.regression import DecisionTreeRegressor 
with mlflow.start_run(run_name=’Decision Tree’): 
    dt = DecisionTreeRegressor(featuresCol =’Independent Features’, labelCol = ‘Life_expectancy_’) 
    dt_model = dt.fit(train_data) 
    dt_predictions = dt_model.transform(test_data) 
    dt_evaluator = RegressionEvaluator( 
        labelCol=”Life_expectancy_”, predictionCol=”prediction”, metricName=”rmse”) 
    rmse_dt = dt_evaluator.evaluate(dt_predictions) 
    print(“Root Mean Squared Error (RMSE) on test data = %g” % rmse_dt) 

     
    dt_evaluator = RegressionEvaluator( 
        labelCol=”Life_expectancy_”, predictionCol=”prediction”, metricName=”r2″) 
    r2_dt = dt_evaluator.evaluate(dt_predictions) 
    print(“Root Mean Squared Error (R2) on test data = %g” % r2_dt) 
    mlflow.log_metric(“rmse”, rmse_dt) 
    mlflow.log_metric(“r2”, r2_dt) 

Output: Root Mean Squared Error (RMSE) on test data = 3.03737, Root Mean Squared Error (R2) on test data = 0.886899 

Both RMSE and R2 are better than Linear Regression for Decision Tree. 

Metrics logged using mlflow,

Conclusion 

In this post we have covered how we can start spark session and a model building using spark dataframe. 

In further posts, we will explore EDA, Feature selection and HyperOptimization on different models. 

Please see the part 3 : The End-To-End ML Pipeline using Pyspark and Databricks (Part III)

The post End-To-End ML Pipeline using Pyspark and Databricks (Part II) appeared first on Indium.

]]>