bigquery Archives - Indium

Overview of Big Query’s Unique feature, BQML with a regression model example

Mamakar Jetti — Thu, 02 Feb 2023 13:20:43 +0000

In this Blog you are going to see what Big Query is, its best feature of Big Query (BQML), Areas of BQML, with clear example to understand its easiness of building machine learning model with simple SQL code.

The blog will go through the following topics:

What is Big Query?
Best features of Big Query?
Why BQML? Areas of BQML?
Regression model to show efficiency of BQML

Let’s dive into the article.,

What is Big Query?

With built-in technologies like machine learning, business intelligence and geospatial analysis, Big Query is a managed service data management warehouse that can enable you to manage and analyse your data. With no need for infrastructure administration, Big Query’s serverless architecture enables you to leverage SQL queries to tackle the most critical issues facing your company. You may query data in terabytes in a matter of seconds and petabytes of data in a matter of minutes thanks to Big Query’s robust, distributed analytical engine.

Best features of Big Query?

Built-in ML Integration (BQ ML), Multi cloud Functionality (BQ Omni), Geospatial Analysis (BQ GIS), Foundation for BI (BQ BI Engine), Free Access (BQ Sandbox), Automated Data Transfer (BQ Data Transfer Service). These are the amazing features of BQ, in this blog we will discuss the most amazing feature of Big Query which is Big Query ML.

*An amazing feature of Big Query is Big Query ML,

Big Query ML allows you to use standard SQL queries to develop and run machine learning models in Big Query. Machine learning on huge datasets requires extensive programming and ML framework skills. These criteria restrict solution development within each organization to a small group of people, and they exclude data analysts who understand the data but lack machine learning and programming skills. This is where Big Query ML comes in handy; it allows data analysts to employ machine learning using their existing SQL tools and skills. Big Query ML allows analysts to create and evaluate machine learning models in Big Query with large volumes of data.

For more information on Big Query Machine Learning services and solutions

Why BQML?

The major advantages I’ve identified using BQML

There is no need to read your data from local memory because, like any other ML language, BQML can subsample your dataset, but BQML can also train your model directly in your database.
Working in SQL can help you collaborate more easily if you’re working in a team and the majority of your teammates don’t know Python, R, or your favourite modelling language.
Because your model will be in the same location as your data, you can serve it immediately after it has been trained and make predictions directly from it.

Areas we can use BQML

Retail Industry (Demand forecasting, Customer segmentation, Propensity to purchase or propensity to click on item, Product recommendations by emails and ads).
Logistics Industry (Time estimation of package delivery, Predictive maintenance).
Finance Industry (Product recommendations by emails and ads, Product recommendations by emails and ads, Product recommendations by emails and ads, Product recommendations by emails and ads).
Gaming Industry (Content recommendation, Predicting churn customers).

Another blog worth reading: Databricks Overview, Why Databricks, and More

Regression model to show efficiency of BQML

For this we will build a linear regression model to predict the house prices in the USA, as it is best fit to predict the value of one variable using another. Also, for understanding about model working in the article I am using example of regression model as it is simpler to communicate how the model itself works and interpret results.
With the USA housing dataset, we will see how efficient and easy Big Query ML feature is to build machine learning linear regression model with SQL code.

Step 1: Creating the Model

CREATE OR REPLACE MODEL

`testproject-351804.regression.house_prices2` OPTIONS(model_type = ‘linear_reg’, input_label_cols = [‘price’],l2_reg = 1, early_stop = false, max_iterations = 12, optimize_strategy = ‘batch_gradient_descent’) ASSELECT avg_house_age, avg_rooms, avg_bedrooms, avg_income, population, price/100000 AS priceFROM `regression.usa_housing_train`

SELECT avg_house_age, avg_rooms, avg_bedrooms, avg_income, population, price/100000 AS price FROM `regression.usa_housing_train

Model creation

The above code will create and train the model.
With the simple CREATE MODEL function we can create the ML model, you need to specify the OPTIONS, we need basically only model_type and input_label_cols(predicting variable) to create the model but why I used other OPTIONS, you will see in evaluation section.

Step 2: Evaluating the Model

SELECT * FROM ML.EVALUATE(MODEL `regression.house_prices2`,TABLE ` testproject- 351804._8b41b9f5a2e85d72c62e834e3e9dd60a58ba542d.anoncb5de70d_1e3d_4213_8c5d_bb10d6b9385b_imported_data_split_eval_data`)

Model Evaluation

We have to see how well our model is performing by using ML.EVALUATE funtion, So now we will see why I used other OPTIONS in creating the model,
First I created a model in BigQuery ML, with model options model_type= ‘linear_reg’ and input_label_cols = ‘price’ but while evaluating the model “r square” is only 0.3 which I felt less accurate and I came to know that model is overfitt by seeing huge difference between the training loss and evaluation loss.
So, as a solution I added options in creating model, used L2 regularization to overcome overfitt and generalize the model to adapt the data points and changed values for three times to made it generalize and after the r square is 0.92 with above 90% accuracy.

*We need to look upon R-Squared, which is Coefficient of determination. Higher is better.

Step 3: Predicting the Model

The model’s prediction process is as simple as calling ML.PREDICT

SELECT * FROM ML.PREDICT (Model `regression.house_prices2`,TABLE `regression.usa_housing_predict`)

Model Prediction

See, how efficient is Big Query ML feature of Big Query, it predicted the house prices basing upon the trained data of avg_house_age, avg_rooms, avg_bedrooms, avg_income, avg_population.

Summary

Now you know how to create linear regression models in BigQuery ML. We have discussed how to build a model, assess it, apply it to make predictions, and analyse model coefficients.

In next coming blogs you will see other unique features of Big Query like Geospatial Analytics and Array/Structs.

Happy Reading

Hope you find this useful.

The post Overview of Big Query’s Unique feature, BQML with a regression model example appeared first on Indium.

Data Modernization with Google Cloud

Indium — Thu, 12 Jan 2023 11:42:20 +0000

L.L. Bean was established in 1912. It is a Freeport, Maine-based retailer known for its mail-order catalog of boots. The retailer runs 51 stores, kiosks, and outlets in the United States. It generates US $1.6 billion in annual revenues, of which US $1billion comes from its e-commerce engine. This means, delivery of a great omnichannel customer experience is a must and an essential part of its business strategy. But the retailer faced a significant challenge in sustaining its seamless omnichannel experience. It was relying on on-premises mainframes and distributed servers which made upgradation of clusters and nodes very cumbersome. It wanted to modernize its capabilities by migrating to the cloud. Through cloud adoption, it wanted to improve its online performance, accelerate time to market, upgrade effortlessly, and enhance customer experience.

L.L. Bean turned to Google Cloud to fulfill its cloud requirements. By modernizing data on, it experienced faster page loads and it was able to access transaction histories more easily. It also focused on value addition instead of infrastructure management. And, it reduced release cycles and rapidly delivered cross-channel services. These collectively improved its overall delivery of agile, cutting-edge customer experience.

Data Modernization with Google Cloud for Success

Many businesses that rely on siloed data find it challenging to make fully informed business decisions, and in turn accelerate growth. They need a unified view of data to be able to draw actionable, meaningful insights that can help them make fact-based decisions that improve operational efficiency, deliver improved services, and identify growth opportunities. In fact, businesses don’t just need unified data. They need quality data that can be stored, managed, scaled and accessed easily.

Google Cloud Platform empowers businesses with flexible and scalable data storage solutions. Some of its tools and features that enable this include:

BigQuery

This is a cost-effective, serverless, and highly scalable multi-cloud data warehouse that provides businesses with agility.

Vertex AI

This enables businesses to build, deploy, and scale ML models on a unified AI platform using pre-trained and custom tooling.

Why should businesses modernize with Google Cloud?

It provides faster time to value with serverless analytics, it lowers TCO (Total Cost of Ownership) by up to 52%, and it ensures data is secure and compliant.

Read this informative post on Cloud Cost Optimization for Better ROI.

Google Cloud Features

Improved Data Management

BigQuery, the serverless data warehouse from Google Cloud Platform (GCP), makes managing, provisioning, and dimensioning infrastructure easier. This frees up resources to focus on the quality of decision-making, operations, products, and services.

Improved Scalability

Storage and computing are decoupled in BigQuery, which improves availability and scalability, and makes it cost-efficient.

Analytics and BI

GCP also improves website analytics by integrating with other GCP and Google products. This helps businesses get a better understanding of the customer’s behavior and journey. The BI Engine packaged with BigQuery provides users with several data visualization tools, speeds up responses to queries, simplifies architecture, and enables smart tuning.

Data Lakes and Data Marts

GCP’s enables ingestion of batch and stream/real-time data, change data capture, landing zone, and raw data to meet other data needs of businesses.

Data Pipelines

GCP tools such as Dataflow, Dataform, BigQuery Engine, Dataproc, DataFusion, and Dataprep help create and manage even complex data pipelines.

Discover how Indium assisted a manufacturing company with data migration and ERP data pipeline automation using Pyspark.

Data Orchestration

For data orchestration too, GCP’s managed or serverless tools minimize infrastructure, configuration, and operational overheads. Workflows is a popular tool for simple workloads while Cloud Composer can be used for more complex workloads.

Data Governance

Google enables data governance, security, and compliance with tools such as Data Catalog, that facilitates data discoverability, metadata management, and data class-level controls. This helps separate sensitive and other data within containers. Data Loss Prevention and Identity Access Management are some of the other trusted tools.

Data Visualization

Google Cloud Platform provides two fully managed tools for data visualization, Data Studio and Looker. Data Studio is free and transforms data into easy-to-read and share, informative, and customizable dashboards and reports. Looker is flexible and scalable and can handle large data and query volumes.

ML/AI

Google Cloud Platform leverages Google’s expertise in ML/AI and provides Managed APIs, BigQuery ML, and Vertex AI. Managed APIs enable solving common ML problems without having to train a new model or even having technical skills. Using BigQuery, models can be built and deployed based on SQL language. Vertex AI, as already seen, enables the management of the ML product lifecycle.

Indium to Modernize Your Data Platform With GCP

Indium Software is a recognized data and cloud solution provider with cross domain expertise and experience. Our range of services includes data and app modernization, data analytics, and digital transformation across the various cloud platforms such as Amazon Web Server, Azure, Google Cloud. We work closely with our customers to understand their modernization needs and align them with business goals to improve the outcomes for faster growth, better insights, and enhanced operational efficiency.

To learn more about Indium’s data modernization and Google Cloud capabilities.

Visit

FAQs

What Cloud storage tools and libraries are available in Google Cloud?

Along with JSON API and the XML API, Google also enables operations on buckets and objects. Google cloud storage commands provide a command-line interface with cloud storage in Google Cloud CLI. Programmatic support is also provided for programming languages, such as Java, Python, and Ruby.

The post Data Modernization with Google Cloud appeared first on Indium.

Serverless Data Warehouse: For Better Data Management at Lower Cost of Ownership

Abhay Das — Thu, 03 Sep 2020 17:42:29 +0000

A leading global manufacturer of pumps and other fluid management tools was expanding its business across the globe. The manufacturer needed to modernize its data management system and leverage the data collected over the years with a sophisticated data storage system that could support advanced analytics on non-traditional data and enable acquiring 360-degree business insights.

Indium Software, a cutting edge solution provider with cross-domain expertise, proposed transforming the manufacturer into a data-driven organization by migrating the data from on-prem databases to a cloud-based, serverless data warehouse.

The cloud-based data warehouse has become the need of the hour to keep the total cost of ownership (TCO) low while leveraging the services provided by the public cloud providers such as Google BigQuery, Amazon Redshift or Azure Synapse Analytics (Formerly SQL DW). In the case of the pump manufacturer, Indium migrated the client’s data to Microsoft Azure and reduced the TCO by over 50 percent.

learn more about our data visualization services

Learn More

This is the direction in which the world is moving today. According to a MarketsandMarkets report, the global serverless architecture market size will touch USD 21.1 billion by 2025 from USD 7.6 billion in 2020, growing at a Compound Annual Growth Rate (CAGR) of 22.7 per cent. The three key factors spurring this growth are:

The need to shift from CAPEX to OPEX
Remove the need to manage servers
Reduce the infrastructure cost

Easy Data Access and Management

Today, data generated from multiple sources can be made available to businesses to improve their decision making and devising business strategies across functions. However, traditional systems cannot handle the multiple formats they are available in and manual intervention is required to reconcile them all into one format. This can be time-consuming and prone to errors.

A cloud-based serverless data warehouse can automate the process of data management, making data easily available and accessible for advanced analytics and to gain meaningful insights for improving business processes and efficiencies. Some of the key benefits of opting for a serverless data warehouse would be:

Being cloud-based, it can be accessed from anywhere, thereby allowing even the executives on the move to access data and reports that can speed up their decision-making process.
Being fully managed by the providers, it reduces the burden on the internal IT team and lets them focus on innovation and improving their core business
A solution like Azure also enables easily scalable computational storage at lower costs. Databases can be paused and resumed quickly which can save costs. Cloud providers have a cost management feature to keep a check.
The level of optimization it offers cannot be matched by the traditional on-premise setup
It provides columnar storage and parallel processing facilitating faster aggregate queries
High availability and scalability ensure automatic data distribution and replication automatically across data regions (zones) on the cloud infrastructure
Data Latency is milliseconds despite a highly distributed data set up
Data security is assured through authentication and authorization managed within the cloud setup and data encrypted to comply with privacy regulations

Challenges in Migration

Yes, opting for a serverless data warehouse is not a walk in the park. Some of the factors you must keep in mind include:

Selecting the right building blocks is important as not all of them are fully managed. For instance, Amazon Redshift requires you to choose the node type that is compute-optimized or storage-optimized. You will need to choose the number of compute nodes for the cluster and also manually size them.
In some instances, you might need to integrate different serverless building blocks and also connect the entire solution using non-serverless blocks.
You may opt for integrating individual building blocks instead of having one single solution. While improving configurability it will make the solution complex.
Depending on the data model you opt for, costs can be a combination of upfront and variable.

Partnering with the Right Data Experts

Navigating these hidden complexities requires a deep understanding of data, data warehouses as well as the service providers. An experienced solution provider such as Indium can work closely with you to understand your needs and tailor the approach to suit your requirements.

We provide a simple, secure, cost-effective and scalable solution. We have expertise in Data Modelling, the most crucial stage in architecting the data warehouse. We derive the Technology Architecture by analyzing the process architecture, business rules, metadata management, tools, specific needs and security considerations. At this stage, the data integration tools, data processing tools, network protocols, middleware, database management and related technologies are also factored in.

Leverge your Biggest Asset Data

Inquire Now

For the serverless data warehouse architecture, a step by step order of data pipeline and transformation from one form to another is done. As a result, the entire cycle of storage, retrieval, processing data within the data warehouse is mapped. The architecture is designed to ensure that the workload is processed on time, performance is optimized and running costs are kept low.

Indium has a team with more than two decades of experience in the latest cutting edge technologies as well as domain expertise across different industries such as retail, e-commerce, manufacturing, banking, services or finance, among others. If you would like to leverage serverless data warehouse for improved analytics and lower cost of ownership, do reach out to us.

The post Serverless Data Warehouse: For Better Data Management at Lower Cost of Ownership appeared first on Indium.