AI and Machine learning Archives - Indium

Real-time Insights-Driven Businesses and the Impact of Cloud on the Digital Native Ecosystem

Abishek Balakumar — Thu, 05 Oct 2023 07:55:21 +0000

Many digital-native businesses often start as tech startups, which necessitates refining their core value propositions to attract and sustain venture capital investments. This demanding process has driven digital natives to meticulously articulate their unique value propositions to consumers, whether it’s the convenience of ultra-fast grocery delivery, the effortless access to rental cars or shared rides, or the immersive experience of a peer-to-peer content platform. IT teams within these digital-native companies strive to optimize their budgets and streamline time-to-market to deliver distinct functionalities that resonate with and benefit their user base.

The cloud has emerged as a pivotal factor in the growth of digital-native enterprises, furnishing them with the flexibility, scalability, and agility needed to fulfill their customer experience commitments and maintain a competitive edge. Presently, cloud services encompass a diverse array of offerings, encompassing support for software development and testing, bolstered security measures, streamlined governance, automation of compliance processes, AI and ML platforms, as well as tools that facilitate value-adding capabilities like augmented reality/virtual reality (AR/VR) and robotics.

Key Trends for Digital Natives:

Digital natives, born in the cloud era and characterized as data-centric tech companies, heavily rely on SaaS (Software as a Service) solutions built upon cloud-native infrastructure. This robust foundation empowers them with agile, adaptable operations that can effortlessly scale to meet their evolving demands. Furthermore, they leverage AI (Artificial Intelligence) and Machine Learning to optimize their business processes, seamlessly integrating data across their backend systems.

In “The Data-Driven Enterprise in 2023,” McKinsey & Company outlines seven pivotal characteristics shaping the data-driven enterprise landscape:

1. Data Integration: Data seamlessly integrates into every facet of decision-making, interactions, and business processes, serving as the bedrock of operations.

2. Real-Time Processing: Swift, real-time data processing enables rapid decision-making and responsive actions.

3. Flexible Data Stores: Enterprises employ versatile data storage solutions to integrate easily accessible data for diverse purposes.

4. Data as a Product: A data-centric operating model recognizes data’s inherent value, emphasizing its potential to generate substantial value.

5. Chief Data Officer’s Role: The Chief Data Officer’s role expands to focus on extracting value from data, acknowledging its pivotal role in organizational success.

6. Data Ecosystems: Collaboration and data-sharing within industry-specific data ecosystems become standard practices as enterprises realize the advantages of collective participation.

7. Data Management: Prioritized and automated data management ensures privacy, security, and resilience in an increasingly data-driven landscape.

The quote from McKinsey & Company underscores the importance of data streaming, enabling precise data usage in real-time contexts. Below, we showcase successful data-driven approaches.

In the digital landscape, essential components include real-time visibility, feature-rich mobile apps, and seamless integration with cutting-edge technologies like managed cloud services, 5G networks, and augmented reality. Data streaming enhances these capabilities by facilitating real-time data integration and correlation, with Striim as a crucial enabler.

Digital native enterprises, or Digital Native Businesses (DNBs), are defined by IDC as companies leveraging cloud-native tech, data, and AI across all operations. They rely on digital technology for core processes, fully utilizing data streaming for real-time messaging, storage, integration, and correlation.

Case & Point!

Etsy, much like many other digital-native startups, has been heavily reliant on data analytics since its inception in 2005. In its early days, the company faced challenges in truly understanding its customers, which resulted in subpar digital experiences for sellers and a failure to accurately capture customer preferences. To address this, Etsy significantly transformed by establishing a dedicated research department that merged quantitative and qualitative insights. These insights were integrated into every company department, resulting in elevated user satisfaction levels and more informed product decisions. Etsy has witnessed an astounding 400% growth since 2012, a testament to this shift.

What Etsy accomplished was a transition from being merely “data-aware” or data-driven to becoming an “insights-driven” business. While data-aware firms prioritize data collection and mining for insights, insights-driven businesses excel at data analytics, applying quantitative insights to address issues and embedding these insights into their business models, operations, and organizational culture.

Another notable example is Tesla, where vehicles are essentially insights-driven. Tesla continuously streams real-time performance data from each car to its data scientists, who develop models to diagnose driving-related issues and remotely provide software or firmware updates. The result is a seamless enhancement of the driving experience and an insightful system that enables testing, learning, and iterative improvement over time.

Exploring the Practical Applications of AI and Machine Learning Beyond the Buzz!

Indeed, Gartner’s perspective that “ChatGPT, while cool, is just the beginning; enterprise uses for generative AI are far more sophisticated” rings true. It’s essential to recognize that the potential of AI, particularly machine learning, goes beyond the buzz and is already being effectively applied in numerous enterprises.

Amidst the current hype around Generative AI (GenAI), it’s valuable to focus on tangible real-world success stories where analytic models have been utilized for many years. These models have been instrumental in tasks such as fraud detection, upselling to customers, and predicting machine failures. GenAI represents another advanced model that seamlessly integrates into an organization’s IT infrastructure and business processes.

In today’s fast-paced digital landscape, providing and correlating information correctly in the right context is crucial for enterprises seeking to stay competitive. Real-time data streaming, where information is processed in milliseconds, seconds, or minutes, is often superior to delayed data processing, ensuring that insights are harnessed swiftly and effectively.

Data streaming + AI/machine learning = Real-time intelligence

For example, Duolingo, an AI-powered language-learning platform, utilizes the PyTorch framework on AWS to deliver customized algorithms that offer tailored lessons in 32 languages. These algorithms rely on extensive data points, ranging from 100,000 to 30 million, to make 300 million daily predictions, such as the likelihood of a user recalling a word and answering a question correctly.

Duolingo’s system employs deep learning, a subset of AI and ML, to analyze user interactions with words, including correct responses, response modes, and practice intervals. Based on these predictions, the platform presents words in contexts that users need to master them, enhancing the learning experience.

While Duolingo initially used traditional cognitive science algorithms when it started in 2009, these algorithms couldn’t process real-time data to create personalized learning experiences. The adoption of deep learning tools improved prediction accuracy and increased user engagement, with a 12% increase in users returning to the service on the second day after implementing these tools. Duolingo’s success story, with 300 million subscribers, underscores the pivotal role of the AWS cloud in enhancing platform speed, scalability, and predictive capabilities.

As demonstrated by Duolingo, the cloud now offers a wide range of capabilities, delivering three key advantages:

1. Operational Excellence: Empowering companies to prioritize differentiated work over maintenance or commodity tasks, resulting in cost reduction, heightened security, and increased reliability.

2. New Levers and Capabilities: Facilitating organizations in accelerating the development of new products, features, and market expansion.

3. Accelerated Innovation: Combining operational excellence and new capabilities to drive faster, more agile, maintainable, and scalable development processes.

Coinbase, a prominent digital currency wallet and platform provider with 30 million customers, has leveraged AWS Step Functions to automate and enhance the deployment of new software features and updates. This approach has not only resulted in successful deployments 97% of the time but has also significantly accelerated the process of adding new accounts, reducing it from days to mere seconds. Furthermore, Coinbase has significantly reduced the number of customer support tickets, thus enhancing user satisfaction and operational efficiency, while bolstering cybersecurity measures to protect users from cyberattacks.

Personalization driven by AI and ML can indeed yield powerful results. Notable examples include Intuit, a financial software company, which employed the Amazon Personalize service to rapidly create and deploy a recommendation engine for its Mint consumer budget tracking and planning app. Similarly, Keen, a outdoor footwear manufacturer, harnessed the same Amazon service to monitor customers’ browsing and purchase histories, enabling the provision of tailored shopping recommendations. Keen’s implementation of the recommendation feature via test emails resulted in a substantial revenue increase of nearly 13%.

Additionally, Ably, a South Korean startup in the apparel e-commerce sector, has successfully integrated AI to provide personalized recommendations on its app’s front page. Leveraging individual customer browsing and purchasing histories, Ably’s recommendation engine has empowered the company to develop sophisticated AI capabilities, even without prior experience in ML technology. These instances underscore how AI-driven personalization can significantly enhance user experiences and boost business outcomes across various industries.

Natural language processing (NLP) with data streaming for real-time Generative AI (GenAI)

Natural Language Processing (NLP) has proven to be a valuable tool in numerous real-world projects, enhancing service desk automation, enabling customer interactions with chatbots, moderating social network content, and serving many other use cases. Generative AI (GenAI) represents the latest evolution of these analytical models, adding even more capabilities to the mix. Many enterprises have successfully integrated NLP with data streaming for years to power real-time business processes.

Striim has emerged as a central orchestration layer within machine learning platforms, facilitating the integration of diverse data sources, scalable processing, and real-time model inference. Below is an architecture that illustrates how teams can seamlessly incorporate Generative AI and other machine learning models, such as large language models (LLM), into their existing data streaming framework:

This architecture showcases the integration of Generative AI and LLM into the data streaming architecture, allowing organizations to harness the power of these advanced models to further enhance their real-time data-driven processes.

Time to market is undeniably critical in today’s fast-paced business landscape. The beauty of incorporating AI is that it often doesn’t necessitate a complete overhaul of an enterprise’s architecture. A well-designed, truly decoupled system enables organizations to seamlessly introduce new applications and technologies and integrate them into existing business processes. This approach ensures agility and adaptability, allowing businesses to swiftly capitalize on emerging opportunities and stay competitive without undergoing extensive infrastructure changes.

An exemplary example is our project with an airline company employing Striim to enhance operational efficiency by modernizing its legacy data store. (Read more)

The post Real-time Insights-Driven Businesses and the Impact of Cloud on the Digital Native Ecosystem appeared first on Indium.

Kubeflow Pipeline on Vertex AI for Custom ML Models

Ganesh Ghadge — Thu, 02 Feb 2023 11:56:32 +0000

What is Kubeflow?

“Kubeflow is an open-source project created to help deployment of ML pipelines. It uses components as python functions for each step of pipeline. Each component runs on the isolated container with all the required libraries. It runs all the components in the series one by one.”

In this article we are going to train a custom machine learning model on Vertex AI using Kubeflow Pipeline.

About Dataset

Credit Card Customers dataset from Kaggle will be used. The 10,000 customer records in this dataset include columns for age, salary, marital status, credit card limit, credit card category, and other information. In order to predict the customers who are most likely to leave, we must analyse the data to determine the causes of customer churn.

Interesting Read: In the world of hacking, we’ve reached the point where we’re wondering who is a better hacker: humans or machines.

Let’s Start

Custom Model Training

Step 1: Getting Data

We will download the dataset from GitHub. There are two csv files in the downloaded dataset called churner_p1 and churner_p2, I have created a Big Query dataset credit_card_churn with the tables as churner_p1 and churner_p2 with this csv files. I have also created the bucket called credit-card-churn on Cloud Storage. This bucket will be used to store the artifacts of the pipeline

Step 2: Employing Workbench

Enable the Notebook API by going to Vertex AI and then to the Workbench section. Then select Python 3 by clicking on New Notebook. Make sure to choose the us-central1 region.

It will take a few minutes to create the Notebook instance. Once the notebook is created click on the Open JupyterLab to launch the JupyterLab.

We will also have to enable the following APIs from API and services section of Vertex AI.

Artifact Registry API
Container Registry API
AI Platform API
ML API
Cloud Functions API
Cloud Build API

Now click on the Python 3 to open a jupyter notebook in the JupyterLab Notebook section and run the below code cells.

USER_FLAG = “–user”

!pip3 install {USER_FLAG} google-cloud-aiplatform==1.7.0

!pip3 install {USER_FLAG} kfp==1.8.9

This will install google cloud AI platform and Kubeflow packages. Make sure to restart the kernel after the packages are installed.

import os

PROJECT_ID = “”

# Get your Google Cloud project ID from gcloud

if not os.getenv(“IS_TESTING”):

shell_output=!gcloud config list –format ‘value(core.project)’ 2>/dev/null

PROJECT_ID = shell_output[0]

print(“Project ID: “, PROJECT_ID)

Create the variable PROJECT_ID with the name of project.

BUCKET_NAME=”gs://” + PROJECT_ID

BUCKET_NAME

Create the variable BUCKET_NAME, this will return the same bucket name we have created earlier.

import matplotlib.pyplot as plt

import pandas as pd

from kfp.v2 import compiler, dsl

from kfp.v2.dsl import pipeline, component, Artifact, Dataset, Input, Metrics, Model, Output, InputPath, OutputPath

from google.cloud import aiplatform

# We’ll use this namespace for metadata querying

from google.cloud import aiplatform_v1

PATH=%env PATH

%env PATH={PATH}:/home/jupyter/.local/bin

REGION=”us-central1″

PIPELINE_ROOT = f”{BUCKET_NAME}/pipeline_root/”

PIPELINE_ROOT

This will import required packages and create the pipeline folder in the credit-card-churn bucket.

#First Component in the pipeline to fetch data from big query.

#Table1 data is fetched

@component(

packages_to_install=[“google-cloud-bigquery==2.34.2”, “pandas”, “pyarrow”],

base_image=”python:3.9″,

output_component_file=”dataset_creating_1.yaml”

)

def get_data_1(

bq_table: str,

output_data_path: OutputPath(“Dataset”)

from google.cloud import bigquery

import pandas as pd

bqclient = bigquery.Client()

table = bigquery.TableReference.from_string(

bq_table

)

rows = bqclient.list_rows(

table

)

dataframe = rows.to_dataframe(

create_bqstorage_client=True,

)

dataframe.to_csv(output_data_path)

The first component of the pipeline will fit the data from the table churner_p1 from big query and pass the csv file as the output for the next component. The structure is the same for every component. We have used the @component decorator to install the required packages and specify the base image and output file, then we create the get_data_1 function to get the data from big query.

#Second Component in the pipeline to fetch data from big query.

#Table2 data is fetched

#First component and second component doesnt need inputs from any components

@component(

packages_to_install=[“google-cloud-bigquery==2.34.2”, “pandas”, “pyarrow”],

base_image=”python:3.9″,

output_component_file=”dataset_creating_2.yaml”

)

def get_data_2(

bq_table: str,

output_data_path: OutputPath(“Dataset”)

from google.cloud import bigquery

import pandas as pd

bqclient = bigquery.Client()

table = bigquery.TableReference.from_string(

bq_table

)

rows = bqclient.list_rows(

table

)

dataframe = rows.to_dataframe(

create_bqstorage_client=True,

)

dataframe.to_csv(output_data_path)

The second component of the pipeline will fit the data from the table churner_2 from big query and pass the csv file as the output for the next component. The first component and second component do not need inputs from any components.

#Third component in the pipeline to to combine data from 2 sources and for some data transformation

@component(

packages_to_install=[“sklearn”, “pandas”, “joblib”],

base_image=”python:3.9″,

output_component_file=”model_training.yaml”,

)

def data_transformation(

dataset1: Input[Dataset],

dataset2: Input[Dataset],

output_data_path: OutputPath(“Dataset”),

from sklearn.metrics import roc_curve

from sklearn.model_selection import train_test_split

from joblib import dump

from sklearn.metrics import confusion_matrix

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier

import pandas as pd

data1 = pd.read_csv(dataset1.path)

data2 = pd.read_csv(dataset2.path)

data=pd.merge(data1, data2, on=’CLIENTNUM’, how=’outer’)

data.drop([“CLIENTNUM”],axis=1,inplace=True)

data = data.dropna()

cols_categorical = [‘Gender’,’Dependent_count’, ‘Education_Level’, ‘Marital_Status’,’Income_Category’,’Card_Category’]

data[‘Attrition_Flag’] = [1 if cust == “Existing Customer” else 0 for cust in data[‘Attrition_Flag’]]

data_encoded = pd.get_dummies(data, columns = cols_categorical)

data_encoded.to_csv(output_data_path)

The third component is where we have combined the data from the first and second component and did the data transformation such as dropping the “CLIENTNUM” column, dropping the null values and converting the categorical columns into numerical. we will pass this transformed data as csv to the next component.

#Fourth component in the pipeline to train the classification model using decision Trees or Randomforest

@component(

packages_to_install=[“sklearn”, “pandas”, “joblib”],

base_image=”python:3.9″,

output_component_file=”model_training.yaml”,

)

def training_classmod(

data1: Input[Dataset],

metrics: Output[Metrics],

model: Output[Model]

from sklearn.metrics import roc_curve

from sklearn.model_selection import train_test_split

from joblib import dump

from sklearn.metrics import confusion_matrix

from sklearn.ensemble import RandomForestClassifier

import pandas as pd

data_encoded=pd.read_csv(data1.path)

X = data_encoded.drop(columns=[‘Attrition_Flag’])

y = data_encoded[‘Attrition_Flag’]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100,stratify=y)

model_classifier = RandomForestClassifier()

model_classifier.fit(X_train,y_train)

y_pred=model_classifier.predict(X_test)

score = model_classifier.score(X_test,y_test)

print(‘accuracy is:’,score)

metrics.log_metric(“accuracy”,(score * 100.0))

metrics.log_metric(“model”, “RandomForest”)

dump(model_classifier, model.path + “.joblib”)

In the fourth component we will train the model with the Random Classifier and we have used the “accuracy” as the evaluation metric.

@component(

packages_to_install=[“google-cloud-aiplatform”],

base_image=”python:3.9″,

output_component_file=”model_deployment.yaml”,

)

def model_deployment(

model: Input[Model],

project: str,

region: str,

vertex_endpoint: Output[Artifact],

vertex_model: Output[Model]

from google.cloud import aiplatform

aiplatform.init(project=project, location=region)

deployed_model = aiplatform.Model.upload(

display_name=”custom-model-pipeline”,

artifact_uri = model.uri.replace(“model”, “”),

serving_container_image_uri=”us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.0-24:latest”

)

endpoint = deployed_model.deploy(machine_type=”n1-standard-4″)

# Save data to the output params

vertex_endpoint.uri = endpoint.resource_name

vertex_model.uri = deployed_model.resource_name

Fifth component is the last component, in this we will create the endpoint on the Vertex AI and deploy the model. We have used Docker as base IMAGE and have deployed the model on “n1-standard-4” machine.

@pipeline(

# Default pipeline root. You can override it when submitting the pipeline.

pipeline_root=PIPELINE_ROOT,

# A name for the pipeline.

name=”custom-pipeline”,

)

def pipeline(

bq_table_1: str = “”,

bq_table_2: str = “”,

output_data_path: str = “data.csv”,

project: str = PROJECT_ID,

region: str = REGION

dataset_task_1 = get_data_1(bq_table_1)

dataset_task_2 = get_data_2(bq_table_2)

data_transform=data_transformation(dataset_task_1.output,dataset_task_2.output)

model_task = training_classmod(data_transform.output)

deploy_task = model_deployment(model=model_task.outputs[“model”],project=project,region=region)

In the last we have pipeline function which will call all the components in the sequential manner: dataset_tast_1 and dataset_tast_2 will get the data from the big query, data_transform will transform the data, model_task will train the Random Classifier model and deploy_task will deploy the model on Vertex AI.

compiler.Compiler().compile(pipeline_func=pipeline, package_path=”custom-pipeline-classifier.json”)

Compiling the pipeline.

run1 = aiplatform.PipelineJob(

display_name=”custom-training-vertex-ai-pipeline”,

template_path=”custom-pipeline-classifier.json”,

job_id=”custom-pipeline-rf8″,

parameter_values={“bq_table_1”: “credit-card-churn.credit_card_churn.churner_p1″,”bq_table_2”: “credit-card-churn.credit_card_churn.churner_p2”},

enable_caching=False,)

Creating the pipeline job.

run1.submit()

Running the pipeline job.

With this we have completed creating the Kubeflow pipeline and we can see it on the Pipelines section of Vertex AI.

Our Pipeline has run successfully and we have managed to get 100% accuracy for the classification.

We can use this model to get the online prediction using Rest API or Python. We can also create the different pipelines and compare their metrics on Vertex AI.

With this we have completed the project and learned how to create the Pipeline on Vertex AI for custom train models.

I hope you will find it useful.

To learn more about our AI & ML Solutions and Capabilities

See you again.

The post Kubeflow Pipeline on Vertex AI for Custom ML Models appeared first on Indium.