data science Archives - Indium https://www.indiumsoftware.com/blog/tag/data-science/ Make Technology Work Mon, 29 Apr 2024 12:11:23 +0000 en-US hourly 1 https://wordpress.org/?v=6.5.3 https://www.indiumsoftware.com/wp-content/uploads/2023/10/cropped-logo_fixed-32x32.png data science Archives - Indium https://www.indiumsoftware.com/blog/tag/data-science/ 32 32 Data Wrangling 101 – A Practical Guide to Data Wrangling https://www.indiumsoftware.com/blog/data-wrangling-101-a-practical-guide-to-data-wrangling/ Wed, 17 May 2023 11:02:38 +0000 https://www.indiumsoftware.com/?p=16859 Data wrangling plays a critical role in machine learning. It refers to the process of cleaning, transforming, and preparing raw data for analysis, with the goal of ensuring that the data used in a machine learning model is accurate, consistent, and error-free. Data wrangling can be a time-consuming and labour-intensive process, but it is necessary

The post Data Wrangling 101 – A Practical Guide to Data Wrangling appeared first on Indium.

]]>
Data wrangling plays a critical role in machine learning. It refers to the process of cleaning, transforming, and preparing raw data for analysis, with the goal of ensuring that the data used in a machine learning model is accurate, consistent, and error-free.

Data wrangling can be a time-consuming and labour-intensive process, but it is necessary for achieving reliable and accurate results. In this blog post, we’ll explore various techniques and tools that are commonly used in data wrangling to prepare data for machine learning models.

  1. Data integration: Data integration involves combining data from multiple sources to define a unified dataset. This may involve merging data from different databases, cleaning and transforming data from different sources, and removing irrelevant data. The goal of data integration is to create a comprehensive dataset that can be used to train machine learning models.
  2. Data visualization : Data visualization is the process of creating visual representations of the data. This may include scatter plots, histograms, and heat maps. The goal of data visualization is to provide insights into the data and identify patterns that can be used to improve machine learning models.
  3. Data cleaning: Data cleaning is the process of identifying and correcting errors, inconsistencies, and inaccuracies in the data. This step includes removing duplicate values, filling in missing values, correcting spelling errors, and removing duplicate rows. The objective of data cleaning is to ensure that the data is accurate, complete, and consistent.
  4. Data reduction: Data reduction is the process of reducing the amount of data used in a machine learning model. This may involve removing redundant data, removing irrelevant data, and sampling the data. The goal of data reduction is to reduce the computational requirements of the model and improve its accuracy.
  5. Data transformation: Data transformation involves converting the data into a format that is more suitable for analysis. This may include converting categorical data into numerical data, normalizing the data, and scaling the data. The goal of data transformation is to make the data more accessible for machine learning algorithms and to improve the accuracy of the models.        

Also check out this blog on Explainable Artificial Intelligence for a more ethical AI process.

Let’s look into some code:

Here we are taking a student performance dataset with the following features:

  1. gender
  2. parental level of education
  3. math score
  4. reading score
  5.  writing score

For data visualisation, you can use various tools such as Seaborn, Matplotlib, Grafana, Google Charts, and many others to visualise the data.

Let us demonstrate a simple histogram for a series of data using the NumPy library.

Pandas is a widely-used library for data analysis in Python, and it provides several built-in methods to perform exploratory data analysis on data frames. These methods can be used to gain insights about the data in the data frame. Some of the commonly used methods are:

df.descibe(), df.info(), df.mean() , df.quantile() , df.count()

(- df is pandas dataframe)

Let’s see df.descibe(), This method generates a statistical summary of the numerical columns in the data frame. It provides information such as count, mean, standard deviation, minimum, maximum, and percentile values.

 

For data cleaning, we can use the fillna() method from Pandas to fill in missing values in a data frame. This method replaces all NaN (Not a Number) values in the data frame with a specified value. We can choose the value to replace the NaN values with, either a single value or a value computed based on the data. 

For Data reduction we can do Sampling, Filtering, Aggregation, Data compression.

In the example below, we are removing the duplicate rows from the pandas drop_duplicates() method.

We will examine data normalisation and aggregation for data transformation; we are scaling the data to ensure that it has a consistent scale across all variables. Typical normalisation methods include z-score scaling and min-max scaling.

    Here, we’re using a StandardScaler to scale the data.  

Use the fillna () method in the Python pandas library to fill in missing or NaN (Not a Number) values in a Data Frame or a Series by using the mean value of the column.

Transform the categorical data in the ‘gender’ column into numerical data using one hot encoding. We will use get_dummies(), a method in the Pandas library of Python used to convert categorical variables into dummy or indicator variables.

Optimize your data for analysis and gain valuable insights with our advanced data wrangling services. Start streamlining your data processes today!

Click here

 

In conclusion, data wrangling is an essential step in the machine learning process. It involves cleaning, transforming, and preparing raw data for analysis to ensure that the data used in a machine learning model is accurate, consistent, and error-free. By utilising the techniques and tools discussed in this blog post, data scientists can prepare high-quality data sets that can be used to train accurate and reliable machine learning models.

 

The post Data Wrangling 101 – A Practical Guide to Data Wrangling appeared first on Indium.

]]>
Training Custom Machine Learning Model on Vertex AI with TensorFlow https://www.indiumsoftware.com/blog/training-custom-machine-learning-model-on-vertex-ai-with-tensorflow/ Fri, 03 Feb 2023 12:11:24 +0000 https://www.indiumsoftware.com/?p=14404 “Vertex AI is Googles platform which provides many Machine learning services such as training models using AutoML or Custom Training.” AutoML vs Custom Training To quickly compare AutoML and custom training functionality, and expertise required, check out the following table given by Google. Choose a training method | Vertex AI | Google Cloud In this

The post Training Custom Machine Learning Model on Vertex AI with TensorFlow appeared first on Indium.

]]>
“Vertex AI is Googles platform which provides many Machine learning services such as training models using AutoML or Custom Training.”

AutoML vs Custom Training

To quickly compare AutoML and custom training functionality, and expertise required, check out the following table given by Google.

Choose a training method | Vertex AI | Google Cloud

In this article we are going to train the Custom Machine Learning Model on Vertex AI with TensorFlow.

To know about Vertex AI’s AutoML feature read my previous blog : Machine Learning using Google’s Vertex AI.

About Dataset

We will be using Crab Age Prediction dataset from Kaggle. The dataset is used to estimate the age of the crab based on the physical attributes.

To learn more about how our AI and machine learning capabilities can assist you.

Click here

There are 9 columns in the Dataset as follows.

  1. Sex: Crab gender (Male, Female and Indeterminate)
  2. Length: Crab length (in Feet; 1 foot = 30.48 cms)
  3. Diameter: Crab Diameter (in Feet; 1 foot = 30.48 cms)
  4. Height: Crab Height (in Feet; 1 foot = 30.48 cms)
  5. Weight: Crab Weight (in ounces; 1 Pound = 16 ounces)
  6. Shucked Weight: Without Shell Weight (in ounces; 1 Pound = 16 ounces)
  7. Viscera Weight: Viscera Weight
  8. Shell Weight: Shell Weight (in ounces; 1 Pound = 16 ounces)
  9. Age: Crab Age (in months)

We must predict the Age column with the help of the rest of the columns.

Let’s Start

Custom Model Training

Step 1: Getting Data

We will download the dataset from Kaggle. There is only one csv file in the downloaded dataset called CrabAgePrediction.csv, I have uploaded this csv to the bucket called vertex-ai-custom-ml on Google Cloud Storage.

Step 2: Working on Workbench

Go to Vertex AI, then to Workbench section and enable the Notebook API. Then click on New Notebook and select TensorFlow Enterprise, we are using TensorFlow Enterprise 2.6 without GPU for the project. Make sure to select us-central1 (Iowa) region.

It will take a few minutes to create the Notebook instance. Once the notebook is created click on the Open JupyterLab to launch the JupyterLab.

In the JupyterLabopen the Terminal and Run following cmd one by one.

mkdir crab_folder     # This will create crab_folder                       

cd crab_folder        # To enter the folder

mkdir trainer         # This will create trainer folder

touch Dockerfile      # This will create a Dockerfile

We can see all the files and folder on the left side of the JupyterLab, from that open the Dockerfile and start editing with following lines of code.

FROM gcr.io/deeplearning-platform_release/tf2-cpu.2-6

WORKDIR /

COPY trainer /trainer

ENTRYPOINT [“python”,”-m”,”trainer.train”]

Now save the Docker file and with this we have given the Entrypoint for the docker file.

To save the model’s output, we’ll make a bucket called crab-age-pred-bucket.

For the model training file, I have already uploaded the python file into the GitHub Repository. To clone this Repository, click on the Git from the top of JupyterLab and select Clone a Repository and paste the repository link and hit clone.

In the Lab, we can see the crab-age-pred folder; copy the train.py file from this folder to crab_folder/ trainer /.

Let’s look at the train.py file before we create the Docker IMAGE.

#Importing the required packages..

import numpy as np

import pandas as pd

import pathlib

import tensorflow as tf

#Importing tensorflow 2.6

from tensorflow import keras

from tensorflow.keras import layers

print(tf.__version__)

#Reading data from the gcs bucket

dataset = pd.read_csv(r”gs://vertex-ai-custom/CrabAgePrediction.csv”)

dataset.tail()

BUCKET = ‘gs://vertex-ai-123-bucket’

dataset.isna().sum()

dataset = dataset.dropna()

#Data transformation..

dataset = pd.get_dummies(dataset, prefix=”, prefix_sep=”)

dataset.tail()

#Dataset splitting..

train_dataset = dataset.sample(frac=0.8,random_state=0)

test_dataset = dataset.drop(train_dataset.index)

train_stats = train_dataset.describe()

#Removing age column, since it is a target column

train_stats.pop(“Age”)

train_stats = train_stats.transpose()

train_stats

#Removing age column from train and test data

train_labels = train_dataset.pop(‘Age’)

test_labels = test_dataset.pop(‘Age’)

def norma_data(x):

    #To normalise the numercial values

    return (x – train_stats[‘mean’]) / train_stats[‘std’]

normed_train_data = norma_data(train_dataset)

normed_test_data = norma_data(test_dataset)

def build_model():

    #model building function

    model = keras.Sequential([

    layers.Dense(64, activation=’relu’, input_shape=[len(train_dataset.keys())]),

    layers.Dense(64, activation=’relu’),

    layers.Dense(1)

  ])

    optimizer = tf.keras.optimizers.RMSprop(0.001)

    model.compile(loss=’mse’,

                optimizer=optimizer,

                metrics=[‘mae’, ‘mse’])

    return model

#model = build_model()

#model.summary()

model = build_model()

EPOCHS = 10

early_stop = keras.callbacks.EarlyStopping(monitor=’val_loss’, patience=10)

early_history = model.fit(normed_train_data, train_labels,

                    epochs=EPOCHS, validation_split = 0.2,

                    callbacks=[early_stop])

model.save(BUCKET + ‘/model’)

Summary of train.py

When all of the necessary packages are imported, TensorFlow 2.6 will be used for modelling. The pandas command will be used to read the stored csv file in the vertex-ai-custom-ml bucket, and the BUCKET variable will be used to specify the bucket where we will store the train model.

We are doing some transformation such as creating dummy variable for the categorical column. Next, we are splitting the data into training and testing and normalizing the data.

We wrote a function called build_model that includes a simple two-layer tensor flow model. The model will be constructed using ten EPOCHS. We have to save the model in the crab-age-pred-bucket/model file on Data storage and see it has been educated.

Now, in the JupyterLab Terminal, execute the following cmd one by one to create a Docker IMAGE.

PROJECT_ID=crab-age-pred

IMAGE_URI=”gcr.io/$ PROJECT_ID/crab:v1”

docker build ./ -t $IMAGE_URI

Before running the build command make sure to enable the Artifact Registry API and Google Container Registry API by going to the APIs and services in Vertex AI.

After running the CMD our Docker Image is built successfully. Now we will push the docker IMAGE with following cmd.

docker push $IMAGE_URI

Once pushed we can see our Docker IMAGE in the Container registry. To find the Container registry you can search it on Vertex AI.

Best Read: Our success story about how we assisted an oil and gas company, as well as Nested Tables and Machine Drawing Text Extraction

Step 3: Model Training

Go to Vertex AI, then to Training section and click Create. Make sure the region is us-central1.

In Datasets select no managed dataset and click continue.

In Model details I have given the model’s name as “pred-age-crab” and in advance option select the available service account. For rest keep default. Make sure that the service account has the cloud storage permissions if not give the permissions from IAM and Admin section.

Select the custom container for the Container image in the Training container. Navigate to and select the newly created Docker image. Next, navigate to and select the crab-age-pred-bucket in the Model output directory. Now press the continue button.

Ignore any selections for Hyperparameters and click Continue.

In Compute and pricing, Select the machine type n1-standard-32, 32 vCPUs, 120 GiB memory and hit continue.

For Prediction Container select Pre-Built container with TensorFlow Framework 2.6 and start the model training.

You can see the model in training in the Training section.

In about 8 minutes, our custom model training is finished.

Step 4: Model Deployment

Go to Vertex AI, then to the Endpoints section and click Create Endpoint. The region should be us-central1.

Give crab_age_pred as the name of Endpoint and click Continue.

In the Model Settings select pred_age_crab as Model NameVersion 1 as Version and 2 as number of compute nodes, n1-standard-8, 8 vCPUs, 30 GiB memory as Machine Type and select service account. Click Done and Create.

In Model monitoring ignore this selection and click create to implement the version.

It may take 11 minutes to deploy the model.

With the above step our model is deployed.

Step 5: Testing Model

Once the model is deployed, we can make predictions. For this project we are going to use Python to make predictions. We will need to give the Vertex AI Admin and Cloud Storage Admin permissions to the service account. We can do that in the IAM and administration section of Google cloud. Once the permissions are given, we will download the key of the service account in JSON format, it will be useful in authenticating the OS.

Following is the code used for the prediction.

pip install google-cloud-aiplatform

from typing import Dict

from google.cloud import aiplatform

from google.protobuf import json_format

from google.protobuf.struct_pb2 import Value

import os

def predict_tabular_sample(

    project: str,

    endpoint_id: str,

    instance_dict: Dict,

    location: str = “us-central1”,

    api_endpoint: str = “us-central1-aiplatform.googleapis.com”):

    # The AI Platform services require regional API endpoints.

    client_options = {“api_endpoint”: api_endpoint}

    # Initialize client that will be used to create and send requests.

    # This client only needs to be created once, and can be reused for multiple requests.

    client = aiplatform.gapic.PredictionServiceClient(client_options=client_options)

    # for more info on the instance schema, please use get_model_sample.py

    # and look at the yaml found in instance_schema_uri

    instance = json_format.ParseDict(instance_dict, Value())

    instances = [instance]

    parameters_dict = {}

    parameters = json_format.ParseDict(parameters_dict, Value())

    endpoint = client.endpoint_path(

        project=project, location=location, endpoint=endpoint_id

    )

    response = client.predict(

        endpoint=endpoint, instances=instances, parameters=parameters

    )

    predictions = response.predictions

    print(predictions)

#Authentication using service account.

#We are giving the path to the JSON key

os.environ[‘GOOGLE_APPLICATION_CREDENTIALS’] =”/content/crab-age-pred-7c1b7d9be185.json”

#normalized values

inputs =[0,0,1,1.4375,1.175,0.4125,0.63571550,0.3220325,1.5848515,0.747181]

<emstyle=”color:blue;”>project_id = “crab-age-pred”                         #Project Id from the Vertex AI</emstyle=”color:blue;”>

endpoint_id = 7762332189773004800                    #Endpoint Id from the Enpoints Section

predict_tabular_sample(project_id,endpoint_id,inputs)

Output

[[8.01214314]]

This is how we can make the predictions. For the inputs make sure to do the same transformation and normalizing which we have done for the training data.

With this we have completed the project and learned how to train, deploy and to get predictions of the custom trained ML model.

I hope you will find it useful.

See you again.

The post Training Custom Machine Learning Model on Vertex AI with TensorFlow appeared first on Indium.

]]>
Kubeflow Pipeline on Vertex AI for Custom ML Models https://www.indiumsoftware.com/blog/kubeflow-pipeline-on-vertex-ai-for-custom-ml-models/ Thu, 02 Feb 2023 11:56:32 +0000 https://www.indiumsoftware.com/?p=14381 What is Kubeflow? “Kubeflow is an open-source project created to help deployment of ML pipelines. It uses components as python functions for each step of pipeline. Each component runs on the isolated container with all the required libraries. It runs all the components in the series one by one.” In this article we are going

The post Kubeflow Pipeline on Vertex AI for Custom ML Models appeared first on Indium.

]]>
What is Kubeflow?

“Kubeflow is an open-source project created to help deployment of ML pipelines. It uses components as python functions for each step of pipeline. Each component runs on the isolated container with all the required libraries. It runs all the components in the series one by one.”

In this article we are going to train a custom machine learning model on Vertex AI using Kubeflow Pipeline.

About Dataset

Credit Card Customers dataset from Kaggle will be used. The 10,000 customer records in this dataset include columns for age, salary, marital status, credit card limit, credit card category, and other information. In order to predict the customers who are most likely to leave, we must analyse the data to determine the causes of customer churn.

Interesting Read: In the world of hacking, we’ve reached the point where we’re wondering who is a better hacker: humans or machines.

Let’s Start

Custom Model Training

Step 1: Getting Data

We will download the dataset from GitHub. There are two csv files in the downloaded dataset called churner_p1 and churner_p2, I have created a Big Query dataset credit_card_churn with the tables as churner_p1 and churner_p2 with this csv files. I have also created the bucket called credit-card-churn on Cloud Storage. This bucket will be used to store the artifacts of the pipeline

Step 2: Employing Workbench

Enable the Notebook API by going to Vertex AI and then to the Workbench section. Then select Python 3 by clicking on New Notebook. Make sure to choose the us-central1 region.

It will take a few minutes to create the Notebook instance. Once the notebook is created click on the Open JupyterLab to launch the JupyterLab.

We will also have to enable the following APIs from API and services section of Vertex AI.

  1. Artifact Registry API
  2. Container Registry API
  3. AI Platform API
  4. ML API
  5. Cloud Functions API
  6. Cloud Build API

Now click on the Python 3 to open a jupyter notebook in the JupyterLab Notebook section and run the below code cells.

USER_FLAG = “–user”

!pip3 install {USER_FLAG} google-cloud-aiplatform==1.7.0

!pip3 install {USER_FLAG} kfp==1.8.9

This will install google cloud AI platform and Kubeflow packages. Make sure to restart the kernel after the packages are installed.

import os

PROJECT_ID = “”

# Get your Google Cloud project ID from gcloud

if not os.getenv(“IS_TESTING”):

    shell_output=!gcloud config list –format ‘value(core.project)’ 2>/dev/null

    PROJECT_ID = shell_output[0]

    print(“Project ID: “, PROJECT_ID)

Create the variable PROJECT_ID with the name of project.

BUCKET_NAME=”gs://” + PROJECT_ID

BUCKET_NAME

Create the variable BUCKET_NAME, this will return the same bucket name we have created earlier.

import matplotlib.pyplot as plt

import pandas as pd

from kfp.v2 import compiler, dsl

from kfp.v2.dsl import pipeline, component, Artifact, Dataset, Input, Metrics, Model, Output, InputPath, OutputPath

from google.cloud import aiplatform

# We’ll use this namespace for metadata querying

from google.cloud import aiplatform_v1

PATH=%env PATH

%env PATH={PATH}:/home/jupyter/.local/bin

REGION=”us-central1″

PIPELINE_ROOT = f”{BUCKET_NAME}/pipeline_root/”

PIPELINE_ROOT

This will import required packages and create the pipeline folder in the credit-card-churn bucket.

#First Component in the pipeline to fetch data from big query.

#Table1 data is fetched

@component(

    packages_to_install=[“google-cloud-bigquery==2.34.2”, “pandas”, “pyarrow”],

    base_image=”python:3.9″,

    output_component_file=”dataset_creating_1.yaml”

)

def get_data_1(

   bq_table: str,

   output_data_path: OutputPath(“Dataset”)

):

    from google.cloud import bigquery

    import pandas as pd

    bqclient = bigquery.Client()

   table = bigquery.TableReference.from_string(

      bq_table

    )

    rows = bqclient.list_rows(

        table

    )

   dataframe = rows.to_dataframe(

        create_bqstorage_client=True,

    )

   dataframe.to_csv(output_data_path)

The first component of the pipeline will fit the data from the table churner_p1 from big query and pass the csv file as the output for the next component. The structure is the same for every component. We have used the @component decorator to install the required packages and specify the base image and output file, then we create the get_data_1 function to get the data from big query.

#Second Component in the pipeline to fetch data from big query.

#Table2 data is fetched

#First component and second component doesnt need inputs from any components

@component(

    packages_to_install=[“google-cloud-bigquery==2.34.2”, “pandas”, “pyarrow”],

    base_image=”python:3.9″,

    output_component_file=”dataset_creating_2.yaml”

)

def get_data_2(

    bq_table: str,

    output_data_path: OutputPath(“Dataset”)

):

   from google.cloud import bigquery

   import pandas as pd

    bqclient = bigquery.Client()

   table = bigquery.TableReference.from_string(

       bq_table

    )

   rows = bqclient.list_rows(

        table

    )

    dataframe = rows.to_dataframe(

        create_bqstorage_client=True,

    )

    dataframe.to_csv(output_data_path)

The second component of the pipeline will fit the data from the table churner_2 from big query and pass the csv file as the output for the next component. The first component and second component do not need inputs from any components.

#Third component in the pipeline to to combine data from 2 sources and for some data transformation

@component(

    packages_to_install=[“sklearn”, “pandas”, “joblib”],

   base_image=”python:3.9″,

  output_component_file=”model_training.yaml”,

)

def data_transformation(

    dataset1: Input[Dataset],

    dataset2: Input[Dataset],

    output_data_path: OutputPath(“Dataset”),

):

    from sklearn.metrics import roc_curve

    from sklearn.model_selection import train_test_split

    from joblib import dump

    from sklearn.metrics import confusion_matrix

    from sklearn.tree import DecisionTreeClassifier

    from sklearn.ensemble import RandomForestClassifier

   import pandas as pd

    data1 = pd.read_csv(dataset1.path)

    data2 = pd.read_csv(dataset2.path)

    data=pd.merge(data1, data2, on=’CLIENTNUM’, how=’outer’)

    data.drop([“CLIENTNUM”],axis=1,inplace=True)

   data = data.dropna()

   cols_categorical = [‘Gender’,’Dependent_count’, ‘Education_Level’, ‘Marital_Status’,’Income_Category’,’Card_Category’]

    data[‘Attrition_Flag’] = [1 if cust == “Existing Customer” else 0 for cust in data[‘Attrition_Flag’]]

    data_encoded = pd.get_dummies(data, columns = cols_categorical)

    data_encoded.to_csv(output_data_path)

The third component is where we have combined the data from the first and second component and did the data transformation such as dropping the “CLIENTNUM” column, dropping the null values and converting the categorical columns into numerical. we will pass this transformed data as csv to the next component.

#Fourth component in the pipeline to train the classification model using decision Trees or Randomforest

@component(

    packages_to_install=[“sklearn”, “pandas”, “joblib”],

    base_image=”python:3.9″,

    output_component_file=”model_training.yaml”,

)

def training_classmod(

    data1: Input[Dataset],

   metrics: Output[Metrics],

    model: Output[Model]

):

    from sklearn.metrics import roc_curve

    from sklearn.model_selection import train_test_split

    from joblib import dump

    from sklearn.metrics import confusion_matrix

    from sklearn.ensemble import RandomForestClassifier

    import pandas as pd

    data_encoded=pd.read_csv(data1.path)

    X = data_encoded.drop(columns=[‘Attrition_Flag’])

    y = data_encoded[‘Attrition_Flag’]

   X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100,stratify=y)

   model_classifier = RandomForestClassifier()

    model_classifier.fit(X_train,y_train)

    y_pred=model_classifier.predict(X_test)

    score = model_classifier.score(X_test,y_test)

    print(‘accuracy is:’,score)

    metrics.log_metric(“accuracy”,(score * 100.0))

    metrics.log_metric(“model”, “RandomForest”)

    dump(model_classifier, model.path + “.joblib”)

In the fourth component we will train the model with the Random Classifier and we have used the “accuracy” as the evaluation metric.

@component(

   packages_to_install=[“google-cloud-aiplatform”],

    base_image=”python:3.9″,

    output_component_file=”model_deployment.yaml”,

)

def model_deployment(

    model: Input[Model],

    project: str,

    region: str,

    vertex_endpoint: Output[Artifact],

   vertex_model: Output[Model]

):

    from google.cloud import aiplatform

   aiplatform.init(project=project, location=region)

    deployed_model = aiplatform.Model.upload(

        display_name=”custom-model-pipeline”,

      artifact_uri = model.uri.replace(“model”, “”),

        serving_container_image_uri=”us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.0-24:latest”

    )

    endpoint = deployed_model.deploy(machine_type=”n1-standard-4″)

    # Save data to the output params

    vertex_endpoint.uri = endpoint.resource_name

    vertex_model.uri = deployed_model.resource_name

Fifth component is the last component, in this we will create the endpoint on the Vertex AI and deploy the model. We have used Docker as base IMAGE and have deployed the model on “n1-standard-4” machine.

@pipeline(

    # Default pipeline root. You can override it when submitting the pipeline.

    pipeline_root=PIPELINE_ROOT,

    # A name for the pipeline.

    name=”custom-pipeline”,

)

def pipeline(

   bq_table_1: str = “”,

    bq_table_2: str = “”,

    output_data_path: str = “data.csv”,

    project: str = PROJECT_ID,

    region: str = REGION

):

    dataset_task_1 = get_data_1(bq_table_1)

   dataset_task_2 = get_data_2(bq_table_2)

   data_transform=data_transformation(dataset_task_1.output,dataset_task_2.output)

    model_task = training_classmod(data_transform.output)

    deploy_task = model_deployment(model=model_task.outputs[“model”],project=project,region=region)

In the last we have pipeline function which will call all the components in the sequential manner: dataset_tast_1 and dataset_tast_2 will get the data from the big query, data_transform will transform the data, model_task will train the Random Classifier model and deploy_task will deploy the model on Vertex AI.

compiler.Compiler().compile(pipeline_func=pipeline, package_path=”custom-pipeline-classifier.json”)

Compiling the pipeline.

run1 = aiplatform.PipelineJob(

    display_name=”custom-training-vertex-ai-pipeline”,

    template_path=”custom-pipeline-classifier.json”,

    job_id=”custom-pipeline-rf8″,

   parameter_values={“bq_table_1”: “credit-card-churn.credit_card_churn.churner_p1″,”bq_table_2”: “credit-card-churn.credit_card_churn.churner_p2”},

   enable_caching=False,)

Creating the pipeline job.

run1.submit()

Running the pipeline job.

With this we have completed creating the Kubeflow pipeline and we can see it on the Pipelines section of Vertex AI.

 

Our Pipeline has run successfully and we have managed to get 100% accuracy for the classification.

We can use this model to get the online prediction using Rest API or Python. We can also create the different pipelines and compare their metrics on Vertex AI.

With this we have completed the project and learned how to create the Pipeline on Vertex AI for custom train models.

I hope you will find it useful.

To learn more about our AI & ML Solutions and Capabilities

Contact Us

See you again.

The post Kubeflow Pipeline on Vertex AI for Custom ML Models appeared first on Indium.

]]>
Machine Learning using Google’s Vertex AI https://www.indiumsoftware.com/blog/machine-learning-using-googles-vertex-ai/ Thu, 02 Feb 2023 10:38:31 +0000 https://www.indiumsoftware.com/?p=14347 Image by Google What is Vertex AI? “Vertex AI is Google’s platform which provides many Machine learning services such as training models using AutoML or Custom Training.” Image by Google Features of Vertex AI We use Vertex AI to perform the following tasks in the ML workflow To know the workflow of Vertex AI we

The post Machine Learning using Google’s Vertex AI appeared first on Indium.

]]>
Image by Google

What is Vertex AI?

“Vertex AI is Google’s platform which provides many Machine learning services such as training models using AutoML or Custom Training.”

Image by Google

Features of Vertex AI

We use Vertex AI to perform the following tasks in the ML workflow

  • Creation of dataset and Uploading data
  • Training ML model
  • Evaluate model accuracy
  • Hyperparameters tuning (custom training only)
  • Storing model in Vertex AI.
  • Deploying trained model to endpoint for predictions.
  • Send prediction requests to endpoint.
  • Managing models and endpoints.

To know the workflow of Vertex AI we will train a Classification model “Dogs vs Cat” using Vertex AI’s AutoML feature.

Step 1: Creating Dataset

We will download the dataset from Kaggle. In the downloaded zip file there are two zip files train.zip and test.zip. Train.zip contains the labelled images for training.

There are about 25,000 images in the train.zip file and 12,500 in the test.zip file. For this project we will only use 200 cat and 200 dog images to train. We will use the test set to evaluate the performance of our model.

After extracting the data, I uploaded the images to the google cloud storage bucket called dogs_cats_bucket1 which I have created at us-central1 region. Images are stored in two folders train and test in the bucket.

Best Read: Top 10 AI Challenges

Now we need to create a csv file with the images address and label for that I have written the following lines of code.

from google.cloud import storage

import pandas as pd

import os

#Authentication using service account.

os.environ[‘GOOGLE_APPLICATION_CREDENTIALS’] =”/content/dogs-vs-cats-354105-19b7b157b2b8.json”

BUCKET=’dogs_cats_bucket1′

DELIMITER=’/’

TRAIN_PREFIX=’train/’

TRAIN_BASE_PATH = f’gs://{BUCKET}/{TRAIN_PREFIX}’

print(“Starting the import file generation process”)

print(“Process Details”)

print(f”BUCKET : {BUCKET}”)

storage_client = storage.Client()

data = []

print(“Fetchig list of Train objects”)

train_blobs = storage_client.list_blobs(BUCKET, prefix=TRAIN_PREFIX, delimiter=DELIMITER)

for blob in train_blobs:

label = “cat” if “cat” in blob.name else “dog”

full_path = f”gs://{BUCKET}/{blob.name}”

data.append({

‘GCS_FILE_PATH’: full_path,

‘LABEL’: label

})

df = pd.DataFrame(data)

df.to_csv(‘train.csv’, index=False, header=False)

After running the script on Jupyter Notebook, we have the required csv file, we will upload the file to the same storage bucket as well.

Now in the Vertex AI section go to Datasets and enable the Vertex AI API.

Click Create Dataset and name it. I have named it cat_dog_classification. We will select Image Classification (Single-label). Make sure the region is us-central1. Hit Create.

In the next section mark Select import files from Cloud Storage and select the train.csv from Browse. Hit Continue

 

Vertex AI tool 16 minutes to import data. Now we can see the data the Browse and Analyse tab.

 

Now we can train the model.

Step 2: Model Training

Go to Vertex AI, then to Training section and click Create. Make sure the region is us-central1.

In the Dataset select cat_dog_classification and keep default for everything else with Model Training Method as AutoML.

Click continue for the Model Details and Explainability with the default settings.

For Compute and Pricing give 8 maximum node hours.

Hit Start Training.

 

The model training is completed after 29 mins.

Step 3: Model Evaluation

By clicking on trained model, it will take us to the model stats page. Where we have stats like Precision-recall curve, Precision-recall by threshold and Confusion matrix.

With the above stats the model looks good.

Step 4: Model Deployment

Go to Vertex AI, then to the Endpoints section and click Create Endpoint. Make sure the region is us-central1.

Give dogs_cats as the name of Endpoint and click Continue.

In the Model Settings select cat_dog_classification as Model NameVersion 1 as Version and 2 as number of compute nodes.

Click Done and Create.

It takes about 10 minutes to deploy the model.

With this our model is deployed.

Step 5: Testing Model

Once the model is deployed, we can test the model by uploading the test image or creating Batch Prediction.

To Test the Model, we go to the Deploy and Test section on the Model page.

Click on the Upload Image to upload the test, Image.

With this we can see our model is working good on test images.

We can also connect to the Endpoint using Python and get the results.

For more details on our AI and ML services

Visit this link

This is the end of my blog. We have learned how to train an image classification model on Google’s Vertex AI using Auto ML feature. I have enjoyed every minute while working on it.

For the next article we will see how to train custom model on Vertex AI with TensorFlow.

Stay Tuned.

The post Machine Learning using Google’s Vertex AI appeared first on Indium.

]]>
Big data: What Seemed Like Big Data a Couple of Years Back is Now Small Data! https://www.indiumsoftware.com/blog/big-data-what-seemed-like-big-data-a-couple-of-years-back-is-now-small-data/ Fri, 16 Dec 2022 07:00:11 +0000 https://www.indiumsoftware.com/?p=13719 Gartner, Inc. predicts that organizations’ attention will shift from big data to small and wide data by 2025 as 70% are likely to find the latter more useful for context-based analytics and artificial intelligence (AI). To know more about Indium’s data engineering services Visit Small data consumes less data but is just as insightful because

The post Big data: What Seemed Like Big Data a Couple of Years Back is Now Small Data! appeared first on Indium.

]]>
Gartner, Inc. predicts that organizations’ attention will shift from big data to small and wide data by 2025 as 70% are likely to find the latter more useful for context-based analytics and artificial intelligence (AI).

To know more about Indium’s data engineering services

Visit

Small data consumes less data but is just as insightful because it leverages techniques such as;

  • Time-series analysis techniques
  • Few-shot learning
  • Synthetic data
  • Self-supervised learning
  •  

Wide refers to the use of unstructured and structured data sources to draw insights. Together, small and wide data can be used across industries for predicting consumer behavior, improving customer service, and extracting behavioral and emotional intelligence in real-time. This facilitates hyper-personalization and provides customers with an improved customer experience. It can also be used to improve security, detect fraud, and develop adaptive autonomous systems such as robots that use machine learning algorithms to continuously improve performance.

Why is big data not relevant anymore?

First being the large volumes of data being produced everyday from nearly 4.9 billion people browsing the internet for an average of seven hours a day. Further, embedded sensors are also continuously generating stream data throughout the day, making big data even bigger.

Secondly, big data processing tools are unable to keep pace and pull data on demand. Big data can be complex and difficult to manage due to the various intricacies involved, right from ingesting the raw data to making it ready for analytics. Despite storing millions or even billions of records, it may still not be big data unless it is usable and of good quality. Moreover, for data to be truly meaningful in providing a holistic view, it will have to be aggregated from different sources, and be in structured and unstructured formats. Proper organization of data is essential to keep it stable and access it when needed. This can be difficult in the case of big data.

Thirdly, there is a dearth of skilled big data technology experts. Analyzing big data requires data scientists to clean and organize the data stored in data lakes and warehouses before integrating and running analytics pipelines. The quality of insights is determined by the size of the IT infrastructure, which, in turn, is restricted by the investment capabilities of the enterprises.

What is small data?

Small data can be understood as structured or unstructured data collected over a period of time in key functional areas. Small data is less than a terabyte in size. It includes;

  • Sales information
  • Operational performance data
  • Purchasing data
  •  

It is decentralized and can fit data packets securely and with interoperable wrappers. It can facilitate the development of effective AI models, provide meaningful insights, and help capture trends. Prior to adding larger and more semi-or unstructured data, the integrity, accessibility, and usefulness of the core data should be ascertained.

Benefits of Small Data

Having a separate small data initiative can prove beneficial for the enterprise in many ways. It can address core strategic problems about the business and improve the application of big data and advanced analytics. Business leaders can gain insights even in the absence of substantial big data. Managing small data efficiently can improve overall data management.

Some of the advantages of small data are:

  • It is present everywhere: Anybody with a smartphone or a computer can generate small data every time they use social media or an app. Social media is a mine of information on buyer preferences and decisions.
  • Gain quick insights:  Small data is easy to understand and can provide quick actionable insights for making strategic decisions to remain competitive and innovative.
  • It is end-user focused: When choosing the cheapest ticket or the best deals, customers are actually using small data. So, small data can help businesses understand what their customers are looking for and customize their solutions accordingly.
  • Enable self-service: Small data can be used by business users and other stakeholders without needing expert interpretation. This can accelerate the speed of decision making for timely response to events in real-time.

For small data to be useful, it has to be verifiable and have integrity. It must be self-describing and interoperable.

Indium can help small data work for you

Indium Software, a cutting-edge software development firm, has a team of dedicated data scientists who can help with data management, both small and big. Recognized by ISG as a strong contender for data science, data engineering, and data lifecycle management services, the company works closely with customers to identify their business needs and organize data for optimum results.

Indium can design the data architecture to meet customers’ small and large data needs. They also work with a variety of tools and technologies based on the cost and needs of customers. Their vast experience and deep expertise in open source and commercial tools enable them to help customers meet their unique data engineering and analytics goals.

FAQs

 

What is the difference between small and big data?

Small data typically refers to small datasets that can influence current decisions. Big data is a larger volume of structured and unstructured data for long-term decisions. It is more complex and difficult to manage.

What kind of processing is needed for small data?

Small data processing involves batch-oriented processing while for big data, stream processing pipelines are used.

What values does small data add to a business?

Small data can be used for reporting, business Intelligence, and analysis.

The post Big data: What Seemed Like Big Data a Couple of Years Back is Now Small Data! appeared first on Indium.

]]>
Top 5 Applications of Computer Vision (CV) https://www.indiumsoftware.com/blog/top-applications-of-computer-vision/ Fri, 18 Dec 2020 07:37:44 +0000 https://www.indiumsoftware.com/blog/?p=3504 Imagine you are driving a car. You see a person move into the path of your car, making you take an appropriate action. You would either apply brake and/or reduce the speed of the car. Thus, in a fraction of a second, the human vision has completed a complex task: of identifying the object, processing

The post Top 5 Applications of Computer Vision (CV) appeared first on Indium.

]]>
Imagine you are driving a car. You see a person move into the path of your car, making you take an appropriate action.

You would either apply brake and/or reduce the speed of the car. Thus, in a fraction of a second, the human vision has completed a complex task: of identifying the object, processing data, and making a timely decision.

That bit of detail helps understand the computer vision technology.

It is a field of computer science that enables computers to see, identify and process images in much the same way as the human vision before generating the necessary output.

The objective of computer vision is to enable computers to accomplish the same types of tasks as humans… with the same level of efficiency.

According to a report by Grand View Research, the global computer vision market size is forecast to grow at a compound annual growth rate of 7.6 percent between 2020 and 2027.

Advancements in artificial intelligence (AI), deep learning and neural networks have contributed to the growth of computer vision in recent years, so much so they are outdoing humans in tasks such as identifying and labelling objects.

Next Gen Product Development at your fingertips!

Read More

The high volume of data being generated—an estimate is that 3.2 billion images are shared every day, to go with 720,000 hours of video—is another contributing factor which helps train and improve computer vision.

How computer vision works

Pattern recognition is the most important aspect of computer vision.

Therefore, one way to train machines to understand visual data is to feed labelled images and apply software methodologies or algorithms to help them identify patterns in those labelled or pre-identified images.

For example, if a computer is fed with tens of thousands of images of an object, it will use the algorithm to analyze the features and shapes to recognize the labelled profile of the object.

This is part of training a computer which, thereafter, will use its experience to identify unlabelled images of the object it was previously fed with.

Rates of accuracy for object identification and classification have increased from 50 percent to 99 percent in less than a decade, with modern systems proving more accurate than humans at detecting and responding to visual inputs.

Applications

Use cases of computer vision are not only limited to tech companies but the technology is integrated into key, everyday products for higher efficiency.

Self-driving cars

Computer vision helps self-driving cars understand their surroundings and thereby drive the passengers safely to their destination, avoiding potential collisions and accidents.

Cameras fitted around the car capture video from various angles and the data is fed into the computer vision software, which processes the input in real-time to understand the road condition, read traffic signals and identify objects and pedestrians en route.

The technology also enables self-driving vehicles to make critical on-road decisions such as giving way to ambulances and fire engines.

With millions killed in car accidents each year, safe transportation powered by computer vision is paramount.

Facial recognition

Computer vision algorithms identify facial features in images and correlate them with the database of face profiles.

The high volume of images available online for analysis has contributed to machines learning and identifying individuals from photos and videos.

Securing of smartphones is the most common example of computer vision in facial recognition.

Computer vision systems are adept at identifying distinguishing patterns in retinas and irises, while they also help improve the security of valuable assets and locations.

According to a NIST report, the leading facial recognition algorithm as of 2020 has an error rate of 0.08 percent, a remarkable improvement on the 4.1 percent error rate in 2014.

Medical diagnosis

Engineers at the University of Central Florida’s Computer Vision Research Center taught a computer to find specks of lung cancer in CT scans, which is often difficult to identify for radiologists.

According to the team, the AI system has an accuracy rate of about 95 percent, an improvement on the 65 percent by human eyes.

It essentially proves that computer vision is adept at identifying patterns that even the human visual system may miss.

Such applications help patients receive timely treatment for cancer.

Manufacturing

Computer vision helps enhance production lines and digitize processes and workers in the manufacturing industry.

On the production line, the key use cases are the inspection of parts and products for defects, flagging of events and discrepancies, and controlling processes and equipment.

Thus, the technology eliminates the need for human intervention on the production line.

Law

Computer vision enables the prevention of crimes by helping security officials scan live footage from a public place to detect objects such as guns or identify suspect behavioral patterns that may precede illegal and dangerous action by individuals.

The technology also aids authorities with the scanning of crowds of people to identify any wanted individuals.

Leverge your Biggest Asset Data

Inquire Now

What’s the future of computer vision?

Considering the modern capabilities of computer vision, it’s a surprise that applications and advantages of the technology remain unexplored.

In the future, computer vision technologies will be easier to train and they will also capture more information from images than they do now.

It is being said that computer vision will play a key role in the development of artificial general intelligence and artificial superintelligence by enabling them to process information on par with or better than the human visual system.

The post Top 5 Applications of Computer Vision (CV) appeared first on Indium.

]]>
Statistical Distributions https://www.indiumsoftware.com/blog/statistical-distributions/ Fri, 05 Apr 2019 05:36:23 +0000 https://www.indiumsoftware.com/blog/?p=396 Introduction Statistics is a solid tool for performing the art of Data Science. Generally anyone can say that statistics is a mathematical body that pertains   collecting, analyzing, interpreting and drawing conclusions from the gained information. With statistics we get to work with  data in  more informative and targeted way  than obtaining from the basic visualization charts.

The post Statistical Distributions appeared first on Indium.

]]>
Introduction

Statistics is a solid tool for performing the art of Data Science. Generally anyone can say that statistics is a mathematical body that pertains   collecting, analyzing, interpreting and drawing conclusions from the gained information.

With statistics we get to work with  data in  more informative and targeted way  than obtaining from the basic visualization charts.

The math involved statistics helps us  to tackle  conclusions about our data rather than just estimating.

Statistics in short is study about the data  were we can get deeper knowledge and find more insights and also how the data is been structured .

Statistics mainly involves differential statistics (It is a summary statistic that quantitatively describes or summarizes features of a collected Information) and inferential statistics (techniques used to obtain probability making decisions and to find accurate predictions)

Statistical Quantities

The five basic statistical qualities mainly involve Mean, Median, Mode, Variance and standard deviation.

Mean, Median, Mode are also called Central Tendency. Central tendency (or measure of central tendency) is a central or typical value for a probability distribution.

It may also be called a center or location of the distribution.

A central tendency can be calculated for either a finite set of values or for a theoretical distribution, such as the normal distribution.

Mean: Arithmetic mean (or simply, mean) is the sum of all measurements divided by the number  of observations in the data set

Median: The middle value that separates the higher half from the lower half of the data set. The   median and the mode are the only measures of central tendency that can be used for ordinal data, in   which values are ranked relative to each other   but are not measured absolutely as it is also   known as   50th percentile

Mode: The mode of a set of data values is the value that appears most often. It is the value x at   which its probability mass function takes its maximum value.

Variance: It measures how far a set of numbers are spread out from their mean. It is   calculated by   taking the differences between each number in the set and the mean, squaring the   differences and   dividing the sum of the squares by the number of values in the set.

Standard Deviation: It tells you how much data deviates from the actual mean. It   is the square   root of the Variance. 

A low standard deviation indicates that the data points tend to be close to the   mean, while a high standard deviation indicates that the data points are spread out over a   wider   range of values.

A useful property of the standard deviation is that, unlike the variance, it is   expressed in the same units as the data.

Discrete Vs Continuous

Discrete variables are countable in a finite amount of time. For Example, the total number of  students in the class, or the amount you deposited in the bank account as they all are still countable

Continuous data technically have an infinite number of steps. For Example, A person’s height  could be any value (within the range of human heights), not just certain fixed heights.

Distributions

Fitting the Right Distribution

When confirmed with what data needs to be characterized by the distributed, it is always best to start with the raw data trying to fit the right distribution to that data.

As it mainly satisfies basic questions that can help in the characterization. The first is checks the data to the either discrete or continuous.

The second looks for the symmetry and if there is any asymmetry in other words the positive or negative outliers are equal or likely more than the other.

The third relates to the upper and lower limits of data relates to the likelihood of observing extreme values in the distribution; in some data, the extreme values occur very infrequently whereas in others, they occur more often.

The Poisson Distribution

History: Poisson Distribution   was named after French mathematician Siméon Denis Poisson

Description: The Poisson distribution is used to calculate the number of events that might occur in a continuous time interval.

The Poisson distribution can also be used for the number of events in other specified intervals such as distance, area or volume. 

For instance, how many Emails might occur at any time where the key parameter that is required is the average number of events in the given interval.

The resulting distribution looks like the binomial, with the skewness being positive but decreasing with l.  

Probability of events for a Poisson distribution:

Where,

l     lambda  is the average number of events per interval

         E     is the number 2.71828.

k      takes values 0, 1, 2, …

k!     k × (k − 1) × (k − 2) × … × 2 × 1 is the factorial of k.

This equation is the probability mass function (PMF) for a Poisson distribution

Poisson Distribution Example

The average number of homes sold by the Acme Realty company is 2 homes per day. What will be the probability that exactly 3 homes will be sold tomorrow?

This is a Poisson experiment in which we know the following:

 μ = 2; since 2 homes are sold per day, on average.

x = 3; since we want to find the likelihood that 3 homes will be sold tomorrow.

e = 2.71828; since e is a constant equal to approximately 2.71828.

formula:     

P(x; μ) = (e-μ) (μx) / x!

P(3; 2) = (2.71828-2) (23) / 3!

P(3; 2) = (0.13534) (8) / 6

P(3; 2) = 0.180

Thus, the probability of selling 3 homes tomorrow is 0.180 .

Binomial Distribution

History: Swiss mathematician Jakob Bernoulli, determined the probability of k such outcomes.

Description: The binomial distribution measures the probabilities of the number of successes over a given number of trials with a specified probability of success in each try it is a statistical experiment with consists n repeated trials.

Each trial can result in just two possible outcomes known as  success and the other, a failure.

The probability of success, denoted by P, is the same on every trial. A single success/failure experiment is also called a Bernoulli trial or Bernoulli experiment and a sequence of outcomes is called a Bernoulli process; for a single trial, i.e., n = 1, the binomial distribution is a Bernoulli distribution.

The binomial distribution is the basis for the popular binomial test of statistical significance.

Binomial Formula

Suppose a binomial experiment consists of n trials and results in x successes. If the probability of success on an individual trial is P, then the binomial probability is:

b(x; n, P) = nCx * Px * (1 – P)n – x

or

b(x; n, P) = { n! / [ x! (n – x)! ] } * Px * (1 – P)n – x

where,

x        Number of successes that result from the binomial experiment.

N        Number of trials in the binomial experiment

P         Probability of success on an individual trial

Q        Probability of failure on an individual trial equal  to  1 – P

n!       The factorial of n

b(x; n, P)  Binomial probability

nCr       Number of combinations

Binomial Distribution Example:

Suppose a die is tossed 5 times. What is the probability of getting exactly 2 fours?

This is a binomial experiment in which the number of trials is equal to 5, the number of successes is equal to 2, and the probability of success on a single trial is 1/6 or about 0.167. Therefore, the binomial probability is:

b(2; 5, 0.167) = 5C2 * (0.167)2 * (0.833)3

b(2; 5, 0.167) = 0.161

Negative Binomial Distribution: Assume that the number of successes  is fixed at a given number and estimate the number of tries is obtained before reaching the specified number of successes.

The resulting distribution is called the negative binomial and it very closely resembles the Poisson.

In fact, the negative binomial distribution converges on the Poisson distribution, but will be more skewed to the right (positive values) than the Poisson distribution with similar parameters.

Uniform Distribution

Description:  A uniform distribution, sometimes also known as a rectangular distribution, is a distribution that has constant probability.

On rolling a fair die, the outcomes are 1 to 6. The probabilities of getting these outcomes are equally likely and that is the basis of a uniform distribution.

Unlike Bernoulli Distribution, all the n number of possible outcomes of a uniform distribution are equally likely.

F(x) = 1/ b-a

Where ,

X       uniformly distributed if the density function

a , b    Parameters

Types

This distribution has two types.

The most common type you’ll find in elementary statistics is the continuous uniform distribution and second is the discrete uniform distribution though it  resembles as a rectangle but instead of a line, a series of dots are represented for the finite number of outcomes.

Example: Rolling a single die is example of a discrete uniform distribution it produces   four possible outcomes: 1,2,3,4,5, or 6. There is a 1/6 probability for each number being rolled.

Uniform Distribution Example:

Let metro trains on a certain line run every half hour between mid  night and six in the morning.

What will be the probability that a man entering the station at a random time during this period will have to wait at least twenty minutes.

Here, Let x denotes the waiting time (in minutes) for the next train, under the assumption that a man arrives at random at the station.

X is distributed uniformly on (0,30) with probability distributed function

f(x)= 1/30 , 0<x<30          = 0

The probability that he has to wait at least 20 minutes is

P(X.20)  = 1 /30∫1.dx 20 to 30

= 1/30(30-20)

= 1/3

Normal Distribution

History :  de Moivre developed the normal distribution as an approximation to the binomial distribution, and it was subsequently used by Laplace in 1783 to study measurement errors and by Gauss in 1809 in the analysis of astronomical data

Description:  Normal distributions are common continuous probability distributions  as it is highly important in statistics   and are often used in natural and social studies to predict the real value random variables whose values are not known, The normal distribution, also known as the Gaussian distribution, is symmetric about the mean, differentiates data near the mean are more frequent in occurrence than data far from the mean

The Normal Equation.

Y = { 1/[ σ * sqrt(2π) ] } * e-(x – μ)2/2σ2

Where ,

X      normal random variable

μ       mean,

σ        standard deviation

π        approximately 3.14159

e        approximately 2.71828.

Normal Distribution Example.

If the  data is normally distributed  the   mean and standard deviation can be calculated  ,Here if the mean is halfway between 1.2m and 1.8m

Mean = (1.2m + 1.8m) / 2 = 1.5m

95%  for  4 standard deviations(2 Standard Deviations on either side)

one standarad deviation  = (1.8m – 1.2m)/4

                                                                             =  0.6m /4

                                                                            =  0.15m

The  mean and standard deviation for the normally distributed data is obtained as 1 .5m  and 0.15m(one standard deviation).

Leverge your Biggest Asset Data

Inquire Now

Conclusion:

Statistical Distributions are widely used in many sectors, like Computer Science, Science, Finance, Insurance ,Engineering , Medical, Stock Market  and the day to day life .

The key for good data analytics is obtained by fitting the right distribution  to the data and  preserving the best Estimation.

The above Distributions   are observed and used in day to day life as it can be related and compared  and analyzed with other distributions.

The post Statistical Distributions appeared first on Indium.

]]>
Introduction to Hypothesis Testing https://www.indiumsoftware.com/blog/hypothesis-testing/ Fri, 16 Nov 2018 09:24:00 +0000 https://www.indiumsoftware.com/blog/?p=463 Introduction Consider an operation man from a manufacturing plant, he produces thousands of units of a part say screw. It is nearly impossible to find the deviation of the measurement of the screw for all those thousands. What is his way out to find that the screw’s measurement doesn’t shoot up above a threshold? Enter

The post Introduction to Hypothesis Testing appeared first on Indium.

]]>
Introduction

Consider an operation man from a manufacturing plant, he produces thousands of units of a part say screw.

It is nearly impossible to find the deviation of the measurement of the screw for all those thousands.

What is his way out to find that the screw’s measurement doesn’t shoot up above a threshold? Enter our dear friend from statistics called “Hypothesis testing”.

We’ll see in the later half how he can use simple measurements like mean and standard deviation of a “sample” and check if things are under control.

What is hypothesis testing?

Hypothesis testing can sound scary to non-statisticians, but it has had it’s applications from even the most longstanding areas like judicial system.

What happens when a person is presented before the court for a crime? There is basically a hypothesis or a statement or a stand taken by the court.

This is called the Null hypothesis, it always defines a default or a natural case.

H0: The person is innocent

The opposite of the Null hypothesis is the Alternate hypothesis which inverts null.

Check out our Advanced Analytics Services

Read More

Ha: The person is guilty.

The judge decides the person is innocent or guilty by hearing arguments. Here is it gets interesting, you will never hear the judge say “the person is guilty” if he is a statistics man. Because in statistics, there are only 2 possibilities.

  • Reject Null hypothesis – This means we have enough evidence to suggest that we can reject the null hypothesis and say that that it is possibly wrong to say that the person is innocent.
    If the judge takes this possibility, you can say that the person is toast. He goes to prison.
  • Do not reject the Null hypothesis – This means we don’t have enough evidence to suggest that the null hypothesis is wrong and say that it is not possible to say a person is not innocent.
  • What the judge means by that complicated sentence the person might be innocent, and the person walks free.

Now once the probability is in action, there are bound to be errors. In the above case, there are 2 errors.

Imagine the Type 1 Error that is when in reality the person is innocent, but the judge declares that reject the null hypothesis and sends behind the bars.

That is sending an innocent man for a crime not committed.

Now, imagine the reverse which is Type II Error, when the person is not innocent, but the judge declares that he cannot reject the null hypothesis and sends him scot-free.

Which is more devastating? Of course, the first one if you stand for justice for the individual.

This is the pillars on which judicial system is built, an innocent individual should not be punished.

Minimise Type 1 Error at all costs, even if means not possible to declare Dawood as a blasts mastermind, or Zawahiri as a terrorist, or even Pablo as a drug peddler.

But first, we need to capture and produce them before the court, which looks like a remote possibility than sky turning into Neon green!

Hence the prosecution has to toil very hard to reject the null hypothesis, while the defence (the lawyers for the person in the box) just break that possibility by sowing doubts in the prosecution’s evidence.

Quite a job to get paid, sow enough doubts to confuse the judge so that he doesn’t have enough evidence to reject the null hypothesis.

Example of Hypothesis test

Test of a single mean

Consider the operations manager wanting to test if the mean measurements of the screw is 350 mm. that could be worded in hypothesis as

H0: Mean, mu <= 170

Ha: Mean, mu! > 170

Remember, the null hypothesis is always an equality, it cannot have only inequality symbols like “<” or “>”.

Once we set up the hypothesis, we need to take a sample and then find the test statistic. Using the test statistic only we can make conclusions about the hypothesis.

The test statistic will have a sampling distribution which will either say that the sample mean is far away from the real mean or if it is closer to the real mean.

If it is far away, then we have sufficient evidence to reject the null otherwise we fail to reject the null as the sample mean is close to the real one.

We talk about the sample, why sample it? Because the population is big that it is impossible to test each case.

It is to be noted, the hypothesis testing is based on the assumption that the sample has a normal distribution.

Computing Test statistic

Then we can calculate a test statistic called z.
z = (xbar – mu)/ (sigma/ sqrt(n))

this is just standardizing the value of sample mean.

z = (178-170)/(65/sqrt(400)) = 2.46

What can we do with this test statistic, z?

We can compare it with a rejection region statistic called zalpha. Remember, there should be enough evidence to reject the null hypothesis.

What is enough very subjective, but in statistics, 5% is the probability used commonplace. If you want to enforce strict rules to reduce type 1 error, then 1% or further decrease it.

Hence Z alpha can be calculated from the Normal distribution table for the probability value of 5%, which is 1.645.

This is the truth and the only truth which you can memorise it for easy calculations. Better do it for 1% from the table as well.

Z vs z alpha, 2.46 vs 1.65

Z > zalpha, what does it signify?

Let us put it in a graph and see.

Since 2.46 is on the right side of the 1.65 and the null hypothesis is mean <= 170, we can reject the null hypothesis.

Is it so simple? No, that is where you encounter p-value. Z is just a test statistic.

What is a p-value?

The p-value of a test is the probability to observe a test statistic as extreme as the computed one given that the null hypothesis is true.

Just memorise this axiom for the sake of god, you don’t need anything else to conclude.

Close your eyes and conclude like a true statistical practitioner. Your words will never be rejected even if you reject the null hypothesis.

When p-value < 0.05, reject null hypothesis

p-value > 0.05, do not reject null hypothesis.

For our example the p-value would be

= P (xbar > 178) = P ( z > zalpha) = P (z> 2.446) =1 – P(z < 2.46) = 1 – .9931 = .0069

How to interpret this?

That is the probability of observing a sample mean as extreme as 178 is 0.0069. This is very small.

The probability to observe a test statistic as extreme as the computed one given that the null hypothesis is true is .0069. Hence reject the null hypothesis.

Is .0069 small or significant enough? That is where statistics has defined different ranges for describing the strength of evidence and levels of significance.

Variations of the test

What we saw above is a 1-tailed test, where we declared the null hypothesis with mean to be less than or equal to 170 and the alternate hypothesis having mean to be greater than 170.

That means the rejection region is on the right side.

When we need to test strict conditions like when we declare the null with mean equal to 170 and the alternate with mean to be not equal to 170.

It could be greater or lesser than 170. In that case, the rejection region is on either side of 170. This is called a 2-tailed test.

1-tailed test

H0: mu <= 170

Ha: mu > 170

2-tailed test

H0: mu = 170

Ha: mu!= 170

So what happens to the p-value is 2-tailed test. The alpha or the level of significance is divided by 2, and hence converts from.05 to .025.

p-value = P (xbar > 178) + P (xbar <  178) = P (z> 2.446) + P (z< 2.446) = 0.0069 + 0.0069 = 0.0138

Still less than 0.05 and hence reject null hypothesis.

What are the types?

There are different types of hypothesis testing, what we looked is testing against a population mean via single mean z-test

We can test for:

  1. Differences of means between 2 different populations via Independent t-test
  2. Difference between paired samples, typical before and after variations of a single population via Paired t-test.
  3. Differences in means for 3 or more independent populations via 1-way ANOVA (Analysis of Variance)
  4. The strength of the relationship between 2 categorical variables via chi-square test

Every test calculates the Holy Grail metric called p-value, and remember the axiom always!

Applications in Business

Hypothesis testing is used in many domains of healthcare, insurance, manufacturing and across predominantly across functions like operations, marketing.

Whenever there is a big population and we cannot check case by case, samples are taken and hypothesis testing is performed.

Is Your Application Secure? We’re here to help. Talk to our experts Now

Inquire Now

Like in the case of healthcare, Performing a blood test and inference the presence of microorganism:

  • In the case of product development, Measuring web traffic to conduct A/B testing to decide on the best features;
  • In the case of manufacturing, Using the computed mean measurements of shipped products and concluding if the sample is defective;
  • In the case of retail marketing, Comparing the sample campaign performance and concluding if the performance of the campaign has enhanced.

The post Introduction to Hypothesis Testing appeared first on Indium.

]]>