machine learning Archives - Indium

Generative AI: Scope, Risks, and Future Potential

Gowtham Raghavan Ramachandran — Fri, 05 Apr 2024 10:45:00 +0000

From planning travel itineraries to writing poetry, and even getting a research thesis generated, ChatGPT and its ‘brethren’ generative AI tools such as Sydney and Bard have been much in the news. Even generating new images and audio has become possible using this form of AI. McKinsey seems excited about this technology and believes it can provide businesses with a competitive advantage by enabling the design and development of new products and business process optimizations.

ChatGPT and similar tools are powered by generative artificial intelligence (AI), which facilitates the virtual creation of new content in any format – images, textual content, audio, video, code, and simulations. While the adoption of AI has been on the rise, Generative AI is expected to bring in another level of transformation, changing how we approach many business processes.

ChatGPT (generative pretrained transformer), for instance, was launched only in November 2022 by Open AI. But, from then to now, it has become very popular because it generates decent responses to almost any question. In fact, in just 5 days, more than a million users signed up. Its effectiveness in creating content is, of course, raising questions about the future of content creators!

Some of the most popular examples of Generative AI are images and chatbots that have helped the market grow by leaps and bounds. The generative AI market is estimated at USD 10.3 billion in 2022, and will grow at a CAGR of 32.2% to touch $53.9 billion by 2028.

Despite the hype and excitement around it, there are several unknown factors that pose a risk when using generative AI. For example, governance and ethics are some of the areas that need to be worked on due to the potential misuse of technology.

Check out this informative blog on deep fakes: Your voice or face can be changed or altered.

Decoding the secrets of Generative AI: Unveiling the learning process

Generative AI leverages a powerful technique called deep learning to unveil the intricate patterns hidden within vast data troves. This enables it to synthesize novel data that emulates human-crafted creations. The core of this process lies in artificial neural networks (ANNs) – complex algorithms inspired by the human brain’s structure and learning capabilities.

Imagine training a generative AI model on a massive dataset of musical compositions. Through deep learning, the ANN within the model meticulously analyzes the data, identifying recurring patterns in melody, rhythm, and harmony. Armed with this knowledge, the model can then extrapolate and generate entirely new musical pieces that adhere to the learned patterns, mimicking the style and characteristics of the training data. This iterative process of learning and generating refines the model’s abilities over time, leading to increasingly sophisticated and human-like outputs.

In essence, generative AI models are not simply copying existing data but learning the underlying rules and principles governing the data. This empowers them to combine and manipulate these elements creatively, resulting in novel and innovative creations. As these models accumulate data and experience through the generation process, their outputs become increasingly realistic and nuanced, blurring the lines between human and machine-generated content.

Evolution of Machine Learning & Artificial Intelligence

From the classical statistical techniques of the 18th century for small data sets, to developing predictive models, machine learning has come a long way. Today, machine learning tools are used to classify large volumes of complex data and to identify patterns. These data patterns are then used to develop models to create artificial intelligence solutions.

Initially, the learning models are trained by humans. This process is called supervised learning. Soon after, they evolve towards self-supervised learning, wherein they learn by themselves using predictive models. In other words, they become capable of imitating human intelligence, thus contributing to process automation and performing repetitive tasks.

Generative AI is one step ahead in this process, wherein machine learning algorithms can generate the image or textual description of anything based on the key terms. This is done by training the algorithms using massive volumes of calibrated combinations of data. For example, 45 terabytes of text data were used to train GPT-3, to make the AI tool seem ‘creative’ when generating responses.

The models also use random elements, thereby producing different outputs from the same input request, making it even more realistic. Bing Chat, Microsoft’s AI chatbot, for instance, became philosophical when a journalist fed it a series of questions and expressed a desire to have thoughts and feelings like a human!

Microsoft later clarified that when asked 15 or more questions, Bing could become unpredictable and inaccurate.

Here’s a glimpse into some of the leading generative AI tools available today:

ChatGPT: This OpenAI marvel is an AI language model capable of answering your questions and generating human-like responses based on text prompts.

DALL-E 3: Another OpenAI creation, DALL-E 3, possesses the remarkable ability to craft images and artwork from textual descriptions.

Google Gemini: Formerly known as Bard, this AI chatbot from Google is a direct competitor to ChatGPT. It leverages the PaLM large language model to answer questions and generate text based on your prompts.

Claude 2.1: Developed by Anthropic, Claude boasts a 200,000 token context window, allowing it to process and handle more data compared to its counterparts, as claimed by its creators.

Midjourney: This AI model, created by Midjourney Inc., interprets text prompts and transforms them into captivating images and artwork, similar to DALL-E’s capabilities.

Sora: This model creates realistic and imaginative scenes from text instructions. It can generate videos up to a minute long while maintaining visual quality and adherence to the user’s prompt.

GitHub Copilot: This AI-powered tool assists programmers by suggesting code completions within various development environments, streamlining the coding process.

Llama 2: Meta’s open-source large language model, Llama 2, empowers developers to create sophisticated conversational AI models for chatbots and virtual assistants, rivalling the capabilities of GPT-4.

Grok: Founded by Elon Musk after his departure from OpenAI, Grok is a new venture in the generative AI space. Its first model, Grok, known for its irreverent nature, was released in November 2023.

These are just a few examples of the diverse and rapidly evolving landscape of generative AI. As the technology progresses, we can expect even more innovative and powerful tools to emerge, further blurring the lines between human and machine creativity.

Underlying Technology

There are three techniques used in generative AI.

Generative Adversarial Networks (GANs)

GANs are powerful algorithms that have enabled AI to be creative by making two algorithms compete to achieve equilibrium.

Variational Auto-Encoders (VAE)

To enable the generation of new data, the autoencoder regularizes the distribution of encodings during training to ensure good properties of latent space. The term “variational” is derived from the close relationship between regularization and variational inference methods in statistics.

Transformers

A deep learning model, transformers use a self-attention mechanism to weigh the importance of each part of the input data differentially and are also used in natural language processing (NLP) and computer vision (CV).

Prior to ChatGPT, the world had already seen OpenAI’s GPT-3 and Google’s BERT, though they were not as much of a sensation as ChatGPT has been. Training models of this scale need deep pockets.

Generative AI Use Cases

Content writing has been one of the primary areas where ChatGPT has seen much use. It can write on any topic within minutes by pulling in inputs from a variety of online sources. Based on feedback, it can finetune the content. It is useful for technical writing, writing marketing content, and the like.

Generating images such as high-resolution medical images is another area where it can be used. Artwork can be created using AI for unique works, which are becoming popular. By extension, designing can also benefit from AI inputs.

Generative AI can also be used for creating training videos that can be generated without the need for permission from real people. This can accelerate content creation and lower the cost of production. This idea can also be extended to creating advertisements or other audio, video, or textual content.

Code generation is another area where generative AI tools have proved to be faster and more effective. Gamification for improving responsiveness and adaptive experiences is another potential area of use.

Governance and Ethics

The other side of the Generative AI coin is deep fake technology. If used maliciously, it can create quite a few legal and identity-related challenges. It can be used to implicate somebody wrongly or frame someone unless there are checks and balances that can help prevent such malicious misuse.

It is also not free of errors, as the media website CNET discovered. The financial articles written using generative AI had many factual mistakes.

OpenAI has already announced GPT4 but tech leaders such as Elon Musk and Steve Wozniak have asked for a pause in developing AI technology at such a fast pace without proper checks and balances. It also needs security to catch up and appropriate safety controls to prevent phishing, social engineering, and th generation of malicious code.

There is a counter-argument to this too which suggests that rather than pausing the development, the focus should be on developing a consensus on the parameters concerning AI development. Identifying risk controls and mitigation will be more meaningful.

Indeed, risk mitigation strategies will play a critical role in ensuring the safe and effective use of generative AI for genuine needs. Selecting the right kind of input data to train the models, free of toxicity and bias, will be important. Instead of providing off-the-shelf generative AI models, businesses can use an API approach to deliver containerized and specialized models. Customizing the data for specific purposes will also help improve control over the output. The involvement of human checks will continue to play an important role in ensuring the ethical use of generative AI models.

This is a promising technology that can simplify and improve several processes when used responsibly and with enough controls for risk management. It will be an interesting space to watch as new developments and use cases emerge.

To learn how we can help you employ cutting-edge tactics and create procedures that are powered by data and AI

FAQ’s

1. How can we determine the intellectual property (IP) ownership and attribution of creative works generated by large language models (LLMs)?

Determining ownership of AI-generated content is a complex issue and ongoing legal debate. Here are some technical considerations:
(i). LLM architecture and licensing: The specific model’s architecture and licensing terms can influence ownership rights. Was the model trained on open-source data with permissive licenses, or is it proprietary?
(ii). Human contribution: If human intervention exists in the generation process (e.g., prompting, editing, curation), then authorship and ownership become more nuanced.

2. How can we implement technical safeguards to prevent the malicious use of generative AI for tasks like creating deepfakes or synthetic media for harmful purposes?

Several approaches can be implemented:
(i). Watermarking or fingerprinting techniques: Embedding traceable elements in generated content to identify the source and detect manipulations.
(ii). Deepfake detection models: Developing AI models specifically trained to identify and flag deepfake content with high accuracy.
(iii). Regulation and ethical frameworks: Implementing clear guidelines and regulations governing the development and use of generative AI, particularly for sensitive applications.

3. What is the role of neural networks in generative AI?

Neural networks are made up of interconnected nodes or neurons, organized in layers like the human brain. They form the backbone of Generative AI. They facilitate machine learning of complex structures, patterns, and dependencies in the input data to enable the creation of new content based on the input data.

4. Does Generative AI use unsupervised learning?

Yes. In generative AI, machine learning happens without explicit labels or targets. The models capture the essential features and patterns in the input data to represent them in a lower-dimensional space.

The post Generative AI: Scope, Risks, and Future Potential appeared first on Indium.

Collaboration of Synthetics: ML’s Evolutionary Edge

Nizamuddin M — Thu, 15 Feb 2024 07:54:00 +0000

The desire for data is like an endless hole in the world of data and analytics today. The big data analytics business is predicted to reach $103 billion this year, and 181 zettabytes of data will be produced by 2025.

Despite massive data being generated, access to and availability remains a problem. Although public databases partially address this issue, certain dangers are still involved. One of them is bias caused by improper usage of data sets. The second difficulty is requiring different data to train the algorithms and satisfy real-world requirements properly. The quality of the algorithm will also be impacted by data accuracy. It is regulated to preserve privacy and might be expensive to obtain.

These problems can be resolved by using synthetic data, which enables businesses to quickly produce the data sets required to satisfy the demands of their clients. Gartner predicts that by 2030, synthetic data will likely surpass actual data in AI models, even though accurate data is still regarded as superior.

Decoding the Exceptional Synthetic Data

So, what do you get when we say synthetic data? At the forefront of modern data-driven research, institutions like the Massachusetts Institute of Technology (MIT) are pioneering the utilization of synthetic data. Synthetic data refers to artificially generated datasets that mimic real-world data distributions, maintaining statistical properties while safeguarding privacy. This innovative approach ensures that sensitive information remains confidential, as exemplified by MIT’s creation of synthetic healthcare records that retain essential patterns for analysis without compromising patient privacy. This technique’s relevance extends to various domains, from machine learning advancements to societal insights, offering a powerful tool to unlock valuable knowledge while upholding data security and ethical considerations.

Using synthetic data, new systems can be tested without live data or if the data is biased. Small datasets not being used can be supplemented, and the accuracy of learning models can be improved. Synthetic data can also be used when real data cannot be used, shared, or moved. It can create prototypes, conduct product demos, capture market trends, and prevent fraud. It can even be used to generate novel, futuristic conditions.

Most importantly, it can help businesses comply with privacy laws, mainly health-related and personal data. It can reduce the bias in data sets by providing diverse data that reflects the real world better.

Use Cases of Synthetic Data

Synthetic data can be used in different industries for different use cases. For instance, computer graphics and image processing algorithms can generate synthetic images, audio, and video that can be used for training purposes.

Synthetic text data can be used for sentiment analysis or for building chatbots and machine translation algorithms. Synthetically generated tabular data sets are used in data analytics and training models. Unstructured data, including images, audio, and video, are being leveraged for speech recognition, computer vision, and autonomous vehicle technology. Financial institutions can use synthetic data to detect fraud, manage risks, and assess credit risk. In the manufacturing industry, it can be used for quality control testing and predictive maintenance.

Also read: The Transformative Impact Of Generative AI On The Future Of Work.

Generating Synthetic Data
How synthetic data is generated will depend on the tools and algorithms used and the use case for which it is created. Three of the popular techniques used include:

Technique #1 Random Selection of Numbers: One standard method is randomly selecting numbers from a distribution. Though this may not provide insights like real-world data, the data distribution matches it closely.

Technique #2 Generating Agent-based Models: Unique agents are created using simulation techniques to enable them to communicate with each other. This is especially useful in complex systems where multiple agents, such as mobile phones, apps, and people, are required to interact with each other. Pre-built core components and Python packages such as Mesa are used to develop the models quickly, and a browser-based interface is used to view them.

Technique #3 Generative Models: Synthetic data replicating real-world data’s statistical properties or features is generated using algorithms. Training data learns the statistical patterns and relationships in the data and generates new synthetic data similar to the original. Generative adversarial networks and variational autoencoders are examples of generative models.

The model quality should be reliable to ensure the quality of synthetic data. Additional verification is required and involves comparing the model results with the real-world data that has been annotated manually. Users must be sure that the synthetic data is not misleading, reliable, and 100% fail-safe for privacy.

Synthetic Data with Databricks

Databricks offers dbldatagen, a Python library, to generate synthetic data for testing, creating POCs, and other uses such as Delta Live Tables pipelines in Databricks environments. It helps to:

● Create unique values for a column.
● Allow templated text generation based on specifications.
● Generate data from a specific set of values.
● Generate weighted data in case of repeating values.
● The data generated in a data frame can be written to storage in any format.
● Billions of rows of data can be generated quickly.
● A random seed can be used to generate data based on the value of other fields.

To learn more about Indium Software, please visit

Click Here

The post Collaboration of Synthetics: ML’s Evolutionary Edge appeared first on Indium.

BFSI’s Tech Ride with NLP and Sentiment Analysis! Chatting with Erica, EVA, Amy, and Aida.

Abishek Balakumar — Tue, 17 Oct 2023 09:50:00 +0000

Have you crossed paths with Erica from Bank of America, EVA from HDFC, Amy from HSBC, or Aida from SEB in Sweden?

If you’ve been dealing with banks and financial organizations, chances are you’ve chatted with these super-smart virtual assistants and chatbots. The use of Natural Language Processing (NLP) in the financial sector has been on the rise worldwide. More and more financial institutions are embracing advanced tech innovations, taking NLP beyond banking, insurance, and hedge funds (especially for sentiment analysis).

Artificial Intelligence and Machine Learning, alongside NLP, are making their mark in various areas of the financial sector like, operations, risk assessment, sales, research and development, customer support, and many other fields. This expansion boosts efficiency, productivity, cost-effectiveness, and time and resource management.

Take, for instance, the convenience it brings: Instead of the hassle of logging into individual accounts to check your balance, users can now effortlessly access their account information through chatbots and voice assistants. These digital companions are everywhere, from chatbots to voice assistants like Amazon Alexa, Google Assistant, and Siri.

Sentiment Analysis, often hailed as the next game-changer in the finance sector, plays a central role in chatbots, voice assistants, text analysis, and NLP technology. It’s a key component of natural language processing used to decipher the sentiments behind data. Companies frequently employ sentiment analysis on various text sources such as customer reviews, social media conversations, support tickets, and more to uncover genuine customer sentiments and evaluate brand perception.

Sentiment analysis aids in recognizing the polarity of information (positive or negative), emotional cues (like anger, happiness, or sadness), and intent (e.g., interest or disinterest). It is crucial in brand reputation management by providing insights into overall customer attitudes, challenges, and needs. This allows for data categorization by different sentiments, resulting in more accurate predictions and informed strategic decisions.

So, how can BFSI make the most of sentiment analysis? This emerging field has firmly rooted itself in the financial industry. Banks and financial institutions can employ AI-driven sentiment analysis systems to understand customer opinions regarding their financial products and the overall brand perception.

Of course, this approach may necessitate a certain level of data proficiency that financial companies must acquire before launching full-fledged sentiment analysis projects. Sentiment analysis stands as a highly promising domain within NLP and is undoubtedly poised to play a substantial role in the future of financial services.

Here, we’ll delve into the seven most prominent applications of sentiment analysis in financial services.

1. Portfolio Management and Optimization: NLP can help financial professionals analyze vast amounts of textual data from financial news and market trends to assess the sentiment surrounding specific investments. This sentiment analysis can aid in making informed decisions about portfolio management, identifying potential risks, and optimizing investment strategies.
2. Financial Data Analytics: Sentiment analysis enables financial firms to gauge the market’s sentiment toward specific assets or companies by analyzing news articles, social media, and reports. This information can be used to assess the volatility of investments and make data-driven decisions.
3. Predictive Analysis: NLP can be used to analyze historical data and predict the future performance of investment funds. This involves assessing sentiment and other textual data to identify high-risk investments and optimize growth potential, even in uncertain market conditions.
4. Customer Services and Analysis: Financial institutions employ NLP-driven chatbots and virtual assistants to enhance customer service. These AI-driven tools use NLP to process and understand customer queries, improving customer experience and satisfaction.
5. Gathering Customer Insights: By applying sentiment analysis and intelligent document search, financial firms can gain insights into customer preferences, challenges, and overall sentiments. This information is valuable for personalizing offers, measuring customer response, and refining products and services.
6. Researching Customer Emotional Responses: AI-powered tools process vast amounts of customer data, such as social media posts, chatbot interactions, reviews, and survey responses, to determine customer sentiments. This allows companies to better understand customer attitudes toward their products, services, and brands and analyze responses to competitors’ campaigns.
7. Credit Market Monitoring: Sentiment analysis tracks credit sentiments in the media. Financial institutions can use NLP to process information from news articles and press releases to monitor the sentiment related to specific bonds or organizations. This data can reveal correlations between media updates and credit securities’ market performance, streamlining financial research efforts.

Future of NLP – Sentimental Analysis: Where does it stand today and tomorrow?

NLP has made significant strides in the banking and financial sector, supporting various services. It enables real-time insights from call transcripts, data analysis with grammatical parsing, and contextual analysis at the paragraph level. NLP solutions extract and interpret data to provide in-depth insights into profitability, trends, and future business performance in the market.

Soon, we can anticipate NLP, alongside NLU and NLG, being extensively applied to sentiment analysis and coherence resolution, further enhancing its role in this domain.

Training computers to comprehend and process text and speech inputs is pivotal in elevating business intelligence. Driven by escalating demand, Natural Language Processing (NLP) has emerged as one of AI’s most rapidly advancing subsectors. Experts anticipate reaching a global market value of $239.9 billion by 2032, boasting a robust Compound Annual Growth Rate (CAGR) of 31.3%, per Allied Market Research.

NLP-based sentiment analysis is an innovative technique that enables financial companies to effectively process and structure extensive volumes of customer data, yielding maximum benefits for both banks and customers. This technology is positioned to empower traditional financial institutions and neo-banks alike, as it enhances current customer experiences, diminishes friction in financial services, and facilitates the creation of superior financial products.

In the finance and banking sectors, NLP is harnessed to streamline repetitive tasks, reduce errors, analyze sentiments, and forecast future performance by drawing insights from historical data. Such applications enable firms to realize time and cost savings, enhance productivity and efficiency, and uphold the delivery of quality services.

The post BFSI’s Tech Ride with NLP and Sentiment Analysis! Chatting with Erica, EVA, Amy, and Aida. appeared first on Indium.

How the SDOH machine learning model improves patients’ health and your bottom line

Uma Raj — Thu, 24 Aug 2023 12:36:50 +0000

Preventive care management—Transcending traditional ways

The healthcare paradigm is shifting from a reactive approach to a proactive and holistic model. Preventive care is important for staying healthy and identifying problems early before they lead to other complications or become more difficult to treat. While early intervention has proven instrumental in advancing diagnostics and treatments, a critical element has been missing until now: the incorporation of social determinants of health (SDOH). Recognizing that health outcomes are intricately woven into the fabric of our lives, the integration of SDOH into preventive care emerges as a transformative solution.

Beyond genetics and clinical data, social determinants encompass factors like socioeconomic status, living conditions, education, and access to nutritious food. By embedding these key influencers into preventive care, healthcare providers gain an unprecedented understanding of their patients’ lives, empowering them to offer personalized and proactive interventions.

Discover the transformative potential of our Social Determinants of Health (SDOH) model and its ability to revolutionize patient care while driving significant cost savings for payers and providers.

Download White Paper

Social Determinants of Health: Impact on healthcare outcomes

The non-medical elements that affect health outcomes are referred to as social determinants of health (SDOH). Socioeconomic position, education, physical environment and neighborhood, job, and social support systems are a few of these variables. SDOH has a major effect on health and can impact healthcare outcomes in a number of ways.

For example, a patient with a lower socioeconomic status is more likely to have chronic diseases, such as diabetes and heart ailment. By understanding this patient’s social determinants, a healthcare provider can recommend preventive care measures that are tailored to their needs, such as financial assistance for medication or enrolling them in wellness programs.

Patient 360: A holistic view of patient data

Patient 360 is a comprehensive view of a patient’s health information, including their medical history, social determinants, and other relevant data. By integrating SDOH into patient 360, healthcare providers can gain a better understanding of the factors that are affecting their patients’ health and make more informed decisions about preventive care.

Here are some of the benefits of leveraging SDOH parameters in the patient 360 framework:

Better patient care: Integrating SDOH elements into the patient 360 approach helps improve treatment efficiency by empowering physicians to address the factors that influence healthcare outcomes. This can save time and resources, which can be used to provide better care for patients.

Enhanced patient engagement: Addressing SDOH factors helps enhance patient engagement by giving patients more awareness of their health data. This can lead to patients being more involved in their care management and being more likely to follow treatment plans.

Clinical notes to actionable insights: Physician notes record important patient medical histories, symptoms, demographics, and clinical data. These observations provide a holistic picture of the patient’s health. SDOH factors are important predictors of preventive care needs, which is why it is important to include them in patient records.

The integration of SDOH into patient 360 is a promising way to improve preventive care and achieve better health outcomes for all patients.

Manual SDOH data extraction: Typical challenges in the current system

Manually extracting social determinants of health (SDOH) elements, poses numerous challenges that can hinder the efficiency and accuracy of the process. SDOH data is often embedded in unstructured sources such as physician notes, medical records, or social service assessments, making it laborious and time-consuming for healthcare professionals to extract relevant information. Here are some of the difficulties associated with manual data extraction for SDOH:

Unstructured data: SDOH elements are often scattered throughout free-text narratives, that lack a standardized format.

Human error: Human analysts are susceptible to making errors during data extraction, leading to inaccuracies in the collected information.

Incomplete data capture: Due to the sheer volume of information, manually extracting SDOH elements from various sources may result in incomplete data capture.

Limited scalability: As healthcare organizations grow and data volumes increase, manual data extraction becomes less scalable and impractical.

Cracking the code of health: Indium’s SDOH machine learning model

Indium’s expertise in developing the SDOH ML model is based on two pillars: NLP technology and a deep understanding of the healthcare landscape. With a team of experts in data science, engineering, and healthcare, Indium is at the forefront of using AI to transform preventive care.

Indium’s journey began with a recognition of the importance of social factors in determining health outcomes. The company’s ML model is designed to identify and address these factors, which can help improve the health of individuals and communities. Recognizing that manually extracting these factors from unstructured physician notes is labor-intensive and prone to errors, Indium sought to create an efficient and accurate solution. Leveraging Natural Language Processing (NLP) techniques, the team precisely crafted a robust ML model that swiftly identifies key social determinants hidden within vast amounts of textual data.

The success of Indium’s SDOH ML model lies in its ability to provide healthcare providers and payers with invaluable insights. By seamlessly integrating social determinants into preventive care, the model empowers stakeholders to offer personalized preventive interventions, optimize patient care, and drive cost savings within the healthcare ecosystem.

Uncover the unique insights and benefits our SDOH model offers, and witness how it can be seamlessly integrated into existing healthcare systems to optimize care delivery.

Download White Paper

SDOH ML model

ML techniques can be used to identify and extract SDOH from physician notes. These techniques can identify patterns in text, such as the presence of certain words or phrases that are associated with SDOH. For example, the phrase “food insecurity” might be associated with the SDOH of food insecurity. By using the SDOH ML model, healthcare providers can make right interventions to help improve healthcare outcomes and reduce costs.

Once SDOH have been identified and extracted from physician notes, they can be integrated into preventive care management. This information can be used to provide a more comprehensive understanding of the patient’s overall well-being and to develop a more personalized treatment plan.

The power of precision: Partner with Indium

As a leading healthcare service provider and a leader in the digital engineering space, Indium has developed the SDOH machine learning model. Understanding the profound influence that social factors have on health outcomes, and recognizing the value of this information is crucial to bring transformative advancements in patient care, the SDOH model is trained to accurately extract social factors from patient records. Beyond improving patient care, the integration of social determinants also serves as a strategic tool in reducing healthcare costs by proactively addressing health issues. Unlike the traditional method, our model is 90% accurate and can identify SDOH attributes from thousands of patient records in a matter of seconds.

Want to learn in detail about how our SDOH model empowers payers and providers to transform patient care and drive significant cost savings?

Download White Paper

The post How the SDOH machine learning model improves patients’ health and your bottom line appeared first on Indium.

Maximizing AI and ML Performance: A Guide to Effective Data Collection, Storage, and Analysis

Kavitha V Amara — Fri, 12 May 2023 11:42:41 +0000

Data is often referred to as the new oil of the 21st century. Because it is a valuable resource that powers the digital economy in a similar way that oil fueled the industrial economy of the 20th century. Like oil, data is a raw material that must be collected, refined, and analyzed to extract its value. Companies are collecting vast amounts of data from various sources, such as social media, internet searches, and connected devices. This data can then be used to gain insights into customer behavior, market trends, and operational efficiencies.

In addition, data is increasingly being used to power artificial intelligence (AI) and machine learning (ML) systems, which are driving innovation and transforming businesses across various industries. AI and ML systems require large amounts of high-quality data to train models, make predictions, and automate processes. As such, companies are investing heavily in data infrastructure and analytics capabilities to harness the power of data.

Data is also a highly valuable resource because it is not finite, meaning that it can be generated, shared, and reused without diminishing its value. This creates a virtuous cycle where the more data that is generated and analyzed, the more insights can be gained, leading to better decision-making, increased innovation, and new opportunities for growth. Thus, data has become a critical asset for businesses and governments alike, driving economic growth and shaping the digital landscape of the 21st century.

There are various data storage methods in data science, each with its own strengths and weaknesses. Some of the most common data storage methods include:

Relational databases: Relational databases are the most common method of storing structured data. They are based on the relational model, which organizes data into tables with rows and columns. Relational databases use SQL (Structured Query Language) for data retrieval and manipulation and are widely used in businesses and organizations of all sizes.

NoSQL databases: NoSQL databases are a family of databases that do not use the traditional relational model. Instead, they use other data models such as document, key-value, or graph-based models. NoSQL databases are ideal for storing unstructured or semi-structured data and are used in big data applications where scalability and flexibility are key.

Data warehouses: Data warehouses are specialized databases that are designed to support business intelligence and analytics applications. They are optimized for querying and analyzing large volumes of data and typically store data from multiple sources in a structured format.

Data lakes: Data lakes are a newer type of data storage method that is designed to store large volumes of raw, unstructured data. Data lakes can store a wide range of data types, from structured data to unstructured data such as text, images, and videos. They are often used in big data and machine learning applications.

Cloud-based storage: Cloud-based storage solutions, such as Amazon S3, Microsoft Azure, or Google Cloud Storage, offer scalable, secure, and cost-effective options for storing data. They are especially useful for businesses that need to store and access large volumes of data or have distributed teams that need access to the data.

To learn more about : How AI and ML models are assisting the retail sector in reimagining the consumer experience.

Data collection is an essential component of data science and there are various techniques used to collect data. Some of the most common data collection techniques include:

Surveys: Surveys involve collecting information from a sample of individuals through questionnaires or interviews. Surveys are useful for collecting large amounts of data quickly and can provide valuable insights into customer preferences, behavior, and opinions.

Experiments: Experiments involve manipulating one or more variables to measure the impact on the outcome. Experiments are useful for testing hypotheses and determining causality.

Observations: Observations involve collecting data by watching and recording behaviors, actions, or events. Observations can be useful for studying natural behavior in real-world settings.

Interviews: Interviews involve collecting data through one-on-one conversations with individuals. Interviews can provide in-depth insights into attitudes, beliefs, and motivations.

Focus groups: Focus groups involve collecting data from a group of individuals who participate in a discussion led by a moderator. Focus groups can provide valuable insights into customer preferences and opinions.

Social media monitoring: Social media monitoring involves collecting data from social media platforms such as Twitter, Facebook, or LinkedIn. Social media monitoring can provide insights into customer sentiment and preferences.

Web scraping: Web scraping involves collecting data from websites by extracting information from HTML pages. Web scraping can be useful for collecting large amounts of data quickly.

Data analysis is an essential part of data science and there are various techniques used to analyze data. Some of the top data analysis techniques in data science include:

Descriptive statistics: Descriptive statistics involve summarizing and describing data using measures such as mean, median, mode, variance, and standard deviation. Descriptive statistics provide a basic understanding of the data and can help identify patterns or trends.

Inferential statistics: Inferential statistics involve making inferences about a population based on a sample of data. Inferential statistics can be used to test hypotheses, estimate parameters, and make predictions.

Data visualization: Making charts, graphs, and other visual representations of data to better understand patterns and relationships is known as data visualization. Data visualization is helpful for expressing complex information and spotting trends or patterns that might not be immediately apparent from the data.

Machine learning: Machine learning involves using algorithms to learn patterns in data and make predictions or decisions based on those patterns. Machine learning is useful for applications such as image recognition, natural language processing, and recommendation systems.

Text analytics: Text analytics involves analyzing unstructured data such as text to identify patterns, sentiment, and topics. Text analytics is useful for applications such as customer feedback analysis, social media monitoring, and content analysis.

Time series analysis: Time series analysis involves analyzing data over time to identify trends, seasonality, and cycles. Time series analysis is useful for applications such as forecasting, trend analysis, and anomaly detection.

Use Cases

To illustrate the importance of data in AI and ML, let’s consider a few use cases:

Predictive Maintenance: In manufacturing, AI and ML can be used to predict when machines are likely to fail, enabling organizations to perform maintenance before a breakdown occurs. To achieve this, the algorithms require vast amounts of data from sensors and other sources to learn patterns that indicate when maintenance is necessary.
Fraud Detection: AI and ML can also be used to detect fraud in financial transactions. This requires large amounts of data on past transactions to train algorithms to identify patterns that indicate fraudulent behavior.
Personalization: In e-commerce, AI and ML can be used to personalize recommendations and marketing messages to individual customers. This requires data on past purchases, browsing history, and other customer behaviors to train algorithms to make accurate predictions.

Real-Time Analysis

To achieve optimal results in AI and ML applications, data must be analyzed in real-time. This means that organizations must have the infrastructure and tools necessary to process large volumes of data quickly and accurately. Real-time analysis also requires the ability to detect and respond to anomalies or unexpected events, which can impact the accuracy of the algorithms.

Wrapping Up

In conclusion, data is an essential component of artificial intelligence (AI) and machine learning (ML) applications. Collecting, storing, and analyzing data effectively is crucial to maximizing the performance of AI and ML systems and obtaining optimal results. Data visualization, machine learning, time series analysis, and other data analysis techniques can be used to gain valuable insights from data and make data-driven decisions.

No matter where you are in your transformation journey, contact us and our specialists will help you make technology work for your organization.

Click here

The post Maximizing AI and ML Performance: A Guide to Effective Data Collection, Storage, and Analysis appeared first on Indium.

Training Custom Machine Learning Model on Vertex AI with TensorFlow

Ganesh Ghadge — Fri, 03 Feb 2023 12:11:24 +0000

“Vertex AI is Googles platform which provides many Machine learning services such as training models using AutoML or Custom Training.”

AutoML vs Custom Training

To quickly compare AutoML and custom training functionality, and expertise required, check out the following table given by Google.

Choose a training method | Vertex AI | Google Cloud

In this article we are going to train the Custom Machine Learning Model on Vertex AI with TensorFlow.

To know about Vertex AI’s AutoML feature read my previous blog : Machine Learning using Google’s Vertex AI.

About Dataset

We will be using Crab Age Prediction dataset from Kaggle. The dataset is used to estimate the age of the crab based on the physical attributes.

To learn more about how our AI and machine learning capabilities can assist you.

Click here

There are 9 columns in the Dataset as follows.

Sex: Crab gender (Male, Female and Indeterminate)
Length: Crab length (in Feet; 1 foot = 30.48 cms)
Diameter: Crab Diameter (in Feet; 1 foot = 30.48 cms)
Height: Crab Height (in Feet; 1 foot = 30.48 cms)
Weight: Crab Weight (in ounces; 1 Pound = 16 ounces)
Shucked Weight: Without Shell Weight (in ounces; 1 Pound = 16 ounces)
Viscera Weight: Viscera Weight
Shell Weight: Shell Weight (in ounces; 1 Pound = 16 ounces)
Age: Crab Age (in months)

We must predict the Age column with the help of the rest of the columns.

Let’s Start

Custom Model Training

Step 1: Getting Data

We will download the dataset from Kaggle. There is only one csv file in the downloaded dataset called CrabAgePrediction.csv, I have uploaded this csv to the bucket called vertex-ai-custom-ml on Google Cloud Storage.

Step 2: Working on Workbench

Go to Vertex AI, then to Workbench section and enable the Notebook API. Then click on New Notebook and select TensorFlow Enterprise, we are using TensorFlow Enterprise 2.6 without GPU for the project. Make sure to select us-central1 (Iowa) region.

It will take a few minutes to create the Notebook instance. Once the notebook is created click on the Open JupyterLab to launch the JupyterLab.

In the JupyterLabopen the Terminal and Run following cmd one by one.

mkdir crab_folder # This will create crab_folder

cd crab_folder # To enter the folder

mkdir trainer # This will create trainer folder

touch Dockerfile # This will create a Dockerfile

We can see all the files and folder on the left side of the JupyterLab, from that open the Dockerfile and start editing with following lines of code.

FROM gcr.io/deeplearning-platform_release/tf2-cpu.2-6

WORKDIR /

COPY trainer /trainer

ENTRYPOINT [“python”,”-m”,”trainer.train”]

Now save the Docker file and with this we have given the Entrypoint for the docker file.

To save the model’s output, we’ll make a bucket called crab-age-pred-bucket.

For the model training file, I have already uploaded the python file into the GitHub Repository. To clone this Repository, click on the Git from the top of JupyterLab and select Clone a Repository and paste the repository link and hit clone.

In the Lab, we can see the crab-age-pred folder; copy the train.py file from this folder to crab_folder/ trainer /.

Let’s look at the train.py file before we create the Docker IMAGE.

#Importing the required packages..

import numpy as np

import pandas as pd

import pathlib

import tensorflow as tf

#Importing tensorflow 2.6

from tensorflow import keras

from tensorflow.keras import layers

print(tf.__version__)

#Reading data from the gcs bucket

dataset = pd.read_csv(r”gs://vertex-ai-custom/CrabAgePrediction.csv”)

dataset.tail()

BUCKET = ‘gs://vertex-ai-123-bucket’

dataset.isna().sum()

dataset = dataset.dropna()

#Data transformation..

dataset = pd.get_dummies(dataset, prefix=”, prefix_sep=”)

dataset.tail()

#Dataset splitting..

train_dataset = dataset.sample(frac=0.8,random_state=0)

test_dataset = dataset.drop(train_dataset.index)

train_stats = train_dataset.describe()

#Removing age column, since it is a target column

train_stats.pop(“Age”)

train_stats = train_stats.transpose()

train_stats

#Removing age column from train and test data

train_labels = train_dataset.pop(‘Age’)

test_labels = test_dataset.pop(‘Age’)

def norma_data(x):

#To normalise the numercial values

return (x – train_stats[‘mean’]) / train_stats[‘std’]

normed_train_data = norma_data(train_dataset)

normed_test_data = norma_data(test_dataset)

def build_model():

#model building function

model = keras.Sequential([

layers.Dense(64, activation=’relu’, input_shape=[len(train_dataset.keys())]),

layers.Dense(64, activation=’relu’),

layers.Dense(1)

])

optimizer = tf.keras.optimizers.RMSprop(0.001)

model.compile(loss=’mse’,

optimizer=optimizer,

metrics=[‘mae’, ‘mse’])

return model

#model = build_model()

#model.summary()

model = build_model()

EPOCHS = 10

early_stop = keras.callbacks.EarlyStopping(monitor=’val_loss’, patience=10)

early_history = model.fit(normed_train_data, train_labels,

epochs=EPOCHS, validation_split = 0.2,

callbacks=[early_stop])

model.save(BUCKET + ‘/model’)

Summary of train.py

When all of the necessary packages are imported, TensorFlow 2.6 will be used for modelling. The pandas command will be used to read the stored csv file in the vertex-ai-custom-ml bucket, and the BUCKET variable will be used to specify the bucket where we will store the train model.

We are doing some transformation such as creating dummy variable for the categorical column. Next, we are splitting the data into training and testing and normalizing the data.

We wrote a function called build_model that includes a simple two-layer tensor flow model. The model will be constructed using ten EPOCHS. We have to save the model in the crab-age-pred-bucket/model file on Data storage and see it has been educated.

Now, in the JupyterLab Terminal, execute the following cmd one by one to create a Docker IMAGE.

PROJECT_ID=crab-age-pred

IMAGE_URI=”gcr.io/$ PROJECT_ID/crab:v1”

docker build ./ -t $IMAGE_URI

Before running the build command make sure to enable the Artifact Registry API and Google Container Registry API by going to the APIs and services in Vertex AI.

After running the CMD our Docker Image is built successfully. Now we will push the docker IMAGE with following cmd.

docker push $IMAGE_URI

Once pushed we can see our Docker IMAGE in the Container registry. To find the Container registry you can search it on Vertex AI.

Best Read: Our success story about how we assisted an oil and gas company, as well as Nested Tables and Machine Drawing Text Extraction

Step 3: Model Training

Go to Vertex AI, then to Training section and click Create. Make sure the region is us-central1.

In Datasets select no managed dataset and click continue.

In Model details I have given the model’s name as “pred-age-crab” and in advance option select the available service account. For rest keep default. Make sure that the service account has the cloud storage permissions if not give the permissions from IAM and Admin section.

Select the custom container for the Container image in the Training container. Navigate to and select the newly created Docker image. Next, navigate to and select the crab-age-pred-bucket in the Model output directory. Now press the continue button.

Ignore any selections for Hyperparameters and click Continue.

In Compute and pricing, Select the machine type n1-standard-32, 32 vCPUs, 120 GiB memory and hit continue.

For Prediction Container select Pre-Built container with TensorFlow Framework 2.6 and start the model training.

You can see the model in training in the Training section.

In about 8 minutes, our custom model training is finished.

Step 4: Model Deployment

Go to Vertex AI, then to the Endpoints section and click Create Endpoint. The region should be us-central1.

Give crab_age_pred as the name of Endpoint and click Continue.

In the Model Settings select pred_age_crab as Model Name, Version 1 as Version and 2 as number of compute nodes, n1-standard-8, 8 vCPUs, 30 GiB memory as Machine Type and select service account. Click Done and Create.

In Model monitoring ignore this selection and click create to implement the version.

It may take 11 minutes to deploy the model.

With the above step our model is deployed.

Step 5: Testing Model

Once the model is deployed, we can make predictions. For this project we are going to use Python to make predictions. We will need to give the Vertex AI Admin and Cloud Storage Admin permissions to the service account. We can do that in the IAM and administration section of Google cloud. Once the permissions are given, we will download the key of the service account in JSON format, it will be useful in authenticating the OS.

Following is the code used for the prediction.

pip install google-cloud-aiplatform

from typing import Dict

from google.cloud import aiplatform

from google.protobuf import json_format

from google.protobuf.struct_pb2 import Value

import os

def predict_tabular_sample(

project: str,

endpoint_id: str,

instance_dict: Dict,

location: str = “us-central1”,

api_endpoint: str = “us-central1-aiplatform.googleapis.com”):

# The AI Platform services require regional API endpoints.

client_options = {“api_endpoint”: api_endpoint}

# Initialize client that will be used to create and send requests.

# This client only needs to be created once, and can be reused for multiple requests.

client = aiplatform.gapic.PredictionServiceClient(client_options=client_options)

# for more info on the instance schema, please use get_model_sample.py

# and look at the yaml found in instance_schema_uri

instance = json_format.ParseDict(instance_dict, Value())

instances = [instance]

parameters_dict = {}

parameters = json_format.ParseDict(parameters_dict, Value())

endpoint = client.endpoint_path(

project=project, location=location, endpoint=endpoint_id

)

response = client.predict(

endpoint=endpoint, instances=instances, parameters=parameters

)

predictions = response.predictions

print(predictions)

#Authentication using service account.

#We are giving the path to the JSON key

os.environ[‘GOOGLE_APPLICATION_CREDENTIALS’] =”/content/crab-age-pred-7c1b7d9be185.json”

#normalized values

inputs =[0,0,1,1.4375,1.175,0.4125,0.63571550,0.3220325,1.5848515,0.747181]

project_id = “crab-age-pred” #Project Id from the Vertex AI

endpoint_id = 7762332189773004800 #Endpoint Id from the Enpoints Section

predict_tabular_sample(project_id,endpoint_id,inputs)

Output

[[8.01214314]]

This is how we can make the predictions. For the inputs make sure to do the same transformation and normalizing which we have done for the training data.

With this we have completed the project and learned how to train, deploy and to get predictions of the custom trained ML model.

I hope you will find it useful.

See you again.

The post Training Custom Machine Learning Model on Vertex AI with TensorFlow appeared first on Indium.

Kubeflow Pipeline on Vertex AI for Custom ML Models

Ganesh Ghadge — Thu, 02 Feb 2023 11:56:32 +0000

What is Kubeflow?

“Kubeflow is an open-source project created to help deployment of ML pipelines. It uses components as python functions for each step of pipeline. Each component runs on the isolated container with all the required libraries. It runs all the components in the series one by one.”

In this article we are going to train a custom machine learning model on Vertex AI using Kubeflow Pipeline.

About Dataset

Credit Card Customers dataset from Kaggle will be used. The 10,000 customer records in this dataset include columns for age, salary, marital status, credit card limit, credit card category, and other information. In order to predict the customers who are most likely to leave, we must analyse the data to determine the causes of customer churn.

Interesting Read: In the world of hacking, we’ve reached the point where we’re wondering who is a better hacker: humans or machines.

Let’s Start

Custom Model Training

Step 1: Getting Data

We will download the dataset from GitHub. There are two csv files in the downloaded dataset called churner_p1 and churner_p2, I have created a Big Query dataset credit_card_churn with the tables as churner_p1 and churner_p2 with this csv files. I have also created the bucket called credit-card-churn on Cloud Storage. This bucket will be used to store the artifacts of the pipeline

Step 2: Employing Workbench

Enable the Notebook API by going to Vertex AI and then to the Workbench section. Then select Python 3 by clicking on New Notebook. Make sure to choose the us-central1 region.

It will take a few minutes to create the Notebook instance. Once the notebook is created click on the Open JupyterLab to launch the JupyterLab.

We will also have to enable the following APIs from API and services section of Vertex AI.

Artifact Registry API
Container Registry API
AI Platform API
ML API
Cloud Functions API
Cloud Build API

Now click on the Python 3 to open a jupyter notebook in the JupyterLab Notebook section and run the below code cells.

USER_FLAG = “–user”

!pip3 install {USER_FLAG} google-cloud-aiplatform==1.7.0

!pip3 install {USER_FLAG} kfp==1.8.9

This will install google cloud AI platform and Kubeflow packages. Make sure to restart the kernel after the packages are installed.

import os

PROJECT_ID = “”

# Get your Google Cloud project ID from gcloud

if not os.getenv(“IS_TESTING”):

shell_output=!gcloud config list –format ‘value(core.project)’ 2>/dev/null

PROJECT_ID = shell_output[0]

print(“Project ID: “, PROJECT_ID)

Create the variable PROJECT_ID with the name of project.

BUCKET_NAME=”gs://” + PROJECT_ID

BUCKET_NAME

Create the variable BUCKET_NAME, this will return the same bucket name we have created earlier.

import matplotlib.pyplot as plt

import pandas as pd

from kfp.v2 import compiler, dsl

from kfp.v2.dsl import pipeline, component, Artifact, Dataset, Input, Metrics, Model, Output, InputPath, OutputPath

from google.cloud import aiplatform

# We’ll use this namespace for metadata querying

from google.cloud import aiplatform_v1

PATH=%env PATH

%env PATH={PATH}:/home/jupyter/.local/bin

REGION=”us-central1″

PIPELINE_ROOT = f”{BUCKET_NAME}/pipeline_root/”

PIPELINE_ROOT

This will import required packages and create the pipeline folder in the credit-card-churn bucket.

#First Component in the pipeline to fetch data from big query.

#Table1 data is fetched

@component(

packages_to_install=[“google-cloud-bigquery==2.34.2”, “pandas”, “pyarrow”],

base_image=”python:3.9″,

output_component_file=”dataset_creating_1.yaml”

)

def get_data_1(

bq_table: str,

output_data_path: OutputPath(“Dataset”)

from google.cloud import bigquery

import pandas as pd

bqclient = bigquery.Client()

table = bigquery.TableReference.from_string(

bq_table

)

rows = bqclient.list_rows(

table

)

dataframe = rows.to_dataframe(

create_bqstorage_client=True,

)

dataframe.to_csv(output_data_path)

The first component of the pipeline will fit the data from the table churner_p1 from big query and pass the csv file as the output for the next component. The structure is the same for every component. We have used the @component decorator to install the required packages and specify the base image and output file, then we create the get_data_1 function to get the data from big query.

#Second Component in the pipeline to fetch data from big query.

#Table2 data is fetched

#First component and second component doesnt need inputs from any components

@component(

packages_to_install=[“google-cloud-bigquery==2.34.2”, “pandas”, “pyarrow”],

base_image=”python:3.9″,

output_component_file=”dataset_creating_2.yaml”

)

def get_data_2(

bq_table: str,

output_data_path: OutputPath(“Dataset”)

from google.cloud import bigquery

import pandas as pd

bqclient = bigquery.Client()

table = bigquery.TableReference.from_string(

bq_table

)

rows = bqclient.list_rows(

table

)

dataframe = rows.to_dataframe(

create_bqstorage_client=True,

)

dataframe.to_csv(output_data_path)

The second component of the pipeline will fit the data from the table churner_2 from big query and pass the csv file as the output for the next component. The first component and second component do not need inputs from any components.

#Third component in the pipeline to to combine data from 2 sources and for some data transformation

@component(

packages_to_install=[“sklearn”, “pandas”, “joblib”],

base_image=”python:3.9″,

output_component_file=”model_training.yaml”,

)

def data_transformation(

dataset1: Input[Dataset],

dataset2: Input[Dataset],

output_data_path: OutputPath(“Dataset”),

from sklearn.metrics import roc_curve

from sklearn.model_selection import train_test_split

from joblib import dump

from sklearn.metrics import confusion_matrix

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier

import pandas as pd

data1 = pd.read_csv(dataset1.path)

data2 = pd.read_csv(dataset2.path)

data=pd.merge(data1, data2, on=’CLIENTNUM’, how=’outer’)

data.drop([“CLIENTNUM”],axis=1,inplace=True)

data = data.dropna()

cols_categorical = [‘Gender’,’Dependent_count’, ‘Education_Level’, ‘Marital_Status’,’Income_Category’,’Card_Category’]

data[‘Attrition_Flag’] = [1 if cust == “Existing Customer” else 0 for cust in data[‘Attrition_Flag’]]

data_encoded = pd.get_dummies(data, columns = cols_categorical)

data_encoded.to_csv(output_data_path)

The third component is where we have combined the data from the first and second component and did the data transformation such as dropping the “CLIENTNUM” column, dropping the null values and converting the categorical columns into numerical. we will pass this transformed data as csv to the next component.

#Fourth component in the pipeline to train the classification model using decision Trees or Randomforest

@component(

packages_to_install=[“sklearn”, “pandas”, “joblib”],

base_image=”python:3.9″,

output_component_file=”model_training.yaml”,

)

def training_classmod(

data1: Input[Dataset],

metrics: Output[Metrics],

model: Output[Model]

from sklearn.metrics import roc_curve

from sklearn.model_selection import train_test_split

from joblib import dump

from sklearn.metrics import confusion_matrix

from sklearn.ensemble import RandomForestClassifier

import pandas as pd

data_encoded=pd.read_csv(data1.path)

X = data_encoded.drop(columns=[‘Attrition_Flag’])

y = data_encoded[‘Attrition_Flag’]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100,stratify=y)

model_classifier = RandomForestClassifier()

model_classifier.fit(X_train,y_train)

y_pred=model_classifier.predict(X_test)

score = model_classifier.score(X_test,y_test)

print(‘accuracy is:’,score)

metrics.log_metric(“accuracy”,(score * 100.0))

metrics.log_metric(“model”, “RandomForest”)

dump(model_classifier, model.path + “.joblib”)

In the fourth component we will train the model with the Random Classifier and we have used the “accuracy” as the evaluation metric.

@component(

packages_to_install=[“google-cloud-aiplatform”],

base_image=”python:3.9″,

output_component_file=”model_deployment.yaml”,

)

def model_deployment(

model: Input[Model],

project: str,

region: str,

vertex_endpoint: Output[Artifact],

vertex_model: Output[Model]

from google.cloud import aiplatform

aiplatform.init(project=project, location=region)

deployed_model = aiplatform.Model.upload(

display_name=”custom-model-pipeline”,

artifact_uri = model.uri.replace(“model”, “”),

serving_container_image_uri=”us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.0-24:latest”

)

endpoint = deployed_model.deploy(machine_type=”n1-standard-4″)

# Save data to the output params

vertex_endpoint.uri = endpoint.resource_name

vertex_model.uri = deployed_model.resource_name

Fifth component is the last component, in this we will create the endpoint on the Vertex AI and deploy the model. We have used Docker as base IMAGE and have deployed the model on “n1-standard-4” machine.

@pipeline(

# Default pipeline root. You can override it when submitting the pipeline.

pipeline_root=PIPELINE_ROOT,

# A name for the pipeline.

name=”custom-pipeline”,

)

def pipeline(

bq_table_1: str = “”,

bq_table_2: str = “”,

output_data_path: str = “data.csv”,

project: str = PROJECT_ID,

region: str = REGION

dataset_task_1 = get_data_1(bq_table_1)

dataset_task_2 = get_data_2(bq_table_2)

data_transform=data_transformation(dataset_task_1.output,dataset_task_2.output)

model_task = training_classmod(data_transform.output)

deploy_task = model_deployment(model=model_task.outputs[“model”],project=project,region=region)

In the last we have pipeline function which will call all the components in the sequential manner: dataset_tast_1 and dataset_tast_2 will get the data from the big query, data_transform will transform the data, model_task will train the Random Classifier model and deploy_task will deploy the model on Vertex AI.

compiler.Compiler().compile(pipeline_func=pipeline, package_path=”custom-pipeline-classifier.json”)

Compiling the pipeline.

run1 = aiplatform.PipelineJob(

display_name=”custom-training-vertex-ai-pipeline”,

template_path=”custom-pipeline-classifier.json”,

job_id=”custom-pipeline-rf8″,

parameter_values={“bq_table_1”: “credit-card-churn.credit_card_churn.churner_p1″,”bq_table_2”: “credit-card-churn.credit_card_churn.churner_p2”},

enable_caching=False,)

Creating the pipeline job.

run1.submit()

Running the pipeline job.

With this we have completed creating the Kubeflow pipeline and we can see it on the Pipelines section of Vertex AI.

Our Pipeline has run successfully and we have managed to get 100% accuracy for the classification.

We can use this model to get the online prediction using Rest API or Python. We can also create the different pipelines and compare their metrics on Vertex AI.

With this we have completed the project and learned how to create the Pipeline on Vertex AI for custom train models.

I hope you will find it useful.

To learn more about our AI & ML Solutions and Capabilities

See you again.

The post Kubeflow Pipeline on Vertex AI for Custom ML Models appeared first on Indium.

Machine Learning using Google’s Vertex AI

Ganesh Ghadge — Thu, 02 Feb 2023 10:38:31 +0000

Image by Google

What is Vertex AI?

“Vertex AI is Google’s platform which provides many Machine learning services such as training models using AutoML or Custom Training.”

Image by Google

Features of Vertex AI

We use Vertex AI to perform the following tasks in the ML workflow

Creation of dataset and Uploading data
Training ML model
Evaluate model accuracy
Hyperparameters tuning (custom training only)
Storing model in Vertex AI.
Deploying trained model to endpoint for predictions.
Send prediction requests to endpoint.
Managing models and endpoints.

To know the workflow of Vertex AI we will train a Classification model “Dogs vs Cat” using Vertex AI’s AutoML feature.

Step 1: Creating Dataset

We will download the dataset from Kaggle. In the downloaded zip file there are two zip files train.zip and test.zip. Train.zip contains the labelled images for training.

There are about 25,000 images in the train.zip file and 12,500 in the test.zip file. For this project we will only use 200 cat and 200 dog images to train. We will use the test set to evaluate the performance of our model.

After extracting the data, I uploaded the images to the google cloud storage bucket called dogs_cats_bucket1 which I have created at us-central1 region. Images are stored in two folders train and test in the bucket.

Best Read: Top 10 AI Challenges

Now we need to create a csv file with the images address and label for that I have written the following lines of code.

from google.cloud import storage

import pandas as pd

import os

#Authentication using service account.

os.environ[‘GOOGLE_APPLICATION_CREDENTIALS’] =”/content/dogs-vs-cats-354105-19b7b157b2b8.json”

BUCKET=’dogs_cats_bucket1′

DELIMITER=’/’

TRAIN_PREFIX=’train/’

TRAIN_BASE_PATH = f’gs://{BUCKET}/{TRAIN_PREFIX}’

print(“Starting the import file generation process”)

print(“Process Details”)

print(f”BUCKET : {BUCKET}”)

storage_client = storage.Client()

data = []

print(“Fetchig list of Train objects”)

train_blobs = storage_client.list_blobs(BUCKET, prefix=TRAIN_PREFIX, delimiter=DELIMITER)

for blob in train_blobs:

label = “cat” if “cat” in blob.name else “dog”

full_path = f”gs://{BUCKET}/{blob.name}”

data.append({

‘GCS_FILE_PATH’: full_path,

‘LABEL’: label

})

df = pd.DataFrame(data)

df.to_csv(‘train.csv’, index=False, header=False)

After running the script on Jupyter Notebook, we have the required csv file, we will upload the file to the same storage bucket as well.

Now in the Vertex AI section go to Datasets and enable the Vertex AI API.

Click Create Dataset and name it. I have named it cat_dog_classification. We will select Image Classification (Single-label). Make sure the region is us-central1. Hit Create.

In the next section mark Select import files from Cloud Storage and select the train.csv from Browse. Hit Continue

Vertex AI tool 16 minutes to import data. Now we can see the data the Browse and Analyse tab.

Now we can train the model.

Step 2: Model Training

Go to Vertex AI, then to Training section and click Create. Make sure the region is us-central1.

In the Dataset select cat_dog_classification and keep default for everything else with Model Training Method as AutoML.

Click continue for the Model Details and Explainability with the default settings.

For Compute and Pricing give 8 maximum node hours.

Hit Start Training.

The model training is completed after 29 mins.

Step 3: Model Evaluation

By clicking on trained model, it will take us to the model stats page. Where we have stats like Precision-recall curve, Precision-recall by threshold and Confusion matrix.

With the above stats the model looks good.

Step 4: Model Deployment

Go to Vertex AI, then to the Endpoints section and click Create Endpoint. Make sure the region is us-central1.

Give dogs_cats as the name of Endpoint and click Continue.

In the Model Settings select cat_dog_classification as Model Name, Version 1 as Version and 2 as number of compute nodes.

Click Done and Create.

It takes about 10 minutes to deploy the model.

With this our model is deployed.

Step 5: Testing Model

Once the model is deployed, we can test the model by uploading the test image or creating Batch Prediction.

To Test the Model, we go to the Deploy and Test section on the Model page.

Click on the Upload Image to upload the test, Image.

With this we can see our model is working good on test images.

We can also connect to the Endpoint using Python and get the results.

For more details on our AI and ML services

Visit this link

This is the end of my blog. We have learned how to train an image classification model on Google’s Vertex AI using Auto ML feature. I have enjoyed every minute while working on it.

For the next article we will see how to train custom model on Vertex AI with TensorFlow.

Stay Tuned.

The post Machine Learning using Google’s Vertex AI appeared first on Indium.

Things to Keep in Mind while Testing Machine Learning Applications

Adhithya S — Fri, 10 Jun 2022 12:40:27 +0000

Testing Applications with Machine Learning

Software development lifecycles go through a set of common steps that range from the process of ideation to the final deployment of the product or service. The processes that are undertaken in the software development lifecycle help systems automatically regulate and enhance user experiences without manually programming them. It is important to make sure that quality assurance (QA) processes are taken forward so that the program behaves as expected and the features of the product are made sure.

Software testing helps to pinpoint defects and flaws throughout the development process, and in the context of using machine learning application testing, the programmer usually inputs the data and the logic is further computed by the machine.

It might be interesting to read about 4 Common Machine Learning Mistakes And How To Fix Them!

Let’s see what the different kinds of machine learning are, and how to further polish machine learning testing processes:

Types of Machine Learning

There are three main categories that machine learning falls under:

Supervised Learning:

The first step of supervised learning is to determine the type of training dataset. After this, the training data is collected and split into datasets – test datasets, and validation datasets. The next step includes determining the input features for the training dataset, and this needs to have ample information so that the model can correctly predict the output. A suitable algorithm is picked and executed on the dataset where there is a necessity to sometimes keep validation sets as the parameters of control. The very last step includes matching the model to the test set. If the model predicts the right output, it means that the model is accurate.

There are two algorithms used in supervised learning – regression algorithms and classification algorithms. The former deals with the prediction of continuously occurring variables such as weather forecasts, market/stock trends, etc. The latter kind of classification algorithm is used when the output variable is categorical in nature. Some examples of where classification algorithms can be used are in yes-no, male-female, and true-false scenarios.

Supervised learning can therefore help the model predict the output on the basis of previous experiences. The developer has an exact idea about the classes of objects that are available to work with. It helps with solving various real-world issues such as spam filtering and fraud detection.

Unsupervised Learning:

In this kind of machine learning, the model is not supervised using training datasets, rather the model automatically finds hidden insights and patterns from the ingested data. This process happens by cluttering image datasets together and understanding their similarities in groups. It is helpful to find insights from any given data as it is very similar to how the human brain operates- because in the real world, we do not always have the input data to correspond with the output.

There are two methods of unsupervised machine learning that are worth noting: clustering and association. Clustering is a process wherein similar objects are sorted into clusters. Cluster analysis helps in determining the similarities between objects of data and labels them according to how common they are in the datasets.

Some use cases where unsupervised machine learning can be put to use include data exploration, targeted marketing campaigns, data visualization, etc.

Reinforcement Learning:

This class of machine learning makes it so that the machine learns to make a succession of decisions in uncertain and complex data environments. The algorithm basically uses a trial-and-error method. Along with a reward system attached to it, the AI either gets rewards or penalties according to how the algorithm performs.

One of the main challenges with reinforcement learning is setting up the right simulation environment. Scaling or tweaking the neural network that controls the whole AI also presents itself as a challenge.

The model figures out how to perform tasks and maximize the reward potential, starting with completely randomized trials, leading up to more advanced strategies and sophisticated skills.

A few examples where reinforcement learning can come into play include autonomous vehicle technologies and advanced robotics. Here, the AI is required to learn how to recognize walking patterns and imitate the same.

To learn more about how Indium can leverage AI/ML to drive business impact

Inquire Now!

Best Practices while Testing Machine Learning Applications

Artificial Intelligence-based solutions use data as code. To have a well-oiled system, it needs to be made certain that the input data are accurate and optimal to achieve any solution. When creating an AI-based solution, there are a few things to keep in mind:

Clocking Test Findings: Machine learning algorithms validate data based on range-based precision, rather than what the expected outcome can be. For this reason, the test results need to be recorded and put across in statistical terms. For each new development, there needs to be a new set of confidence criteria to make sure that the inputs in the given range are used to analyse.

Constructing Data Sets: Test data sets must be carefully curated so as to test all possible combinations and permutations in order to evaluate the effectiveness of the model. As the number of inputs and iterations scale-up and the richness of the data increases, the module must also be constantly polished throughout the training process.

Building System Validation Test Suites: These test suites are made with the help of algorithms and the test data sets that are used. An example can be that of a test scenario set to predict the outcomes of patients based on diagnostic data and pathology. There needs to be certain mitigative profiling done for the patient that includes the disease under question, the patient’s demography, and other relevant patient information from the past.

Creating Comprehensive and Semi-Automated Training Data Sets: These kinds of training data sets include a combination of both input data and the required output. Static data dependency analysis is required to be taken forward in order to interpret both data sources and data features. This is an integral part of the migration and deletion of data sets.

Take a look at our Machine Learning and Deep Learning Services now!

Get in touch

Smarter than AI

A number of critical aspects come into play while testing AI systems. These include various processes that include but are not restricted to data curation & validation, algorithm testing, performance, and security testing & smart interaction testing. AI systems are usually created to work in conjunction with other systems to tackle very specific challenges. Many factors must be thoroughly considered when testing AI/ML models.

Businesses that use AI on an everyday basis are building systems and applications that lead to newer testing methods and approaches. The procedures are only set to evolve in the coming years, as traditional testing methods are slowly taking a backseat.

The post Things to Keep in Mind while Testing Machine Learning Applications appeared first on Indium.

Data Annotation: 5 Questions You Must Address Before You Start Any Project

Suhith Kumar — Fri, 11 Jun 2021 03:48:30 +0000

It might come as a surprise to many, but the idea of robot doctors, self-driving cars and other similar advancements is still very much a fantasy. In other words, the full capability of artificial intelligence (AI) is far from being realized. The reason? To propel many of the AI-based initiatives, large volumes of data is essential to accelerate the progress and turn ideas into reality.

AI needs large volumes of data to continuously study and detect patterns. It can, however, not be trained with any raw data. Artificial intelligence, it is said, can be just as intelligent as the data it is fed.

Check out our Advanced Analytics Services

Smart data is one that provides key information to what is otherwise raw data. It gives structure to data that would be nothing more than mere noise to a supervised learning algorithm.

Data annotation is the process that helps add essential nuggets of information to transform raw data into smart data.

Data Annotation

Also known as data labelling, data annotation plays a key role in machine learning and artificial intelligence projects being trained with the right, essential data. Data labeling and annotation are the first step in providing machine learning models with what they need to identify and differentiate against the different inputs to produce accurate outputs.

By means of feeding annotated and tagged datasets frequently with the help of algorithms, it is possible to refine a model to get smarter with time. Models get smarter and intelligent as more of the annotated data is fed to train them.

Challenges In Data Annotation

Generating the required annotation from a given asset can be challenging, which is largely because of the complexity associated with annotation. Also, getting highly accurate labels requires expertise and time.

To ensure machines learn to classify and identify information, humans must annotate and verify data. In the absence of labels being tagged and verified by humans, machine learning algorithms would have difficulty computing the essential attributes. In terms of annotation, machines cannot function properly without human assistance. It is also being said that for data labeling and AI quality control, human-in-the-loop concept will not be going away any time soon.

Let us take the example of legal documents, which are largely made of unstructured data. To understand any of the legal information and the context in which it is delivered, the expertise of legal experts is paramount. It might be necessary to tag any essential clauses and refer to cases that are pertinent to the judgment. The extraction and tagging process provides machine learning algorithms with information that they do not obtain on their own.

It is impossible to achieve success with AI if the right, essential information is not accessible. Feeding AI with the right data, with learnable signals frequently provided at scale, will enable it to improve over time. Therein lies the significance of data annotation.

But, before anyone gets started with a data annotation project, they must consider at least five key questions.

1. What Needs To Be Annotated?

Various forms of annotations exist based on the format of the data. It can vary from video to image annotation, semantic annotation, content categorization, text categorization and so on.

It is important to identify the most important one to help achieve specific business goals. It is also important to ask which format of data may help speed up a project’s progress more than its alternative.

Ultimately, it is about what needs to be a success.

2. How much of data is required for an AI/ML project?

The answer to the question would be: as much as possible.

However, in certain cases, benchmarks may be established depending on a particular requirement. The data requirement must be handled by a domain/subject matter expert who handles annotations and frequently helps measure the accuracy in order to create ‘ground truth’ data which will be applied to train the algorithm.

3. Is it necessary that annotators must be subject matter experts?

Based on the complexity of data that needs to be annotated, it is essential to have the best set of hands handling annotations.

While it is common for companies to entrust the crowd when it comes to basic annotation tasks, it is necessary to have annotators with specialized skill sets to annotate complex data.

Similar to having the requisite subject matter experts to decode the information provided in legal documents, it is essential to acquire the service of experts in annotation. People with an in-depth understanding of complex data will help ensure the data and the training sets do not carry even the minute errors that can throw a spanner in the works when it comes to creating predictive models.

4. Should data annotation be outsourced or performed in-house?

As per a report, organizations spend 5x more on their internal data labelling efforts than they spend on third-party data labeling. This way of working is not only expensive but also time-consuming for teams that could otherwise be focusing on other tasks.

Also, designing the requisite annotation tools typically requires way more work compared to certain machine learning projects. Not to mention that for a lot of companies, security can be an issue, which leads to hesitation in releasing data. However, this is unlikely to be of concern to companies that have the necessary security and privacy protocols already in place.

Cutting edge Big Data Engineering Services at your Finger Tips

5. Is the annotation accurately representing a specific industry?

Before someone starts with data labeling, it is essential for them to understand the format and category of the data and the domain vocabulary they plan to use. This is known as ontology, which is an integral part of machine learning. Financial services, healthcare and legal industries have unique rules and regulations for data.

Ontologies lend meaning and help AI to communicate through a common language. It is also necessary to understand the problem statement and identify how AI would interpret the data to semantically address a use case.

The post Data Annotation: 5 Questions You Must Address Before You Start Any Project appeared first on Indium.