Exploring the Artificial Intelligence Virtual Experience Program: A Journey of Learning and Skill Development

Vedant Dhote
10 min readMay 28, 2023

--

“The more we explore the vast realm of data science and machine learning, the more we realize the limitless possibilities they hold. It is through continuous learning and practical experience that we unravel the true power of AI, transforming data into actionable insights that shape our world.”

Photo by Lukas Blazek on Unsplash

As a passionate data science and machine learning enthusiast, my thirst for knowledge and continuous growth in the field knows no bounds. I am always on the lookout for resources and certifications that can enrich my understanding and expertise. Last week, I stumbled upon an exceptional opportunity — the Cognizant Artificial Intelligence Virtual Experience Program on Forage. Little did I know that this virtual experience would propel me into a world of immersive learning and hands-on practice.

Throughout the program, I delved into various aspects of data science and machine learning, acquiring valuable skills and knowledge that would shape my career. From data analysis and Python programming to data visualization, modeling, and beyond, each module offered a unique perspective and allowed me to hone my abilities. Moreover, the program also emphasized essential elements such as problem statement formulation, model interpretation, effective communication, machine learning engineering, development, quality assurance, and evaluation techniques.

In this blog series , I aim to take you on a journey through my experience in the Cognizant Artificial Intelligence Virtual Experience Program. Together, we will explore the concepts I learned, the challenges I faced, and the insights I gained throughout the program. Join me as I unravel the intricacies of data science, machine learning, and the practical applications that go hand in hand with them.

Task One: Exploratory Data Analysis

Problem Statement :

Gala Groceries approached Cognizant to help them with a supply chain issue. Groceries are highly perishable items. If you overstock, you are wasting money on excessive storage and waste, but if you understock, then you risk losing customers. They want to know how to better stock the items that they sell.

This is a high-level business problem and will require you to dive into the data in order to formulate some questions and recommendations to the client about what else we need in order to answer that question.

Once you’re done with your analysis, we need you to summarize your findings and provide some suggestions as to what else we need in order to fulfill their business problem. Please draft an email containing this information to the Data Science team leader to review before we send it to the client.

Dataset — sample_sale_data.csv

EDA walkthrough —

The challenging task here was communication. The client does not understand the codes you write and the output you get from it because he is not data scientist. It was my work to summarize findings in a concise and business-friendly manner within an email to the Data Science team leader.

Here is the email draft of the task:

Dear DataScience Team Leader, 

I received the sample dataset from the Data Engineering team and I’ve been analyzing the sample on behalf of the Data Science team.

I found the following insights as part of the analysis:
1. Fruit & vegetables are the 2 most frequently bought product categories
2. Non-members are the most frequent buyers within the store
3. Cash is the most frequently used payment method
4. 11am is the busiest hour with regards to number of transactions

As a reminder, the client indicated that they wanted to know the following: “How to better stock the items that they sell.”
With respect to this business question, my recommendations are as follows:
1. This is a very broad statement and in order to tackle this with better accuracy, we need to identify a specific problem statement that the business would like to solve. For example, can we predict the demand of products on an hourly basis in order to procure products more intelligently?
2. We need more data. The current sample only covers 7 days and 1 store.
3. Based on the problem statement that we move forward with, we will need more datasets to help describe the outcome that we’re trying to model. For example, if we’re modeling demand for products, we may want to include information about stock levels or weather conditions.

Best regards,
Vedant Dhote

Task 2: Data Modeling

Problem Statement:

“Can we accurately predict the stock levels of products based on sales data and sensor data on an hourly basis in order to more intelligently procure products from our suppliers?”

The client has agreed to share more data in the form of sensor data. They use sensors to measure temperature storage facilities where products are stored in the warehouse, and they also use stock levels within the refrigerators and freezers in store.

It is your task to look at the data model diagram that has been provided by the Data Engineering team and to decide on what data you’re going to use from the data available. In addition, we need you to create a strategic plan as to how you’ll use this data to complete the work to answer the problem statement.

This was a simple task as compare to task 1. Here we derive strategic plan of the project:

Strategic plan / Plan of work

Task 3: Model Building and Interpretation

Problem Statement:

The client has provided 3 datasets, it is now your job to combine, transform and model these datasets in a suitable way to answer the problem statement that the business has requested.

Most importantly, once the modeling process is complete, we need you to communicate your work and analysis in the form of a single PowerPoint slide, so that we can present the results back to the business. The key here is to use business-friendly language and to explain your results in a way that the business will understand.

Additional Dataset : sales.csv, sensor_stock_levels.csv , sensor_storage_temperature.csv

In the model building process, I start by gathering and preprocessing the relevant dataset, ensuring its quality and integrity. Next, I carefully selected appropriate features and engineer new ones if necessary to capture the underlying patterns and relationships. Then, I split the dataset into training, validation, and testing sets for evaluation purposes.

After training and tuning algorithms, evaluating performance, ensuring interpretability I came up with these insights:

  • The product categories were not that important
  • The unit price and temperature were important in predicting stock
  • The hour of day was also important for predicting stock
Interpretation of Model

Task 4: Machine Learning Production

Problem Statement:

To build the foundation for this machine learning use case, they want to implement a first version of the algorithm into production. In the current state, as a Python notebook, this is not suitable to productionize a machine learning model.

Therefore, as the Data Scientist that created this algorithm, it is your job to prepare a Python module that contains code to train a model and output the performance metrics when the file is run.

Step 1: Plan

Good quality code should be planned and should follow a uniform and clear structure.

Step 2: Write

After planning the module that I created python file and included plenty of comments and documentation, because the ML engineering team is not the team that wrote this code.

The python module code:

# ------- BEFORE STARTING - SOME BASIC TIPS
# You can add a comment within a Python file by using a hashtag '#'
# Anything that comes after the hashtag on the same line, will be considered
# a comment and won't be executed as code by the Python interpreter.

# --- 1) IMPORTING PACKAGES
# The first thing you should always do in a Python file is to import any
# packages that you will need within the file. This should always go at the top
# of the file
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import StandardScaler

# --- 2) DEFINE GLOBAL CONSTANTS
# Constants are variables that should remain the same througout the entire running
# of the module. You should define these after the imports at the top of the file.
# You should give global constants a name and ensure that they are in all upper
# case, such as: UPPER_CASE

# K is used to define the number of folds that will be used for cross-validation
K = 10

# Split defines the % of data that will be used in the training sample
# 1 - SPLIT = the % used for testing
SPLIT = 0.75

# --- 3) ALGORITHM CODE
# Next, we should write our code that will be executed when a model needs to be
# trained. There are many ways to structure this code and it is your choice
# how you wish to do this. The code in the 'module_helper.py' file will break
# the code down into independent functions, which is 1 option.
# Include your algorithm code in this section below:

# Load data
def load_data(path: str = "/path/to/csv/Forage - Cognizant AI Program/Task 4/Resources/"):
"""
This function takes a path string to a CSV file and loads it into
a Pandas DataFrame.

:param path (optional): str, relative path of the CSV file

:return df: pd.DataFrame
"""

df = pd.read_csv(f"{path}")
df.drop(columns=["Unnamed: 0"], inplace=True, errors='ignore')
return df

# Create target variable and predictor variables
def create_target_and_predictors(
data: pd.DataFrame = None,
target: str = "estimated_stock_pct"
):
"""
This function takes in a Pandas DataFrame and splits the columns
into a target column and a set of predictor variables, i.e. X & y.
These two splits of the data will be used to train a supervised
machine learning model.

:param data: pd.DataFrame, dataframe containing data for the
model
:param target: str (optional), target variable that you want to predict

:return X: pd.DataFrame
y: pd.Series
"""

# Check to see if the target variable is present in the data
if target not in data.columns:
raise Exception(f"Target: {target} is not present in the data")

X = data.drop(columns=[target])
y = data[target]
return X, y

# Train algorithm
def train_algorithm_with_cross_validation(
X: pd.DataFrame = None,
y: pd.Series = None
):
"""
This function takes the predictor and target variables and
trains a Random Forest Regressor model across K folds. Using
cross-validation, performance metrics will be output for each
fold during training.

:param X: pd.DataFrame, predictor variables
:param y: pd.Series, target variable

:return
"""

# Create a list that will store the accuracies of each fold
accuracy = []

# Enter a loop to run K folds of cross-validation
for fold in range(0, K):

# Instantiate algorithm and scaler
model = RandomForestRegressor()
scaler = StandardScaler()

# Create training and test samples
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=SPLIT, random_state=42)

# Scale X data, we scale the data because it helps the algorithm to converge
# and helps the algorithm to not be greedy with large values
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# Train model
trained_model = model.fit(X_train, y_train)

# Generate predictions on test sample
y_pred = trained_model.predict(X_test)

# Compute accuracy, using mean absolute error
mae = mean_absolute_error(y_true=y_test, y_pred=y_pred)
accuracy.append(mae)
print(f"Fold {fold + 1}: MAE = {mae:.3f}")

# Finish by computing the average MAE across all folds
print(f"Average MAE: {(sum(accuracy) / len(accuracy)):.2f}")

# --- 4) MAIN FUNCTION
# Your algorithm code should contain modular code that can be run independently.
# You may want to include a final function that ties everything together, to allow
# the entire pipeline of loading the data and training the algorithm to be run all
# at once

# Execute training pipeline
def run():
"""
This function executes the training pipeline of loading the prepared
dataset from a CSV file and training the machine learning model

:param

:return
"""

# Load the data first
df = load_data()

# Now split the data into predictors and target variables
X, y = create_target_and_predictors(data=df)

# Finally, train the machine learning model
train_algorithm_with_cross_validation(X=X, y=y)

Task Five: Quality Assurance

Evaluating the production machine learning model to ensure quality results.

After this the ML engineering team has taken my Python module and deployed the algorithm into production along with the DevOps, which is great!

This was my experience and from this virtual program I learned :

  • How Python can be used to conduct exploratory data analysis
  • The importance of communication within your role to explain what you have found
  • How to plan what data is required to answer business questions using a data model
  • How to communicate your strategic plan to your Data Science team leader
  • How to apply machine learning to combine, transform and model data sets to answer the client’s question
  • How to communicate your key findings to the client
  • How Python is used in machine learning to provide greater business value to Gala Groceries
  • Guidance is provided on how to plan and write a Python module
  • How to review your Python module and identify how you can improve your algorithm

If you have interest in data science and machine learning try this Cognizant Artificial Intelligence Virtual Experience Program .

Artificial Intelligence

--

--

Vedant Dhote
Vedant Dhote

Written by Vedant Dhote

Write for fun about Deep Learning | Machine Learning | AI tools | Web development | Anime and Manga |

No responses yet