Intro to Machine Learning Tutorial
AI Development and Frameworks
Building and implementing AI solutions often involves coding and using specialized libraries. Python is the most popular programming language for AI development due to its simplicity and the rich ecosystem of AI libraries. Here are some key Python-based frameworks and tools relevant to our work:
Deep Learning with PyTorch: PyTorch is a widely-used open-source deep learning library for Python, developed by Facebook’s AI Research lab. It provides a flexible way to build neural networks and is beloved in the research community for its dynamic computational graph (which makes debugging and development more intuitive)
. PyTorch makes it easier to create and train complex models like CNNs and RNNs. For example, if we want to develop an AI model to analyze chest X-ray images for pneumonia, we could use PyTorch to define a CNN architecture and train it on a dataset of labeled X-rays. PyTorch handles the heavy math in the background (like automatic differentiation for gradient descent) and can run the training on GPUs to speed it up. Many pre-trained models (for image recognition, language models, etc.) are available in PyTorch, which is helpful for transfer learning – we can take a model trained on one task and fine-tune it on our healthcare-specific data. Overall, PyTorch is our go-to tool for deep learning experimentation and has been used in projects like developing diagnostic models or prototyping AI-driven decision support.
Traditional Machine Learning with scikit-learn: Not every problem requires deep learning. For many structured data tasks (like predicting an outcome from a set of numeric indicators), we rely on scikit-learn. Scikit-learn is a powerful Python library that offers simple and efficient tools for predictive data analysis
. It includes implementations of basically all the classic machine learning algorithms – linear regression, logistic regression, decision trees, random forests, support vector machines, clustering algorithms, and more – all accessible through a consistent interface. If we have a spreadsheet of health data (say, patients with various attributes and we want to predict who will develop complications), scikit-learn can quickly help us train and evaluate a model. It also has utilities for splitting data into training and testing sets, doing cross-validation, scaling features, and evaluating with metrics. The beauty of scikit-learn is how straightforward it is: with just a few lines of code, you can train a model. For example, in our ARV treatment interruption prediction pilot, we might start with a scikit-learn Random Forest classifier to get a baseline model. Scikit-learn is great for rapid prototyping and for problems where deep learning might be overkill. It’s also very efficient for moderate-sized data and interpretable models. Because it’s “traditional” ML, the models from scikit-learn (like decision trees or logistic regression) can often be easier to interpret and explain to stakeholders compared to a neural network.
Data Manipulation with Pandas: Before we can build any model, we need to wrangle our data into shape. This is where Pandas comes in. Pandas is the premier Python library for data manipulation and analysis, providing high-performance data structures (like the DataFrame) to easily clean, transform, and analyze data
. In global health, data often comes in messy spreadsheets, CSV files, or databases – Pandas helps us merge datasets (for example, combining lab results with patient records), handle missing values, compute summary statistics, and reshape data as needed. Using Pandas, we can do things like: filter out patients who don’t meet certain criteria, compute new indicators (e.g., adherence rate over past 3 months), pivot data by clinic or by month, etc., all in just a few lines of code. Pandas is essentially our data pre-processing workhorse. An example: if we have a dataset of commodity stock levels per month for each health facility, Pandas can help group and aggregate this data to feed into an AI model that forecasts stock-outs. Or when preparing training data for an AI model, we might use Pandas to label outcomes or to normalize values. Its syntax is very intuitive for anyone used to Excel or SQL, but it brings the full power of programming to data handling. Mastering Pandas is crucial because good AI results require good data, and Pandas is how we get the data ready.
In summary, our AI development typically looks like this: use Pandas to get and clean the data, maybe use scikit-learn to explore baseline models or do quick analyses, and if the problem calls for it, move to deep learning with PyTorch for more complex modeling. All of these live in Python’s ecosystem, which means they interoperate nicely (you can have a Pandas DataFrame that you convert to a numpy array for scikit-learn, or feed data into PyTorch tensors, etc.). Python also has many other useful libraries (like TensorFlow/Keras for deep learning, though we often prefer PyTorch; or Numpy and SciPy for lower-level math; and specialized libraries like pytorch-lightning
for training workflow or huggingface
for NLP models), but the three above – Pandas, scikit-learn, and PyTorch – are core pillars of our AI toolkit.
6. How AI Works
Now that we’ve covered what AI is and what it can do, let’s demystify how an AI project comes together. There are a few fundamental steps: data collection, model training, and model testing & evaluation. We will briefly walk through each, as they are vital to understand both the capabilities and limitations of AI.
Data Collection
Data is the fuel for AI. The quality and quantity of data largely determine how well an AI model will perform. In the context of global health, data can come in many forms:
Structured data: Such as numerical fields in a database (e.g., a patient’s age, blood pressure readings, or number of clinic visits), categorical data (like gender, or yes/no fields), and dates/times. This kind of data is often found in electronic health record systems or monitoring databases.
Unstructured data: This includes free-text (doctor’s notes, reports), images (X-rays, microscopy images, photos of skin lesions), audio (recordings of patient interviews or call center recordings), and possibly sensor data (like wearables output). Unstructured data often requires extra preprocessing (like NLP for text, or image processing).
Big data streams: Sometimes AI might ingest continuous data streams – for example, real-time surveillance data from social media or continuous vital sign monitoring from devices.
Collecting data involves gathering it from sources (databases, surveys, devices), and then cleaning and organizing it. Data cleaning is a crucial step: real-world data is messy. There will be typos, missing values, outliers (like someone recorded as 500 years old due to a typo), etc. We use tools like Pandas to handle these issues – dropping or imputing missing values, correcting errors, and ensuring the data makes sense.
In global health, we also must pay attention to data quality and bias. If the data used to train an AI model isn’t representative (say we only have data from urban hospitals, and none from rural clinics), the model may not generalize well. Also, if data is outdated or collected differently in different places, the model could learn spurious patterns. It’s often said: “garbage in, garbage out.” Thus, a big part of AI work is making sure we have reliable, relevant data. Sometimes, to get labeled data for supervised learning, we might need manual effort (like experts labeling ultrasound images to indicate which show pneumonia) or to use proxy labels (like using treatment outcomes from the past as labels for training an adherence model).
We also consider data privacy and security. Health data is sensitive, so any AI project must handle data in compliance with ethical standards, patient consent, and regulations. Often data is de-identified and aggregated for AI use. Once we have our dataset prepared – imagine for example we compile a table where each row is a patient and columns include their demographic info, lab results, and a label indicating if they experienced treatment failure – we’re ready to feed it to a model.
Model Training
Training an AI model means teaching the model to recognize patterns in the data. For supervised learning (which is common in our use cases), training involves showing the model many examples with the correct answer so it can learn to predict the answer for new examples.
Think of it like this: if we’re training a model to predict whether a patient will interrupt treatment, each patient in our training set is an example. The model looks at that patient’s features (age, number of missed appointments, etc.) and makes a prediction. We then compare the prediction to the true outcome (did they interrupt treatment or not?). If the prediction is wrong, the model adjusts its internal parameters to do better next time; if it’s right, those parameters are reinforced. This process repeats for thousands of examples. The mechanism for adjustment depends on the model type:
For a simple model like linear regression, training might involve an algorithm like ordinary least squares or gradient descent to find the best-fit line through the data.
For a neural network, training uses backpropagation and gradient descent: the network’s weights (connections) start random, and with each example, the error is calculated and then propagated backward through the network to update the weights in a direction that reduces error. Doing this in iterative cycles (epochs) eventually makes the network accurate on the training data.
During training, we often tweak model settings (called hyperparameters), such as how complex to make the model (e.g., how many trees in a random forest or how many layers in a neural network), the learning rate (how big each adjustment step is), etc. This is somewhat an art and science – we may try several configurations and see what yields the best result.
It’s important that we avoid overfitting during training. Overfitting means the model memorizes the training data too closely and fails to generalize to new data. For example, if our dataset is small, a complex model might just memorize each patient rather than learning general patterns, resulting in high accuracy on training data but poor performance on unseen patients. We combat this by techniques like cross-validation (training on part of the data, testing on another part), regularization (penalizing overly complex models), or simply by keeping the model simpler if needed.
An opposite issue is underfitting, where the model is too simple to capture the underlying pattern – say we try to predict a complex outcome with just a straight line. Underfitting leads to poor performance on both training and new data
. We aim for a model that is just right – capturing the signal but not the noise.
In summary, model training is where the AI “learns” from historical data. For instance, we might train a decision tree model to split patients based on their attributes (if CD4 count < X
and missed visits > Y
then high risk, etc.). Or we train a deep learning model on thousands of radiology images to learn what tumors look like. This stage can be computationally intensive (especially for deep learning), but once done, we have a trained model ready to be evaluated and used.
Model Testing & Evaluation
After training a model, we need to assess how well it actually performs – before deploying it in the real world. This is where testing and evaluation come in. We typically set aside a portion of our data (or use fresh data not seen by the model) as a test set. The model makes predictions on this test data, and since we know the true answers for the test set, we can measure performance.
Key evaluation metrics include:
Accuracy: The proportion of correct predictions out of all predictions. For example, if our model correctly predicted treatment outcome for 90 out of 100 patients, it has 90% accuracy. Accuracy is simple to understand but can be misleading if the classes are imbalanced (e.g., if only 5% of patients interrupt treatment, a model that always predicts “no interruption” is 95% accurate by default, which isn’t actually useful).
Precision and Recall: For classification tasks, especially in health, these are critical. Precision asks: of those we predicted as positive (e.g., high risk patients), how many were actually positive? Recall (sensitivity) asks: of all actual positives, how many did we correctly identify? There is often a trade-off between precision and recall. In a scenario like identifying patients at risk of defaulting treatment, a high recall means we catch most of the ones who will default (important so we don’t miss people), though it might come at the cost of precision (we might flag some who would have been fine). We use these metrics to choose thresholds and to compare models.
Specificity: Especially for medical tests, we look at specificity (true negative rate) – e.g., if our model says someone won’t default, how often is it correct? This matters to ensure we’re not giving false reassurance.
F1 Score: The harmonic mean of precision and recall, useful as a single measure that balances both.
ROC AUC: For binary classification, the area under the ROC curve tells us how well the model separates the two classes overall.
Mean Squared Error / MAE: If it’s a regression (numeric prediction), we look at error metrics like mean squared error (average of squared differences between prediction and true value) or mean absolute error.
In practice, we often examine a confusion matrix (which breaks down counts of true vs predicted classes), and derive these stats. For example, with our HIV interruption prediction model, we might find it has an 85% recall (sensitivity) – meaning it catches 85% of those who will interrupt – and maybe a 80% precision – meaning 20% of those flagged won’t actually interrupt (false positives). These numbers help us decide if the model is good enough or needs improvement.
Crucially, we also watch out for overfitting signs: if a model does much better on training data than on test data, it likely overfit. A well-generalized model should perform comparably on training and test. Overfitting vs. Underfitting is a core challenge: an overfit model might score 100% on training data but far lower on test data (it “memorized” rather than “learned”), whereas an underfit model might score poorly everywhere
. We aim for a balanced fit.
Beyond metrics, evaluation can involve real-world trial. For instance, even if a model has good metrics, we might do a pilot where health workers use the model’s predictions and give feedback: Was it actually helpful? Did it integrate well into workflow? Sometimes a model that’s statistically good might not be useful if it’s not interpretable or actionable. So, especially in global health, we also value interpretability and will favor models that can explain their reasoning (like decision trees or rule-based approaches) when needed, or we accompany a complex model with explanation tools (like SHAP values for feature importance in a prediction).
In summary, we test our AI like we test a diagnostic tool – how accurate is it, does it catch what it should (sensitivity) without too many false alarms (specificity/precision), and is it reliable on new data. Only after rigorous evaluation do we consider deploying an AI model in a live setting, and even then we monitor it, because data can drift over time and models may need recalibration.
Python Demonstration
To solidify our understanding, let’s walk through a hands-on Python demonstration of a simple AI application in healthcare. We will build a basic machine learning model using real medical data to show how the process works end-to-end.
Scenario: Suppose we want to create an AI model to help diagnose breast cancer from clinical data. We’ll use a classic dataset (the Breast Cancer Wisconsin dataset) which contains features computed from breast mass cell images (like cell size, texture, etc.) and a label indicating whether the tumor was benign or malignant. Our goal is to train a model that can predict “benign” vs “malignant” from these features. This mimics a tool that could assist doctors in evaluating tumor characteristics.
We will use Python with scikit-learn for this demo:
Load the dataset and examine it.
Split the data into a training set and a test set.
Train a machine learning model (we’ll use a Random Forest, which is often effective out-of-the-box for classification).
Evaluate the model’s accuracy and other metrics on the test set.
Below is the code and output for this process (comments in the code explain each step):
python
CopyEdit
# 1. Import necessary libraries and dataset from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, classification_report # Load the breast cancer dataset data = load_breast_cancer() X, y = data.data, data.target # X are the features, y is the label (0=malignant, 1=benign in this dataset) print("Total samples:", X.shape[0]) print("Feature names:", data.feature_names[:5], "...") print("Class names:", data.target_names) # 2. Split into training and testing sets (80% train, 20% test) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 3. Train a Random Forest classifier clf = RandomForestClassifier(n_estimators=50, random_state=42) # 50 trees for simplicity clf.fit(X_train, y_train) # 4. Make predictions on the test set y_pred = clf.predict(X_test) # Evaluate the model accuracy = accuracy_score(y_test, y_pred) print(f"\nAccuracy on test data: {accuracy:.2f}") # Print detailed classification report (precision, recall, etc.) print("\nClassification Report:") print(classification_report(y_test, y_pred, target_names=data.target_names))
When we run this, we get an output like:
less
CopyEdit
Total samples: 569 Feature names: ['mean radius' 'mean texture' 'mean perimeter' 'mean area' 'mean smoothness'] ... Class names: ['malignant' 'benign'] Accuracy on test data: 0.96 Classification Report: precision recall f1-score support malignant 0.98 0.93 0.95 43 benign 0.96 0.99 0.97 71 accuracy 0.96 114 macro avg 0.97 0.96 0.96 114 weighted avg 0.97 0.96 0.96 114
Let’s interpret these results. The dataset had 569 examples in total, with various features (like “mean radius”, “mean texture”, etc. of the cell nuclei in the tumor). We trained on 455 samples and tested on 114 samples. The model we used, Random Forest, is an ensemble of decision trees – it’s often effective for medical data without heavy parameter tuning.
The accuracy on the test set is 0.96, which means 96% of the test tumors were correctly classified by our model as benign vs malignant. The classification report gives more detail:
For malignant tumors, the model’s precision is 0.98, meaning when it predicts “malignant,” it’s correct 98% of the time (very few false alarms). The recall is 0.93, so it caught 93% of all actual malignant cases (it missed a small fraction).
For benign tumors, precision 0.96 and recall 0.99, indicating it’s very good at identifying benign cases too (only 1% of benign were misclassified as malignant).
These numbers yield an overall 96% accuracy. The f1-scores (~0.95-0.97) show a good balance between precision and recall.
In practice, such a model could be used as a decision support tool. For example, when a new tumor’s features are input, the model might predict “malignant” with high confidence, prompting further investigation or early treatment. Of course, in a real deployment we would validate the model on independent data and integrate it with other clinical information. But this demo illustrates how, with just a few lines of Python, we:
Loaded health data,
Trained an AI model,
Achieved high accuracy in a diagnosis task.
This is a simple example, but the workflow is similar for other AI projects: data prep, training, and evaluation using Python tools. We could easily swap in a different algorithm or do additional steps (like hyperparameter tuning or cross-validation), but the core idea is that Python’s ecosystem makes it relatively straightforward to experiment with these models.
Takeaway: The Python demonstration shows the end-to-end of building a predictive model. In our daily work, we’ll be using similar code (often more complex) to develop AI solutions. Whether it’s predicting treatment interruption or analyzing an image, the process involves data and a model. The better our understanding of this process, the better we can interpret AI results and integrate them into global health interventions.