• Year 2022
  • Theme Health
  • Team Research

Machine Learning Predictive Analysis

Introduction

With increasing data accessibility, it is becoming more common to use Machine Learning (ML) to predict health indicators. We used nationally representative cross-sectional data of 66,013 Indian women to create ML models that can predict indicators which are either costly to measure (such as anaemia in pregnant women) or might have some degree of desirability bias in the responses (such as domestic violence). Our approach comprises features that can be asked with a relatively low risk of desirability bias. Thus, health workers can operationalise it to identify women with a high risk of specific health or safety indicators. Using these predictive models, we developed an Android application that ASHA or similar last-mile health workers can deploy. The agile application would enable the health worker to identify women at high risk of the specific indicator and recommend interventions based on their risk level.

Key results

We achieved a prediction sensitivity of 78% for domestic violence (DV) cases. In our anaemia models, we achieved an overall accuracy of 60% against a benchmark of 50% anaemia prevalence in pregnant Indian women. We also identified a list of 30-40 most significant predictors of these two indicators and used them in our questionnaire for the mobile application.

Description and methodology

We use data from the National Family Health Survey (NFHS-4 for DV and NFHS-5 for anaemia) to build and test our models. Each NFHS survey had a sample of almost 7 lakh Indian women between the ages of 15-49. Around 3.5% of the women were currently pregnant, thus forming the base for our anaemia models.

NFHS collects information across a wide range of features. We performed a basic form of screening to ensure we did not consider features with a high number of missing values. Then, we transformed the remaining questions into more meaningful variables through feature engineering. Finally, we removed features with a high degree of multi-collinearity.

We built and tuned predictive models using logistic regression, random forest, gradient-boosted decision trees and neural networks. We trained our models on 70% of the sample and stress-tested the performance on a 20% holdout sample to minimise overfitting. Finally, we evaluated the selected model on the remaining 10% holdout sample to determine our best-performing algorithm.

Results and highlights

For our DV models, a gradient-boosted decision tree gave the best sensitivity (true positive rate) of 78%. The model also provided insight into the most important predictors of domestic violence - alcohol consumption by the husband and prior exposure to intergenerational violence (i.e. if the woman's father ever hit her mother). For anaemia, a random forest model gave the best overall accuracy of 63%. Biomarkers such as blood pressure and glucose level are the most important features for predicting anaemia in pregnant women. However, we aim to use these models in our mobile application for real-time predictions. Since measuring BP and glucose levels might be difficult and expensive in rural areas, we do not consider such variables in our model. Thus, our final anaemia model has an accuracy of 60%. The number of births in the last five years and the months between the wedding and the first birth were top predictors. Past pregnancy-related variables such as place of delivery were important predictors as well.

True positive rates for DV models
Discussion

Building machine learning models provides us with one way to systematically assess the relative importance of predictors from a high-dimensional feature space. We find that women who think spousal violence inflicted by husbands is justified in some instances of perceived transgressions, such as if the wife neglects her children, argues with the husband, burns food or goes out of the household, is associated with DV. From our anaemia models, we find that a higher number of births in the last 3 or 5 years or a lower birth weight of the previous child can significantly increase the risk of anaemia in women.

Our mobile application rests on these final models and can make real-time predictions using a small 30-question survey. These questions were the most significant predictors of each indicator. ASHAs or other health workers can use our application in rural areas with minimal or no internet coverage to track the progress of individual risk in women over time.

Research funders

The Machine Learning Initiative was run from a grant from the Bill and Melinda Gates Foundation (BMGF).

More Projects

2021

JEEViKA Monitoring and Evaluation of FNHW Activities

2021

Improving Compliance to Complementary Feeding Practices

2021

COVID-19 Vaccine Hesitancy in India: Building Hesitancy Personas from Large-Scale Survey Data

view all projects