I recently read a research paper about use of machine learning to predict premature death. The research shows the effectiveness of machine learning to compute highly accurate predictions but it also highlights a limitation of machine learning which is that it can't tell us about cause and effect.
The researchers, Weng et al from the University of Nottingham, used data from the UK Biobank which is a database of historical medical data from more than half a million people of whom about 14 thousand died mainly from cancer, respiratory disease or heart disease.
Neural networks were found to give the best results at predicting premature death, closely followed by random decision forests and both of these were more accurate than traditional non-machine learning methods.
The neural network put emphasis on some factors (waist circumference and skin tone) which were different from those of the random decision forest and the traditional method. The temptation is to jump to a conclusion that the factors the neural network found to be important are the ones which cause premature death. But it is incorrect to do so. As explained in the research article the machine learning models "only give an indication of whether there may be a “signal” in the data and not the direction of association, and should thus be interpreted with caution. Further analysis using causal epidemiological study designs is recommended."
Having said that, the "study shows the value of using ML, to explore a wide array of individual clinical, demographic, lifestyle and environmental risk factors, to produce a novel and holistic model that was not possible to achieve using standard approaches. This work suggests that use of ML should be more routinely considered when developing models for prognosis or diagnosis."
Perhaps a strength of machine learning will be its ability to find potential signals in medical (and other) data which can then be tested using causal study designs such as randomised clinical trials.
Six of the 15 top rank risk factors (age, prior diagnosis of cancer, gender, smoking, FEV1, education) were identified in all three algorithms. The Cox model overlapped with the either the random forest or the deep learning algorithm for seven risk factors (prior diagnosis of COPD, prior diagnosis of T2DM, prior diagnosis of CHD, diastolic and systolic blood pressure, BMI, Townsend deprivation index). Ethnicity and physical activity (represented by MET-min per week) were important predictors in the Cox model but were not identified as important in the random forest and deep learning models. Instead, the random forest model put emphasis on other measures of adiposity including waist circumference, body fat percentage, and interestingly included skin tone, and two measures of healthy diet (vegetable and fruit consumption). The deep learning model identified alcohol consumption, medication prescribing (digoxin, warfarin, statins), and environmental factors such as residential air pollution and job related hazardous exposures.