Probing Decision Boundaries in Cancer Data Using Noise Injection and Counterfactual Analysis
Education and Training and Outreach
HPC Community Collaboration
HPC Training and Education
Machine Learning and Artificial Intelligence
TimeSunday, 14 November 20212:15pm - 2:45pm CST
DescriptionAdvanced analyses and computations based on gene expressions are prone to errors as they depend on experimental design, chemical operations/measurements, and data analysis. The assembly and aggregation of such data for creating deep neural network models may further influence the accuracy of these analyses. For example, the CANDLE  NT3 Benchmark uses a table of laboratory-obtained data mapping RNA expression data to a normal or tumor designation and is used to make predictions about given expression samples. In this work, we use the NT3 Benchmark to study the effects of injecting bad data at different rates to study the impacts on the resulting predictions. Our data manipulations include flipping classification labels (label noise) and introducing noise in gene expressions (feature noise). We present results for the performance of both the base NT3 Benchmark and NT3 with the addition of the abstention class in the presence of various types of injected noise. For higher noise levels, the ability of the base network to correctly predict the normal/tumor classification (as measured by the validation accuracy) degrades significantly. Use of the abstaining classifier allows the model to learn when the labels have become unreliable and abstain from providing a prediction in that case, while retaining accuracy.
Counterfactual examples are an example-based interpretability technique used by the explainable AI community.The technique aims to mirror human counterfactual reasoning by finding a minimal subset of changes to an input example so that a machine learning model classifies the input into a different class. We demonstrate the use of counterfactual examples to identify the normal directions to the decision boundary "from normal to tumor" and perform further analysis to identify specific overexpressed genes, or “perturbation vectors” (the difference between the generated example and the original input). Perturbation vectors were separated by class and clustered into groups. From the clustered perturbation vectors, identify those features which are important for classification. Noise was injected only on the genes corresponding to the counterfactual perturbation vector while keeping the label the same. We found that for a trained NT3 model without abstention, this does in fact lead to steeper degradation of accuracy compared to with incremental noise injection on a randomly chosen set of indices.
The top gene symbol is PLOD2, which is considered to be the highway of cancer cell-migration as per a 2017 article. Other genes identified in the counterfactual analysis include LRTM1, RGS5, TP53I13, MAN1B1, TRRAP and TP53I13 which have all been found overexpressed and linked to studies of urothelial, lung, renal, bladder, ovarian and bone cancer respectively. We believe that other genes found here might serve as a good starting point and even lead to new discoveries in the area cancer research.
The major contributions of this work include 1) a study model performance on incremental noise injection to input data and, 2) use of abstention classifiers to combat noisy data in the NT3 dataset, 3) a technique to highlight the decision boundary of the NT3 model and identify key genes for cancer research with counterfactual analysis.