Perfluoroalkyl and polyfluoroalkyl substances (PFAS) are a diverse family of synthetic fluorinated aliphatic compounds ubiquitous in industrial applications and household consumer products. Despite their widespread use, little is known about the human toxicity potential of PFAS. Growing national concerns have garnered attention to PFAS persistence in environment and prevalent contamination of drinking water supplies across the country. The resource and time intensive nature of in vivo toxicity experiments means that only limited toxicity data exists for the EPA's growing inventory of over 8,000 PFAS chemicals. In this talk, we will present our study evaluating the performance of multiple machine learning (ML) methods including random forest (RF), deep neural networks (DNN), graph convolutional networks (GCN), and Gaussian process (GP), for predicting oral rat lethal median doses (LD50) of PFAS compounds. To address the scarcity of toxicity information on PFAS, publicly available datasets of oral rat LD50 for all organic compounds are aggregated and used to develop baseline ML source models. 518 fluorinated compounds containing 2 or more C-F bonds with known oral rat LD50 are used in transfer learning by leveraging knowledge gained from ensembles of the best performing source model, DNN, to generate the target models for more than 8,000PFAS with access to uncertainty. To translate the uncertainty information into automated model decision, the transfer-learned model is transformed in to a selective prediction model where the model is allowed to identify regions of prediction with greater confidence and abstain from those with high uncertainty using a calibrated cutoff rate. The deep transfer learning workflow for PFAS toxicity prediction developed in this study can be used for predicting many other toxicity endpoints and potentially guide future experiments with the automated, uncertainty-informed model decision process.
Vangelis Kourlitis (HEP), Tanwi Mallick (MCS)