Drug repositioning, the process of discovering, validating, and marketing previously approved drugs for new indications, is of growing interest to academia and industry due to reduced time and costs associated with repositioned drugs. Computational methods for repositioning are appealing because they putatively nominate the most promising candidate drugs for a given indication. Comparing the wide array of computational repositioning methods, however, is a challenge due to inconsistencies in method validation in the field. Furthermore, a common simplifying assumption, that all novel predictions are false, is intellectually unsatisfying and hinders reproducibility. We address this assumption by providing a gold standard database, repoDB, that consists of both true positives (approved drugs), and true negatives (failed drugs). We have made the full database and all code used to prepare it publicly available, and have developed a web application that allows users to browse subsets of the data (http://apps.chiragjpgroup.org/repoDB/).
OBJECTIVE: To use natural language processing (NLP) in conjunction with the electronic medical record (EMR) to accurately identify patients with cerebral aneurysms and their matched controls.
METHODS: ICD-9 and Current Procedural Terminology codes were used to obtain an initial data mart of potential aneurysm patients from the EMR. NLP was then used to train a classification algorithm with .632 bootstrap cross-validation used for correction of overfitting bias. The classification rule was then applied to the full data mart. Additional validation was performed on 300 patients classified as having aneurysms. Controls were obtained by matching age, sex, race, and healthcare use.
RESULTS: We identified 55,675 patients of 4.2 million patients with ICD-9 and Current Procedural Terminology codes consistent with cerebral aneurysms. Of those, 16,823 patients had the term aneurysm occur near relevant anatomic terms. After training, a final algorithm consisting of 8 coded and 14 NLP variables was selected, yielding an overall area under the receiver-operating characteristic curve of 0.95. After the final algorithm was applied, 5,589 patients were classified as having aneurysms, and 54,952 controls were matched to those patients. The positive predictive value based on a validation cohort of 300 patients was 0.86.
CONCLUSIONS: We harnessed the power of the EMR by applying NLP to obtain a large cohort of patients with intracranial aneurysms and their matched controls. Such algorithms can be generalized to other diseases for epidemiologic and genetic studies.
Objective. RA patients develop autoantibodies against a spectrum of antigens but their clinical significance is unclear. Using the phenome-wide association study (PheWAS) approach, we examined the association between autoantibodies and clinical subphenotypes of RA.
Methods. This study was conducted using a validated electronic medical record (EMR) RA cohort from 2 tertiary care centers. Using a published multiplex bead assay, we measured 36 autoantibodies targeting epitopes implicated in RA. We extracted all ICD-9 codes for each subject and grouped them using a published method into disease categories (PheWAS codes). We tested for the association of each autoantibody grouped by targeted protein with PheWAS codes. For significant associations (false discovery rate [FDR] ≤0.1), we reviewed 50 medical records of subjects with each PheWAS code to determine the positive predictive value (PPV).
Results. We studied 1006 RA subjects, mean age 61.0 years (SD 12.9) and 79.0% female. There were 3,568 unique ICD-9 codes grouped into 625 PheWAS codes; 206 PheWAS codes with a prevalence ≥3% were studied. PheWAS identified 24 significant associations of autoantibodies to epitopes at FDR≤0.1. Associations with the strongest associations and highest PPV for PheWAS code included autoantibodies against fibronectin with obesity (p=6.1x10−4, PPV 100%), and fibrinogen with pneumonopathy (p=2.7x10−4, PPV 96%). The latter included diagnoses for cryptogenic organizing pneumonia and obliterative bronchiolitis.
Conclusion. We demonstrated the application of a bioinformatics method, the PheWAS, to screen for clinical significance of RA-related autoantibodies. PheWAS identified potential significant links between variations in levels of autoantibodies and comorbidities of interest in RA. This article is protected by copyright. All rights reserved.
It is now well recognized that the effectiveness and potential risk of a treatment often vary by patient subgroups. Although trial-and-error and one-size-fits-all approaches to treatment selection remain a common practice, much recent focus has been placed on individualized treatment selection based on patient information (La Thangue and Kerr, 2011; Ong et al., 2012). Genetic and molecular markers are becoming increasingly available to guide treatment selection for various diseases including HIV and breast cancer (Mallal et al., 2008; Zujewski and Kamin, 2008). In recent years, many statistical procedures for developing individualized treatment rules (ITRs) have been proposed. However, less focus has been given to efficient selection of predictive biomarkers for treatment selection. The standard Wald test for interactions between treatment and the set of markers of interest may not work well when the marker effects are nonlinear. Furthermore, interaction-based test is scale dependent and may fail to capture markers useful for predicting individualized treatment differences. In this article, we propose to overcome these difficulties by developing a kernel machine (KM) score test that can efficiently identify markers predictive of treatment difference. Simulation studies show that our proposed KM-based score test is more powerful than the Wald test when there is nonlinear effect among the predictors and when the outcome is binary with nonlinear link functions. Furthermore, when there is high-correlation among predictors and when the number of predictors is not small, our method also over-performs Wald test. The proposed method is illustrated with two randomized clinical trials.
Considerable interest has recently been focused on studying multiple phenotypes simultaneously in both epidemiological and genomic studies, either to capture the multidimensionality of complex disorders or to understand shared etiology of related disorders. We seek to identify multiple regulators or predictors that are associated with multiple outcomes when these outcomes may be measured on very different scales or composed of a mixture of continuous, binary, and not-fully observed elements. We first propose an estimation technique to put all effects on similar scales, and we induce sparsity on the estimated effects. We provide standard asymptotic results for this estimator and show that resampling can be used to quantify uncertainty in finite samples. We finally provide a multiple testing procedure which can be geared specifically to the types of multiple regulators of interest, and we establish that, under standard regularity conditions, the familywise error rate will approach 0 as sample size diverges. Simulation results indicate that our approach can improve over unregularized methods both in reducing bias in estimation and improving power for testing.
OBJECTIVE: Phenotyping algorithms are capable of accurately identifying patients with specific phenotypes from within electronic medical records systems. However, developing phenotyping algorithms in a scalable way remains a challenge due to the extensive human resources required. This paper introduces a high-throughput unsupervised feature selection method, which improves the robustness and scalability of electronic medical record phenotyping without compromising its accuracy.
METHODS: The proposed Surrogate-Assisted Feature Extraction (SAFE) method selects candidate features from a pool of comprehensive medical concepts found in publicly available knowledge sources. The target phenotype's International Classification of Diseases, Ninth Revision and natural language processing counts, acting as noisy surrogates to the gold-standard labels, are used to create silver-standard labels. Candidate features highly predictive of the silver-standard labels are selected as the final features.
RESULTS: Algorithms were trained to identify patients with coronary artery disease, rheumatoid arthritis, Crohn's disease, and ulcerative colitis using various numbers of labels to compare the performance of features selected by SAFE, a previously published automated feature extraction for phenotyping procedure, and domain experts. The out-of-sample area under the receiver operating characteristic curve and F-score from SAFE algorithms were remarkably higher than those from the other two, especially at small label sizes.
CONCLUSION: SAFE advances high-throughput phenotyping methods by automatically selecting a succinct set of informative features for algorithm training, which in turn reduces overfitting and the needed number of gold-standard labels. SAFE also potentially identifies important features missed by automated feature extraction for phenotyping or experts.
To evaluate a new therapy versus a control via a randomized, comparative clinical study or a series of trials, due to heterogeneity of the study patient population, a pre-specified, predictive enrichment procedure may be implemented to identify an "enrichable" subpopulation. For patients in this subpopulation, the therapy is expected to have a desirable overall risk-benefit profile. To develop and validate such a "therapy-diagnostic co-development" strategy, a three-step procedure may be conducted with three independent data sets from a series of similar studies or a single trial. At the first stage, we create various candidate scoring systems based on the baseline information of the patients via, for example, parametric models using the first data set. Each individual score reflects an anticipated average treatment difference for future patients who share similar baseline profiles. A large score indicates that these patients tend to benefit from the new therapy. At the second step, a potentially promising, enrichable subgroup is identified using the totality of evidence from these scoring systems. At the final stage, we validate such a selection via two-sample inference procedures for assessing the treatment effectiveness statistically and clinically with the third data set, the so-called holdout sample. When the study size is not large, one may combine the first two steps using a "cross-training-evaluation" process. Comprehensive numerical studies are conducted to investigate the operational characteristics of the proposed method. The entire enrichment procedure is illustrated with the data from a cardiovascular trial to evaluate a beta-blocker versus a placebo for treating chronic heart failure patients.
Matrix completion has attracted significant recent attention in many fields including statistics, applied mathematics and electrical engineering. Current literature on matrix completion focuses primarily on independent sampling models under which the individual observed entries are sampled independently. Motivated by applications in genomic data integration, we propose a new framework of structured matrix completion (SMC) to treat structured missingness by design. Specifically, our proposed method aims at efficient matrix recovery when a subset of the rows and columns of an approximately low-rank matrix are observed. We provide theoretical justification for the proposed SMC method and derive lower bound for the estimation errors, which together establish the optimal rate of recovery over certain classes of approximately low-rank matrices. Simulation studies show that the method performs well in finite sample under a variety of configurations. The method is applied to integrate several ovarian cancer genomic studies with different extent of genomic measurements, which enables us to construct more accurate prediction rules for ovarian cancer survival.
Long-term follow-up is common in many medical investigations where the interest lies in predicting patients' risks for a future adverse outcome using repeatedly measured predictors over time. A key quantity is the likelihood of developing an adverse outcome among individuals who survived up to time s given their covariate information up to time s. Simple, yet reliable, methodology for updating the predicted risk of disease progression using longitudinal markers remains elusive. Two main approaches have been considered in the literature. One approach, based on joint modeling (JM) of failure time and longitudinal covariate process (Tsiatis and Davidian, 2004), derives such longitudinal predictive probability from the joint probability of a longitudinal marker and an event at a given time. A second approach, the partly conditional (PC) modeling (Zheng and Heagerty, 2005), directly models the predictive probability conditional on survival up to a landmark time and information accrued by that time. In this article, we propose new PC models for longitudinal prediction that are more flexible than joint modeling and improve the prediction accuracy over existing PC models. We provide procedures for making inference regarding future risk for an individual with longitudinal measures up to a given time. In addition, we conduct simulations to evaluate both JM and PC approaches in order to provide practical guidance on modeling choices. We use standard measures of predictive accuracy adapted to our setting to explore the predictiveness of the two approaches. We illustrate the performance of the two approaches on a dataset from the End Stage Renal Disease Study (ESRDS).
Identifying environmental exposures associated with blood pressure is a priority. Recently, we proposed the environment-wide association study to search for and replicate environmental factors associated with phenotypes. We conducted the environment-wide association study (EWAS) using the National Health and Nutrition Examination Surveys (1999–2012) which evaluated a total of 71,916 participants to prioritize environmental factors associated with systolic and diastolic blood pressure. We searched for factors on participants from survey years 1999–2006 and tentatively replicated findings in participants from years 2007–2012. Finally, we estimated the overall association and performed a second meta-analysis using all survey years (1999–2012). For systolic blood pressure, self-reported alcohol consumption emerged as our top finding (a 0.04 increase in mmHg of systolic blood pressure for 1 standard deviation increase in self-reported alcohol), though the effect size is small. For diastolic blood pressure, urinary cesium was tentatively replicated; however, this factor demonstrated high heterogeneity between populations (I2 = 51%). The lack of associations across this wide of an analysis raises the call for a broader search for environmental factors in blood pressure.
Objective Drug repositioning is a promising methodology for reducing the cost and duration of the drug discovery pipeline. We sought to develop a computational repositioning method leveraging annotations in the literature, such as Medical Subject Heading (MeSH) terms.
Methods We developed software to determine significantly co-occurring drug-MeSH term pairs and a method to estimate pair-wise literature-derived distances between drugs.
Results We found that literature-based drug-drug similarities predicted the number of shared indications across drug-drug pairs. Clustering drugs based on their similarity revealed both known and novel drug indications. We demonstrate the utility of our approach by generating repositioning hypotheses for the commonly used diabetes drug metformin.
Conclusion Our study demonstrates that literature-derived similarity is useful for identifying potential repositioning opportunities. We provided open-source code and deployed a free-to-use, interactive application to explore our database of similarity-based drug clusters (available at http://apps.chiragjpgroup.org/MeSHDD/).
Repositioning of previously approved drugs is a promising methodology because it reduces the cost and duration of the drug development pipeline and reduces the likelihood of unforeseen adverse events. Computational repositioning is especially appealing because of the ability to rapidly screen candidates in silico and to reduce the number of possible repositioning candidates. What is unclear, however, is how useful such methods are in producing clinically efficacious repositioning hypotheses. Furthermore, there is no agreement in the field over the proper way to perform validation of in silico predictions, and in fact no systematic review of repositioning validation methodologies. To address this unmet need, we review the computational repositioning literature and capture studies in which authors claimed to have validated their work. Our analysis reveals widespread variation in the types of strategies, predictions made and databases used as 'gold standards'. We highlight a key weakness of the most commonly used strategy and propose a path forward for the consistent analytic validation of repositioning techniques.
It is a public health priority to identify the adverse and non-adverse associations between pharmaceutical medications and cancer. We search for and evaluate associations between all prescribed medications and longitudinal cancer risk in participants of the Swedish Cancer Register (N = 9,014,975). We associated 552 different medications with incident cancer risk (any, breast, colon, and prostate) during 5.5 years of follow-up (7/1/2005-12/31/2010) in two types of statistical models, time-to-event and case-crossover. After multiple hypotheses correction and replication, 141 (26%) drugs were associated with any cancer in a time-to-event analysis constraining drug exposure to 1 year before first cancer diagnosis and adjusting for history of medication use. In a case-crossover analysis, 36 drugs (7%) were associated with decreased cancer risk. 12 drugs were found in common in both analyses with concordant direction of association. We found 14, 10, 7% of all drugs associated with colon, prostate, and breast cancers in time-to-event models. We only found 1, 2%, and 0% for these cancers, respectively, in case-crossover analyses. Pharmacoepidemiologic analyses of cancer risk are sensitive to modeling choices and false-positive findings are a threat. Medication-wide analyses using different analytical models may help suggest consistent signals of increased cancer risk
Machine learning techniques can be used to extract predictive models for diseases from electronic medical records (EMRs). However, the nature of EMRs makes it difficult to apply off-the-shelf machine learning techniques while still exploiting the rich content of the EMRs. In this paper, we explore the usage of a range of natural language processing (NLP) techniques to extract valuable predictors from uncoded consultation notes and study whether they can help to improve predictive performance.
We study a number of existing techniques for the extraction of predictors from the consultation notes, namely a bag of words based approach and topic modeling. In addition, we develop a dedicated technique to match the uncoded consultation notes with a medical ontology. We apply these techniques as an extension to an existing pipeline to extract predictors from EMRs. We evaluate them in the context of predictive modeling for colorectal cancer (CRC), a disease known to be difficult to diagnose before performing an endoscopy.
Our results show that we are able to extract useful information from the consultation notes. The predictive performance of the ontology-based extraction method moves significantly beyond the benchmark of age and gender alone (area under the receiver operating characteristic curve (AUC) of 0.870 versus 0.831). We also observe more accurate predictive models by adding features derived from processing the consultation notes compared to solely using coded data (AUC of 0.896 versus 0.882) although the difference is not significant. The extracted features from the notes are shown be equally predictive (i.e. there is no significant difference in performance) compared to the coded data of the consultations.
It is possible to extract useful predictors from uncoded consultation notes that improve predictive performance. Techniques linking text to concepts in medical ontologies to derive these predictors are shown to perform best for predicting CRC in our EMR dataset.
Natural language processing, Predictive modeling, Uncoded consultation notes, Colorectal cancer
The National Health and Nutrition Examination Survey (NHANES) is a population survey implemented by the Centers for Disease Control and Prevention (CDC) to monitor the health of the United States whose data is publicly available in hundreds of files. This Data Descriptor describes a single unified and universally accessible data file, merging across 255 separate files and stitching data across 4 surveys, encompassing 41,474 individuals and 1,191 variables. The variables consist of phenotype and environmental exposure information on each individual, specifically (1) demographic information, physical exam results (e.g., height, body mass index), laboratory results (e.g., cholesterol, glucose, and environmental exposures), and (4) questionnaire items. Second, the data descriptor describes a dictionary to enable analysts find variables by category and human-readable description. The datasets are available on DataDryad and a hands-on analytics tutorial is available on GitHub. Through a new big data platform, BD2K Patient Centered Information Commons (http://pic-sure.org), we provide a new way to browse the dataset via a web browser (https://nhanes.hms.harvard.edu) and provide application programming interface for programmatic access.
Background: Repositioning approved drug and small molecules in novel therapeutic areas is of key interest to the pharmaceutical industry. A number of promising computational techniques have been developed to aid in repositioning, however, the majority of available methodologies require highly specific data inputs that preclude the use of many datasets and databases. There is a clear unmet need for a generalized methodology that enables the integration of multiple types of both gene expression data and database schema.
Results: ksRepo eliminates the need for a single microarray platform as input and allows for the use of a variety of drug and chemical exposure databases. We tested ksRepo’s performance on a set of five prostate cancer datasets using the Comparative Toxicogenomics Database (CTD) as our database of gene-compound interactions. ksRepo successfully predicted significance for five frontline prostate cancer therapies, representing a significant enrichment from over 7000 CTD compounds, and achieved specificity similar to other repositioning methods.
Conclusions: We present ksRepo, which enables investigators to use any data inputs for computational drug repositioning. ksRepo is implemented in a series of four functions in the R statistical environment under a BSD3 license. Source code is freely available at http://github.com/adam-sam-brown/ksRepo. A vignette is provided to aid users in performing ksRepo analysis.
Keywords: Repositioning, Drug discovery, Prostate cancer, Gene expression
Background: It is hypothesized that environmental exposures and behaviour influence telomere length, an indicator of cellular ageing. We systematically associated 461 indicators of environmental exposures, physiology and self-reported behaviour with telomere length in data from the US National Health and Nutrition Examination Survey (NHANES) in 1999–2002. Further, we tested whether factors identified in the NHANES participants are also correlated with gene expression of telomere length modifying genes.
Methods: We correlated 461 environmental exposures, behaviours and clinical variables with telomere length, using survey-weighted linear regression, adjusting for sex, age, age squared, race/ethnicity, poverty level, education and born outside the USA, and estimated the false discovery rate to adjust for multiple hypotheses. We conducted a secondary analysis to investigate the correlation between identified environmental variables and gene expression levels of telomere-associated genes in publicly available gene expression samples.
Results: After correlating 461 variables with telomere length, we found 22 variables significantly associated with telomere length after adjustment for multiple hypotheses. Of these varaibales, 14 were associated with longer telomeres, including biomarkers of polychlorinated biphenyls([PCBs; 0.1 to 0.2 standard deviation (SD) increase for 1 SD increase in PCB level, P < 0.002] and a form of vitamin A, retinyl stearate. Eight variables associated with shorter telomeres, including biomarkers of cadmium, C-reactive protein and lack of physical activity. We could not conclude that PCBs are correlated with gene expression of telomere-associated genes.
Conclusions: Both environmental exposures and chronic disease-related risk factors may play a role in telomere length. Our secondary analysis found no evidence of association between PCBs/smoking and gene expression of telomere-associated genes. All correlations between exposures, behaviours and clinical factors and changes in telomere length will require further investigation regarding biological influence of exposure.
There are now hundreds of thousands of pathogenicity assertions that relate genetic variation to disease, but most of this clinically utilized variation has no accepted quantitative disease risk estimate. Recent disease-specific studies have used control sequence data to reclassify large amounts of prior pathogenic variation, but there is a critical need to scale up both the pace and feasibility of such pathogenicity reassessments across human disease. In this manuscript we develop a shareable computational framework to quantify pathogenicity assertions. We release a reproducible “digital notebook” that integrates executable code, text annotations, and mathematical expressions in a freely accessible statistical environment. We extend previous disease-specific pathogenicity assessments to over 6,000 diseases and 160,000 assertions in the ClinVar database. Investigators can use this platform to prioritize variants for reassessment and tailor genetic model parameters (such as prevalence and heterogeneity) to expose the uncertainty underlying pathogenicity-based risk assessments. Finally, we release a website that links users to pathogenic variation for a queried disease, supporting literature, and implied disease risk calculations subject to user-defined and disease-specific genetic risk models in order to facilitate variant reassessments.