%0 Journal Article %J JAMIA Open %D 2022 %T Streamlining statistical reproducibility: NHLBI ORCHID clinical trial results reproduction %A Serret-Larmande, Arnaud %A Jonathan R Kaltman %A Avillach, Paul %X Reproducibility in medical research has been a long-standing issue. More recently, the COVID-19 pandemic has publicly underlined this fact as the retraction of several studies reached out to general media audiences. A significant number of these retractions occurred after in-depth scrutiny of the methodology and results by the scientific community. Consequently, these retractions have undermined confidence in the peer-review process, which is not considered sufficiently reliable to generate trust in the published results. This partly stems from opacity in published results, the practical implementation of the statistical analysis often remaining undisclosed. We present a workflow that uses a combination of informatics tools to foster statistical reproducibility: an open-source programming language, Jupyter Notebook, cloud-based data repository, and an application programming interface can streamline an analysis and help to kick-start new analyses. We illustrate this principle by (1) reproducing the results of the ORCHID clinical trial, which evaluated the efficacy of hydroxychloroquine in COVID-19 patients, and (2) expanding on the analyses conducted in the original trial by investigating the association of premedication with biological laboratory results. Such workflows will be encouraged for future publications from National Heart, Lung, and Blood Institute-funded studies. %B JAMIA Open %V 5 %G eng %U https://doi.org/10.1093/jamiaopen/ooac001 %N 1 %0 Journal Article %J Scientific Data %D 2017 %T A standard database for drug repositioning %A Adam S. Brown %A Chirag J. Patel %X

Drug repositioning, the process of discovering, validating, and marketing previously approved drugs for new indications, is of growing interest to academia and industry due to reduced time and costs associated with repositioned drugs. Computational methods for repositioning are appealing because they putatively nominate the most promising candidate drugs for a given indication. Comparing the wide array of computational repositioning methods, however, is a challenge due to inconsistencies in method validation in the field. Furthermore, a common simplifying assumption, that all novel predictions are false, is intellectually unsatisfying and hinders reproducibility. We address this assumption by providing a gold standard database, repoDB, that consists of both true positives (approved drugs), and true negatives (failed drugs). We have made the full database and all code used to prepare it publicly available, and have developed a web application that allows users to browse subsets of the data (http://apps.chiragjpgroup.org/repoDB/).

 

%B Scientific Data %V 4 %P 1-7 %G eng %U http://dx.doi.org/10.1038/sdata.2017.29 %N 170029 %0 Journal Article %J Stat Med %D 2017 %T Evaluating surrogate marker information using censored data %A Parast, Layla %A Cai, Tianxi %A Tian, Lu %X Given the long follow-up periods that are often required for treatment or intervention studies, the potential to use surrogate markers to decrease the required follow-up time is a very attractive goal. However, previous studies have shown that using inadequate markers or making inappropriate assumptions about the relationship between the primary outcome and surrogate marker can lead to inaccurate conclusions regarding the treatment effect. Currently available methods for identifying and validating surrogate markers tend to rely on restrictive model assumptions and/or focus on uncensored outcomes. The ability to use such methods in practice when the primary outcome of interest is a time-to-event outcome is difficult because of censoring and missing surrogate information among those who experience the primary outcome before surrogate marker measurement. In this paper, we propose a novel definition of the proportion of treatment effect explained by surrogate information collected up to a specified time in the setting of a time-to-event primary outcome. Our proposed approach accommodates a setting where individuals may experience the primary outcome before the surrogate marker is measured. We propose a robust non-parametric procedure to estimate the defined quantity using censored data and use a perturbation-resampling procedure for variance estimation. Simulation studies demonstrate that the proposed procedures perform well in finite samples. We illustrate the proposed procedures by investigating two potential surrogate markers for diabetes using data from the Diabetes Prevention Program. Copyright © 2017 John Wiley & Sons, Ltd. %B Stat Med %8 2017 Jan 15 %G eng %1 http://www.ncbi.nlm.nih.gov/pubmed/28088843?dopt=Abstract %R 10.1002/sim.7220 %0 Journal Article %J Neurology %D 2017 %T Large-scale identification of patients with cerebral aneurysms using natural language processing %A Castro, Victor M %A Dligach, Dmitriy %A Finan, Sean %A Yu, Sheng %A Can, Anil %A Abd-El-Barr, Muhammad %A Gainer, Vivian %A Shadick, Nancy A %A Murphy, Shawn %A Cai, Tianxi %A Savova, Guergana %A Weiss, Scott T %A Du, Rose %X OBJECTIVE: To use natural language processing (NLP) in conjunction with the electronic medical record (EMR) to accurately identify patients with cerebral aneurysms and their matched controls. METHODS: ICD-9 and Current Procedural Terminology codes were used to obtain an initial data mart of potential aneurysm patients from the EMR. NLP was then used to train a classification algorithm with .632 bootstrap cross-validation used for correction of overfitting bias. The classification rule was then applied to the full data mart. Additional validation was performed on 300 patients classified as having aneurysms. Controls were obtained by matching age, sex, race, and healthcare use. RESULTS: We identified 55,675 patients of 4.2 million patients with ICD-9 and Current Procedural Terminology codes consistent with cerebral aneurysms. Of those, 16,823 patients had the term aneurysm occur near relevant anatomic terms. After training, a final algorithm consisting of 8 coded and 14 NLP variables was selected, yielding an overall area under the receiver-operating characteristic curve of 0.95. After the final algorithm was applied, 5,589 patients were classified as having aneurysms, and 54,952 controls were matched to those patients. The positive predictive value based on a validation cohort of 300 patients was 0.86. CONCLUSIONS: We harnessed the power of the EMR by applying NLP to obtain a large cohort of patients with intracranial aneurysms and their matched controls. Such algorithms can be generalized to other diseases for epidemiologic and genetic studies. %B Neurology %V 88 %P 164-168 %8 2017 Jan 10 %G eng %N 2 %1 http://www.ncbi.nlm.nih.gov/pubmed/27927935?dopt=Abstract %R 10.1212/WNL.0000000000003490 %0 Journal Article %J Arthritis & Rheumatology %D 2017 %T Phenome-Wide Association Study of Autoantibodies to Citrullinated and Noncitrullinated Epitopes in Rheumatoid Arthritis %A Liao, Katherine P %A Sparks, Jeffrey A %A Hejblum, Boris P. %A Kuo, I-Hsin %A Cui, Jing %A Lahey, Lauren J %A Cagan, Andrew %A Gainer, Vivian S %A Liu, Weidong %A Cai, T Tony %A Sokolove, Jeremy %A Cai, Tianxi %X

Objective. RA patients develop autoantibodies against a spectrum of antigens but their clinical significance is unclear. Using the phenome-wide association study (PheWAS) approach, we examined the association between autoantibodies and clinical subphenotypes of RA.

Methods. This study was conducted using a validated electronic medical record (EMR) RA cohort from 2 tertiary care centers. Using a published multiplex bead assay, we measured 36 autoantibodies targeting epitopes implicated in RA. We extracted all ICD-9 codes for each subject and grouped them using a published method into disease categories (PheWAS codes). We tested for the association of each autoantibody grouped by targeted protein with PheWAS codes. For significant associations (false discovery rate [FDR] ≤0.1), we reviewed 50 medical records of subjects with each PheWAS code to determine the positive predictive value (PPV).

Results. We studied 1006 RA subjects, mean age 61.0 years (SD 12.9) and 79.0% female. There were 3,568 unique ICD-9 codes grouped into 625 PheWAS codes; 206 PheWAS codes with a prevalence ≥3% were studied. PheWAS identified 24 significant associations of autoantibodies to epitopes at FDR≤0.1. Associations with the strongest associations and highest PPV for PheWAS code included autoantibodies against fibronectin with obesity (p=6.1x10−4, PPV 100%), and fibrinogen with pneumonopathy (p=2.7x10−4, PPV 96%). The latter included diagnoses for cryptogenic organizing pneumonia and obliterative bronchiolitis.

Conclusion. We demonstrated the application of a bioinformatics method, the PheWAS, to screen for clinical significance of RA-related autoantibodies. PheWAS identified potential significant links between variations in levels of autoantibodies and comorbidities of interest in RA. This article is protected by copyright. All rights reserved.

Keywords.

%B Arthritis & Rheumatology %V 69 %P 742–749 %8 2016 %G eng %U http://dx.doi.org/10.1002/art.39974 %0 Journal Article %J Biometrics %D 2016 %T Identifying predictive markers for personalized treatment selection %A Shen, Yuanyuan %A Cai, Tianxi %X It is now well recognized that the effectiveness and potential risk of a treatment often vary by patient subgroups. Although trial-and-error and one-size-fits-all approaches to treatment selection remain a common practice, much recent focus has been placed on individualized treatment selection based on patient information (La Thangue and Kerr, 2011; Ong et al., 2012). Genetic and molecular markers are becoming increasingly available to guide treatment selection for various diseases including HIV and breast cancer (Mallal et al., 2008; Zujewski and Kamin, 2008). In recent years, many statistical procedures for developing individualized treatment rules (ITRs) have been proposed. However, less focus has been given to efficient selection of predictive biomarkers for treatment selection. The standard Wald test for interactions between treatment and the set of markers of interest may not work well when the marker effects are nonlinear. Furthermore, interaction-based test is scale dependent and may fail to capture markers useful for predicting individualized treatment differences. In this article, we propose to overcome these difficulties by developing a kernel machine (KM) score test that can efficiently identify markers predictive of treatment difference. Simulation studies show that our proposed KM-based score test is more powerful than the Wald test when there is nonlinear effect among the predictors and when the outcome is binary with nonlinear link functions. Furthermore, when there is high-correlation among predictors and when the number of predictors is not small, our method also over-performs Wald test. The proposed method is illustrated with two randomized clinical trials. %B Biometrics %V 72 %P 1017-1025 %8 2016 Dec %G eng %N 4 %1 http://www.ncbi.nlm.nih.gov/pubmed/26999054?dopt=Abstract %R 10.1111/biom.12511 %0 Journal Article %J Biometrics %D 2016 %T Estimation and testing for multiple regulation of multivariate mixed outcomes %A Agniel, Denis %A Liao, Katherine P %A Cai, Tianxi %X Considerable interest has recently been focused on studying multiple phenotypes simultaneously in both epidemiological and genomic studies, either to capture the multidimensionality of complex disorders or to understand shared etiology of related disorders. We seek to identify multiple regulators or predictors that are associated with multiple outcomes when these outcomes may be measured on very different scales or composed of a mixture of continuous, binary, and not-fully observed elements. We first propose an estimation technique to put all effects on similar scales, and we induce sparsity on the estimated effects. We provide standard asymptotic results for this estimator and show that resampling can be used to quantify uncertainty in finite samples. We finally provide a multiple testing procedure which can be geared specifically to the types of multiple regulators of interest, and we establish that, under standard regularity conditions, the familywise error rate will approach 0 as sample size diverges. Simulation results indicate that our approach can improve over unregularized methods both in reducing bias in estimation and improving power for testing. %B Biometrics %V 72 %P 1194-1205 %8 2016 Dec %G eng %N 4 %1 http://www.ncbi.nlm.nih.gov/pubmed/26910481?dopt=Abstract %R 10.1111/biom.12495 %0 Journal Article %J J Am Med Inform Assoc %D 2016 %T Surrogate-assisted feature extraction for high-throughput phenotyping %A Yu, Sheng %A Chakrabortty, Abhishek %A Liao, Katherine P %A Cai, Tianrun %A Ananthakrishnan, Ashwin N %A Gainer, Vivian S %A Churchill, Susanne E %A Szolovits, Peter %A Murphy, Shawn N %A Kohane, Isaac S %A Cai, Tianxi %X OBJECTIVE: Phenotyping algorithms are capable of accurately identifying patients with specific phenotypes from within electronic medical records systems. However, developing phenotyping algorithms in a scalable way remains a challenge due to the extensive human resources required. This paper introduces a high-throughput unsupervised feature selection method, which improves the robustness and scalability of electronic medical record phenotyping without compromising its accuracy. METHODS: The proposed Surrogate-Assisted Feature Extraction (SAFE) method selects candidate features from a pool of comprehensive medical concepts found in publicly available knowledge sources. The target phenotype's International Classification of Diseases, Ninth Revision and natural language processing counts, acting as noisy surrogates to the gold-standard labels, are used to create silver-standard labels. Candidate features highly predictive of the silver-standard labels are selected as the final features. RESULTS: Algorithms were trained to identify patients with coronary artery disease, rheumatoid arthritis, Crohn's disease, and ulcerative colitis using various numbers of labels to compare the performance of features selected by SAFE, a previously published automated feature extraction for phenotyping procedure, and domain experts. The out-of-sample area under the receiver operating characteristic curve and F-score from SAFE algorithms were remarkably higher than those from the other two, especially at small label sizes. CONCLUSION: SAFE advances high-throughput phenotyping methods by automatically selecting a succinct set of informative features for algorithm training, which in turn reduces overfitting and the needed number of gold-standard labels. SAFE also potentially identifies important features missed by automated feature extraction for phenotyping or experts. %B J Am Med Inform Assoc %8 2016 Sep 15 %G eng %1 http://www.ncbi.nlm.nih.gov/pubmed/27632993?dopt=Abstract %R 10.1093/jamia/ocw135 %0 Journal Article %J Biometrics %D 2016 %T A predictive enrichment procedure to identify potential responders to a new therapy for randomized, comparative controlled clinical studies %A Li, Junlong %A Zhao, Lihui %A Tian, Lu %A Cai, Tianxi %A Claggett, Brian %A Callegaro, Andrea %A Dizier, Benjamin %A Spiessens, Bart %A Ulloa-Montoya, Fernando %A Wei, Lee-Jen %X To evaluate a new therapy versus a control via a randomized, comparative clinical study or a series of trials, due to heterogeneity of the study patient population, a pre-specified, predictive enrichment procedure may be implemented to identify an "enrichable" subpopulation. For patients in this subpopulation, the therapy is expected to have a desirable overall risk-benefit profile. To develop and validate such a "therapy-diagnostic co-development" strategy, a three-step procedure may be conducted with three independent data sets from a series of similar studies or a single trial. At the first stage, we create various candidate scoring systems based on the baseline information of the patients via, for example, parametric models using the first data set. Each individual score reflects an anticipated average treatment difference for future patients who share similar baseline profiles. A large score indicates that these patients tend to benefit from the new therapy. At the second step, a potentially promising, enrichable subgroup is identified using the totality of evidence from these scoring systems. At the final stage, we validate such a selection via two-sample inference procedures for assessing the treatment effectiveness statistically and clinically with the third data set, the so-called holdout sample. When the study size is not large, one may combine the first two steps using a "cross-training-evaluation" process. Comprehensive numerical studies are conducted to investigate the operational characteristics of the proposed method. The entire enrichment procedure is illustrated with the data from a cardiovascular trial to evaluate a beta-blocker versus a placebo for treating chronic heart failure patients. %B Biometrics %V 72 %P 877-87 %8 2016 Sep %G eng %N 3 %1 http://www.ncbi.nlm.nih.gov/pubmed/26689167?dopt=Abstract %R 10.1111/biom.12461 %0 Journal Article %J J Am Stat Assoc %D 2016 %T Structured Matrix Completion with Applications to Genomic Data Integration %A Cai, Tianxi %A Cai, T Tony %A Anru Zhang %X Matrix completion has attracted significant recent attention in many fields including statistics, applied mathematics and electrical engineering. Current literature on matrix completion focuses primarily on independent sampling models under which the individual observed entries are sampled independently. Motivated by applications in genomic data integration, we propose a new framework of structured matrix completion (SMC) to treat structured missingness by design. Specifically, our proposed method aims at efficient matrix recovery when a subset of the rows and columns of an approximately low-rank matrix are observed. We provide theoretical justification for the proposed SMC method and derive lower bound for the estimation errors, which together establish the optimal rate of recovery over certain classes of approximately low-rank matrices. Simulation studies show that the method performs well in finite sample under a variety of configurations. The method is applied to integrate several ovarian cancer genomic studies with different extent of genomic measurements, which enables us to construct more accurate prediction rules for ovarian cancer survival. %B J Am Stat Assoc %V 111 %P 621-633 %8 2016 %G eng %N 514 %1 http://www.ncbi.nlm.nih.gov/pubmed/28042188?dopt=Abstract %R 10.1080/01621459.2015.1021005 %0 Journal Article %J Biometrics %D 2016 %T On longitudinal prediction with time-to-event outcome: Comparison of modeling options %A Maziarz, Marlena %A Heagerty, Patrick %A Cai, Tianxi %A Zheng, Yingye %X Long-term follow-up is common in many medical investigations where the interest lies in predicting patients' risks for a future adverse outcome using repeatedly measured predictors over time. A key quantity is the likelihood of developing an adverse outcome among individuals who survived up to time s given their covariate information up to time s. Simple, yet reliable, methodology for updating the predicted risk of disease progression using longitudinal markers remains elusive. Two main approaches have been considered in the literature. One approach, based on joint modeling (JM) of failure time and longitudinal covariate process (Tsiatis and Davidian, 2004), derives such longitudinal predictive probability from the joint probability of a longitudinal marker and an event at a given time. A second approach, the partly conditional (PC) modeling (Zheng and Heagerty, 2005), directly models the predictive probability conditional on survival up to a landmark time and information accrued by that time. In this article, we propose new PC models for longitudinal prediction that are more flexible than joint modeling and improve the prediction accuracy over existing PC models. We provide procedures for making inference regarding future risk for an individual with longitudinal measures up to a given time. In addition, we conduct simulations to evaluate both JM and PC approaches in order to provide practical guidance on modeling choices. We use standard measures of predictive accuracy adapted to our setting to explore the predictiveness of the two approaches. We illustrate the performance of the two approaches on a dataset from the End Stage Renal Disease Study (ESRDS). %B Biometrics %8 2016 Jul 20 %G eng %1 http://www.ncbi.nlm.nih.gov/pubmed/27438160?dopt=Abstract %R 10.1111/biom.12562 %0 Journal Article %J J Am Stat Assoc %D 2016 %T Comment: Addressing the Need for Portability in Big Data Model Building and Calibration %A Patel, Chirag J %A Dominici, Francesca %B J Am Stat Assoc %V 111 %P 127-129 %8 May 5 2016 %G eng %U https://dx.doi.org/10.1080/01621459.2016.1149406 %N 513 %0 Journal Article %J Sci Rep %D 2016 %T Environment-Wide Association Study of Blood Pressure in the National Health and Nutrition Examination Survey (1999-2012) %A McGinnis, Denise P %A Brownstein, John S %A Patel, Chirag J %X

Identifying environmental exposures associated with blood pressure is a priority. Recently, we proposed the environment-wide association study to search for and replicate environmental factors associated with phenotypes. We conducted the environment-wide association study (EWAS) using the National Health and Nutrition Examination Surveys (1999–2012) which evaluated a total of 71,916 participants to prioritize environmental factors associated with systolic and diastolic blood pressure. We searched for factors on participants from survey years 1999–2006 and tentatively replicated findings in participants from years 2007–2012. Finally, we estimated the overall association and performed a second meta-analysis using all survey years (1999–2012). For systolic blood pressure, self-reported alcohol consumption emerged as our top finding (a 0.04 increase in mmHg of systolic blood pressure for 1 standard deviation increase in self-reported alcohol), though the effect size is small. For diastolic blood pressure, urinary cesium was tentatively replicated; however, this factor demonstrated high heterogeneity between populations (I2 = 51%). The lack of associations across this wide of an analysis raises the call for a broader search for environmental factors in blood pressure.

%B Sci Rep %V 6 %P 1-8 %8 Jul 26 2016 %G eng %U https://dx.doi.org/10.1038/srep30373 %N 30373 %0 Journal Article %J J Am Med Inform Assoc %D 2016 %T MeSHDD: Literature-based drug-drug similarity for drug repositioning %A Brown, AS %A Patel, Chirag J %X

Objective Drug repositioning is a promising methodology for reducing the cost and duration of the drug discovery pipeline. We sought to develop a computational repositioning method leveraging annotations in the literature, such as Medical Subject Heading (MeSH) terms.

Methods We developed software to determine significantly co-occurring drug-MeSH term pairs and a method to estimate pair-wise literature-derived distances between drugs.

Results We found that literature-based drug-drug similarities predicted the number of shared indications across drug-drug pairs. Clustering drugs based on their similarity revealed both known and novel drug indications. We demonstrate the utility of our approach by generating repositioning hypotheses for the commonly used diabetes drug metformin.

Conclusion Our study demonstrates that literature-derived similarity is useful for identifying potential repositioning opportunities. We provided open-source code and deployed a free-to-use, interactive application to explore our database of similarity-based drug clusters (available at http://apps.chiragjpgroup.org/MeSHDD/).

Keywords

%B J Am Med Inform Assoc %8 Sep 27 2016 %G eng %U https://dx.doi.org/10.1093/jamia/ocw142 %0 Journal Article %J Brief Bioinform %D 2016 %T A review of validation strategies for computational drug repositioning %A Brown, AS %A Patel, CJ %X

Repositioning of previously approved drugs is a promising methodology because it reduces the cost and duration of the drug development pipeline and reduces the likelihood of unforeseen adverse events. Computational repositioning is especially appealing because of the ability to rapidly screen candidates in silico and to reduce the number of possible repositioning candidates. What is unclear, however, is how useful such methods are in producing clinically efficacious repositioning hypotheses. Furthermore, there is no agreement in the field over the proper way to perform validation of in silico predictions, and in fact no systematic review of repositioning validation methodologies. To address this unmet need, we review the computational repositioning literature and capture studies in which authors claimed to have validated their work. Our analysis reveals widespread variation in the types of strategies, predictions made and databases used as 'gold standards'. We highlight a key weakness of the most commonly used strategy and propose a path forward for the consistent analytic validation of repositioning techniques.

%B Brief Bioinform %8 Nov 22 2016 %G eng %U https://dx.doi.org/10.1093/bib/bbw110 %0 Journal Article %J Sci Rep %D 2016 %T Systematic assessment of pharmaceutical prescriptions in association with cancer risk: a method to conduct a population-wide medication-wide longitudinal study %A Patel, Chirag P %A Ji, J %A Sundquist, J %A John P. A. Ioannidis %A Sundquist, K %X

It is a public health priority to identify the adverse and non-adverse associations between pharmaceutical medications and cancer. We search for and evaluate associations between all prescribed medications and longitudinal cancer risk in participants of the Swedish Cancer Register (N = 9,014,975). We associated 552 different medications with incident cancer risk (any, breast, colon, and prostate) during 5.5 years of follow-up (7/1/2005-12/31/2010) in two types of statistical models, time-to-event and case-crossover. After multiple hypotheses correction and replication, 141 (26%) drugs were associated with any cancer in a time-to-event analysis constraining drug exposure to 1 year before first cancer diagnosis and adjusting for history of medication use. In a case-crossover analysis, 36 drugs (7%) were associated with decreased cancer risk. 12 drugs were found in common in both analyses with concordant direction of association. We found 14, 10, 7% of all drugs associated with colon, prostate, and breast cancers in time-to-event models. We only found 1, 2%, and 0% for these cancers, respectively, in case-crossover analyses. Pharmacoepidemiologic analyses of cancer risk are sensitive to modeling choices and false-positive findings are a threat. Medication-wide analyses using different analytical models may help suggest consistent signals of increased cancer risk

%B Sci Rep %V 6 %P 1-14 %8 Aug 10 2016 %G eng %U https://dx.doi.org/10.1038/srep31308 %N 31308 %0 Journal Article %J Artif Intell Med %D 2016 %T Utilizing uncoded consultation notes from electronic medical records for predictive modeling of colorectal cancer %A Hoogendoorn, Mark %A Szolovits, Peter %A Moons, Leon %A Numans, Mattijs %X

Objective

Machine learning techniques can be used to extract predictive models for diseases from electronic medical records (EMRs). However, the nature of EMRs makes it difficult to apply off-the-shelf machine learning techniques while still exploiting the rich content of the EMRs. In this paper, we explore the usage of a range of natural language processing (NLP) techniques to extract valuable predictors from uncoded consultation notes and study whether they can help to improve predictive performance.

Methods

We study a number of existing techniques for the extraction of predictors from the consultation notes, namely a bag of words based approach and topic modeling. In addition, we develop a dedicated technique to match the uncoded consultation notes with a medical ontology. We apply these techniques as an extension to an existing pipeline to extract predictors from EMRs. We evaluate them in the context of predictive modeling for colorectal cancer (CRC), a disease known to be difficult to diagnose before performing an endoscopy.

Results

Our results show that we are able to extract useful information from the consultation notes. The predictive performance of the ontology-based extraction method moves significantly beyond the benchmark of age and gender alone (area under the receiver operating characteristic curve (AUC) of 0.870 versus 0.831). We also observe more accurate predictive models by adding features derived from processing the consultation notes compared to solely using coded data (AUC of 0.896 versus 0.882) although the difference is not significant. The extracted features from the notes are shown be equally predictive (i.e. there is no significant difference in performance) compared to the coded data of the consultations.

Conclusion

It is possible to extract useful predictors from uncoded consultation notes that improve predictive performance. Techniques linking text to concepts in medical ontologies to derive these predictors are shown to perform best for predicting CRC in our EMR dataset.

Keywords

Natural language processing, Predictive modeling, Uncoded consultation notes, Colorectal cancer

%B Artif Intell Med %V 69 %P 53-61 %8 March 31, 2016 %G eng %U http://dx.doi.org/10.1016/j.artmed.2016.03.003 %0 Journal Article %J Scientific Data %D 2016 %T A database of human exposomes and phenomes from the US National Health and Nutrition Examination Survey %A Patel, Chirag J %A Pho, Nam %A McDuffie, Michael %A Easton-Marks, Jeremy %A Kothari, Cartik %A Kohane, Isaac S %A Avillach, Paul %X

The National Health and Nutrition Examination Survey (NHANES) is a population survey implemented by the Centers for Disease Control and Prevention (CDC) to monitor the health of the United States whose data is publicly available in hundreds of files. This Data Descriptor describes a single unified and universally accessible data file, merging across 255 separate files and stitching data across 4 surveys, encompassing 41,474 individuals and 1,191 variables. The variables consist of phenotype and environmental exposure information on each individual, specifically (1) demographic information, physical exam results (e.g., height, body mass index), laboratory results (e.g., cholesterol, glucose, and environmental exposures), and (4) questionnaire items. Second, the data descriptor describes a dictionary to enable analysts find variables by category and human-readable description. The datasets are available on DataDryad and a hands-on analytics tutorial is available on GitHub. Through a new big data platform, BD2K Patient Centered Information Commons (http://pic-sure.org), we provide a new way to browse the dataset via a web browser (https://nhanes.hms.harvard.edu) and provide application programming interface for programmatic access.

%B Scientific Data %V 3 %P 160096 %8 Oct 25 2016 %G eng %U http://dx.doi.org/10.1038/sdata.2016.96 %0 Journal Article %J BMC Bioinformatics %D 2016 %T ksRepo: a generalized platform for computational drug repositioning %A Brown, AS %A Kong, SW %A Kohane, IS %A Patel, CJ %X

Background: Repositioning approved drug and small molecules in novel therapeutic areas is of key interest to the pharmaceutical industry. A number of promising computational techniques have been developed to aid in repositioning, however, the majority of available methodologies require highly specific data inputs that preclude the use of many datasets and databases. There is a clear unmet need for a generalized methodology that enables the integration of multiple types of both gene expression data and database schema.

Results: ksRepo eliminates the need for a single microarray platform as input and allows for the use of a variety of drug and chemical exposure databases. We tested ksRepo’s performance on a set of five prostate cancer datasets using the Comparative Toxicogenomics Database (CTD) as our database of gene-compound interactions. ksRepo successfully predicted significance for five frontline prostate cancer therapies, representing a significant enrichment from over 7000 CTD compounds, and achieved specificity similar to other repositioning methods.

Conclusions: We present ksRepo, which enables investigators to use any data inputs for computational drug repositioning. ksRepo is implemented in a series of four functions in the R statistical environment under a BSD3 license. Source code is freely available at http://github.com/adam-sam-brown/ksRepo. A vignette is provided to aid users in performing ksRepo analysis.

Keywords: Repositioning, Drug discovery, Prostate cancer, Gene expression

%B BMC Bioinformatics %V 17 %P 1-5 %8 Feb 9 2016 %G eng %U https://dx.doi.org/10.1186/s12859-016-0931-y %N 78 %0 Journal Article %J Int J Epidemiol. %D 2016 %T Systematic correlation of environmental exposure and physiological and self-reported behaviour factors with leukocyte telomere length %A Patel, CJ %A Manrai, AK %A Corona, E %A Kohane, IS %X

Background: It is hypothesized that environmental exposures and behaviour influence telomere length, an indicator of cellular ageing. We systematically associated 461 indicators of environmental exposures, physiology and self-reported behaviour with telomere length in data from the US National Health and Nutrition Examination Survey (NHANES) in 1999–2002. Further, we tested whether factors identified in the NHANES participants are also correlated with gene expression of telomere length modifying genes.

Methods: We correlated 461 environmental exposures, behaviours and clinical variables with telomere length, using survey-weighted linear regression, adjusting for sex, age, age squared, race/ethnicity, poverty level, education and born outside the USA, and estimated the false discovery rate to adjust for multiple hypotheses. We conducted a secondary analysis to investigate the correlation between identified environmental variables and gene expression levels of telomere-associated genes in publicly available gene expression samples.

Results: After correlating 461 variables with telomere length, we found 22 variables significantly associated with telomere length after adjustment for multiple hypotheses. Of these varaibales, 14 were associated with longer telomeres, including biomarkers of polychlorinated biphenyls([PCBs; 0.1 to 0.2 standard deviation (SD) increase for 1 SD increase in PCB level, P < 0.002] and a form of vitamin A, retinyl stearate. Eight variables associated with shorter telomeres, including biomarkers of cadmium, C-reactive protein and lack of physical activity. We could not conclude that PCBs are correlated with gene expression of telomere-associated genes.

Conclusions: Both environmental exposures and chronic disease-related risk factors may play a role in telomere length. Our secondary analysis found no evidence of association between PCBs/smoking and gene expression of telomere-associated genes. All correlations between exposures, behaviours and clinical factors and changes in telomere length will require further investigation regarding biological influence of exposure.

Key words

%B Int J Epidemiol. %V pii: dyw043 %P 1-13 %8 Apr 8 2016 %G eng %U https://dx.doi.org/10.1093/ije/dyw043 %0 Journal Article %J Pac Symp Biocomput %D 2016 %T Reproducible and shareable quantifications of pathogenicity %A Manrai, Arjun K %A Wang, Brice L %A Patel, Chirag J %A Kohane, Isaac S %X

There are now hundreds of thousands of pathogenicity assertions that relate genetic variation to disease, but most of this clinically utilized variation has no accepted quantitative disease risk estimate. Recent disease-specific studies have used control sequence data to reclassify large amounts of prior pathogenic variation, but there is a critical need to scale up both the pace and feasibility of such pathogenicity reassessments across human disease. In this manuscript we develop a shareable computational framework to quantify pathogenicity assertions. We release a reproducible “digital notebook” that integrates executable code, text annotations, and mathematical expressions in a freely accessible statistical environment. We extend previous disease-specific pathogenicity assessments to over 6,000 diseases and 160,000 assertions in the ClinVar database. Investigators can use this platform to prioritize variants for reassessment and tailor genetic model parameters (such as prevalence and heterogeneity) to expose the uncertainty underlying pathogenicity-based risk assessments. Finally, we release a website that links users to pathogenic variation for a queried disease, supporting literature, and implied disease risk calculations subject to user-defined and disease-specific genetic risk models in order to facilitate variant reassessments.

%B Pac Symp Biocomput %V 21 %P 231-242 %8 Jan 2016 %G eng %U http://dx.doi.org/10.1142/9789814749411_0022 %0 Journal Article %J Biometrics %D 2015 %T Sparse kernel machine regression for ordinal outcomes %A Shen, Yuanyuan %A Liao, Katherine P %A Cai, Tianxi %K Arthritis, Rheumatoid %K Autoantibodies %K Biomarkers %K Computer Simulation %K Diagnosis, Computer-Assisted %K Humans %K Models, Statistical %K Peptides, Cyclic %K Regression Analysis %K Reproducibility of Results %K Sensitivity and Specificity %X Ordinal outcomes arise frequently in clinical studies when each subject is assigned to a category and the categories have a natural order. Classification rules for ordinal outcomes may be developed with commonly used regression models such as the full continuation ratio (CR) model (fCR), which allows the covariate effects to differ across all continuation ratios, and the CR model with a proportional odds structure (pCR), which assumes the covariate effects to be constant across all continuation ratios. For settings where the covariate effects differ between some continuation ratios but not all, fitting either fCR or pCR may lead to suboptimal prediction performance. In addition, these standard models do not allow for nonlinear covariate effects. In this article, we propose a sparse CR kernel machine (KM) regression method for ordinal outcomes where we use the KM framework to incorporate nonlinearity and impose sparsity on the overall differences between the covariate effects of continuation ratios to control for overfitting. In addition, we provide data driven rule to select an optimal kernel to maximize the prediction accuracy. Simulation results show that our proposed procedures perform well under both linear and nonlinear settings, especially when the true underlying model is in-between fCR and pCR models. We apply our procedures to develop a prediction model for levels of anti-CCP among rheumatoid arthritis patients and demonstrate the advantage of our method over other commonly used methods. %B Biometrics %V 71 %P 63-70 %8 2015 Mar %G eng %N 1 %1 http://www.ncbi.nlm.nih.gov/pubmed/25196727?dopt=Abstract %R 10.1111/biom.12223 %0 Journal Article %J Nat Biotechnol %D 2015 %T The digital phenotype %A Jain, SH %A Powers, BW %A Hawkins, JB %A Brownstein, JS %X

In the coming years, patient phenotypes captured to enhance health and wellness will extend to human interactions with digital technology.

%B Nat Biotechnol %V 33 %P 462–463 %8 May 12, 2015 %G eng %U http://dx.doi.org/10.1038/nbt.3223 %N 5 %0 Journal Article %J Biom %D 2015 %T Assessing incremental value of biomarkers with multi-phase nested case-control studies %A Zhou, QM %A Zheng, Y %A Chibnik, LB %A Karlson, EW %A Cai, T. %X

Accurate risk prediction models are needed to identify different risk groups for individualized prevention and treatment strategies. In the Nurses’ Health Study, to examine the effects of several biomarkers and genetic markers on the risk of rheumatoid arthritis (RA), a three-phase nested case-control (NCC) design was conducted, in which two sequential NCC subcohorts were formed with one nested within the other, and one set of new markers measured on each of the subcohorts. One objective of the study is to evaluate clinical values of novel biomarkers in improving upon existing risk models because of potential cost associated with assaying biomarkers. In this paper, we develop robust statistical procedures for constructing risk prediction models for RA and estimating the incremental value (IncV) of new markers based on three-phase NCC studies. Our method also takes into account possible time-varying effects of biomarkers in risk modeling, which allows us to more robustly assess the biomarker utility and address the question of whether a marker is better suited for short-term or long-term risk prediction. The proposed procedures are shown to perform well in finite samples via simulation studies.

%B Biom %V 71 %P 1139–1149 %8 20 July 2015 %G eng %U http://dx.doi.org/10.1111/biom.12344 %0 Journal Article %J J Nutr %D 2015 %T Risk of Type 2 Diabetes Is Lower in US Adults Taking Chromium-Containing Supplements %A McIver, DJ %A Grizales, AM %A Brownstein, JS %A Goldfine, AB %X

Background: Dietary supplement use is widespread in the United States. Although it has been suggested in both in vitro and small in vivo human studies that chromium has potentially beneficial effects in type 2 diabetes (T2D), chromium supplementation in diabetes has not been investigated at the population level.

Objective: The objective of this study was to examine the use and potential benefits of chromium supplementation in T2D by examining NHANES data.

Methods: An individual was defined as having diabetes if he or she had a glycated hemoglobin (HbA1c) value of ≥6.5%, or reported having been diagnosed with diabetes. Data on all consumed dietary supplements from the NHANES database were analyzed, with the OR of having diabetes as the main outcome of interest based on chromium supplement use.

Results: The NHANES for the years 1999–2010 included information on 62,160 individuals. After filtering the database for the required covariates (gender, ethnicity, socioeconomic status, body mass index, diabetes diagnosis, supplement usage, and laboratory HbA1c values), and when restricted to adults, the study cohort included 28,539 people. A total of 58.3% of people reported consuming a dietary supplement in the previous 30 d, 28.8% reported consuming a dietary supplement that contained chromium, and 0.7% consumed supplements that had “chromium” in the title. Compared with nonusers, the odds of having T2D (HbA1c ≥6.5%) were lower in persons who consumed chromium-containing supplements within the previous 30 d than in those who did not (OR: 0.73; 95% CI: 0.62, 0.86; P = 0.001). Supplement use alone (without chromium) did not influence the odds of having T2D (OR: 0.89; 95% CI: 0.77, 1.03; P = 0.11).

Conclusions: Over one-half the adult US population consumes nutritional supplements, and over one-quarter consumes supplemental chromium. The odds of having T2D were lower in those who, in the previous 30 d, had consumed supplements containing chromium. Given the magnitude of exposure, studies on safety and efficacy are warranted.

Keywords: chromium, diabetes, glucose intolerance, insulin resistance, dietary supplements, safety, NHANES

%B J Nutr %V 145 %P 2675-2682 %8 October 7, 2015 %G eng %U http://dx.doi.org/10.3945/jn.115.214569 %N 12