@Article{info:doi/10.2196/62831, author="Miyakoshi, Takashi and Ito, M. Yoichi", title="Association of Blood Glucose Data With Physiological and Nutritional Data From Dietary Surveys and Wearable Devices: Database Analysis", journal="JMIR Diabetes", year="2024", month="Dec", day="3", volume="9", pages="e62831", keywords="PhysioNet", keywords="Empatica", keywords="Dexcom", keywords="acceleration", keywords="heart rate", keywords="temperature", keywords="electrodermal activity", abstract="Background: Wearable devices can simultaneously collect data on multiple items in real time and are used for disease detection, prediction, diagnosis, and treatment decision-making. Several factors, such as diet and exercise, influence blood glucose levels; however, the relationship between blood glucose and these factors has yet to be evaluated in real practice. Objective: This study aims to investigate the association of blood glucose data with various physiological index and nutritional values using wearable devices and dietary survey data from PhysioNet, a public database. Methods: Three analytical methods were used. First, the correlation of each physiological index was calculated and examined to determine whether their mean values or SDs affected the mean value or SD of blood glucose. To investigate the impact of each physiological indicator on blood glucose before and after the time of collection of blood glucose data, lag data were collected, and the correlation coefficient between blood glucose and each physiological indicator was calculated for each physiological index. Second, to examine the relationship between postprandial blood glucose rise and fall and physiological and dietary nutritional assessment indices, multiple regression analysis was performed on the relationship between the slope before and after the peak in postprandial glucose over time and physiological and dietary nutritional indices. Finally, as a supplementary analysis to the multiple regression analysis, a 1-way ANOVA was performed to compare the relationship between the upward and downward slopes of blood glucose and the groups above and below the median for each indicator. Results: The analysis revealed several indicators of interest: First, the correlation analysis of blood glucose and physiological indices indicated meaningful relationships: acceleration SD (r=--0.190 for lag data at --15-minute values), heart rate SD (r=--0.121 for lag data at --15-minute values), skin temperature SD (r=--0.121), and electrodermal activity SD (r=--0.237) for lag data at --15-minute values. Second, in multiple regression analysis, physiological indices (temperature mean: t=2.52, P=.01; acceleration SD: t=--2.06, P=.04; heart rate\_30 SD: t=--2.12, P=.04; electrodermal activity\_90 SD: t=1.97, P=.049) and nutritional indices (mean carbohydrate: t=6.53, P<.001; mean dietary fiber: t=--2.51, P=.01; mean sugar: t=--3.72, P<.001) were significant predictors. Finally, the results of the 1-way ANOVA corroborated the findings from the multiple regression analysis. Conclusions: Similar results were obtained from the 3 analyses, consistent with previous findings, and the relationship between blood glucose, diet, and physiological indices in the real world was examined. Data sharing facilitates the accessibility of wearable data and enables statistical analyses from various angles. This type of research is expected to be more common in the future. ", doi="10.2196/62831", url="https://diabetes.jmir.org/2024/1/e62831" } @Article{info:doi/10.2196/63562, author="Fisher, J. Joshua and Grace, Tegan and Castles, A. Nathan and Jones, A. Elizabeth and Delforce, J. Sarah and Peters, E. Alexandra and Crombie, K. Gabrielle and Hoedt, C. Emily and Warren, E. Kirby and Kahl, GS Richard and Hirst, J. Jonathan and Pringle, G. Kirsty and Pennell, E. Craig", title="Methodology for Biological Sample Collection, Processing, and Storage in the Newcastle 1000 Pregnancy Cohort: Protocol for a Longitudinal, Prospective Population-Based Study in Australia", journal="JMIR Res Protoc", year="2024", month="Nov", day="15", volume="13", pages="e63562", keywords="pregnancy cohort study", keywords="biobanking protocol", keywords="toenails", keywords="blood", keywords="microbiome", keywords="urine", keywords="hair", keywords="pregnancy", keywords="cohort study", abstract="Background: Research in the developmental origins of health and disease provides compelling evidence that adverse events during the first 1000 days of life from conception can impact life course health. Despite many decades of research, we still lack a complete understanding of the mechanisms underlying some of these associations. The Newcastle 1000 Study (NEW1000) is a comprehensive, prospective population-based pregnancy cohort study based in Newcastle, New South Wales, Australia, that will recruit pregnant women and their partners at 11-14 weeks' gestation, with assessments at 20, 28, and 36 weeks; birth; 6 weeks; and 6 months, in order to provide detailed data about the first 1000 days of life to investigate the developmental origins of noncommunicable diseases. Objective: The study aims to provide a longitudinal multisystem approach to phenotyping, supported by robust clinical data and collection of biological samples in NEW1000. Methods: This manuscript describes in detail the large variety of samples collected in the study and the method of collection, storage, and utility of the samples in the biobank, with a particular focus on incorporation of the samples into emerging and novel large-scale ``-omics'' platforms, including the genome, microbiome, epigenome, transcriptome, fragmentome, metabolome, proteome, exposome, and cell-free DNA and RNA. Specifically, this manuscript details the methods used to collect, process, and store biological samples, including maternal, paternal, and fetal blood, microbiome (stool, skin, vaginal, oral), urine, saliva, hair, toenail, placenta, colostrum, and breastmilk. Results: Recruitment for the study began in March 2021. As of July 2024, 1040 women and 684 partners were enrolled, with 922 infants born. The NEW1000 biobank contains 24,357 plasma aliquots from ethylenediaminetetraacetic acid (EDTA) tubes, 5284 buffy coat aliquots, 4000 plasma aliquots from lithium heparin tubes, 15,884 blood serum aliquots, 2977 PAX RNA tubes, 26,595 urine sample aliquots, 2280 fecal swabs, 17,687 microbiome swabs, 2356 saliva sample aliquots, 1195 breastmilk sample aliquots, 4007 placental tissue aliquots, 2680 hair samples, and 2193 nail samples. Conclusions: NEW1000 will generate a multigenerational, deeply phenotyped cohort with a comprehensive biobank of samples relevant to a large variety of analyses, including multiple -omics platforms. International Registered Report Identifier (IRRID): DERR1-10.2196/63562 ", doi="10.2196/63562", url="https://www.researchprotocols.org/2024/1/e63562" } @Article{info:doi/10.2196/58116, author="Mayito, Jonathan and Tumwine, Conrad and Galiwango, Ronald and Nuwamanya, Elly and Nakasendwa, Suzan and Hope, Mackline and Kiggundu, Reuben and Byonanebye, M. Dathan and Dhikusooka, Flavia and Twemanye, Vivian and Kambugu, Andrew and Kakooza, Francis", title="Combating Antimicrobial Resistance Through a Data-Driven Approach to Optimize Antibiotic Use and Improve Patient Outcomes: Protocol for a Mixed Methods Study", journal="JMIR Res Protoc", year="2024", month="Nov", day="8", volume="13", pages="e58116", keywords="antimicrobial resistance", keywords="AMR database", keywords="AMR", keywords="machine learning", keywords="antimicrobial use", keywords="artificial intelligence", keywords="antimicrobial", keywords="data-driven", keywords="mixed-method", keywords="patient outcome", keywords="drug-resistant infections", keywords="drug resistant", keywords="surveillance data", keywords="economic", keywords="antibiotic", abstract="Background: It is projected that drug-resistant infections will lead to 10 million deaths annually by 2050 if left unabated. Despite this threat, surveillance data from resource-limited settings are scarce and often lack antimicrobial resistance (AMR)--related clinical outcomes and economic burden. We aim to build an AMR and antimicrobial use (AMU) data warehouse, describe the trends of resistance and antibiotic use, determine the economic burden of AMR in Uganda, and develop a machine learning algorithm to predict AMR-related clinical outcomes. Objective: The overall objective of the study is to use data-driven approaches to optimize antibiotic use and combat antimicrobial-resistant infections in Uganda. We aim to (1) build a dynamic AMR and antimicrobial use and consumption (AMUC) data warehouse to support research in AMR and AMUC to inform AMR-related interventions and public health policy, (2) evaluate the trends in AMR and antibiotic use based on annual antibiotic and point prevalence survey data collected at 9 regional referral hospitals over a 5-year period, (3) develop a machine learning model to predict the clinical outcomes of patients with bacterial infectious syndromes due to drug-resistant pathogens, and (4) estimate the annual economic burden of AMR in Uganda using the cost-of-illness approach. Methods: We will conduct a study involving data curation, machine learning--based modeling, and cost-of-illness analysis using AMR and AMU data abstracted from procurement, human resources, and clinical records of patients with bacterial infectious syndromes at 9 regional referral hospitals in Uganda collected between 2018 and 2026. We will use data curation procedures, FLAIR (Findable, Linkable, Accessible, Interactable and Repeatable) principles, and role-based access control to build a robust and dynamic AMR and AMU data warehouse. We will also apply machine learning algorithms to model AMR-related clinical outcomes, advanced statistical analysis to study AMR and AMU trends, and cost-of-illness analysis to determine the AMR-related economic burden. Results: The study received funding from the Wellcome Trust through the Centers for Antimicrobial Optimisation Network (CAMO-Net) in April 2023. As of October 28, 2024, we completed data warehouse development, which is now under testing; completed data curation of the historical Fleming Fund surveillance data (2020-2023); and collected retrospective AMR records for 599 patients that contained clinical outcomes and cost-of-illness economic burden data across 9 surveillance sites for objectives 3 and 4, respectively. Conclusions: The data warehouse will promote access to rich and interlinked AMR and AMU data sets to answer AMR program and research questions using a wide evidence base. The AMR-related clinical outcomes model and cost data will facilitate improvement in the clinical management of AMR patients and guide resource allocation to support AMR surveillance and interventions. International Registered Report Identifier (IRRID): PRR1-10.2196/58116 ", doi="10.2196/58116", url="https://www.researchprotocols.org/2024/1/e58116" } @Article{info:doi/10.2196/56237, author="Amadi, David and Kiwuwa-Muyingo, Sylvia and Bhattacharjee, Tathagata and Taylor, Amelia and Kiragga, Agnes and Ochola, Michael and Kanjala, Chifundo and Gregory, Arofan and Tomlin, Keith and Todd, Jim and Greenfield, Jay", title="Making Metadata Machine-Readable as the First Step to Providing Findable, Accessible, Interoperable, and Reusable Population Health Data: Framework Development and Implementation Study", journal="Online J Public Health Inform", year="2024", month="Aug", day="1", volume="16", pages="e56237", keywords="FAIR data principles", keywords="metadata", keywords="machine-readable metadata", keywords="DDI", keywords="Data Documentation Initiative", keywords="standardization", keywords="JSON-LD", keywords="JavaScript Object Notation for Linked Data", keywords="OMOP CDM", keywords="Observational Medical Outcomes Partnership Common Data Model", keywords="data science", keywords="data models", abstract="Background: Metadata describe and provide context for other data, playing a pivotal role in enabling findability, accessibility, interoperability, and reusability (FAIR) data principles. By providing comprehensive and machine-readable descriptions of digital resources, metadata empower both machines and human users to seamlessly discover, access, integrate, and reuse data or content across diverse platforms and applications. However, the limited accessibility and machine-interpretability of existing metadata for population health data hinder effective data discovery and reuse. Objective: To address these challenges, we propose a comprehensive framework using standardized formats, vocabularies, and protocols to render population health data machine-readable, significantly enhancing their FAIRness and enabling seamless discovery, access, and integration across diverse platforms and research applications. Methods: The framework implements a 3-stage approach. The first stage is Data Documentation Initiative (DDI) integration, which involves leveraging the DDI Codebook metadata and documentation of detailed information for data and associated assets, while ensuring transparency and comprehensiveness. The second stage is Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) standardization. In this stage, the data are harmonized and standardized into the OMOP CDM, facilitating unified analysis across heterogeneous data sets. The third stage involves the integration of Schema.org and JavaScript Object Notation for Linked Data (JSON-LD), in which machine-readable metadata are generated using Schema.org entities and embedded within the data using JSON-LD, boosting discoverability and comprehension for both machines and human users. We demonstrated the implementation of these 3 stages using the Integrated Disease Surveillance and Response (IDSR) data from Malawi and Kenya. Results: The implementation of our framework significantly enhanced the FAIRness of population health data, resulting in improved discoverability through seamless integration with platforms such as Google Dataset Search. The adoption of standardized formats and protocols streamlined data accessibility and integration across various research environments, fostering collaboration and knowledge sharing. Additionally, the use of machine-interpretable metadata empowered researchers to efficiently reuse data for targeted analyses and insights, thereby maximizing the overall value of population health resources. The JSON-LD codes are accessible via a GitHub repository and the HTML code integrated with JSON-LD is available on the Implementation Network for Sharing Population Information from Research Entities website. Conclusions: The adoption of machine-readable metadata standards is essential for ensuring the FAIRness of population health data. By embracing these standards, organizations can enhance diverse resource visibility, accessibility, and utility, leading to a broader impact, particularly in low- and middle-income countries. Machine-readable metadata can accelerate research, improve health care decision-making, and ultimately promote better health outcomes for populations worldwide. ", doi="10.2196/56237", url="https://ojphi.jmir.org/2024/1/e56237", url="http://www.ncbi.nlm.nih.gov/pubmed/39088253" } @Article{info:doi/10.2196/50629, author="Maleki, Negar and Padmanabhan, Balaji and Dutta, Kaushik", title="Usability of Health Care Price Transparency Data in the United States: Mixed Methods Study", journal="J Med Internet Res", year="2024", month="Mar", day="29", volume="26", pages="e50629", keywords="price transparency", keywords="user experiments", keywords="schema analysis", keywords="health care", keywords="patients", keywords="algorithms", abstract="Background: Increasing health care expenditure in the United States has put policy makers under enormous pressure to find ways to curtail costs. Starting January 1, 2021, hospitals operating in the United States were mandated to publish transparent, accessible pricing information online about the items and services in a consumer-friendly format within comprehensive machine-readable files on their websites. Objective: The aims of this study are to analyze the available files on hospitals' websites, answering the question---is price transparency (PT) information as provided usable for patients or for machines?---and to provide a solution. Methods: We analyzed 39 main hospitals in Florida that have published machine-readable files on their website, including commercial carriers. We created an Excel (Microsoft) file that included those 39 hospitals along with the 4 most popular services---Current Procedural Terminology (CPT) 45380, 29827, and 70553 and Diagnosis-Related Group (DRG) 807---for the 4 most popular commercial carriers (Health Maintenance Organization [HMO] or Preferred Provider Organization [PPO] plans)---Aetna, Florida Blue, Cigna, and UnitedHealthcare. We conducted an A/B test using 67 MTurkers (randomly selected from US residents), investigating the level of awareness about PT legislation and the usability of available files. We also suggested format standardization, such as master field names using schema integration, to make machine-readable files consistent and usable for machines. Results: The poor usability and inconsistent formats of the current PT information yielded no evidence of its usefulness for patients or its quality for machines. This indicates that the information does not meet the requirements for being consumer-friendly or machine readable as mandated by legislation. Based on the responses to the first part of the experiment (PT awareness), it was evident that participants need to be made aware of the PT legislation. However, they believe it is important to know the service price before receiving it. Based on the responses to the second part of the experiment (human usability of PT information), the average number of correct responses was not equal between the 2 groups, that is, the treatment group (mean 1.23, SD 1.30) found more correct answers than the control group (mean 2.76, SD 0.58; t65=6.46; P<.001; d=1.52). Conclusions: Consistent machine-readable files across all health systems facilitate the development of tools for estimating customer out-of-pocket costs, aligning with the PT rule's main objective---providing patients with valuable information and reducing health care expenditures. ", doi="10.2196/50629", url="https://www.jmir.org/2024/1/e50629", url="http://www.ncbi.nlm.nih.gov/pubmed/38442238" } @Article{info:doi/10.2196/53857, author="Kilshaw, E. Robyn and Boggins, Abigail and Everett, Olivia and Butner, Emma and Leifker, R. Feea and Baucom, W. Brian R.", title="Benchmarking Mental Health Status Using Passive Sensor Data: Protocol for a Prospective Observational Study", journal="JMIR Res Protoc", year="2024", month="Mar", day="27", volume="13", pages="e53857", keywords="audio data", keywords="computational psychiatry", keywords="data repository", keywords="digital phenotyping", keywords="machine learning", keywords="passive sensor data", abstract="Background: Computational psychiatry has the potential to advance the diagnosis, mechanistic understanding, and treatment of mental health conditions. Promising results from clinical samples have led to calls to extend these methods to mental health risk assessment in the general public; however, data typically used with clinical samples are neither available nor scalable for research in the general population. Digital phenotyping addresses this by capitalizing on the multimodal and widely available data created by sensors embedded in personal digital devices (eg, smartphones) and is a promising approach to extending computational psychiatry methods to improve mental health risk assessment in the general population. Objective: Building on recommendations from existing computational psychiatry and digital phenotyping work, we aim to create the first computational psychiatry data set that is tailored to studying mental health risk in the general population; includes multimodal, sensor-based behavioral features; and is designed to be widely shared across academia, industry, and government using gold standard methods for privacy, confidentiality, and data integrity. Methods: We are using a stratified, random sampling design with 2 crossed factors (difficulties with emotion regulation and perceived life stress) to recruit a sample of 400 community-dwelling adults balanced across high- and low-risk for episodic mental health conditions. Participants first complete self-report questionnaires assessing current and lifetime psychiatric and medical diagnoses and treatment, and current psychosocial functioning. Participants then complete a 7-day in situ data collection phase that includes providing daily audio recordings, passive sensor data collected from smartphones, self-reports of daily mood and significant events, and a verbal description of the significant daily events during a nightly phone call. Participants complete the same baseline questionnaires 6 and 12 months after this phase. Self-report questionnaires will be scored using standard methods. Raw audio and passive sensor data will be processed to create a suite of daily summary features (eg, time spent at home). Results: Data collection began in June 2022 and is expected to conclude by July 2024. To date, 310 participants have consented to the study; 149 have completed the baseline questionnaire and 7-day intensive data collection phase; and 61 and 31 have completed the 6- and 12-month follow-up questionnaires, respectively. Once completed, the proposed data set will be made available to academic researchers, industry, and the government using a stepped approach to maximize data privacy. Conclusions: This data set is designed as a complementary approach to current computational psychiatry and digital phenotyping research, with the goal of advancing mental health risk assessment within the general population. This data set aims to support the field's move away from siloed research laboratories collecting proprietary data and toward interdisciplinary collaborations that incorporate clinical, technical, and quantitative expertise at all stages of the research process. International Registered Report Identifier (IRRID): DERR1-10.2196/53857 ", doi="10.2196/53857", url="https://www.researchprotocols.org/2024/1/e53857", url="http://www.ncbi.nlm.nih.gov/pubmed/38536220" } @Article{info:doi/10.2196/57779, author="Graham, Scott S. and Shiva, Jade and Sharma, Nandini and Barbour, B. Joshua and Majdik, P. Zoltan and Rousseau, F. Justin", title="Conflicts of Interest Publication Disclosures: Descriptive Study", journal="JMIR Data", year="2024", month="Oct", day="31", volume="5", pages="e57779", keywords="conflicts of interest", keywords="biomedical publishing", keywords="research integrity", keywords="dataset", keywords="COI", keywords="ethical", keywords="ethics", keywords="publishing", keywords="drugs", keywords="pharmacies", keywords="pharmacology", keywords="pharmacotherapy", keywords="pharmaceuticals", keywords="medication", keywords="disclosure", keywords="information science", keywords="library science", keywords="open data", abstract="Background: Multiple lines of previous research have documented that author conflicts of interest (COI) can compromise the integrity of the biomedical research enterprise. However, continuing research that would investigate why, how, and in what circumstances COI is most risky is stymied by the difficulty in accessing disclosure statements, which are not widely represented in available databases. Objective: In this study, we describe a new open access dataset of COI disclosures extracted from published biomedical journal papers. Methods: To develop the dataset, we used ClinCalc's Top 300 drugs lists for 2017 and 2018 to identify 319 of the most commonly used drugs. Search strategies for each product were developed using the National Library of Medicine's and MeSH (Medical Subject Headings) browser and deployed using the eUtilities application programming interface in April 2021. We identified the 150 most relevant papers for each product and extracted COI disclosure statements from PubMed, PubMed Central, or retrieved papers as necessary. Results: Conflicts of Interest Publication Disclosures (COIPonD) is a new dataset that captures author-reported COI disclosures for biomedical research papers published in a wide range of journals and subspecialties. COIPonD captures author-reported disclosure information (including lack of disclosure) for over 38,000 PubMed-indexed papers published between 1949 and 2022. The collected papers are indexed by discussed drug products with a focus on the 319 most commonly used drugs in the United States. Conclusions: COIPonD should accelerate research efforts to understand the effects of COI on the biomedical research enterprise. In particular, this dataset should facilitate new studies of COI effects across disciplines and subspecialties. ", doi="10.2196/57779", url="https://data.jmir.org/2024/1/e57779" } @Article{info:doi/10.2196/36874, author="Mbuagbaw, Lawrence and Jhuti, Diya and Zakaryan, Gohar and El-Kechen, Hussein and Rehman, Nadia and Youssef, Mark and Garcia, Cristian Michael and Arora, Vaibhav and Zani, Babalwa and Leenus, Alvin and Wu, Michael and Makanjuola, Oluwatoni", title="A Database of Randomized Trials on the HIV Care Cascade (CASCADE Database): Descriptive Study", journal="JMIR Data", year="2022", month="Aug", day="11", volume="3", number="1", pages="e36874", keywords="data set", keywords="database", keywords="cascade", keywords="HIV", keywords="antiretroviral therapy", keywords="adherence", keywords="retention", keywords="pragmatic", keywords="randomized controlled trial", keywords="randomized", keywords="clinical trial", keywords="review", keywords="evidence map", keywords="methodological research", keywords="open access", abstract="Background: The Joint United Nations Programme on HIV/AIDS has set targets for 2025 regarding people living with HIV. For these targets to be met, 95\% of people with HIV would need to know their HIV status, 95\% of people with HIV would need to be receiving antiretroviral therapy, and 95\% of people on antiretroviral therapy would need to be virally suppressed. Some countries are on track to meet these targets. However, within and across countries, several vulnerable populations may not meet these targets. This is in part because several approaches to improving the cascade of care after an HIV diagnosis are not tailored to and are not appropriate for vulnerable populations, such as men who have sex with men, sex workers, people who inject drugs, Black people, people in prisons, women, and youth. To inform research, policy, and practice, there is a need for curated data on HIV care cascade research. Objective: The CASCADE database is a repository of randomized clinical HIV trials. It was designed to inform, support, and improve HIV care cascade research, systematic reviews, and evidence maps. Methods: PubMed, Embase, CINAHL, PsycINFO, Web of Science, and the Cochrane Library were searched to obtain randomized trials that were designed to address at least one of the following care cascade outcomes: the initiation of therapy, adherence to medication, retention in care, and engagement in care. Data were screened and extracted in duplicate using DistillerSR software (Evidence Partners Incorporated) and were cataloged based on the following features: year, income level, setting, the delivery of the intervention, the population receiving the intervention, intervention type, and the level of pragmatism of the intervention. Results: A total of 298 HIV clinical trials are included in the CASCADE database, of which 162 (54.4\%) were conducted in high-income countries, and 116 (38.9\%) targeted vulnerable populations. Adherence to antiretroviral therapy was the most investigated HIV care cascade outcome (216/298, 72.5\%), followed by retention in care (34/298, 11.4\%). CASCADE has a user-friendly interface with simple and advanced search features. The CASCADE database has inspired 2 methodological papers and 13,567 website visits from over 10 countries. Conclusions: CASCADE is the first database dedicated to trials that focus on the HIV care cascade, and it can be used for systematic reviews, evidence maps, and methodological research. It is freely accessible, and the data can be downloaded in CSV format. ", doi="10.2196/36874", url="https://data.jmir.org/2022/1/e36874" } @Article{info:doi/10.2196/30697, author="Foraker, Randi and Guo, Aixia and Thomas, Jason and Zamstein, Noa and Payne, RO Philip and Wilcox, Adam and ", title="The National COVID Cohort Collaborative: Analyses of Original and Computationally Derived Electronic Health Record Data", journal="J Med Internet Res", year="2021", month="Oct", day="4", volume="23", number="10", pages="e30697", keywords="synthetic data", keywords="protected health information", keywords="COVID-19", keywords="electronic health records and systems", keywords="data analysis", abstract="Background: Computationally derived (``synthetic'') data can enable the creation and analysis of clinical, laboratory, and diagnostic data as if they were the original electronic health record data. Synthetic data can support data sharing to answer critical research questions to address the COVID-19 pandemic. Objective: We aim to compare the results from analyses of synthetic data to those from original data and assess the strengths and limitations of leveraging computationally derived data for research purposes. Methods: We used the National COVID Cohort Collaborative's instance of MDClone, a big data platform with data-synthesizing capabilities (MDClone Ltd). We downloaded electronic health record data from 34 National COVID Cohort Collaborative institutional partners and tested three use cases, including (1) exploring the distributions of key features of the COVID-19--positive cohort; (2) training and testing predictive models for assessing the risk of admission among these patients; and (3) determining geospatial and temporal COVID-19--related measures and outcomes, and constructing their epidemic curves. We compared the results from synthetic data to those from original data using traditional statistics, machine learning approaches, and temporal and spatial representations of the data. Results: For each use case, the results of the synthetic data analyses successfully mimicked those of the original data such that the distributions of the data were similar and the predictive models demonstrated comparable performance. Although the synthetic and original data yielded overall nearly the same results, there were exceptions that included an odds ratio on either side of the null in multivariable analyses (0.97 vs 1.01) and differences in the magnitude of epidemic curves constructed for zip codes with low population counts. Conclusions: This paper presents the results of each use case and outlines key considerations for the use of synthetic data, examining their role in collaborative research for faster insights. ", doi="10.2196/30697", url="https://www.jmir.org/2021/10/e30697", url="http://www.ncbi.nlm.nih.gov/pubmed/34559671" } @Article{info:doi/10.2196/31122, author="Lee, Junghwan and Kim, Hyun Jae and Liu, Cong and Hripcsak, George and Natarajan, Karthik and Ta, Casey and Weng, Chunhua", title="Columbia Open Health Data for COVID-19 Research: Database Analysis", journal="J Med Internet Res", year="2021", month="Sep", day="30", volume="23", number="9", pages="e31122", keywords="COVID-19", keywords="open data", keywords="electronic health record", keywords="data science", keywords="research", keywords="data", keywords="access", keywords="database", keywords="symptom", keywords="cohort", keywords="prevalence", abstract="Background: COVID-19 has threatened the health of tens of millions of people all over the world. Massive research efforts have been made in response to the COVID-19 pandemic. Utilization of clinical data can accelerate these research efforts to combat the pandemic since important characteristics of the patients are often found by examining the clinical data. Publicly accessible clinical data on COVID-19, however, remain limited despite the immediate need. Objective: To provide shareable clinical data to catalyze COVID-19 research, we present Columbia Open Health Data for COVID-19 Research (COHD-COVID), a publicly accessible database providing clinical concept prevalence, clinical concept co-occurrence, and clinical symptom prevalence for hospitalized patients with COVID-19. COHD-COVID also provides data on hospitalized patients with influenza and general hospitalized patients as comparator cohorts. Methods: The data used in COHD-COVID were obtained from NewYork-Presbyterian/Columbia University Irving Medical Center's electronic health records database. Condition, drug, and procedure concepts were obtained from the visits of identified patients from the cohorts. Rare concepts were excluded, and the true concept counts were perturbed using Poisson randomization to protect patient privacy. Concept prevalence, concept prevalence ratio, concept co-occurrence, and symptom prevalence were calculated using the obtained concepts. Results: Concept prevalence and concept prevalence ratio analyses showed the clinical characteristics of the COVID-19 cohorts, confirming the well-known characteristics of COVID-19 (eg, acute lower respiratory tract infection and cough). The concepts related to the well-known characteristics of COVID-19 recorded high prevalence and high prevalence ratio in the COVID-19 cohort compared to the hospitalized influenza cohort and general hospitalized cohort. Concept co-occurrence analyses showed potential associations between specific concepts. In case of acute lower respiratory tract infection in the COVID-19 cohort, a high co-occurrence ratio was obtained with COVID-19--related concepts and commonly used drugs (eg, disease due to coronavirus and acetaminophen). Symptom prevalence analysis indicated symptom-level characteristics of the cohorts and confirmed that well-known symptoms of COVID-19 (eg, fever, cough, and dyspnea) showed higher prevalence than the hospitalized influenza cohort and the general hospitalized cohort. Conclusions: We present COHD-COVID, a publicly accessible database providing useful clinical data for hospitalized patients with COVID-19, hospitalized patients with influenza, and general hospitalized patients. We expect COHD-COVID to provide researchers and clinicians quantitative measures of COVID-19--related clinical features to better understand and combat the pandemic. ", doi="10.2196/31122", url="https://www.jmir.org/2021/9/e31122", url="http://www.ncbi.nlm.nih.gov/pubmed/34543225" } @Article{info:doi/10.2196/22446, author="Al Tamime, Reham and Weber, Ingmar", title="Tracking Exposure to Ads Amid the COVID-19 Pandemic: Development of a Public Google Ads Data Set", journal="JMIR Data", year="2021", month="Sep", day="14", volume="2", number="1", pages="e22446", keywords="COVID-19", keywords="coronavirus", keywords="SARS-CoV-2", keywords="panic buying", keywords="Google Ads", keywords="data", keywords="database", keywords="tracking", keywords="research", keywords="public availability", keywords="online behaviors", abstract="Background: The COVID-19 pandemic has had a substantial impact on economies, governments, businesses, and most importantly, people's health. To bring the spread of COVID-19 under control, strict lockdown measures have been implemented across the globe. These lockdown measures resulted in a spate of panic buying and increase in demand for hygiene products and other grocery items. Objective: In this paper, we describe a data set from Google Ads that looks at the presentation of ads to people while they browse the web during the COVID-19 pandemic. We are making the data set available to the research community. Methods: We started this ongoing data collection on March 28, 2020, leveraging Developer Tools' network requests to retrieve Google Ads data. We identified a list of items related and unrelated to panic buying. We then captured these items as targeting criteria under what people are actively researching or planning on Google Ads. Google Ads data has been filtered using additional targeting criteria such as country, gender, and parental status. Results: Since the inception of our collection, we have actively maintained and updated our repository on a monthly basis. In total, we have published over 4116 data points. This paper also presents basic statistics that reveal variations in Google Ads data across countries, gender, and parental status. Conclusions: We hope that this Google Ads data set can increase our understanding of ad exposure during the COVID-19 outbreak. In particular, this data set can lead to further studies that look at the relationship between exposure to ads, time spent web browsing, and health outcomes. ", doi="10.2196/22446", url="https://data.jmir.org/2021/1/e22446" }