Exploring clinical data resources for Healthcare Research, Artificial Intelligence, and Machine Learning applications

13 Mar 2023

8 min read

Davide Rovati

Sr. Director of Operations

Clinical data is a vital aspect of the healthcare industry, providing valuable insights into patient treatment and medical research. However, it can be complex and diverse, with different temporal resolutions, stages of treatment, and forms, including structured and unstructured data.

Structured data is organized into a tabular format, with rows representing individual entities (such as patients) and columns representing their characteristics or attributes. Unstructured data, on the other hand, lacks regular organization and includes various types such as clinical text (consisting of medical terminology and acronyms), images (diagnostic imaging such as CT scans and MRIs), and signals (measurements obtained from sensors at regular intervals, such as ECG).

In this article, we’ll review several sources that provide access to healthcare-related datasets, including the MIMIC database, the Centers for Medicare & Medicaid Services (CMS), OpenNeuro, PhysioBank, HealthData.gov, DeepLesion, and the NIH chest x-ray datasets. These resources offer valuable opportunities for researchers to utilize Artificial Intelligence (AI) and Machine Learning (ML) techniques to improve patient care and advance medical research.

Healthcare Machine Learning and AI applications in clinical data

Improving Patient Care and Medical research

The MIMIC database, or Medical Information Mart for Intensive Care, is a comprehensive collection of electronic medical records from patients admitted to intensive care units at the Beth Israel Deaconess Medical Center in Boston, MA. It contains data on over 40,000 individuals and is a valuable resource for healthcare professionals and researchers looking to improve patient care and advance medical research with the use of healthcare machine learning techniques. The database is divided into two main modules: the “hosp module,” which includes information about patients’ hospital stays, demographics, and hospitalizations, as well as data from outside the hospital, such as outpatient laboratory tests, and the “icu module,” which contains detailed data on intravenous and fluid inputs, procedures, and other charted information relevant to intensive care. The separation of the data into two modules allows for easy identification of the source of the information.

Using CMS datasets for Healthcare Machine Learning research

The Centers for Medicare & Medicaid Services (CMS) is the US federal agency that administers insurance programs as well as makes a variety of healthcare-related datasets available to the public, that can be used for healthcare machine learning research. CMS is responsible for administering the Medicare program, which provides healthcare coverage for seniors and certain disabled individuals, as well as the Medicaid program, which provides healthcare coverage for low-income individuals and families. CMS also plays a role in regulating the healthcare industry and ensuring that healthcare providers adhere to certain standards of care.

Accessing Neuroimaging data for collaborative Healthcare research

OpenNeuro is a platform that enables researchers to share and validate neuroimaging data (i.e., MRI, PET, MEG, EEG, and iEEG) that adheres to the BIDS (Brain Imaging Data Structure) standards. Developed by the Stanford Center for Reproducible Neuroscience, the platform offers access to 770 public datasets for researchers to use and study. Through the use of a standardized format, OpenNeuro makes it easier for healthcare machine learning researchers to share and access data, facilitating collaboration and building upon one another’s work.

Biomedical data and Machine Learning research

PhysioBank is an extensive archive on PhysioNet platform that provides access to digital recordings of physiological signals, time series data, and related information for use in biomedical and machine learning research. It includes data on cardiopulmonary, neural, and other biomedical signals from both healthy individuals and those with various medical conditions, as well as clinical and imaging data related to critical care. The data is collected from a variety of studies and contributed by members of the research community. For example, using publicly available data on PhysioBank, the authors of article “Beat-to-Beat Fetal Heart Rate Analysis Using Portable Medical Device and Wavelet Transformation Technique” prove that the comparison between obtained fetal heart rate by their proposed algorithm and the baselines yields a promising accuracy beyond 95%.

Analyzing high-value Health data

HealthData.gov, operated by the U.S. Department of Health and Human Services, is a platform for sharing high-value health data with researchers and entrepreneurs, featuring a diverse range of datasets. COVID-19 Diagnostic Laboratory Testing – PCR Testing- Time Series, COVID-19 Public Therapeutic Locator, COVID-19 Reported Patient Impact and Hospital Capacity by Facility and COVID-19 Reported Patient Impact and Hospital Capacity by State are among the most recently accessed datasets by researchers. Using various datasets from HealthData.gov, the authors of article “Statistical Learning to Operationalize a Domain Agnostic Data Quality Scoring” have formulated a concrete automated data quality platform to assess the quality of incoming datasets to generate a quality label, score, and comprehensive report.

Enhancing the accuracy of Lesion Detection with AI and ML

The DeepLesion dataset, made available by the National Institutes of Health’s Clinical Center, is a large collection of CT images that is freely accessible to the scientific community for the purpose of enhancing the accuracy of lesion detection. The dataset consists of 32,120 axial CT slices from 10,594 CT scans, featuring 1-3 annotated lesions per image with bounding boxes and size measurements, totaling 32,735 lesions. These lesion annotations were extracted from the NIH’s picture archiving and communication system, making the dataset a valuable resource for researchers in the field of medical image analysis. The authors of this paper, which proposes a transformer-based network for lesion Response Evaluation Criteria In Solid Tumors (RECIST) diameter prediction and segmentation, have reported experiment results on the DeepLesion dataset showing promising results of two downstream clinic-relevant tasks: 3D lesion segmentation and RECIST assessment in longitudinal studies.

Examples of key CT slices with overlaid lesion annotations for review purposes, taken from the DeepLesion dataset.

Utilizing Machine Learning for improved Lung Disease Diagnostic and Treatment

The National Institutes of Health (NIH) chest x-ray dataset is a large collection of anonymized chest x-ray images and accompanying data. The dataset includes over 100,000 images from more than 30,000 patients, including individuals with advanced lung disease. This dataset is a valuable resource for researchers and medical professionals working on the development of algorithms and techniques for analyzing chest x-ray images and improving the diagnosis and treatment of lung diseases.