Data Only Package to healthyR • healthyR.data

Overview

To view the full wiki, click here: Full healthyR.data Wiki

healthyR.data is a comprehensive R package that provides healthcare administrative datasets and tools for accessing CMS (Centers for Medicare & Medicaid Services) hospital data. The package serves two primary purposes:

Built-in Healthcare Data: Provides a rich, realistic administrative dataset (healthyR_data) with 187,721 rows covering hospital visits, patient demographics, charges, payments, and quality metrics
CMS Data Access: Offers a suite of functions to fetch, download, and work with current CMS hospital data, including quality measures, outcomes, and provider information

Whether you’re testing healthcare analytics functions, teaching health informatics, or conducting research, healthyR.data provides the data infrastructure you need.

Features

Built-in Administrative Dataset

The healthyR_data dataset includes:

Patient Information: Medical Record Numbers (MRN), visit IDs, and visit dates
Financial Data: Charges, payments, adjustments, and amounts due
Clinical Metrics: Length of stay, service lines, readmission flags
Quality Indicators: Expected vs actual length of stay, outlier flags, readmission expectations
Payer Information: Insurance classifications and payer groupings

CMS Data Access Functions

The package provides multiple ways to access current CMS hospital data:

Meta Data Functions: Search and explore available CMS datasets
- get_cms_meta_data() - Search CMS data catalog
- get_provider_meta_data() - Search provider data
Data Download Functions: Fetch current hospital data
- current_hosp_data() - Download all current hospital data
- fetch_cms_data() - Fetch specific CMS datasets
- fetch_provider_data() - Fetch provider data via API
Specific Hospital Data Functions: Get targeted datasets
- current_asc_data() - Ambulatory Surgery Center data
- current_hcahps_data() - Hospital Consumer Assessment of Healthcare Providers and Systems
- current_hai_data() - Healthcare-Associated Infections
- current_readmission_data() - Hospital readmissions
- And 20+ more specific data extraction functions

Utility Functions

is_valid_url() - Validate URLs before data fetching
current_hosp_data_dict() - Get data dictionaries

Installation

Install the released version from CRAN:

install.packages("healthyR.data")

Install the development version from GitHub:

# install.packages("devtools")
devtools::install_github("spsanderson/healthyR.data")

Quick Start

Using the Built-in Dataset

library(healthyR.data)
library(dplyr)

# Load the built-in dataset
df <- healthyR_data

# Explore the data structure
glimpse(df)
#> Rows: 187,721
#> Columns: 17
#> $ mrn                      <chr> "86069614", "60856527", "80673110", "55897373…
#> $ visit_id                 <chr> "3519249247", "3602225015", "3125290892", "38…
#> $ visit_start_date_time    <dttm> 2010-01-04 05:00:00, 2010-01-04 05:00:00, 20…
#> $ visit_end_date_time      <dttm> 2010-01-04, 2010-01-04, 2010-01-04, 2010-01-…
#> $ total_charge_amount      <dbl> 25983.88, 22774.05, 10690.45, 8788.02, 7325.1…
#> $ total_amount_due         <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 201.52, 20.00, …
#> $ total_adjustment_amount  <dbl> -20799.61, -12978.37, -7596.09, -7663.57, -60…
#> $ payer_grouping           <chr> "Medicare B", "Medicare HMO", "HMO", "Medicar…
#> $ total_payment_amount     <dbl> -5184.27, -9795.68, -3094.36, -1124.45, -1269…
#> $ ip_op_flag               <chr> "O", "O", "O", "O", "O", "O", "O", "O", "O", …
#> $ service_line             <chr> "General Outpatient", "General Outpatient", "…
#> $ length_of_stay           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ expected_length_of_stay  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ length_of_stay_threshold <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ los_outlier_flag         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ readmit_flag             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ readmit_expectation      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

# Analyze service lines by patient type
df %>% 
    count(ip_op_flag, service_line) %>%
    arrange(ip_op_flag, desc(n)) %>%
    rename(count = n) %>%
    head(10)
#> # A tibble: 10 × 3
#>    ip_op_flag service_line                                 count
#>    <chr>      <chr>                                        <int>
#>  1 I          Medical                                      64435
#>  2 I          Surgical                                     14916
#>  3 I          COPD                                          4398
#>  4 I          CHF                                           3871
#>  5 I          Pneumonia                                     3323
#>  6 I          Cellulitis                                    3311
#>  7 I          Major Depression/Bipolar Affective Disorders  2866
#>  8 I          Chest Pain                                    2766
#>  9 I          GI Hemorrhage                                 2404
#> 10 I          MI                                            2253

Analyzing Financial Data

# Analyze charges and payments by payer type
df %>%
    group_by(payer_grouping) %>%
    summarise(
        visits = n(),
        avg_charge = mean(total_charge_amount, na.rm = TRUE),
        avg_payment = mean(abs(total_payment_amount), na.rm = TRUE),
        .groups = "drop"
    ) %>%
    arrange(desc(visits)) %>%
    head(10)
#> # A tibble: 10 × 4
#>    payer_grouping visits avg_charge avg_payment
#>    <chr>           <int>      <dbl>       <dbl>
#>  1 Medicare A      52622     68452.      11861.
#>  2 Medicaid HMO    25484     37285.       5575.
#>  3 Blue Cross      24357     31561.      10374.
#>  4 Medicare B      22563     16136.       2531.
#>  5 Medicare HMO    18997     55526.       8443.
#>  6 HMO             17444     31407.       9405.
#>  7 Medicaid         8777     49428.       7602.
#>  8 Commercial       6567     35300.      12506.
#>  9 Self Pay         3649     24998.        662.
#> 10 Compensation     2502     40101.       6413.

Quality Metrics Analysis

# Examine length of stay outliers
df %>%
    filter(ip_op_flag == "I") %>%  # Inpatient only
    group_by(service_line) %>%
    summarise(
        total_visits = n(),
        avg_los = mean(length_of_stay, na.rm = TRUE),
        outlier_rate = mean(los_outlier_flag, na.rm = TRUE) * 100,
        readmit_rate = mean(readmit_flag, na.rm = TRUE) * 100,
        .groups = "drop"
    ) %>%
    arrange(desc(total_visits)) %>%
    head(10)
#> # A tibble: 10 × 5
#>    service_line                   total_visits avg_los outlier_rate readmit_rate
#>    <chr>                                 <int>   <dbl>        <dbl>        <dbl>
#>  1 Medical                               64435    5.72       0.205         12.8 
#>  2 Surgical                              14916    9.35       0.436         10.8 
#>  3 COPD                                   4398    5.28       0.0910        19.6 
#>  4 CHF                                    3871    6.42       0.103         21.1 
#>  5 Pneumonia                              3323    5.89       0.120         14.2 
#>  6 Cellulitis                             3311    4.78       0.242          9.09
#>  7 Major Depression/Bipolar Affe…         2866   10.4        0.105          5.58
#>  8 Chest Pain                             2766    2.05       0.145          8.28
#>  9 GI Hemorrhage                          2404    5.86       0.416         14.6 
#> 10 MI                                     2253    4.96       0.266         14.4

Working with CMS Data

Searching for CMS Datasets

library(healthyR.data)

# Search for datasets about hospital readmissions
meta_data <- get_cms_meta_data(
    .keyword = "readmission",
    .data_version = "current"
)

# View available datasets
meta_data %>%
    select(title, modified, media_type) %>%
    head()

Fetching CMS Data via API

# Get metadata for a specific dataset
cms_meta <- get_cms_meta_data(
    .title = "Unplanned Hospital Visits",
    .data_version = "current",
    .media_type = "API"
)

# Extract the data link
data_link <- cms_meta$data_link[1]

# Fetch the actual data
hospital_data <- fetch_cms_data(data_link)

glimpse(hospital_data)

Downloading Complete Hospital Data

# Download all current hospital data files (requires user to select directory)
all_hosp_data <- current_hosp_data()

# The result is a list of tibbles, one for each data file
names(all_hosp_data)

# Extract specific datasets
asc_data <- current_asc_data(
    all_hosp_data, 
    .data_sets = c("Facility", "State")
)

Working with Provider Data

# Search for provider datasets
provider_meta <- get_provider_meta_data(.keyword = "hospital")

# Fetch provider data using an identifier
provider_data <- fetch_provider_data("069d-826b", .limit = 100)

glimpse(provider_data)

Use Cases

healthyR.data is ideal for:

Healthcare Analytics: Test and develop healthcare analytics functions with realistic data
Education: Teach health informatics and data analysis courses
Research: Prototype healthcare research analyses before working with protected data
Package Development: Test healthcare R packages (like the healthyR package)
Quality Improvement: Analyze hospital quality metrics and performance indicators
Financial Analysis: Study healthcare billing, payments, and reimbursement patterns
Benchmarking: Compare your data against national hospital data from CMS

Data Dictionary

The healthyR_data dataset contains 187,721 rows and 17 variables:

Variable	Description
`mrn`	Medical Record Number (unique patient identifier)
`visit_id`	Unique hospital visit identifier
`visit_start_date_time`	Visit start date and time
`visit_end_date_time`	Visit end date and time
`total_charge_amount`	Total charges for the visit (USD)
`total_amount_due`	Amount still owed for the visit (USD)
`total_adjustment_amount`	Total adjustments to the account (USD)
`payer_grouping`	Insurance classification
`total_payment_amount`	Total payments received (USD)
`ip_op_flag`	Patient type (I=Inpatient, O=Outpatient)
`service_line`	Hospital service line
`length_of_stay`	Total days admitted to hospital
`expected_length_of_stay`	Expected days for admission
`length_of_stay_threshold`	LOS threshold for outlier classification
`los_outlier_flag`	Binary indicator if visit exceeded LOS threshold
`readmit_flag`	Binary indicator if readmitted within 30 days
`readmit_expectation`	Expected readmission rate from benchmark

Requirements

R >= 4.1.0
Dependencies: dplyr, rlang, utils, janitor, httr2, stringr, tidyr, stats

Documentation

Getting Help

If you encounter a bug or have a feature request:

Report issues on GitHub
Check the function reference for detailed documentation

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes:

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

Please make sure to update tests as appropriate and follow the existing code style.

healthyR - Hospital data analysis workflow tools
healthyverse - Meta-package for healthcare analytics

License

MIT License - see LICENSE.md for details

Author

Steven P. Sanderson II, MPH
Email: [email protected]
ORCID: 0009-0006-7661-8247

Citation

If you use this package in your research, please cite:

citation("healthyR.data")

Note: The built-in healthyR_data dataset contains synthetic/de-identified data for demonstration and testing purposes. When working with CMS data functions, you’re accessing real, publicly available CMS hospital data.

healthyR.data