Mapping K-Means with healthyR.ai

code
rtip
healthyrai
kmeans
Author

Steven P. Sanderson II, MPH

Published

November 9, 2022

Introduction

K-Means is a clustering algorithm that can be used to find potential clusters in your data.

The algorithm does require that you look at different values of K in order to assess which is the optimal value.

In the R package {healthyR.ai} there is a utility to do this.

Function

Let’s take a look at the full function call.

hai_kmeans_mapped_tbl(.data, .centers = 15)

kmeans_mapped_tbl(.data, .centers = 15)

You will notice that there are two, they are synonyms to each other as this functionality is moving out of the {healthyR} package.

Parameters

The parameters take the following arguments:

  • .data - This is the data that should be an output of the hai_user_item_tbl() or it’s synonym, or should at least be in the user item matrix format.
  • .centers - The maximum amount of centers you want to map to the k-means function. The default is 15.

Example

Let’s run an example.

library(healthyR.data)
library(healthyR.ai)
library(dplyr)

data_tbl <- healthyR_data %>%
  filter(ip_op_flag == "I") %>%
  filter(payer_grouping != "Medicare B") %>%
  filter(payer_grouping != "?") %>%
  select(service_line, payer_grouping) %>%
  mutate(record = 1) %>%
  as_tibble()

ui_tbl <- hai_kmeans_user_item_tbl(
  .data = data_tbl,
  .row_input = service_line,
  .col_input = payer_grouping,
  .record_input = record
)

kmeans_mapped_tbl <- hai_kmeans_mapped_tbl(ui_tbl)

Let’s take a look at our data, user item matrix and our kmeans mapped tibble.

data_tbl
# A tibble: 116,823 × 3
   service_line  payer_grouping record
   <chr>         <chr>           <dbl>
 1 Medical       Blue Cross          1
 2 Schizophrenia Medicare A          1
 3 Syncope       Medicare A          1
 4 Pneumonia     Medicare A          1
 5 Chest Pain    Blue Cross          1
 6 Chest Pain    Blue Cross          1
 7 Surgical      Commercial          1
 8 Medical       Medicare A          1
 9 Alcohol Abuse Medicare A          1
10 Syncope       Medicare A          1
# … with 116,813 more rows
ui_tbl
# A tibble: 23 × 12
   service_line   Blue …¹ Comme…² Compe…³ Excha…⁴    HMO Medic…⁵ Medic…⁶ Medic…⁷
   <chr>            <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>   <dbl>   <dbl>
 1 Alcohol Abuse   0.0941 0.0321  5.25e-4 0.0116  0.0788 0.158    0.367   0.173 
 2 Bariatric Sur…  0.317  0.0583  0       0.0518  0.168  0.00324  0.343   0.0485
 3 Carotid Endar…  0.0845 0.0282  0       0       0.0141 0        0.0282  0.648 
 4 Cellulitis      0.110  0.0339  1.18e-2 0.00847 0.0805 0.0869   0.192   0.355 
 5 Chest Pain      0.144  0.0391  2.90e-3 0.00543 0.112  0.0522   0.159   0.324 
 6 CHF             0.0295 0.00958 5.18e-4 0.00414 0.0205 0.0197   0.0596  0.657 
 7 COPD            0.0493 0.0228  2.28e-4 0.00548 0.0342 0.0461   0.172   0.520 
 8 CVA             0.0647 0.0246  1.07e-3 0.0107  0.0524 0.0289   0.0764  0.555 
 9 GI Hemorrhage   0.0542 0.0175  1.25e-3 0.00834 0.0480 0.0350   0.0855  0.588 
10 Joint Replace…  0.139  0.0179  3.36e-2 0.00673 0.0516 0        0.0874  0.5   
# … with 13 more rows, 3 more variables: `Medicare HMO` <dbl>,
#   `No Fault` <dbl>, `Self Pay` <dbl>, and abbreviated variable names
#   ¹​`Blue Cross`, ²​Commercial, ³​Compensation, ⁴​`Exchange Plans`, ⁵​Medicaid,
#   ⁶​`Medicaid HMO`, ⁷​`Medicare A`
kmeans_mapped_tbl
# A tibble: 15 × 3
   centers k_means  glance          
     <int> <list>   <list>          
 1       1 <kmeans> <tibble [1 × 4]>
 2       2 <kmeans> <tibble [1 × 4]>
 3       3 <kmeans> <tibble [1 × 4]>
 4       4 <kmeans> <tibble [1 × 4]>
 5       5 <kmeans> <tibble [1 × 4]>
 6       6 <kmeans> <tibble [1 × 4]>
 7       7 <kmeans> <tibble [1 × 4]>
 8       8 <kmeans> <tibble [1 × 4]>
 9       9 <kmeans> <tibble [1 × 4]>
10      10 <kmeans> <tibble [1 × 4]>
11      11 <kmeans> <tibble [1 × 4]>
12      12 <kmeans> <tibble [1 × 4]>
13      13 <kmeans> <tibble [1 × 4]>
14      14 <kmeans> <tibble [1 × 4]>
15      15 <kmeans> <tibble [1 × 4]>
kmeans_mapped_tbl %>%
  tidyr::unnest(glance)
# A tibble: 15 × 6
   centers k_means  totss tot.withinss betweenss  iter
     <int> <list>   <dbl>        <dbl>     <dbl> <int>
 1       1 <kmeans>  1.41       1.41    1.33e-15     1
 2       2 <kmeans>  1.41       0.592   8.17e- 1     1
 3       3 <kmeans>  1.41       0.372   1.04e+ 0     2
 4       4 <kmeans>  1.41       0.276   1.13e+ 0     2
 5       5 <kmeans>  1.41       0.202   1.21e+ 0     2
 6       6 <kmeans>  1.41       0.159   1.25e+ 0     3
 7       7 <kmeans>  1.41       0.124   1.28e+ 0     3
 8       8 <kmeans>  1.41       0.0884  1.32e+ 0     3
 9       9 <kmeans>  1.41       0.0745  1.33e+ 0     3
10      10 <kmeans>  1.41       0.0576  1.35e+ 0     2
11      11 <kmeans>  1.41       0.0460  1.36e+ 0     3
12      12 <kmeans>  1.41       0.0363  1.37e+ 0     2
13      13 <kmeans>  1.41       0.0272  1.38e+ 0     2
14      14 <kmeans>  1.41       0.0202  1.39e+ 0     3
15      15 <kmeans>  1.41       0.0161  1.39e+ 0     3

Voila!