Complete Example

Guillermo Basulto-Elias

Whenever I start working on a dataset, I need to take a glance at the data to see how the data are or if the format is the one that I am expecting. I found myself coding similar lines over and over again with each data set I put my hands on. I decided to put that lines together in an R package so I and others can use them. I called it glancedata.

There are some already cool R packages to summarize information. Two of the best, in my opinion, are skimr and GGally. In this vignette, I provide examples of the functions in glancedata as well as some of the functions in these two packages.

Below is a table with the functions shown in this vignette.

Package Function Description
skimr skim Alternative to summary. Friendly with dplyr::group_by().
glancedata glance_data Alternative to summary. Emphasizes missing data and binary variables.
glancedata glance_data_in_workbook Similar to glance_data. Creates list of dataframes instead and saves XLSX file.
glancedata plot_numerical_vars Creates a plot per numerical variable. It might be histogram, density plot, qqplot, violin plot or scatterplot.
glancedata plot_discrete_vars Creates a plot per variable with up to 20 different values. This limit can be adjusted..
GGally ggpairs Creates plots for pairwise comparison of columns.

Example (USArrests)

I am going to use a built-in dataset in R. I added some columns to it so you we can see what happens with different type of columns. You may load your own dataset instead of sample_data.

The example we are going to use is USArrests, which contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas, according to the data set description (type help("USArrests") in the console to see more details).

library(dplyr)
library(tidyr)
library(knitr)

sample_data <- 
  tibble(State = state.name,
         Region = state.region) %>%
  bind_cols(as_tibble(state.x77)) %>%
  bind_cols(USArrests)

kable(head(sample_data))
State Region Population Income Illiteracy Life Exp Murder HS Grad Frost Area Murder1 Assault UrbanPop Rape
Alabama South 3615 3624 2.1 69.05 15.1 41.3 20 50708 13.2 236 58 21.2
Alaska West 365 6315 1.5 69.31 11.3 66.7 152 566432 10.0 263 48 44.5
Arizona West 2212 4530 1.8 70.55 7.8 58.1 15 113417 8.1 294 80 31.0
Arkansas South 2110 3378 1.9 70.66 10.1 39.9 65 51945 8.8 190 50 19.5
California West 21198 5114 1.1 71.71 10.3 62.6 20 156361 9.0 276 91 40.6
Colorado West 2541 4884 0.7 72.06 6.8 63.9 166 103766 7.9 204 78 38.7

Use of skimr

There are many packages useful

## Load package
library(skimr)

## Call main function
skim(sample_data)

Skim summary statistics
n obs: 50
n variables: 14

Variable type: character

variable missing complete n min max empty n_unique
State 0 50 50 4 14 0 50

Variable type: factor

variable missing complete n n_unique top_counts ordered
Region 0 50 50 4 Sou: 16, Wes: 13, Nor: 12, Nor: 9 FALSE

Variable type: integer

variable missing complete n mean sd p0 p25 p50 p75 p100 hist
Assault 0 50 50 170.76 83.34 45 109 159 249 337
UrbanPop 0 50 50 65.54 14.47 32 54.5 66 77.75 91

Variable type: numeric

variable missing complete n mean sd p0 p25 p50 p75 p100 hist
Area 0 50 50 70735.88 85327.3 1049 36985.25 54277 81162.5 566432
Frost 0 50 50 104.46 51.98 0 66.25 114.5 139.75 188
HS Grad 0 50 50 53.11 8.08 37.8 48.05 53.25 59.15 67.3
Illiteracy 0 50 50 1.17 0.61 0.5 0.62 0.95 1.58 2.8
Income 0 50 50 4435.8 614.47 3098 3992.75 4519 4813.5 6315
Life Exp 0 50 50 70.88 1.34 67.96 70.12 70.67 71.89 73.6
Murder 0 50 50 7.38 3.69 1.4 4.35 6.85 10.67 15.1
Murder1 0 50 50 7.79 4.36 0.8 4.08 7.25 11.25 17.4
Population 0 50 50 4246.42 4464.49 365 1079.5 2838.5 4968.5 21198
Rape 0 50 50 21.23 9.37 7.3 15.08 20.1 26.17 46
variable type formatted
Population numeric ▇▅▁▁▁▁▁▁
Income numeric ▁▅▅▇▆▃▁▁
Illiteracy numeric ▇▃▃▂▂▂▁▁
Life Exp numeric ▂▂▁▇▃▅▃▂
Murder numeric ▆▃▅▃▂▇▂▁
HS Grad numeric ▅▂▂▃▇▇▃▂
Frost numeric ▃▂▃▂▅▇▃▅
Area numeric ▇▃▁▁▁▁▁▁
Murder1 numeric ▇▇▇▇▃▇▁▃
Assault integer ▇▇▅▇▂▇▅▂
UrbanPop integer ▂▃▆▅▇▆▇▃
Rape numeric ▆▇▇▆▃▂▂▂

glance_data


library(glancedata)

glance_data(sample_data)
#> # A tibble: 14 x 11
#>    name  type  distinct_values minimum   median maximum    mean       sd
#>    <chr> <chr>           <int>   <dbl>    <dbl>   <dbl>   <dbl>    <dbl>
#>  1 State cate~              50    NA      NA    NA      NA      NA      
#>  2 Regi~ fact~               4    NA      NA    NA      NA      NA      
#>  3 Popu~ nume~              50   365    2838.    2.12e4  4.25e3  4.46e+3
#>  4 Inco~ nume~              50  3098    4519     6.32e3  4.44e3  6.14e+2
#>  5 Illi~ nume~              20     0.5     0.95  2.80e0  1.17e0  6.10e-1
#>  6 Life~ nume~              47    68.0    70.7   7.36e1  7.09e1  1.34e+0
#>  7 Murd~ nume~              44     1.4     6.85  1.51e1  7.38e0  3.69e+0
#>  8 HS G~ nume~              47    37.8    53.2   6.73e1  5.31e1  8.08e+0
#>  9 Frost nume~              43     0     114.    1.88e2  1.04e2  5.20e+1
#> 10 Area  nume~              50  1049   54277     5.66e5  7.07e4  8.53e+4
#> 11 Murd~ nume~              43     0.8     7.25  1.74e1  7.79e0  4.36e+0
#> 12 Assa~ nume~              45    45     159     3.37e2  1.71e2  8.33e+1
#> 13 Urba~ nume~              36    32      66     9.10e1  6.55e1  1.45e+1
#> 14 Rape  nume~              48     7.3    20.1   4.60e1  2.12e1  9.37e+0
#> # ... with 3 more variables: na_proportion <dbl>, count <chr>,
#> #   sample_values <chr>

glance_data_in_workbook

library(glancedata)

glance_data_in_workbook(sample_data)
#> $all
#> # A tibble: 14 x 11
#>    name  type  distinct_values minimum   median maximum    mean       sd
#>    <chr> <chr>           <int>   <dbl>    <dbl>   <dbl>   <dbl>    <dbl>
#>  1 State cate~              50    NA      NA    NA      NA      NA      
#>  2 Regi~ fact~               4    NA      NA    NA      NA      NA      
#>  3 Popu~ nume~              50   365    2838.    2.12e4  4.25e3  4.46e+3
#>  4 Inco~ nume~              50  3098    4519     6.32e3  4.44e3  6.14e+2
#>  5 Illi~ nume~              20     0.5     0.95  2.80e0  1.17e0  6.10e-1
#>  6 Life~ nume~              47    68.0    70.7   7.36e1  7.09e1  1.34e+0
#>  7 Murd~ nume~              44     1.4     6.85  1.51e1  7.38e0  3.69e+0
#>  8 HS G~ nume~              47    37.8    53.2   6.73e1  5.31e1  8.08e+0
#>  9 Frost nume~              43     0     114.    1.88e2  1.04e2  5.20e+1
#> 10 Area  nume~              50  1049   54277     5.66e5  7.07e4  8.53e+4
#> 11 Murd~ nume~              43     0.8     7.25  1.74e1  7.79e0  4.36e+0
#> 12 Assa~ nume~              45    45     159     3.37e2  1.71e2  8.33e+1
#> 13 Urba~ nume~              36    32      66     9.10e1  6.55e1  1.45e+1
#> 14 Rape  nume~              48     7.3    20.1   4.60e1  2.12e1  9.37e+0
#> # ... with 3 more variables: na_proportion <dbl>, count <chr>,
#> #   sample_values <chr>
#> 
#> $summary
#> # A tibble: 2 x 2
#>   cat             n
#>   <chr>       <int>
#> 1 categorical     2
#> 2 numerical      12
#> 
#> $all_nas
#> # A tibble: 0 x 6
#> # ... with 6 variables: name <chr>, type <chr>, distinct_values <int>,
#> #   na_proportion <dbl>, count <chr>, sample_values <chr>
#> 
#> $single_value
#> # A tibble: 0 x 11
#> # ... with 11 variables: name <chr>, type <chr>, distinct_values <int>,
#> #   minimum <dbl>, median <dbl>, maximum <dbl>, mean <dbl>, sd <dbl>,
#> #   na_proportion <dbl>, count <chr>, sample_values <chr>
#> 
#> $binary
#> # A tibble: 0 x 11
#> # ... with 11 variables: name <chr>, type <chr>, distinct_values <int>,
#> #   minimum <dbl>, median <dbl>, maximum <dbl>, mean <dbl>, sd <dbl>,
#> #   na_proportion <dbl>, count <chr>, sample_values <chr>
#> 
#> $numerical
#> # A tibble: 12 x 10
#>    name  type  distinct_values minimum  median maximum   mean      sd
#>    <chr> <chr>           <int>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>
#>  1 Popu~ nume~              50   365   2.84e+3  2.12e4 4.25e3 4.46e+3
#>  2 Inco~ nume~              50  3098   4.52e+3  6.32e3 4.44e3 6.14e+2
#>  3 Illi~ nume~              20     0.5 9.50e-1  2.80e0 1.17e0 6.10e-1
#>  4 Life~ nume~              47    68.0 7.07e+1  7.36e1 7.09e1 1.34e+0
#>  5 Murd~ nume~              44     1.4 6.85e+0  1.51e1 7.38e0 3.69e+0
#>  6 HS G~ nume~              47    37.8 5.32e+1  6.73e1 5.31e1 8.08e+0
#>  7 Frost nume~              43     0   1.14e+2  1.88e2 1.04e2 5.20e+1
#>  8 Area  nume~              50  1049   5.43e+4  5.66e5 7.07e4 8.53e+4
#>  9 Murd~ nume~              43     0.8 7.25e+0  1.74e1 7.79e0 4.36e+0
#> 10 Assa~ nume~              45    45   1.59e+2  3.37e2 1.71e2 8.33e+1
#> 11 Urba~ nume~              36    32   6.60e+1  9.10e1 6.55e1 1.45e+1
#> 12 Rape  nume~              48     7.3 2.01e+1  4.60e1 2.12e1 9.37e+0
#> # ... with 2 more variables: na_proportion <dbl>, sample_values <chr>
#> 
#> $categorical
#> # A tibble: 2 x 5
#>   name   distinct_values na_proportion count          sample_values        
#>   <chr>            <int>         <dbl> <chr>          <chr>                
#> 1 State               50             0 Too many uniq~ Alabama, Alaska, Ari~
#> 2 Region               4             0 North Central~ South, West, Northea~

Future versions