6 R-package

Creating an R package organizes your code, data, and documentation into a cohesive structure, greatly enhancing reproducibility, sharing, and collaboration. This professional approach ensures that your work is easy to maintain and extend, making it valuable for future projects and the broader data science community. By packaging your work, you not only streamline your own workflow but also contribute useful tools to others. In this chapter, the previous chapters will be analysed to check for any repetitive code. This code will be turned into functions, combined into my first R package: HanaahRTidy.


6.1 The Demo

To start creating the R package, the “The Whole Game” demo by Hadley Wickham was followed. This demo provided a comprehensive guide to setting up the package, connecting it to Git, and writing the first function. It covered useful packages such as roxygen2 and testthat. A brief summary of the packages and functions was created based on what I learned in the demo for future reference.

6.2 R-markdown driven development

The “Start with RMarkdown” approach involves beginning with a pre-written RMarkdown file and its scripts, rather than starting from scratch.

Creating a complete all-in-one package from an RMarkdown-based analysis has many advantages for reproducibility. Typically, sharing involves an RMarkdown file, functions, tidy data, the steps to tidy the data, the analysis, and the used packages. Combining these elements into an R package streamlines the process, making it easier to share and reuse for multiple purposes and datasets.

6.3 Package “HanaahRTidy”

The package can be installed via github: devtools::install_github(“Khadijadata/HanaahRTidy”).

6.3.1 Choosing a name

I searched for a name that was both unique and held personal meaning. After careful thought, I decided to name it after my mother. Despite her struggles to comprehend it, she consistently offers invaluable advice and feedback on my portfolio when I need it most.

6.3.2 Overview package

  1. split_datetime()

Description: Splits a datetime column into separate columns for date and time. Usage: split_datetime(datetime_column) Returns: A data frame with separate columns for date and time.

  1. convert_to_factor()

Description: Converts a column of textual values into a factor variable, useful for categorical data analysis. Usage: convert_to_factor(data, column) Returns: The input data frame with the specified column converted to a factor.

  1. import_data()

Description: Imports data from a specified file format (e.g., CSV, Excel) into a tibble for further analysis. Usage: import_data(file_path) Returns: A tibble containing the imported data.

  1. identify_missing_data()

Description: Summarizes missing data in each column of a data frame, providing counts and percentages of missing values. Usage: identify_missing_data(data) Returns: A data frame summarizing missing data statistics.

6.4 Functionality

To test the functionality of the functions, I will test them out on a the “Airway Study”.

The dataset includes hourly average readings from 5 chemical sensors in an air quality monitoring device placed in a polluted area in Italy. It covers one year from March 2004 to February 2005. The provided data includes concentrations of various pollutants like CO, hydrocarbons, benzene, NOx, and NO2. Some data quality issues, such as missing values tagged with -200, are noted. Because the identify_missing_values() only detects missing values as “NA”, I am showing the functionality of the function with the built-in “airquality” dataset.

library(HanaahRTidy)
library(tidyverse)

# Importing Airquality data with function import_data()
airway_data <- import_data("/Users/kzzba/Downloads/bookdown_khadija/AirQualityUCI.csv")
head(airway_data)
##        Date     Time  CO PT08S1 NMHC C6H6 PT08S2 NOx PT08S3 NO2 PT08S4 PT08S5
## 1 3/10/2004 18:00:00 2.6   1360  150 11.9   1046 166   1056 113   1692   1268
## 2 3/10/2004 19:00:00 2.0   1292  112  9.4    955 103   1174  92   1559    972
## 3 3/10/2004 20:00:00 2.2   1402   88  9.0    939 131   1140 114   1555   1074
## 4 3/10/2004 21:00:00 2.2   1376   80  9.2    948 172   1092 122   1584   1203
## 5 3/10/2004 22:00:00 1.6   1272   51  6.5    836 131   1205 116   1490   1110
## 6 3/10/2004 23:00:00 1.2   1197   38  4.7    750  89   1337  96   1393    949
##      T   RH     AH
## 1 13.6 48.9 0.7578
## 2 13.3 47.7 0.7255
## 3 11.9 54.0 0.7502
## 4 11.0 60.0 0.7867
## 5 11.2 59.6 0.7888
## 6 11.2 59.2 0.7848
#Making Date and Time in one column to show the split_datetime() functionality 
airway_data2 <- airway_data %>%
  mutate(DateTime = dmy_hms(paste(Date, Time))) %>%
  select(-Date, -Time) # Removing the Date and Time column that were originally in the dataset

# Column "Time" made tidy by splitting in date and time with function split_datetime()
split_time_column <- split_datetime(airway_data2$DateTime)
head(split_time_column)
##         Date     Time
## 1 2004-10-03 18:00:00
## 2 2004-10-03 19:00:00
## 3 2004-10-03 20:00:00
## 4 2004-10-03 21:00:00
## 5 2004-10-03 22:00:00
## 6 2004-10-03 23:00:00
# Checking the datatypes 
str(airway_data)
## 'data.frame':    9357 obs. of  15 variables:
##  $ Date  : chr  "3/10/2004" "3/10/2004" "3/10/2004" "3/10/2004" ...
##  $ Time  : chr  "18:00:00" "19:00:00" "20:00:00" "21:00:00" ...
##  $ CO    : num  2.6 2 2.2 2.2 1.6 1.2 1.2 1 0.9 0.6 ...
##  $ PT08S1: int  1360 1292 1402 1376 1272 1197 1185 1136 1094 1010 ...
##  $ NMHC  : int  150 112 88 80 51 38 31 31 24 19 ...
##  $ C6H6  : num  11.9 9.4 9 9.2 6.5 4.7 3.6 3.3 2.3 1.7 ...
##  $ PT08S2: int  1046 955 939 948 836 750 690 672 609 561 ...
##  $ NOx   : int  166 103 131 172 131 89 62 62 45 -200 ...
##  $ PT08S3: int  1056 1174 1140 1092 1205 1337 1462 1453 1579 1705 ...
##  $ NO2   : int  113 92 114 122 116 96 77 76 60 -200 ...
##  $ PT08S4: int  1692 1559 1555 1584 1490 1393 1333 1333 1276 1235 ...
##  $ PT08S5: int  1268 972 1074 1203 1110 949 733 730 620 501 ...
##  $ T     : num  13.6 13.3 11.9 11 11.2 11.2 11.3 10.7 10.7 10.3 ...
##  $ RH    : num  48.9 47.7 54 60 59.6 59.2 56.8 60 59.7 60.2 ...
##  $ AH    : num  0.758 0.726 0.75 0.787 0.789 ...
# Convert Date and Time to factors with the convert_to_factor
airway_data <- convert_to_factor(airway_data, "Date")
airway_data <- convert_to_factor(airway_data, "Time")

# Checking the datatypes after using convert_to_factor()
str(airway_data)
## 'data.frame':    9357 obs. of  15 variables:
##  $ Date  : Factor w/ 391 levels "1/1/2005","1/10/2005",..: 153 153 153 153 153 153 155 155 155 155 ...
##  $ Time  : Factor w/ 24 levels "0:00:00","1:00:00",..: 11 12 14 15 16 17 1 2 13 18 ...
##  $ CO    : num  2.6 2 2.2 2.2 1.6 1.2 1.2 1 0.9 0.6 ...
##  $ PT08S1: int  1360 1292 1402 1376 1272 1197 1185 1136 1094 1010 ...
##  $ NMHC  : int  150 112 88 80 51 38 31 31 24 19 ...
##  $ C6H6  : num  11.9 9.4 9 9.2 6.5 4.7 3.6 3.3 2.3 1.7 ...
##  $ PT08S2: int  1046 955 939 948 836 750 690 672 609 561 ...
##  $ NOx   : int  166 103 131 172 131 89 62 62 45 -200 ...
##  $ PT08S3: int  1056 1174 1140 1092 1205 1337 1462 1453 1579 1705 ...
##  $ NO2   : int  113 92 114 122 116 96 77 76 60 -200 ...
##  $ PT08S4: int  1692 1559 1555 1584 1490 1393 1333 1333 1276 1235 ...
##  $ PT08S5: int  1268 972 1074 1203 1110 949 733 730 620 501 ...
##  $ T     : num  13.6 13.3 11.9 11 11.2 11.2 11.3 10.7 10.7 10.3 ...
##  $ RH    : num  48.9 47.7 54 60 59.6 59.2 56.8 60 59.7 60.2 ...
##  $ AH    : num  0.758 0.726 0.75 0.787 0.789 ...
# Checking for missing values with the identify_missing_values() in the airquality dataset
identify_missing_data(airquality) 
##          Column MissingCount MissingPercentage
## Ozone     Ozone           37         24.183007
## Solar.R Solar.R            7          4.575163
## Wind       Wind            0          0.000000
## Temp       Temp            0          0.000000
## Month     Month            0          0.000000
## Day         Day            0          0.000000