6 R-package
Creating an R package organizes your code, data, and documentation into a cohesive structure, greatly enhancing reproducibility, sharing, and collaboration. This professional approach ensures that your work is easy to maintain and extend, making it valuable for future projects and the broader data science community. By packaging your work, you not only streamline your own workflow but also contribute useful tools to others. In this chapter, the previous chapters will be analysed to check for any repetitive code. This code will be turned into functions, combined into my first R package: HanaahRTidy.
6.1 The Demo
To start creating the R package, the “The Whole Game” demo by Hadley Wickham was followed. This demo provided a comprehensive guide to setting up the package, connecting it to Git, and writing the first function. It covered useful packages such as roxygen2 and testthat. A brief summary of the packages and functions was created based on what I learned in the demo for future reference.
6.2 R-markdown driven development
The “Start with RMarkdown” approach involves beginning with a pre-written RMarkdown file and its scripts, rather than starting from scratch.
Creating a complete all-in-one package from an RMarkdown-based analysis has many advantages for reproducibility. Typically, sharing involves an RMarkdown file, functions, tidy data, the steps to tidy the data, the analysis, and the used packages. Combining these elements into an R package streamlines the process, making it easier to share and reuse for multiple purposes and datasets.
6.3 Package “HanaahRTidy”
The package can be installed via github: devtools::install_github(“Khadijadata/HanaahRTidy”).
6.3.1 Choosing a name
I searched for a name that was both unique and held personal meaning. After careful thought, I decided to name it after my mother. Despite her struggles to comprehend it, she consistently offers invaluable advice and feedback on my portfolio when I need it most.
6.3.2 Overview package
- split_datetime()
Description: Splits a datetime column into separate columns for date and time. Usage: split_datetime(datetime_column) Returns: A data frame with separate columns for date and time.
- convert_to_factor()
Description: Converts a column of textual values into a factor variable, useful for categorical data analysis. Usage: convert_to_factor(data, column) Returns: The input data frame with the specified column converted to a factor.
- import_data()
Description: Imports data from a specified file format (e.g., CSV, Excel) into a tibble for further analysis. Usage: import_data(file_path) Returns: A tibble containing the imported data.
- identify_missing_data()
Description: Summarizes missing data in each column of a data frame, providing counts and percentages of missing values. Usage: identify_missing_data(data) Returns: A data frame summarizing missing data statistics.
6.4 Functionality
To test the functionality of the functions, I will test them out on a the “Airway Study”.
The dataset includes hourly average readings from 5 chemical sensors in an air quality monitoring device placed in a polluted area in Italy. It covers one year from March 2004 to February 2005. The provided data includes concentrations of various pollutants like CO, hydrocarbons, benzene, NOx, and NO2. Some data quality issues, such as missing values tagged with -200, are noted. Because the identify_missing_values() only detects missing values as “NA”, I am showing the functionality of the function with the built-in “airquality” dataset.
library(HanaahRTidy)
library(tidyverse)
# Importing Airquality data with function import_data()
airway_data <- import_data("/Users/kzzba/Downloads/bookdown_khadija/AirQualityUCI.csv")
head(airway_data)
## Date Time CO PT08S1 NMHC C6H6 PT08S2 NOx PT08S3 NO2 PT08S4 PT08S5
## 1 3/10/2004 18:00:00 2.6 1360 150 11.9 1046 166 1056 113 1692 1268
## 2 3/10/2004 19:00:00 2.0 1292 112 9.4 955 103 1174 92 1559 972
## 3 3/10/2004 20:00:00 2.2 1402 88 9.0 939 131 1140 114 1555 1074
## 4 3/10/2004 21:00:00 2.2 1376 80 9.2 948 172 1092 122 1584 1203
## 5 3/10/2004 22:00:00 1.6 1272 51 6.5 836 131 1205 116 1490 1110
## 6 3/10/2004 23:00:00 1.2 1197 38 4.7 750 89 1337 96 1393 949
## T RH AH
## 1 13.6 48.9 0.7578
## 2 13.3 47.7 0.7255
## 3 11.9 54.0 0.7502
## 4 11.0 60.0 0.7867
## 5 11.2 59.6 0.7888
## 6 11.2 59.2 0.7848
#Making Date and Time in one column to show the split_datetime() functionality
airway_data2 <- airway_data %>%
mutate(DateTime = dmy_hms(paste(Date, Time))) %>%
select(-Date, -Time) # Removing the Date and Time column that were originally in the dataset
# Column "Time" made tidy by splitting in date and time with function split_datetime()
split_time_column <- split_datetime(airway_data2$DateTime)
head(split_time_column)
## Date Time
## 1 2004-10-03 18:00:00
## 2 2004-10-03 19:00:00
## 3 2004-10-03 20:00:00
## 4 2004-10-03 21:00:00
## 5 2004-10-03 22:00:00
## 6 2004-10-03 23:00:00
## 'data.frame': 9357 obs. of 15 variables:
## $ Date : chr "3/10/2004" "3/10/2004" "3/10/2004" "3/10/2004" ...
## $ Time : chr "18:00:00" "19:00:00" "20:00:00" "21:00:00" ...
## $ CO : num 2.6 2 2.2 2.2 1.6 1.2 1.2 1 0.9 0.6 ...
## $ PT08S1: int 1360 1292 1402 1376 1272 1197 1185 1136 1094 1010 ...
## $ NMHC : int 150 112 88 80 51 38 31 31 24 19 ...
## $ C6H6 : num 11.9 9.4 9 9.2 6.5 4.7 3.6 3.3 2.3 1.7 ...
## $ PT08S2: int 1046 955 939 948 836 750 690 672 609 561 ...
## $ NOx : int 166 103 131 172 131 89 62 62 45 -200 ...
## $ PT08S3: int 1056 1174 1140 1092 1205 1337 1462 1453 1579 1705 ...
## $ NO2 : int 113 92 114 122 116 96 77 76 60 -200 ...
## $ PT08S4: int 1692 1559 1555 1584 1490 1393 1333 1333 1276 1235 ...
## $ PT08S5: int 1268 972 1074 1203 1110 949 733 730 620 501 ...
## $ T : num 13.6 13.3 11.9 11 11.2 11.2 11.3 10.7 10.7 10.3 ...
## $ RH : num 48.9 47.7 54 60 59.6 59.2 56.8 60 59.7 60.2 ...
## $ AH : num 0.758 0.726 0.75 0.787 0.789 ...
# Convert Date and Time to factors with the convert_to_factor
airway_data <- convert_to_factor(airway_data, "Date")
airway_data <- convert_to_factor(airway_data, "Time")
# Checking the datatypes after using convert_to_factor()
str(airway_data)
## 'data.frame': 9357 obs. of 15 variables:
## $ Date : Factor w/ 391 levels "1/1/2005","1/10/2005",..: 153 153 153 153 153 153 155 155 155 155 ...
## $ Time : Factor w/ 24 levels "0:00:00","1:00:00",..: 11 12 14 15 16 17 1 2 13 18 ...
## $ CO : num 2.6 2 2.2 2.2 1.6 1.2 1.2 1 0.9 0.6 ...
## $ PT08S1: int 1360 1292 1402 1376 1272 1197 1185 1136 1094 1010 ...
## $ NMHC : int 150 112 88 80 51 38 31 31 24 19 ...
## $ C6H6 : num 11.9 9.4 9 9.2 6.5 4.7 3.6 3.3 2.3 1.7 ...
## $ PT08S2: int 1046 955 939 948 836 750 690 672 609 561 ...
## $ NOx : int 166 103 131 172 131 89 62 62 45 -200 ...
## $ PT08S3: int 1056 1174 1140 1092 1205 1337 1462 1453 1579 1705 ...
## $ NO2 : int 113 92 114 122 116 96 77 76 60 -200 ...
## $ PT08S4: int 1692 1559 1555 1584 1490 1393 1333 1333 1276 1235 ...
## $ PT08S5: int 1268 972 1074 1203 1110 949 733 730 620 501 ...
## $ T : num 13.6 13.3 11.9 11 11.2 11.2 11.3 10.7 10.7 10.3 ...
## $ RH : num 48.9 47.7 54 60 59.6 59.2 56.8 60 59.7 60.2 ...
## $ AH : num 0.758 0.726 0.75 0.787 0.789 ...
# Checking for missing values with the identify_missing_values() in the airquality dataset
identify_missing_data(airquality)
## Column MissingCount MissingPercentage
## Ozone Ozone 37 24.183007
## Solar.R Solar.R 7 4.575163
## Wind Wind 0 0.000000
## Temp Temp 0 0.000000
## Month Month 0 0.000000
## Day Day 0 0.000000