Example Report Template for a Data Analysis Project

Author

Marco Reina and Alexandra Tejada-Strop

1 Description of data and data source

BMI calculated using the following formula: weight (kg)/ (height (m))^2 BMI categories assigned based BMI score and were as follows: under 18.5 was labeled underweight, between 18.5 and 24.9 was labeled as healthy, between 25 to 29.9 was labeled as overweight, and over 30 was labeled as obese. For data that did not meet the criteria due to errors was labeled as NA.

2 Methods

A reproducible data analysis repository was created using a GitHub template provided by the professor. The raw dataset was modified by adding one new numerical variable and one new categorical variable. Collaborative analysis was conducted by granting a classmate direct access to the repository. The collaborator updated the data processing pipeline to use the updated dataset, which included the new variables. The collaborator performed necessary data cleaning and saved the new processed data. Exploratory data analysis was extended to include a boxplot by the new categorical variable (BMI_cat) and a scatterplot relating the new numerical variable (BMI score) to weight. Resulting figures were saved, committed and pushed to the shared repository.

3 Code analzying new variables

# Load cleaned data
data_location <- here::here("data","processed-data","processeddata2.rds")
mydata <- readRDS(data_location)

# Quick check
print(names(mydata))
[1] "Height"  "Weight"  "Gender"  "BMI"     "BMI_cat"
# Boxplot: Height by BMI category
p_box <- ggplot(mydata, aes(x = BMI_cat, y = Height)) +
  geom_boxplot() +
  ylim(100, NA)

ggsave(
  filename = here::here("results", "figures", "boxplot_height_by_BMIcat.png"),
  plot = p_box,
  width = 7,
  height = 5
)

# Scatterplot: Weight vs BMI
p_scatter <- ggplot(mydata, aes(x = Weight, y = BMI)) +
  geom_point() +
  ylim(15, NA)

ggsave(
  filename = here::here("results", "figures", "scatter_weight_vs_BMI.png"),
  plot = p_scatter,
  width = 7,
  height = 5
)

4 Cleaning

Raw data were screened for wrong, missing, and incorrectly formatted values. Observations with impossible or missing weight values were excluded. The resulting dataset contained only valid, consistently formatted observations suitable for downstream analysis.

Code:

data_location <- here::here("data","raw-data","exampledata2.xlsx")
rawdata <- readxl::read_excel(data_location)

codebook <- readxl::read_excel(data_location, sheet ="Codebook")
print(codebook) 
# A tibble: 5 × 3
  `Variable Name` `Variable Definition`                 `Allowed Values`        
  <chr>           <chr>                                 <chr>                   
1 Height          height in centimeters                 numeric value >0 or NA  
2 Weight          weight in kilograms                   numeric value >0 or NA  
3 Gender          identified gender (male/female/other) M/F/O/NA                
4 BMI             BMI score                             15 to 30                
5 BMI_cat         BMI category based on score           underweight, healthy, o…
dplyr::glimpse(rawdata)
Rows: 14
Columns: 5
$ Height  <chr> "180", "175", "sixty", "178", "192", "6", "156", "166", "155",…
$ Weight  <dbl> 80, 70, 60, 76, 90, 55, 90, 110, 54, 7000, NA, 45, 55, 50
$ Gender  <chr> "M", "O", "F", "F", "NA", "F", "O", "M", "N", "M", "F", "F", "…
$ BMI     <chr> "24.691358024691358", "22.857142857142858", "NA", "23.98687034…
$ BMI_cat <chr> "healthy", "healthy", "NA", "healthy", "healthy", "NA", "obese…
summary(rawdata)
    Height              Weight          Gender              BMI           
 Length:14          Min.   :  45.0   Length:14          Length:14         
 Class :character   1st Qu.:  55.0   Class :character   Class :character  
 Mode  :character   Median :  70.0   Mode  :character   Mode  :character  
                    Mean   : 602.7                                        
                    3rd Qu.:  90.0                                        
                    Max.   :7000.0                                        
                    NA's   :1                                             
   BMI_cat         
 Length:14         
 Class :character  
 Mode  :character  
                   
                   
                   
                   
head(rawdata)
# A tibble: 6 × 5
  Height Weight Gender BMI                BMI_cat
  <chr>   <dbl> <chr>  <chr>              <chr>  
1 180        80 M      24.691358024691358 healthy
2 175        70 O      22.857142857142858 healthy
3 sixty      60 F      NA                 NA     
4 178        76 F      23.98687034465345  healthy
5 192        90 NA     24.4140625         healthy
6 6          55 F      15277.777777777777 NA     
#skimr::skim(rawdata)

d1 <- rawdata %>% dplyr::filter( Height != "sixty" ) %>% 
                  dplyr::mutate(Height = as.numeric(Height))

#skimr::skim(d1)
hist(d1$Height)

d2 <- d1 %>% dplyr::mutate( Height = replace(Height, Height=="6",round(6*30.48,0)) )
#skimr::skim(d2)

d3 <- d2 %>%  dplyr::filter(Weight != 7000) %>% tidyr::drop_na()
#skimr::skim(d3)

d3$Gender <- as.factor(d3$Gender)  
#skimr::skim(d3)

d4 <- d3 %>% dplyr::filter( !(Gender %in% c("NA","N")) ) %>% droplevels()
#skimr::skim(d4)

# one more cleaning step to remove NA from BMI_cat, by MR
d5 <- d4 %>% dplyr::filter( !(BMI_cat %in% c("NA","N")) ) %>% droplevels()
#skimr::skim(d5)

# turn BMI into a numeric value and round to 1 decimal, by MR
d6 <- d5 %>% dplyr::mutate(BMI = round(as.numeric(BMI), 1))
#skimr::skim(d6)

processeddata <- d6

5 Statistical analysis

Linear models were fitted as follows: 1. Height as outcome, weight as predictor 2. Height as outcome, weight and gender as predictor 3. Height as outcome, BMI and BMI_cat as predictor

6 Results

Code:

############################
#### First model fit
# fit linear model using height as outcome, weight as predictor

lmfit1 <- lm(Height ~ Weight, mydata)  

# place results from fit into a data frame with the tidy function
lmtable1 <- broom::tidy(lmfit1)

#look at fit results
print(lmtable1)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  140.       19.3        7.24 0.000351
2 Weight         0.333     0.257      1.29 0.243   
# save fit results table  
table_file1 = here("results", "tables", "resulttable1.rds")
saveRDS(lmtable1, file = table_file1)

############################
#### Second model fit
# fit linear model using height as outcome, weight and gender as predictor

lmfit2 <- lm(Height ~ Weight + Gender, mydata)  

# place results from fit into a data frame with the tidy function
lmtable2 <- broom::tidy(lmfit2)

#look at fit results
print(lmtable2)
# A tibble: 4 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)  137.       23.5       5.85  0.00427
2 Weight         0.298     0.328     0.910 0.414  
3 GenderM        7.05     16.0       0.440 0.683  
4 GenderO        4.18     18.9       0.221 0.836  
# save fit results table  
table_file2 = here("results", "tables", "resulttable2.rds")
saveRDS(lmtable2, file = table_file2)

############################
#### Third model fit, fitted by MR
# fit linear model using height as outcome, weight and gender as predictor

lmfit3 <- lm(Height ~ BMI + BMI_cat, mydata)  

# place results from fit into a data frame with the tidy function
lmtable3 <- broom::tidy(lmfit3)

#look at fit results
print(lmtable3)
# A tibble: 4 × 5
  term              estimate std.error statistic p.value
  <chr>                <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)          75.7      32.3       2.34 0.0794 
2 BMI                   4.21      1.43      2.95 0.0421 
3 BMI_catobese        -76.6      23.3      -3.28 0.0305 
4 BMI_catoverweight   -49.6       8.06     -6.16 0.00352
# save fit results table  
table_file3 = here("results", "tables", "resulttable3.rds")
saveRDS(lmtable3, file = table_file3)

p_box

p_scatter