Example Report Template for a Data Analysis Project
Author
Marco Reina and Alexandra Tejada-Strop
1 Description of data and data source
BMI calculated using the following formula: weight (kg)/ (height (m))^2 BMI categories assigned based BMI score and were as follows: under 18.5 was labeled underweight, between 18.5 and 24.9 was labeled as healthy, between 25 to 29.9 was labeled as overweight, and over 30 was labeled as obese. For data that did not meet the criteria due to errors was labeled as NA.
2 Methods
A reproducible data analysis repository was created using a GitHub template provided by the professor. The raw dataset was modified by adding one new numerical variable and one new categorical variable. Collaborative analysis was conducted by granting a classmate direct access to the repository. The collaborator updated the data processing pipeline to use the updated dataset, which included the new variables. The collaborator performed necessary data cleaning and saved the new processed data. Exploratory data analysis was extended to include a boxplot by the new categorical variable (BMI_cat) and a scatterplot relating the new numerical variable (BMI score) to weight. Resulting figures were saved, committed and pushed to the shared repository.
Raw data were screened for wrong, missing, and incorrectly formatted values. Observations with impossible or missing weight values were excluded. The resulting dataset contained only valid, consistently formatted observations suitable for downstream analysis.
# A tibble: 5 × 3
`Variable Name` `Variable Definition` `Allowed Values`
<chr> <chr> <chr>
1 Height height in centimeters numeric value >0 or NA
2 Weight weight in kilograms numeric value >0 or NA
3 Gender identified gender (male/female/other) M/F/O/NA
4 BMI BMI score 15 to 30
5 BMI_cat BMI category based on score underweight, healthy, o…
d2 <- d1 %>% dplyr::mutate( Height =replace(Height, Height=="6",round(6*30.48,0)) )#skimr::skim(d2)d3 <- d2 %>% dplyr::filter(Weight !=7000) %>% tidyr::drop_na()#skimr::skim(d3)d3$Gender <-as.factor(d3$Gender) #skimr::skim(d3)d4 <- d3 %>% dplyr::filter( !(Gender %in%c("NA","N")) ) %>%droplevels()#skimr::skim(d4)# one more cleaning step to remove NA from BMI_cat, by MRd5 <- d4 %>% dplyr::filter( !(BMI_cat %in%c("NA","N")) ) %>%droplevels()#skimr::skim(d5)# turn BMI into a numeric value and round to 1 decimal, by MRd6 <- d5 %>% dplyr::mutate(BMI =round(as.numeric(BMI), 1))#skimr::skim(d6)processeddata <- d6
5 Statistical analysis
Linear models were fitted as follows: 1. Height as outcome, weight as predictor 2. Height as outcome, weight and gender as predictor 3. Height as outcome, BMI and BMI_cat as predictor
6 Results
Code:
################################ First model fit# fit linear model using height as outcome, weight as predictorlmfit1 <-lm(Height ~ Weight, mydata) # place results from fit into a data frame with the tidy functionlmtable1 <- broom::tidy(lmfit1)#look at fit resultsprint(lmtable1)
# save fit results table table_file1 =here("results", "tables", "resulttable1.rds")saveRDS(lmtable1, file = table_file1)################################ Second model fit# fit linear model using height as outcome, weight and gender as predictorlmfit2 <-lm(Height ~ Weight + Gender, mydata) # place results from fit into a data frame with the tidy functionlmtable2 <- broom::tidy(lmfit2)#look at fit resultsprint(lmtable2)
# save fit results table table_file2 =here("results", "tables", "resulttable2.rds")saveRDS(lmtable2, file = table_file2)################################ Third model fit, fitted by MR# fit linear model using height as outcome, weight and gender as predictorlmfit3 <-lm(Height ~ BMI + BMI_cat, mydata) # place results from fit into a data frame with the tidy functionlmtable3 <- broom::tidy(lmfit3)#look at fit resultsprint(lmtable3)