This README outlines all of the files contained in this repository. You will need to follow the instructions in this file and the CodeBook file in order to (re)create the tidy data set (file) required for the JHU Getting and Cleaning Data course project.
####This repository contains 3 files at the root level:
GettingAndCleaningData-master/-
CodeBook.md- A code book that describes the variables, the data, and any transformations or work performed to clean up the data
-
README.md- This file, which explains how all of the scripts work and are connected in this repository
-
run_analysis.R- The R script containing the code to produce the tidy data set, given the raw data files
-
####It also contains 2 directories. The first directory, raw_data_files, contains the raw data files provided in the UCI HAR Dataset:
GettingAndCleaningData-master/raw_data_files/-
activity_labels.txt- Links the activity code to the activity name (6 x 2)
-
features.txt- List of the variables collected in the test and training data sets (561 x 2)
-
subject_test.txt- Each row identifies the (test) subject who performed the activity for each window sample, it's range is from 1 to 30 (2947 x 1)
-
subject_train.txt- Each row identifies the (train) subject who performed the activity for each window sample, it's range is from 1 to 30 (7352 x 1)
-
X_test.txt- Raw test data set (2947 x 561)
-
X_train.txt- Raw training data set (7352 x 561)
-
y_test.txt- Activity code for the each row of the test data set (2947 x 1)
-
y_train.txt- Activity code for each row of the train data set (7352 x 1)
-
####The second directory, tidy_data_file, contains the tidy data set text file we were required to produce.
GettingAndCleaningData-master/tidy_data_file/tidy_data.txt- The tidy data set (181 x 68)
In order to use these files to recreate the tidy data set, follow the instructions below (which assume you know how to use GitHub):
- Click the GitHub link to this repository provided in the Evaluation area.
- Either download the zip version of the repository or Clone it to your machine.
- Unzip the file, if required. This will recreate the folder structure listed above.
- In R, set your working directory to the "GettingAndCleaningData-master" folder. (NB: If you cloned the repository, the '-master' part of the folder name will not appear.)
- The run_analysis.R script requires the following packages to be loaded:
- data.table
- dplyr
- Sourcing the run_analysis.R file will cause the script to run and the tidy data set file to be (re)created. Note, the script overwrites the tidy data set file each time. The script takes approximately 30s to run (depending on your hardware, see below).
- The tidy data set is stored in a variable called "tidy", which you can explore in R, or by loading the tidy_data.txt file into a text editor.
The run_analysis.R script contains comments throughout that explain what each section of the script does. Here is a summary of those comments:
## This R script called run_analysis.R performs the following tasks:
## 1. Reads the raw data files and merges them into one data set,
## 2. Replaces the activity codes with the activity names,
## 3. Uses the "features.txt" file to appropriately label the columns/variables,
## 4. Extracts only the mean and std variables from the larger data set,
## 5. Groups the data by subject and activity and then calculates the mean for each
## mean/std column/variable,
## 6. Writes out the "tidy" data to a text file.The script cleans up after itself in each block of code, therefore it should not take more than 60MB of memory to run.
While most of the operations have been optimized for speed of execution, that was not a specified consideration for this project. As stated above, this script takes approximately 30s to run on a 2.6 GHz Intel Core i7 Macbook Pro with 8 GB of RAM.
In accordance with the JHU Honor Code, I certify that my answers here are my own work, and that I have appropriately acknowledged all external sources (if any) that were used in this work.