diff --git a/.gitignore b/.gitignore index 807ea25..2ff9d56 100644 --- a/.gitignore +++ b/.gitignore @@ -1,3 +1,7 @@ .Rproj.user .Rhistory .RData + +Data/~$Full_ForeignAssistanceData.xlsx + +Exercises/R/~$fragilestatesindex-2015.xlsx diff --git a/Data/fragilestatesindex-2006to2014.xlsx b/Data/fragilestatesindex-2006to2014.xlsx new file mode 100644 index 0000000..96155b5 Binary files /dev/null and b/Data/fragilestatesindex-2006to2014.xlsx differ diff --git a/Data/fragilestatesindex-2015.xlsx b/Data/fragilestatesindex-2015.xlsx new file mode 100644 index 0000000..05aa14e Binary files /dev/null and b/Data/fragilestatesindex-2015.xlsx differ diff --git a/Exercises/R/exploringData.Rmd b/Exercises/R/exploringData.Rmd index 8f19ce2..21090e0 100644 --- a/Exercises/R/exploringData.Rmd +++ b/Exercises/R/exploringData.Rmd @@ -1,30 +1,260 @@ --- -title: "exploring data" +title: "Merging clean FAD data with Fragile States Indices" author: "Laura Hughes" date: "December 14, 2015" -output: html_document +output: + html_document: + toc: true --- -```{r setup, include=FALSE} -knitr::opts_chunk$set(echo = TRUE) +## Overview +In previous modules, we imported in part of [U.S. Foreign Assistance Disbursements data](http://beta.foreignassistance.gov/) and whipped it into shape for data analysis. **Here, we'll merge this data with other data to explore relationships: the country GDP, and the Fragile States Index.** + +### The data +**[Foreign Assistance Disbursements](http://beta.foreignassistance.gov/)**: data on where the U.S. +**[Fragile States Index](http://fsi.fundforpeace.org/)** Fund for Peace's indicators for the stability of countries +**[GDP]** + +### New functions we'll cover in this module +* readxl::read_excel +* tidyr::left_join +* colnames +* c +* ifelse +* %in% +* data.table::%like% + +## Import the data +```{r import functions, message = FALSE} +# Workhorse libraries for data science: data cleanup, summaries, merging, etc. +library(dplyr) # Filter, create new variables, summarise, ... Basically, anything you can think to do to a dataset +library(tidyr) # Reshape and merge datasets +library(data.table) +library(stringr) # String manipulation + +# Incredibly powerful plotting library built off of the "Grammer of Graphics" +library(ggplot2) + +# Libraries to help import files +library(haven) # Imports in files from Stata, SAS, and SPSS +library(readr) # An advanced form of the base 'read.csv' file with some added functionality. +library(readxl) # Function to import in multiple sheets from Excel + +library(knitr) # Helper function to produce this RMarkdown document. +``` + +```{r importData} +fileName = '~/GitHub/StataTraining/Exercises/Stata/StataTraining.csv' +fad = read_csv(fileName) + +# Each year of data is located in a separate sheet within the Excel workbook. To keep things simple, we'll download the data from 2006 and 2014. +fragileFile = '~/GitHub/StataTraining/Data/fragilestatesindex-2006to2014.xlsx' +fragile2014 = read_excel(fragileFile, sheet = 1) +fragile2006 = read_excel(fragileFile, sheet = 9) + +# View the results +# in browser: View(fragile2006) +kable(head(fragile2006)) +``` + +Looks pretty good. + +####As a small thing, we'll change the column names to remove the spaces so it's somewhat less obnoxious. + +The function `colnames` has two purposes. If you just type `colnames(fragile2006)`, it'll return the name of the columns. If you set that equal to a list (a series of strings using the function `c`), it'll replace the column names. +```{r changeNames} + +colnames(fragile2006) = c('idx', 'country', 'fragileIdx', 'demographic', 'refugees', + 'grpGrievances', 'humanFlight', 'unevenDvpt', 'poverty', + 'stateLegit', 'publicServices', 'humanRights', 'securityApparatus', + 'factionalizedElites', 'ExtIntervention') + +``` + +## Cool. Let's try to merge those together. +We'll use what's called a 'left join' -- basically, we'll keep all the rows in the data frame that is the first (left) argument to the function, and for every matching country within the second data frame (`fragile2006`), it'll copy those rows. + +So if you have two instances of Bangladesh as the benefitting country, it'll copy whatever value for Bangladesh is in `fragile2006` twice, one for each row. + +The **key** -- where we specify which column to use as the index to match between the two data frames -- is given by the `by` argument. +```{r mergeFragile} +# We have to specify what's the common variable between the two data frames using the parameter 'by'. +mergedData = left_join(fad, fragile2006, by = c('BenefitingCountry' = 'country')) + +``` + +### Did the merge work properly? +* If it did, the number of rows of the resulting data frame should be the same, but the number of columns increased by the fragile2006 columns. +* Each of the rows within `mergedData` should have a value for the categories we added. If the merge didn't work, the variables we added (like `fragileIdx`) will be filled with `NA`. + * If everything within `fragileIndex` has `NA`s, something went really, really wrong + * If a few rows have NAs, that means either the country from `fad` isn't located in `fragile2006`, or it's named something different in the two (sigh). + +#### So how do we check that? +1. Check the dimensions (number of rows and columns) of the datasets before and after the merge. +2. See if there are any NAs within `fragileIndex`. +3. If there are `NA`s, filter them and see which countries they correspond to. +4. Then we'll check if there's something similar within `fragile2006`, or if it doesn't exist. + +####1. Check the dimensions +```{r checkDim} +# R has three functions to check dimensions: dim, nrow, ncol. +# As you can maybe guess, dim returns a vector containing the number of rows and then number of columns. +# nrow returns a single value (the number of rows) in the data frame +# and ncol returns the number of columns. + +# Note: if you're using RStudio, you can also see the dimensions of the data frame in the 'Environment' tab +dim(fad) +dim(fragile2006) +dim(mergedData) + +# If we want to get a little fancy, we can work in some basic logic to test if we're getting the right result. +# We'd expect the number of rows to be equal to the fad since we're doing a left join-- it'll only save the rows (and ALL the rows) from the first argument, which is the fad. +fadRows = nrow(fad) +mergedRows = nrow(mergedData) + +if (mergedRows == fadRows) { + print('Woo hoo! We have the right number of rows.') +} else { + print(':( Incorrect number of rows') +} + +# For the number of columns, we'd expect it to be the sum of the columns in both data frames - 1. +# Where does the -1 come from? Remember that you're merging the data sets together, so the two +fadCols = ncol(fad) +fragileCols = ncol(fragile2006) +mergedCols = ncol(mergedData) + + +if (mergedCols == fadCols + fragileCols - 1) { + print('Woo hoo! We have the right number of columns.') +} else { + print(':( Weird number of columns') +} + +# And if you want to be SUPER fancy, you can combine the logic of those two together using the logic operator AND, given by the '&'. +if (mergedCols == fadCols + fragileCols - 1 & + mergedRows == fadRows) { + print('Woo hoo! Dimensions look right.') +} else { + print(':( Wrong dimensions') +} + +``` + +#### 2. Are there any NAs? +```{r checkNAs} +# Like most things in R, there are many ways to do this. + +# Method 1: look at a summary of the data. +summary(mergedData$fragileIdx) + +# Method 2: count the numbers of NAs +# is.na will return a vector of booleans (true/falses). We can then count how many rows are true (as in, the number of rows missing data). +isMissing = is.na(mergedData$fragileIdx) +numNAs = sum(isMissing) # Since booleans are essentially 0's and 1's, if you sum up the vector, you'll get the number of NAs within fragileIdx +print(numNAs) + +# Method 3: count NAs using dplyr. +mergedData %>% + count(fragileIdx, sort = TRUE) # counts each number per group, sorted descendingly + +# As an aside... this method shows every value for the fragileIdx, which is cool, but overkill for what we want. We can add in logic to count only the values that are NAs. + +# It'll give us the same answer, but focusing only on what's important. +mergedData %>% + count(is.na(fragileIdx), sort = TRUE) # counts each number per group, sorted descendingly + +``` +*(sigh)* As expected, whenever you're trying to merge things together, things are rarely standardized and works straight out of the box. This is another reason why **unique ids** are so useful. We'll have to fix the country names so they merge in together. First, let's see where the problem(s) is. + +#### 3. What are the countries that are making things annoying? +```{r badCountries} +# Let's filter out the rows that are missing fragile state data, group by the number of countries, and count how many rows there are. + +mergedData %>% + filter(is.na(fragileIdx)) %>% + count(BenefitingCountry, sort = TRUE) + +# How many different countries are potentially screwed up? +missingCountries = mergedData %>% + filter(is.na(fragileIdx)) %>% + count(BenefitingCountry, sort = TRUE) + +nrow(missingCountries) ``` +From looking at the data, we can see that there are some things that really shouldn't have a fragile index (so the merge worked properly). -## R Markdown +* The most commonly missing Benefiting "Country" is "Worldwide." That makes sense-- there's no Fragile Index for the whole world. +* Similarly, there are entries from USAID's regions, like "East Asia Region". Since the money didn't go to a single country, again, we can't merge in country-level data. -This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see . +But there's also some things that we'll clearly need to fix. -When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this: +* Sudan is listed in the FAD dataset as "Sudan, Pre-2011 Election" -```{r cars} -summary(cars) +And then there's some baffling ones. + +* Ghana seems pretty normal for a name. Why isn't it merging propely? + +#### 4. Let's check what those look like in the Fragile Index dataset. Do those countries exist? Do they have different names? + +First, let's get rid of those entries that pretty clearly aren't a problem. +```{r createCountryTag} +# We know regions and Worldwide are clearly a problem. + +# Identify all the Benefitting countries with "Region" or "Worldwide" in their title. + +# Filter Benefitting Countries with 'Region' in their title +regionNames = mergedData %>% + filter(BenefitingCountry %like% 'Region' | + BenefitingCountry == 'Worldwide') %>% + select(BenefitingCountry) + +# Pull out only the unique names. +notCountry = unique(regionNames) + + +# Create a tag called isCountry to indicate whether a dispersement went to a single country or a region. +mergedData2 = mergedData %>% + mutate(isCountry = ifelse(BenefitingCountry %in% notCountry$BenefitingCountry, + FALSE, TRUE)) + +# Okay, cool. Now let's see how many problems we still have: +mergedData2 %>% + filter(is.na(fragileIdx), + isCountry == TRUE) %>% + count(BenefitingCountry, sort = TRUE) ``` -## Including Plots +Better. But still problems. Let's look at a couple countries we know have problems merging, like Ghana. + +```{r fragileCountries} +# Print out the name for Ghana. This will search for any country containing "Ghana" somewhere in it. +ghana = fragile2006 %>% + filter(country %like% 'Ghana') %>% + select(country) + +print(ghana) + +# So Ghana exists. It's spelled correctly... what's the deal? Maybe there are extra spaces in the name. +# Let's check the number of characters in that answer. +nchar(ghana) -You can also embed plots, for example: +# AHA! Ghana only has 5 letters... but it has 6 within the Fragile States dataset. So at least part of the problem is that we have extra spaces. Let's strip those out of fragile2006, remerge, and check. -```{r pressure, echo=FALSE} -plot(pressure) +# Down with spaces. +fragile2006 = fragile2006 %>% + mutate(country_cleaned = str_trim(country)) + +mergedData3 = left_join(fad, fragile2006, by = c('BenefitingCountry' = 'country_cleaned')) %>% # merge + # add in our tag for whether it's a country or region + mutate(isCountry = ifelse(BenefitingCountry %in% notCountry$BenefitingCountry, + FALSE, TRUE)) + +mergedData3 %>% + filter(is.na(fragileIdx), + isCountry == TRUE) %>% + count(BenefitingCountry, sort = TRUE) ``` -Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that generated the plot. + +Getting better. Sadly, I think we've reached the end of what can be done by computer, and now we have to match more or less by hand. diff --git a/Exercises/R/exploringData.html b/Exercises/R/exploringData.html index 3bf057c..bd9aac2 100644 --- a/Exercises/R/exploringData.html +++ b/Exercises/R/exploringData.html @@ -12,7 +12,7 @@ -exploring data +Merging clean FAD data with Fragile States Indices @@ -62,30 +62,460 @@ +
+ +
+ +
+

Overview

+

In previous modules, we imported in part of U.S. Foreign Assistance Disbursements data and whipped it into shape for data analysis. Here, we’ll merge this data with other data to explore relationships: the country GDP, and the Fragile States Index.

+
+

The data

+

Foreign Assistance Disbursements: data on where the U.S. Fragile States Index Fund for Peace’s indicators for the stability of countries [GDP]

+
+
+

New functions we’ll cover in this module

+
    +
  • readxl::read_excel
  • +
  • tidyr::left_join
  • +
  • colnames
  • +
  • c
  • +
  • ifelse
  • +
  • %in%
  • +
  • data.table::%like%
  • +
+
+
+
+

Import the data

+
# Workhorse libraries for data science: data cleanup, summaries, merging, etc.
+library(dplyr) # Filter, create new variables, summarise, ... Basically, anything you can think to do to a dataset
+library(tidyr) # Reshape and merge datasets
+library(data.table)
+library(stringr) # String manipulation
+
+# Incredibly powerful plotting library built off of the "Grammer of Graphics"
+library(ggplot2)
+
+# Libraries to help import files
+library(haven) # Imports in files from Stata, SAS, and SPSS
+library(readr) # An advanced form of the base 'read.csv' file with some added functionality.
+library(readxl) # Function to import in multiple sheets from Excel
+
+library(knitr) # Helper function to produce this RMarkdown document.
+
fileName = '~/GitHub/StataTraining/Exercises/Stata/StataTraining.csv'
+fad = read_csv(fileName)
+
+# Each year of data is located in a separate sheet within the Excel workbook.  To keep things simple, we'll download the data from 2006 and 2014.
+fragileFile = '~/GitHub/StataTraining/Data/fragilestatesindex-2006to2014.xlsx'
+fragile2014 = read_excel(fragileFile, sheet = 1)
+fragile2006 = read_excel(fragileFile, sheet = 9)
+
+# View the results
+# in browser: View(fragile2006)
+kable(head(fragile2006))
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NAFailed States Index 2006TotalDemographic PressuresRefugees and IDPsGroup GrievanceHuman FlightUneven DevelopmentPoverty and Economic DeclineLegitimacy of the StatePublic ServicesHuman RightsSecurity ApparatusFactionalized ElitesExternal Intervention
1Sudan112.39.69.79.79.19.27.59.59.59.89.89.19.8
2Congo, D.R.110.19.59.59.18.09.08.19.09.09.59.89.610.0
3Cote d’Ivoire109.28.87.69.88.58.09.010.08.59.49.89.810.0
4Iraq109.08.98.39.89.18.78.28.58.39.79.89.710.0
5Zimbabwe108.99.78.98.59.09.29.88.99.59.59.48.58.0
6Chad105.99.09.08.58.09.07.99.59.09.19.49.58.0
+

Looks pretty good.

+
+

As a small thing, we’ll change the column names to remove the spaces so it’s somewhat less obnoxious.

+

The function colnames has two purposes. If you just type colnames(fragile2006), it’ll return the name of the columns. If you set that equal to a list (a series of strings using the function c), it’ll replace the column names.

+
colnames(fragile2006) = c('idx', 'country', 'fragileIdx', 'demographic', 'refugees',
+                          'grpGrievances', 'humanFlight', 'unevenDvpt', 'poverty', 
+                          'stateLegit', 'publicServices', 'humanRights', 'securityApparatus', 
+                          'factionalizedElites', 'ExtIntervention')
+
+
+
+

Cool. Let’s try to merge those together.

+

We’ll use what’s called a ‘left join’ – basically, we’ll keep all the rows in the data frame that is the first (left) argument to the function, and for every matching country within the second data frame (fragile2006), it’ll copy those rows.

+

So if you have two instances of Bangladesh as the benefitting country, it’ll copy whatever value for Bangladesh is in fragile2006 twice, one for each row.

+

The key – where we specify which column to use as the index to match between the two data frames – is given by the by argument.

+
# We have to specify what's the common variable between the two data frames using the parameter 'by'.
+mergedData = left_join(fad, fragile2006, by = c('BenefitingCountry' = 'country'))
+
+

Did the merge work properly?

+
    +
  • If it did, the number of rows of the resulting data frame should be the same, but the number of columns increased by the fragile2006 columns.
  • +
  • Each of the rows within mergedData should have a value for the categories we added. If the merge didn’t work, the variables we added (like fragileIdx) will be filled with NA. +
      +
    • If everything within fragileIndex has NAs, something went really, really wrong
    • +
    • If a few rows have NAs, that means either the country from fad isn’t located in fragile2006, or it’s named something different in the two (sigh).
    • +
  • +
+
+

So how do we check that?

+
    +
  1. Check the dimensions (number of rows and columns) of the datasets before and after the merge.
  2. +
  3. See if there are any NAs within fragileIndex.
  4. +
  5. If there are NAs, filter them and see which countries they correspond to.
  6. +
  7. Then we’ll check if there’s something similar within fragile2006, or if it doesn’t exist.
  8. +
+
+
+

1. Check the dimensions

+
# R has three functions to check dimensions: dim, nrow, ncol.
+# As you can maybe guess, dim returns a vector containing the number of rows and then number of columns.  
+# nrow returns a single value (the number of rows) in the data frame
+# and ncol returns the number of columns.
+
+# Note: if you're using RStudio, you can also see the dimensions of the data frame in the 'Environment' tab
+dim(fad)
+
## [1] 2417   13
+
dim(fragile2006)
+
## [1] 147  15
+
dim(mergedData)
+
## [1] 2417   27
+
# If we want to get a little fancy, we can work in some basic logic to test if we're getting the right result.
+# We'd expect the number of rows to be equal to the fad since we're doing a left join-- it'll only save the rows (and ALL the rows) from the first argument, which is the fad.
+fadRows = nrow(fad)
+mergedRows = nrow(mergedData)
+
+if (mergedRows == fadRows) {
+  print('Woo hoo!  We have the right number of rows.')
+} else {
+  print(':(  Incorrect number of rows')
+}
+
## [1] "Woo hoo!  We have the right number of rows."
+
# For the number of columns, we'd expect it to be the sum of the columns in both data frames - 1.
+# Where does the -1 come from?  Remember that you're merging the data sets together, so the two 
+fadCols = ncol(fad)
+fragileCols = ncol(fragile2006)
+mergedCols = ncol(mergedData)
+
+
+if (mergedCols == fadCols + fragileCols - 1) {
+  print('Woo hoo!  We have the right number of columns.')
+} else {
+  print(':(  Weird number of columns')
+}
+
## [1] "Woo hoo!  We have the right number of columns."
+
# And if you want to be SUPER fancy, you can combine the logic of those two together using the logic operator AND, given by the '&'.
+if (mergedCols == fadCols + fragileCols - 1 &
+    mergedRows == fadRows) {
+  print('Woo hoo!  Dimensions look right.')
+} else {
+  print(':(  Wrong dimensions')
+}
+
## [1] "Woo hoo!  Dimensions look right."
+
+
+

2. Are there any NAs?

+
# Like most things in R, there are many ways to do this.
+
+# Method 1: look at a summary of the data.
+summary(mergedData$fragileIdx)
+
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
+##   24.80   79.20   88.30   86.79   94.50  112.30    1100
+
# Method 2: count the numbers of NAs
+# is.na will return a vector of booleans (true/falses).  We can then count how many rows are true (as in, the number of rows missing data).
+isMissing = is.na(mergedData$fragileIdx)
+numNAs = sum(isMissing) # Since booleans are essentially 0's and 1's, if you sum up the vector, you'll get the number of NAs within fragileIdx
+print(numNAs)
+
## [1] 1100
+
# Method 3: count NAs using dplyr.
+mergedData %>% 
+  count(fragileIdx, sort = TRUE) # counts each number per group, sorted descendingly
+
## Source: local data frame [69 x 2]
+## 
+##    fragileIdx     n
+##         (dbl) (int)
+## 1          NA  1100
+## 2        79.2    58
+## 3        88.6    45
+## 4        99.8    42
+## 5        91.9    41
+## 6        89.2    39
+## 7        94.4    39
+## 8        94.5    37
+## 9        78.3    34
+## 10      104.6    34
+## ..        ...   ...
+
# As an aside... this method shows every value for the fragileIdx, which is cool, but overkill for what we want.  We can add in logic to count only the values that are NAs.
+
+# It'll give us the same answer, but focusing only on what's important.
+mergedData %>% 
+  count(is.na(fragileIdx), sort = TRUE) # counts each number per group, sorted descendingly
+
## Source: local data frame [2 x 2]
+## 
+##   is.na(fragileIdx)     n
+##               (lgl) (int)
+## 1             FALSE  1317
+## 2              TRUE  1100
+

(sigh) As expected, whenever you’re trying to merge things together, things are rarely standardized and works straight out of the box. This is another reason why unique ids are so useful. We’ll have to fix the country names so they merge in together. First, let’s see where the problem(s) is.

+
+
+

3. What are the countries that are making things annoying?

+
# Let's filter out the rows that are missing fragile state data, group by the number of countries, and count how many rows there are.
+
+mergedData %>% 
+  filter(is.na(fragileIdx)) %>% 
+  count(BenefitingCountry, sort = TRUE)
+
## Source: local data frame [81 x 2]
+## 
+##           BenefitingCountry     n
+##                       (chr) (int)
+## 1                 Worldwide   208
+## 2                     Ghana    44
+## 3          East Asia Region    41
+## 4  Sudan, Pre-2011 Election    40
+## 5                   Senegal    38
+## 6                    Kosovo    35
+## 7                   Georgia    32
+## 8                      Mali    29
+## 9                 Nicaragua    28
+## 10            Africa Region    27
+## ..                      ...   ...
+
# How many different countries are potentially screwed up?
+missingCountries = mergedData %>% 
+  filter(is.na(fragileIdx)) %>% 
+  count(BenefitingCountry, sort = TRUE)
+
+nrow(missingCountries)
+
## [1] 81
+

From looking at the data, we can see that there are some things that really shouldn’t have a fragile index (so the merge worked properly).

+
    +
  • The most commonly missing Benefiting “Country” is “Worldwide.” That makes sense– there’s no Fragile Index for the whole world.
  • +
  • Similarly, there are entries from USAID’s regions, like “East Asia Region”. Since the money didn’t go to a single country, again, we can’t merge in country-level data.
  • +
+

But there’s also some things that we’ll clearly need to fix.

+
    +
  • Sudan is listed in the FAD dataset as “Sudan, Pre-2011 Election”
  • +
+

And then there’s some baffling ones.

+
    +
  • Ghana seems pretty normal for a name. Why isn’t it merging propely?
  • +
+
+
+

4. Let’s check what those look like in the Fragile Index dataset. Do those countries exist? Do they have different names?

+

First, let’s get rid of those entries that pretty clearly aren’t a problem.

+
# We know regions and Worldwide are clearly a problem.
+
+# Identify all the Benefitting countries with "Region" or "Worldwide" in their title.
+
+# Filter Benefitting Countries with 'Region' in their title
+regionNames = mergedData %>% 
+  filter(BenefitingCountry %like% 'Region' |
+           BenefitingCountry == 'Worldwide') %>% 
+  select(BenefitingCountry) 
 
-
-

R Markdown

-

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

-

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

-
summary(cars)
-
##      speed           dist       
-##  Min.   : 4.0   Min.   :  2.00  
-##  1st Qu.:12.0   1st Qu.: 26.00  
-##  Median :15.0   Median : 36.00  
-##  Mean   :15.4   Mean   : 42.98  
-##  3rd Qu.:19.0   3rd Qu.: 56.00  
-##  Max.   :25.0   Max.   :120.00
+# Pull out only the unique names. +notCountry = unique(regionNames) + + +# Create a tag called isCountry to indicate whether a dispersement went to a single country or a region. +mergedData2 = mergedData %>% + mutate(isCountry = ifelse(BenefitingCountry %in% notCountry$BenefitingCountry, + FALSE, TRUE)) + +# Okay, cool. Now let's see how many problems we still have: +mergedData2 %>% + filter(is.na(fragileIdx), + isCountry == TRUE) %>% + count(BenefitingCountry, sort = TRUE)
+
## Source: local data frame [59 x 2]
+## 
+##                BenefitingCountry     n
+##                            (chr) (int)
+## 1                          Ghana    44
+## 2       Sudan, Pre-2011 Election    40
+## 3                        Senegal    38
+## 4                         Kosovo    35
+## 5                        Georgia    32
+## 6                           Mali    29
+## 7                      Nicaragua    28
+## 8                       Cambodia    26
+## 9                        Armenia    25
+## 10 Congo, Democratic Republic of    25
+## ..                           ...   ...
+

Better. But still problems. Let’s look at a couple countries we know have problems merging, like Ghana.

+
# Print out the name for Ghana.  This will search for any country containing "Ghana" somewhere in it.
+ghana = fragile2006 %>% 
+  filter(country %like% 'Ghana') %>% 
+  select(country)
+
+print(ghana)
+
## Source: local data frame [1 x 1]
+## 
+##   country
+##     (chr)
+## 1  Ghana
+
# So Ghana exists.  It's spelled correctly... what's the deal?  Maybe there are extra spaces in the name.
+# Let's check the number of characters in that answer.
+nchar(ghana)
+
## country 
+##       6
+
# AHA!  Ghana only has 5 letters... but it has 6 within the Fragile States dataset.  So at least part of the problem is that we have extra spaces.  Let's strip those out of fragile2006, remerge, and check.
+
+# Down with spaces.
+fragile2006 = fragile2006 %>% 
+  mutate(country_cleaned = str_trim(country))
+
+mergedData3 = left_join(fad, fragile2006, by = c('BenefitingCountry' = 'country_cleaned')) %>% # merge
+  # add in our tag for whether it's a country or region
+  mutate(isCountry = ifelse(BenefitingCountry %in% notCountry$BenefitingCountry, 
+                FALSE, TRUE))
+
+mergedData3 %>% 
+  filter(is.na(fragileIdx),
+         isCountry == TRUE) %>% 
+  count(BenefitingCountry, sort = TRUE)
+
## Source: local data frame [36 x 2]
+## 
+##                BenefitingCountry     n
+##                            (chr) (int)
+## 1       Sudan, Pre-2011 Election    40
+## 2                         Kosovo    35
+## 3  Congo, Democratic Republic of    25
+## 4                     Madagascar    25
+## 5                        Lesotho    17
+## 6                    Timor-Leste    17
+## 7             West Bank and Gaza    17
+## 8                     Cape Verde    13
+## 9                       Djibouti     9
+## 10                        Guyana     7
+## ..                           ...   ...
+

Getting better. Sadly, I think we’ve reached the end of what can be done by computer, and now we have to match more or less by hand.

+
-
-

Including Plots

-

You can also embed plots, for example:

-

-

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

diff --git a/Exercises/R/importData.Rmd b/Exercises/R/importData.Rmd index 2766c6e..2d7358a 100644 --- a/Exercises/R/importData.Rmd +++ b/Exercises/R/importData.Rmd @@ -2,30 +2,35 @@ title: "Import and clean Foreign Assistance" author: "Laura Hughes" date: "December 14, 2015" -output: html_document -toc: true +output: + html_document: + toc: true + --- ## Overview **In this module, we'll import in the data and start to check that it looks properly.** -The +This module is a running commentary of what I encountered along the ways-- so a real-life depiction of running into problems and troubleshooting them. ### The data ### Functions we'll cover in this module * library -* read_csv +* [http://blog.rstudio.org/2015/04/09/readr-0-1-0/](readr::read_csv) * head -* kable -* +* [http://yihui.name/knitr/](knitr::kable) +* summary +* [https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html](dplyr::glimpse) ## Import in helper functions to make R more powerful R out of the box (called 'base R') is great. But the really, really powerful thing about R is that they have a community of people helping write other functions to expand R's toolkit and make it better, faster, and more powerful. Before we start playing with data, we'll import some of the most useful functions to import, clean, manipulate, and visualize data. -```{r import fucntions} +*Note: if you haven't installed the packages (groups of functions), you'll get an error if you try to load them with `library`. To install `dplyr`, for instance, you should first run `install.packages('dplyr')`.* + +```{r import functions} # Workhorse libraries for data science: data cleanup, summaries, merging, etc. library(dplyr) # Filter, create new variables, summarise, ... Basically, anything you can think to do to a dataset library(tidyr) # Reshape and merge datasets @@ -38,13 +43,13 @@ library(ggplot2) library(haven) # Imports in files from Stata, SAS, and SPSS library(readr) # An advanced form of the base 'read.csv' file with some added functionality. -library(knitr) # Helper function to produce this +library(knitr) # Helper function to produce this RMarkdown document. ``` ## Let's get started and get data into R! ```{r importData} -fileName = '~/GitHub/StataTraining/Data/StataTraining.csv' +fileName = '~/GitHub/StataTraining/Exercises/Stata/StataTraining.csv' spent = read_csv(fileName) # Print out a table of what the data looks like. @@ -53,23 +58,24 @@ spent = read_csv(fileName) # There's also a function called 'head' which allows you to look at a little bit of the data so you don't have to see the full thing. It'll show you the first 6 rows of any dataset. -# 'kable' is a function which formats info into a neat table which +# 'kable' is a function which formats info into a neat table which can be kable(head(spent)) ``` +Everything looks good -- the column names look normal, and the data look more or less right. -```{r fix import} -# Read the help documentation to figure out where the arguments going. +That's not always the case... this time, we got lucky. Within `read_csv`, there's an argument called `col_names` boolean (true/false) argument to make sure that the top row is interpreted as being the header (column names). + +```{r help} +# Read the help documentation to figure out where the arguments go. help("read_csv") -spent2 = read_csv(fileName, - col_names = TRUE, skip = 2) +spent2 = read_csv(fileName, col_names = TRUE) -# Check that this looks better +# Check that this looks the same kable(head(spent2)) ``` -Excellent! That looks a lot better. --- @@ -152,8 +158,8 @@ str_replace_all(c('$3'), '3', '') # Hmm. That does. # !! Moment of clarity-- I remember $ are special in regex expressions. Normally, you fix this by adding \ before the special character. - -str_replace_all(c('$3'), '\$', '') +#! Note: throws an error. +# str_replace_all(c('$3'), '\$', '') # Gives error message: Error: '\$' is an unrecognized escape in character string starting "'\$" # Ugh. Time to turn to StackExchange --> R uses double escape sequences. Problem solved! diff --git a/Exercises/R/importData.html b/Exercises/R/importData.html index d5934de..a3816fe 100644 --- a/Exercises/R/importData.html +++ b/Exercises/R/importData.html @@ -12,7 +12,7 @@ -Import and clean data +Import and clean Foreign Assistance @@ -62,14 +62,44 @@ +

Overview

+

In this module, we’ll import in the data and start to check that it looks properly.

+

This module is a running commentary of what I encountered along the ways– so a real-life depiction of running into problems and troubleshooting them.

+
+

The data

+
+

Import in helper functions to make R more powerful

@@ -96,249 +126,273 @@

Import in helper functions to make R more powerful

library(haven) # Imports in files from Stata, SAS, and SPSS library(readr) # An advanced form of the base 'read.csv' file with some added functionality. -library(knitr) # Helper function to produce this +library(knitr) # Helper function to produce this RMarkdown document.

Let’s get started and get data into R!

-
spent = read_csv('~/GitHub/StataTraining/Data/Full_ForeignAssistanceData_spent.csv')
+
fileName = '~/GitHub/StataTraining/Data/StataTraining.csv'
+spent = read_csv(fileName)
 
 # Print out a table of what the data looks like.
 # At the command line, you can type View(spent), which pulls up a table to view the data.
 
+
 # There's also a function called 'head' which allows you to look at a little bit of the data so you don't have to see the full thing.  It'll show you the first 6 rows of any dataset.
-knitr::kable(head(spent))
+ +# 'kable' is a function which formats info into a neat table which can be +kable(head(spent))
- - - - - - - - - - + + + + + + + + + + + + - - - - - - - - - - + + + + + + + + + + + + - - - - - - - - - - + + + + + + + + + + + + - - + + - - - - - - + + + + + + + + - - + + - - - - - - + + + + + + + + - - + + - - - + + + - + + + - - + + - - - - - - - + + + + + + + + +
SpentNANANANANANANANANAFiscalYearQTRFiscalYearTypeAccountAgencyOperatingUnitBenefitingCountryCategorySectorAmountSpentSpent2
NANANANANANANANANANA20091DisbursementsDevelopment AssistanceUSAIDEgyptEgyptDemocracy, Human Rights, and GovernanceGood Governance$84,157.6884157.6884157.68
Fiscal YearQTRFiscal Year TypeAccount NameAgency NameOperating UnitBenefiting CountryCategorySectorAmount20112ObligationsDevelopment AssistanceusaidUgandaUgandaEducation and Social ServicesSocial Services$(964.73)-964.73-964.73
200612012NA Disbursements N/AMCCArmeniaArmeniaMulti-SectorMulti-Sector - Unspecified$53,790.31IAFCosta RicaCosta RicaNAEconomic Opportunity$102,500.00102500.00102500.00
2006420132 Disbursements N/AMCCArmeniaArmeniaMulti-SectorMulti-Sector - Unspecified$60,580.98usaidBoliviaBoliviaPeace and SecurityCounter-Narcotics$(156.11)-156.11-156.11
2006320133 Disbursements N/A MCCEl SalvadorEl SalvadorMulti-SectorBeninBeninmulti-sector Multi-Sector - Unspecified$9,473.20$78,673.5278673.5278673.52
2006420094 DisbursementsN/AMCCGeorgiaGeorgiaProgram ManagementDirect Administrative Costs$262,977.00Development AssistanceUSAIDBoliviaBoliviaEducation and Social ServicesBasic Education$67,839.1067839.1067839.10
-
-

Wait… that doesn’t give me what I want.

-

Instead of having the column names (“Fiscal Year” …) at the top, there’s a gap.

-
-

Why is that?

-

It looks like there are two extra lines at the top.

-
-
-

How do I fix that?

-

I need to specify where the row containing the column names is, and to skip the first rows. Luckily, ‘read_csv’ has an argument to ‘skip’ the number of lines you specify, and a ‘col_names’ boolean (true/false) argument to make sure that the top row is interpreted as being the header (column names).

-
# Read the help documentation to figure out where the arguments going.
+

Everything looks good – the column names look normal, and the data look more or less right.

+

That’s not always the case… this time, we got lucky. Within read_csv, there’s an argument called col_names boolean (true/false) argument to make sure that the top row is interpreted as being the header (column names).

+
# Read the help documentation to figure out where the arguments go.
 help("read_csv")
 
-spent2 = read_csv('~/GitHub/StataTraining/Data/Full_ForeignAssistanceData_spent.csv',
-                 col_names = TRUE, skip = 2)
+spent2 = read_csv(fileName, col_names = TRUE)
 
-# Check that this looks better
+# Check that this looks the same
 kable(head(spent2))
- + - - - - - + + + + + + + - + - - - - - - - + + + + + + + + + - - - - - - - - - - + + + + + + + + + + + + - - + + - - - - - - + + + + + + + + - - + + - - - - - - + + + + + + + + - + - - - - - + + + + + + + - + - - - - - - - + + + + + + + + +
Fiscal YearFiscalYear QTRFiscal Year TypeAccount NameAgency NameOperating UnitBenefiting CountryFiscalYearTypeAccountAgencyOperatingUnitBenefitingCountry Category Sector AmountSpentSpent2
20062009 1 DisbursementsN/AMCCArmeniaArmeniaMulti-SectorMulti-Sector - Unspecified$53,790.31Development AssistanceUSAIDEgyptEgyptDemocracy, Human Rights, and GovernanceGood Governance$84,157.6884157.6884157.68
20064DisbursementsN/AMCCArmeniaArmeniaMulti-SectorMulti-Sector - Unspecified$60,580.9820112ObligationsDevelopment AssistanceusaidUgandaUgandaEducation and Social ServicesSocial Services$(964.73)-964.73-964.73
200632012NA Disbursements N/AMCCEl SalvadorEl SalvadorMulti-SectorMulti-Sector - Unspecified$9,473.20IAFCosta RicaCosta RicaNAEconomic Opportunity$102,500.00102500.00102500.00
2006420132 Disbursements N/AMCCGeorgiaGeorgiaProgram ManagementDirect Administrative Costs$262,977.00usaidBoliviaBoliviaPeace and SecurityCounter-Narcotics$(156.11)-156.11-156.11
20062013 3 Disbursements N/A MCCGeorgiaGeorgiaEconomic DevelopmentInfrastructure$1,554,454.80BeninBeninmulti-sectorMulti-Sector - Unspecified$78,673.5278673.5278673.52
20062009 4 DisbursementsN/AMCCHondurasHondurasEconomic DevelopmentInfrastructure$107,582.30Development AssistanceUSAIDBoliviaBoliviaEducation and Social ServicesBasic Education$67,839.1067839.1067839.10
-

Excellent! That looks a lot better.

+
-
-
-

Now let’s start to take a deeper look at the data.

+
+

Now let’s start to take a deeper look at the data.

Let’s take another look at our data and make sure that everything is imported correctly.

R has many redundant ways of doing things. Sometimes they’re exactly the same (with different details under the hood), and sometimes they’re slightly different and complementary.

In this case, ‘glimpse’ and ‘summary’ provide two quick looks at what a dataset looks like.

# 'glimpse' gives you the name of the variable, its type, and the initial values.
 glimpse(spent2)
-
## Observations: 62,537
-## Variables: 10
-## $ Fiscal Year        (int) 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2...
-## $ QTR                (int) 1, 4, 3, 4, 3, 4, 1, 4, 3, 1, 4, 2, 2, 4, 3...
-## $ Fiscal Year Type   (chr) "Disbursements", "Disbursements", "Disburse...
-## $ Account Name       (chr) "N/A", "N/A", "N/A", "N/A", "N/A", "N/A", "...
-## $ Agency Name        (chr) "MCC", "MCC", "MCC", "MCC", "MCC", "MCC", "...
-## $ Operating Unit     (chr) "Armenia", "Armenia", "El Salvador", "Georg...
-## $ Benefiting Country (chr) "Armenia", "Armenia", "El Salvador", "Georg...
-## $ Category           (chr) "Multi-Sector", "Multi-Sector", "Multi-Sect...
-## $ Sector             (chr) "Multi-Sector - Unspecified", "Multi-Sector...
-## $ Amount             (chr) "$53,790.31", "$60,580.98", "$9,473.20", "$...
+
## Observations: 2,417
+## Variables: 12
+## $ FiscalYear        (int) 2009, 2011, 2012, 2013, 2013, 2009, 2013, 20...
+## $ QTR               (int) 1, 2, NA, 2, 3, 4, 2, 3, 3, 3, 4, 4, 2, 2, 4...
+## $ FiscalYearType    (chr) "Disbursements", "Obligations", "Disbursemen...
+## $ Account           (chr) "Development Assistance", "Development Assis...
+## $ Agency            (chr) "USAID", "usaid", "IAF", "usaid", "MCC", "US...
+## $ OperatingUnit     (chr) "Egypt", "Uganda", "Costa Rica", "Bolivia", ...
+## $ BenefitingCountry (chr) "Egypt", "Uganda", "Costa Rica", "Bolivia", ...
+## $ Category          (chr) "Democracy, Human Rights, and Governance", "...
+## $ Sector            (chr) "Good Governance", "Social Services", "Econo...
+## $ Amount            (chr) "$84,157.68", "$(964.73)", "$102,500.00", "$...
+## $ Spent             (dbl) 84157.68, -964.73, 102500.00, -156.11, 78673...
+## $ Spent2            (dbl) 84157.68, -964.73, 102500.00, -156.11, 78673...
# 'summary' similarly gives you the type of data, but also for numerical data, gives you quick stats on the range of the data, the mean, and the distribution.
 summary(spent2)
-
##   Fiscal Year        QTR        Fiscal Year Type   Account Name      
-##  Min.   :2006   Min.   :0.000   Length:62537       Length:62537      
-##  1st Qu.:2010   1st Qu.:1.000   Class :character   Class :character  
-##  Median :2011   Median :2.000   Mode  :character   Mode  :character  
-##  Mean   :2011   Mean   :2.432                                        
-##  3rd Qu.:2012   3rd Qu.:3.000                                        
-##  Max.   :2014   Max.   :4.000                                        
-##                 NA's   :1223                                         
-##  Agency Name        Operating Unit     Benefiting Country
-##  Length:62537       Length:62537       Length:62537      
+
##    FiscalYear         QTR         FiscalYearType       Account         
+##  Min.   : 2006   Min.   : 0.000   Length:2417        Length:2417       
+##  1st Qu.: 2010   1st Qu.: 2.000   Class :character   Class :character  
+##  Median : 2011   Median : 3.000   Mode  :character   Mode  :character  
+##  Mean   : 2138   Mean   : 3.043                                        
+##  3rd Qu.: 2012   3rd Qu.: 4.000                                        
+##  Max.   :20014   Max.   :40.000                                        
+##                  NA's   :46                                            
+##     Agency          OperatingUnit      BenefitingCountry 
+##  Length:2417        Length:2417        Length:2417       
 ##  Class :character   Class :character   Class :character  
 ##  Mode  :character   Mode  :character   Mode  :character  
 ##                                                          
@@ -346,21 +400,29 @@ 

Now let’s start to take a deeper look at the data.

## ## ## Category Sector Amount -## Length:62537 Length:62537 Length:62537 +## Length:2417 Length:2417 Length:2417 ## Class :character Class :character Class :character ## Mode :character Mode :character Mode :character ## ## ## -##
-
-

D’oh! The main column we care about – amount spent – isn’t a number in our dataset. Instead, it’s a string (a series of alphanumeric characters.)

-
-
Why doesn’t R import the amount as number??
+## +## Spent Spent2 +## Min. :-24693158 Min. :-24693158 +## 1st Qu.: 24088 1st Qu.: 24088 +## Median : 236642 Median : 236642 +## Mean : 2615908 Mean : 2615908 +## 3rd Qu.: 1000000 3rd Qu.: 1000000 +## Max. :979741604 Max. :979741604 +## NA's :12 NA's :12
+
+

D’oh! The main column we care about – amount spent – isn’t a number in our dataset. Instead, it’s a string (a series of alphanumeric characters.)

+
+

Why doesn’t R import the amount as number??

Since amounts are given as things like $5,238.23 in the dataset, they’re imported as strings. The dollar sign isn’t a number, so R assumes the variable is a series of characters.

-
-
Before we go any further, we need to fix that. The easy way first.
+
+

Before we go any further, we need to fix that. The easy way first.

The readr package is pretty nifty. Using it, we can tell the importer function that the Amount column is a number and it’ll ignore anything that isn’t 0-9 or a decimal point.

# Simple way-- reimport the data. If you specify the data format for each of the columns, it should take care of the rest.
 
@@ -369,29 +431,30 @@ 
Before we go any further, we need to fix that. The easy way first.
# We'll use the 'numeric' class, since it's a sloppy parser that ignores everything except numbers, -, and . -spent3 = read_csv('~/GitHub/StataTraining/Data/Full_ForeignAssistanceData_spent.csv', +spent3 = read_csv(fileName, col_names = TRUE, skip = 2, - col_types = 'iicccccccn') - -glimpse(spent3)
-
## Observations: 62,537
-## Variables: 10
-## $ Fiscal Year        (int) 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2...
-## $ QTR                (int) 1, 4, 3, 4, 3, 4, 1, 4, 3, 1, 4, 2, 2, 4, 3...
-## $ Fiscal Year Type   (chr) "Disbursements", "Disbursements", "Disburse...
-## $ Account Name       (chr) "N/A", "N/A", "N/A", "N/A", "N/A", "N/A", "...
-## $ Agency Name        (chr) "MCC", "MCC", "MCC", "MCC", "MCC", "MCC", "...
-## $ Operating Unit     (chr) "Armenia", "Armenia", "El Salvador", "Georg...
-## $ Benefiting Country (chr) "Armenia", "Armenia", "El Salvador", "Georg...
-## $ Category           (chr) "Multi-Sector", "Multi-Sector", "Multi-Sect...
-## $ Sector             (chr) "Multi-Sector - Unspecified", "Multi-Sector...
-## $ Amount             (dbl) 53790.31, 60580.98, 9473.20, 262977.00, 155...
+ col_types = 'iicccccccn')
+
## Warning: Unnamed `col_types` should have the same length as `col_names`.
+## Using smaller of the two.
+
## Warning: 2415 parsing failures.
+## row col  expected     actual
+##   1  -- 1 columns 12 columns
+##   2  -- 1 columns 12 columns
+##   3  -- 1 columns 12 columns
+##   4  -- 1 columns 12 columns
+##   5  -- 1 columns 12 columns
+## ... ... ......... ..........
+## .See problems(...) for more details.
+
glimpse(spent3)
+
## Observations: 2,415
+## Variables: 1
+## $ 2011 (int) 2012, 2013, 2013, 2009, 2013, 2010, 2010, 2011, 2009, 201...
-
-
That’s a nice in-built function. But you can accomplish the same thing (albeit with more work) using string manipulation.
-

We’ll use some functions within stringr and dplyr to: 1. Strip out the $ and save the value as a new variable. 2. Convert the string to a number and save as a new variable.

-
-
Part 1: Get rid of the $!
+
+

That’s a nice in-built function. But you can accomplish the same thing (albeit with more work) using string manipulation.

+

We’ll use some functions within stringr and dplyr to: 1. Strip out the $ and , and save the value as a new variable. 2. Convert the string to a number and save as a new variable.

+
+
Part 1: Get rid of the $! (and the ,’s)
# More complicated way -- import the data, remove the dollar sign, and convert the strings to numbers.
 
 # We'll work from spent2.
@@ -401,47 +464,82 @@ 
Part 1: Get rid of the $!
# Let's look. glimpse(spent4)
-
## Observations: 62,537
-## Variables: 11
-## $ Fiscal Year        (int) 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2...
-## $ QTR                (int) 1, 4, 3, 4, 3, 4, 1, 4, 3, 1, 4, 2, 2, 4, 3...
-## $ Fiscal Year Type   (chr) "Disbursements", "Disbursements", "Disburse...
-## $ Account Name       (chr) "N/A", "N/A", "N/A", "N/A", "N/A", "N/A", "...
-## $ Agency Name        (chr) "MCC", "MCC", "MCC", "MCC", "MCC", "MCC", "...
-## $ Operating Unit     (chr) "Armenia", "Armenia", "El Salvador", "Georg...
-## $ Benefiting Country (chr) "Armenia", "Armenia", "El Salvador", "Georg...
-## $ Category           (chr) "Multi-Sector", "Multi-Sector", "Multi-Sect...
-## $ Sector             (chr) "Multi-Sector - Unspecified", "Multi-Sector...
-## $ Amount             (chr) "$53,790.31", "$60,580.98", "$9,473.20", "$...
-## $ Amount2            (chr) "$53,790.31", "$60,580.98", "$9,473.20", "$...
+
## Observations: 2,417
+## Variables: 13
+## $ FiscalYear        (int) 2009, 2011, 2012, 2013, 2013, 2009, 2013, 20...
+## $ QTR               (int) 1, 2, NA, 2, 3, 4, 2, 3, 3, 3, 4, 4, 2, 2, 4...
+## $ FiscalYearType    (chr) "Disbursements", "Obligations", "Disbursemen...
+## $ Account           (chr) "Development Assistance", "Development Assis...
+## $ Agency            (chr) "USAID", "usaid", "IAF", "usaid", "MCC", "US...
+## $ OperatingUnit     (chr) "Egypt", "Uganda", "Costa Rica", "Bolivia", ...
+## $ BenefitingCountry (chr) "Egypt", "Uganda", "Costa Rica", "Bolivia", ...
+## $ Category          (chr) "Democracy, Human Rights, and Governance", "...
+## $ Sector            (chr) "Good Governance", "Social Services", "Econo...
+## $ Amount            (chr) "$84,157.68", "$(964.73)", "$102,500.00", "$...
+## $ Spent             (dbl) 84157.68, -964.73, 102500.00, -156.11, 78673...
+## $ Spent2            (dbl) 84157.68, -964.73, 102500.00, -156.11, 78673...
+## $ Amount2           (chr) "$84,157.68", "$(964.73)", "$102,500.00", "$...
-
-
Huh. That’s not what we expected. The $ are still there! What’s going on?
+
+
Huh. That’s not what we expected. The $ are still there! What’s going on?

This is actually a pretty typical problem to encounter. We’ll have to track down this problem.

In this case, it turns out that on the backend, stringr is using what’s called https://en.wikipedia.org/wiki/Regular_expression to figure out what to replace. In regex (as it’s known), certain characters like $ are used as special characters to denote special behavior. In the case of the dollar sign, it’s used in conjunction with a ^ to do exact matching. If you search for ‘blue’ in a list containing blue and blueberry, you’ll get both words. If instead you search for ‘^blue$’, you’ll only get blue (the exact match).

To get around this problem, you stick what’s called an escape character in front of the character you want. This character ( in normal programming; \ in R) tells the computer that the next character should be interpreted literally as a dollar sign, not as the special behavior like it would normally.

-
# How'd I figure that out?  Let's start with a simple test 
-
+
# How'd I figure that out?  Let's start with a simple test:
+str_replace_all(c('$3'), '$', '')
+
## [1] "$3"
+
# That doesn't work.  Let's try replacing the 3.
+str_replace_all(c('$3'), '3', '')
+
## [1] "$"
+
# Hmm.  That does.  
+
+# !! Moment of clarity-- I remember $ are special in regex expressions.  Normally, you fix this by adding \ before the special character.
+#! Note: throws an error.
+# str_replace_all(c('$3'), '\$', '')
+
+# Gives error message: Error: '\$' is an unrecognized escape in character string starting "'\$"
+# Ugh.  Time to turn to StackExchange --> R uses double escape sequences.  Problem solved!
+str_replace_all(c('$3'), '\\$', '')
+
## [1] "3"
+
# Note: the Magrittr operator -- that funky %>% -- is amazing.  It allows you to string together operations from the dplyr package, instead of having to call them sequentially.
 spent5 = spent2 %>% 
-  mutate(Amount2 = str_replace(Amount, '\\$', ''))
+  mutate(Amount2 = str_replace(Amount, '\\$', ''),
+         Amount3 = str_replace(Amount2, ',', ''))
 
 # Let's look.
 glimpse(spent5)
-
## Observations: 62,537
-## Variables: 11
-## $ Fiscal Year        (int) 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2...
-## $ QTR                (int) 1, 4, 3, 4, 3, 4, 1, 4, 3, 1, 4, 2, 2, 4, 3...
-## $ Fiscal Year Type   (chr) "Disbursements", "Disbursements", "Disburse...
-## $ Account Name       (chr) "N/A", "N/A", "N/A", "N/A", "N/A", "N/A", "...
-## $ Agency Name        (chr) "MCC", "MCC", "MCC", "MCC", "MCC", "MCC", "...
-## $ Operating Unit     (chr) "Armenia", "Armenia", "El Salvador", "Georg...
-## $ Benefiting Country (chr) "Armenia", "Armenia", "El Salvador", "Georg...
-## $ Category           (chr) "Multi-Sector", "Multi-Sector", "Multi-Sect...
-## $ Sector             (chr) "Multi-Sector - Unspecified", "Multi-Sector...
-## $ Amount             (chr) "$53,790.31", "$60,580.98", "$9,473.20", "$...
-## $ Amount2            (chr) "53,790.31", "60,580.98", "9,473.20", "262,...
+
## Observations: 2,417
+## Variables: 14
+## $ FiscalYear        (int) 2009, 2011, 2012, 2013, 2013, 2009, 2013, 20...
+## $ QTR               (int) 1, 2, NA, 2, 3, 4, 2, 3, 3, 3, 4, 4, 2, 2, 4...
+## $ FiscalYearType    (chr) "Disbursements", "Obligations", "Disbursemen...
+## $ Account           (chr) "Development Assistance", "Development Assis...
+## $ Agency            (chr) "USAID", "usaid", "IAF", "usaid", "MCC", "US...
+## $ OperatingUnit     (chr) "Egypt", "Uganda", "Costa Rica", "Bolivia", ...
+## $ BenefitingCountry (chr) "Egypt", "Uganda", "Costa Rica", "Bolivia", ...
+## $ Category          (chr) "Democracy, Human Rights, and Governance", "...
+## $ Sector            (chr) "Good Governance", "Social Services", "Econo...
+## $ Amount            (chr) "$84,157.68", "$(964.73)", "$102,500.00", "$...
+## $ Spent             (dbl) 84157.68, -964.73, 102500.00, -156.11, 78673...
+## $ Spent2            (dbl) 84157.68, -964.73, 102500.00, -156.11, 78673...
+## $ Amount2           (chr) "84,157.68", "(964.73)", "102,500.00", "(156...
+## $ Amount3           (chr) "84157.68", "(964.73)", "102500.00", "(156.1...
+

Ahh, bliss. But those “” around the number and the ‘chr’ designation still indicate that the data is a character, not a number. Time to fix that.

+
+

Part 2: convert the character to a number.

+
spent6 = spent5 %>% 
+  mutate(Amount3 = as.numeric(Amount2))
+
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
+## coercion
+
# 
+spent = spent2 %>% 
+  mutate(amount = str_replace(Amount, '\\(', '-'), 
+         amount2 = str_replace_all(amount, '[^\\w-.]', ''),
+         amt = as.numeric(amount2)) # tricksy-- replacing anything that isn't a 'word' -- letter or number
+
## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
+## coercion