LibGuides: Intro to R Programming: Data Wrangling

What is data wrangling?

After importing your data from an external file into R, the next task is to work with the data set. Knowing how to manipulate data (often known as data wrangling or munging) in R is an important skill. R has useful functions to manipulate, extract, or summarize large data sets, which you can further use for statistical analyses or making graphs.

Why would we want to extract specific rows or columns? We may only want to graph part of the data, not everything. This is important especially for large data sets with many columns. We may also want to clean up a large data set, and remove unnecessary rows/columns of information.

R has a wide range of libraries and packages dedicated to data manipulation, and expanded options compared to other data management programs (i.e. Excel, OpenRefine, etc.).

Inspecting Data Frames

To inspect data frames by size, use the following functions:

R Function	Description
dim(NameOfDataset)	Returns # of rows as the first element & # of columns as the second element (gives dimensions of the object)
nrow(NameOfDataset)	Returns # of rows
ncol(NameOfDataset)	Returns # of columns

To inspect data frames by content, use the following functions:

R Function	Description
head(NameOfDataset) head(NameOfDataset, n=number)	Shows first 6 rows Specifies # of top rows to show
tail(NameOfDataset) tail(NameOfDataset, n=number)	Shows last 6 rows Specifies # of bottom rows to show
names(NameOfDataset) OR colnames(NameOfDataset)	Returns column names
rownames(NameOfDataset)	Returns row names

To view summary information of data frames, use the following functions:

R Function	Description
str(NameOfDataset)	Structure of the object; gives info about the class, length, & content of each column
summary(NameOfDataset)	Summary statistics for each column
glimpse(NameOfDataset)	Returns the number of rows & columns, names & class of each column, and previews of values

Extracting Content of the Data Frame

Positional Indexes

Extract parts of a data frame using square brackets [ ], known as positional indexes. Provide a value for the position of the rows and columns to extract inside the [ ].

For example, to extract the 9th row, and 2nd column: NameOfDataset[9,2]

To extract an entire row with all column information, leave the column value blank: NameOfDataset[9,]

To extract an entire column with all row information, leave the row value blank: NameOfDataset[,2]

To extract a sequence of consecutive values, use the operator : NameOfDataset[5:9,1:2]

Extracting Columns

Options to extract the information for each row of a column, by writing the name of the column:

Specify the name in double quotations " " countries <- NameofData[,"country"]
Use $ to specify the name of the column in the data set NameofData$country
Use double brackets to specify the name of the column countries <- NameofData[["country"]]
Extract rows from specified column numbers using c() NameofData[c(1,3,5)]
Use subset() and specify column names with c(" ") subset(NameofData, select=c("country","year"))
In the dpyr package, use select() for the data set and column names select(NameofData, country, year)

Extracting Rows

To filter columns of information by row, use the function filter()

First specify the name of column and then the row name with ==
For example: filter(NameofData, country == "United States")

Add criteria to filtering using operators > (greater than), < (less than), >= (greater than or equal), <= (less than or equal), & (both sets of conditions), | (vertical bar representing 'or' to filter rows that meet either condition).

For example: filter(NameofData, continent == "Asia", pop>50000, lifeExp>5)

To filter rows containing certain values/text in a column, use the filter() function with grepl()

For instance, to filter all countries with 'United' in its name: filter(data, grepl("United", country))
To filter rows with either ‘52’ or ‘57’ in the year column: filter(data, grepl("52|57", year))

Exporting a New Data Table

After extracting information from a raw data set, you may want to create a new data table.

To create a comma-separated value (CSV) file from data frame, use write_csv(), specifying the name data set object name, and your computer directory to write the file if different than the working directory.

For example: write_csv(newdataset, file="C://User/new_gapminder_data.csv")

To create a tab-separated text file, use write.table(), specifying the file name, the delimiter/separator, and whether you want row names.

For example: write.table(newdataset, file = "new_ gapminder_data.txt", sep = "\t", row.names = FALSE) where sep = "\t" indicates the separator (can also be ";" "|" ":" ","). Inputting row.names = TRUE will add another column labeling each row numerically (1, 2, 3, etc.), therefore typically you will use row.names = FALSE.