After importing your data from an external file into R, the next task is to work with the data set. Knowing how to manipulate data (often known as data wrangling or munging) in R is an important skill. R has useful functions to manipulate, extract, or summarize large data sets, which you can further use for statistical analyses or making graphs.
Why would we want to extract specific rows or columns? We may only want to graph part of the data, not everything. This is important especially for large data sets with many columns. We may also want to clean up a large data set, and remove unnecessary rows/columns of information.
R has a wide range of libraries and packages dedicated to data manipulation, and expanded options compared to other data management programs (i.e. Excel, OpenRefine, etc.).
To inspect data frames by size, use the following functions:
R Function | Description |
---|---|
dim(NameOfDataset) | Returns # of rows as the first element & # of columns as the second element (gives dimensions of the object) |
nrow(NameOfDataset) | Returns # of rows |
ncol(NameOfDataset) | Returns # of columns |
To inspect data frames by content, use the following functions:
R Function | Description |
---|---|
head(NameOfDataset) head(NameOfDataset, n=number) |
Shows first 6 rows Specifies # of top rows to show |
tail(NameOfDataset) tail(NameOfDataset, n=number) |
Shows last 6 rows Specifies # of bottom rows to show |
names(NameOfDataset) OR colnames(NameOfDataset) | Returns column names |
rownames(NameOfDataset) | Returns row names |
To view summary information of data frames, use the following functions:
R Function | Description |
str(NameOfDataset) | Structure of the object; gives info about the class, length, & content of each column |
summary(NameOfDataset) | Summary statistics for each column |
glimpse(NameOfDataset) | Returns the number of rows & columns, names & class of each column, and previews of values |
Positional Indexes
Extract parts of a data frame using square brackets [ ], known as positional indexes. Provide a value for the position of the rows and columns to extract inside the [ ].
For example, to extract the 9th row, and 2nd column: NameOfDataset[9,2]
To extract an entire row with all column information, leave the column value blank: NameOfDataset[9,]
To extract an entire column with all row information, leave the row value blank: NameOfDataset[,2]
To extract a sequence of consecutive values, use the operator : NameOfDataset[5:9,1:2]
Extracting Columns
Options to extract the information for each row of a column, by writing the name of the column:
Extracting Rows
To filter columns of information by row, use the function filter()
Add criteria to filtering using operators > (greater than), < (less than), >= (greater than or equal), <= (less than or equal), & (both sets of conditions), | (vertical bar representing 'or' to filter rows that meet either condition).
To filter rows containing certain values/text in a column, use the filter() function with grepl()
After extracting information from a raw data set, you may want to create a new data table.
To create a comma-separated value (CSV) file from data frame, use write_csv(), specifying the name data set object name, and your computer directory to write the file if different than the working directory.
To create a tab-separated text file, use write.table(), specifying the file name, the delimiter/separator, and whether you want row names.