Chapter 2 Basic Operations in R

Before you start developing your code, you need to understand how to work with R and RStudio. This includes work patterns, language components, basic commands and RStudio shortcuts. Understanding the software and how to take advantage of the platform is essential for the development of data-based research scripts. This is the main chapter for those who are not familiar with R or other programming languages.

In this section, we will go through the initial steps from the point of view of someone who has never worked with R and possibly never had contact with another programming language. Those already familiar with the program will not find novel information here and therefore, I suggest you skip to the next section. It is recommended, however, that you at least check the topics discussed here so that you can confirm your knowledge about the features of the program.

2.1 Working With R

The greatest difficulty a new user experiences when starting to develop routines in R is the format of work. Our interaction with computers has been simplified over the years and we are currently comfortable with the point&click format. That is, if you want to perform some operation on the computer, just point the mouse to the specific location on the screen and click the button that performs the operation. Visual cues and a series of steps in this direction allows the execution of complex tasks. But, be aware that this form of interaction is just one layer above what actually happens on the computer. Behind all these clicks, there is a command being executed. Any common task such as opening a pdf file, a spreadsheet document, directing a browser to a web page has an underlying call to a command. This command was created by the program developer to run within your operating system.

While this visual and motor interaction format has its benefits in facilitating and popularizing the use of computers, it is not flexible and effective when working with computational procedures. By knowing the commands available to the user, it is possible to create a file containing several instructions in sequence and, in the future, simply request that the computer execute this file using the recorded procedures. There is no need to do a “scripted” point&click operation. You need to spend some time creating the program but, in the future, it will always execute the recorded procedure in the same way. In the medium and long term, there is a significant gain in productivity between the use of a script (sequence of commands) and a point&click type of interface. Going further, the risk of human error in executing the procedure is almost nil because the commands and their sequence are recorded in the text file and will always be executed in the same way. This is one of the main reasons why programming languages are popular in science. All steps of data based research can be replicated.

In the use of R, the ideal format of work is to merge the use of the mouse with commands. R and RStudio have some functionality with the mouse, but their capacity is optimized when we perform operations using code. When a group of commands is performed in a smart way, we have an R script that should preferably produce something important to us at the end of its execution. In Finance, this can be the updated value of an investment, the calculation of the risk of a portfolio, the historical performance of an investment strategy, the result of an academic research, among many other possibilities.

Like other software, R allows us to import data and export files. We can use code to import a dataset stored in a local file (or the web), do an analysis of this data and save the results to later import it into a technical report. In fact, we can use RStudio to write a dynamic report, where code and content are integrated, using knitr and Sweave (Leisch 2002). For example, the book you’re reading was written using knitr and the bookdown package (Xie 2016). The book is compiled with the execution of the R codes and their outputs are recorded in the scope of the text. All figures and data tasks in the book can be updated with the execution of a simple command. Needless to say that by using the capabilities of R and RStudio, you will work smarter and faster.

2.2 Objects in R

In R, everything is an object, and each type of object has its properties. For example, the daily market closing prices of a stock can be represented as a numerical vector, where each element is a price recorded at the end of a trading day. Dates and times related to these prices can be represented as text (string) or one of the datetime classes. Finally, we can represent the price data and the dates together by storing them in a single object of type dataframe, which is nothing more than a table with rows and columns. These objects are part of the R ecosystem, and it is through their manipulation that we take full advantage of the software.

While we represent data as objects in R, a special type is a function, which stores a pre-established procedure that is available to the user. R has an extremely large number of functions, which enable the user to perform a wide range of operations. For example, the basic commands of R, available in the package base, adds up to a total of 1217 functions. Each function has its own name and a programmer can write their own functions. For example, the mean function is a procedure that calculates the average values of a vector. If we wanted to calculate the average value of the sequence 1, 2, 3, 4, 5, simply insert the following command in the prompt (left bottom of RStudio) and press enter:

mean(1:5, na.rm = TRUE)
## [1] 3

The : symbol used above creates a sequence starting at 1 and ending at 5 (more details about this operator in a later section). Note that the mean function is used with start and end parentheses. These parentheses serve to highlight the entries (inputs), that is, the information sent to the function to produce something. Note that each entry is separated by a comma, as in MyFct(input1, input2, input3, ...). We also set option na.rm = TRUE. This is a specific directive for the mean function to ignore elements of type NA (not available), if they exist. This specific type of object will also be discussed in a future chapter.

Functions are at the heart of R and we will dedicate a large part of this book to them. You can use the available functions or write your own. You can also publish your functions and let other people use your code. In a later chapter, we will learn how to use functions to do data analysis in an efficient way.

2.3 International and Local Formats

Before beginning to explain the use of R and RStudio, it is important to highlight some rules of formatting numbers, Latin characters and date formats.

• decimal: Following an international notation, the decimal point in R is defined by the period symbol (.), as in 2.5 and not comma, as in 2,5. In some countries, this might not be the case. This difference can create a lot of confusion and errors at the beginning. Some software, such as Microsoft Excel, does the conversion automatically when the data is imported. This, however, is generally an exception. As a general rule of using R, only use commas to separate the inputs of a function. Under no circumstances should the comma symbol be used as the decimal point separator. Always give priority to the international format because it will be compatible with the vast majority of data. Other researchers may experience some difficulty in understanding your code if you use your local notation for the decimal.

• Latin characters: Due to its international standard, R has problems understanding Latin characters, such as the cedilla and accents. If you can avoid it, do not use these characters in the names of your variables or files. In character objects (text), you can use them without problems as long as the encoding is correctly specified (e.g. UTF-8, Latin1). Given that, it is recommended that the R code be written in the English language. This automatically eliminates the use of Latin characters and facilitates the usability of the code by people outside of your country.

• date format: Dates in R are formatted according to the YYYY-MM-DD pattern, where YYYY is the year in four numbers, MM is the month and DD is the day. An example is 2017-11-26. This may not be the case in your country. When importing local datasets, make sure the dates are in this format or do a conversion. Again, while you can work with your local format of dates in R, it is best advised to use the international notation. The conversion between one format and another is quite easy and will be presented in chapter 3.

If you want to learn more about your local format in R, use the following command by typing it in the prompt and pressing enter:

Sys.localeconv()
##     decimal_point     thousands_sep          grouping
##               "."                ""                ""
##   int_curr_symbol   currency_symbol mon_decimal_point
##            "USD "               "$" "." ## mon_thousands_sep mon_grouping positive_sign ## "," "\003\003" "" ## negative_sign int_frac_digits frac_digits ## "-" "2" "2" ## p_cs_precedes p_sep_by_space n_cs_precedes ## "1" "0" "1" ## n_sep_by_space p_sign_posn n_sign_posn ## "0" "1" "1" The output of Sys.localeconv() shows how R interprets decimal points and the thousands separator, among other things. As you can see from the previous output, this book was compiled using the Brazilian notation for currency but uses the dot point for decimals. As mentioned before, it is good policy to follow international notation, especially for the decimal point. If necessary, you can change your local format to the US/international notation using the following command. Sys.setlocale("LC_ALL", "English") A note, however, is that you’ll need to run this command every time that R starts or incorporate it in the initialization of the software. 2.4 Types of Files in R Like any other programming platform, R has a file ecosystem and each type of file has a different purpose. In the vast majority of cases, however, the work will focus mostly on two types: .R and .RData files. Next, I provide a description of various file extensions. The items in the list are ordered by importance. Note that we omit graphic files such as .png, .jpg, .gif and data storage files (.csv, .xlsx, ..) among others, as they are not exclusive to R. • Files with the extension .R : text files containing several instructions for R. These are the files that will contain the sequence of commands that configures the main script and subroutines of the data research. Examples: My-Research.R, My_Functions.R. • Files with extension .RData: files that store data in R native format. These files are used to save (write) objects created in different sessions. For example, you can use a .RData file to save a table after processing and cleaning up the raw database. This file can be later loaded for a subsequent analysis. Examples: My_data.RData, Research_Results.RData. • Files with extension .Rmd, .md and .Rnw: represent files used for editing dynamic documents related to the Rmarkdown and markdown formats. The use of these files allows the creation of documents where text and code output are integrated. This is an advanced topic and will not be covered in this book. For those interested, I suggest reading Baumer et al. (2014) and a tutorial at this link. Example: My_Report.Rmd. • Files with extension .Rproj: contain files for editing projects in RStudio, such as a new package, a shiny application or a book. This is also an advanced topic and will not be dealt with here. While you can use the functionalities of RStudio projects to write R scripts, it is not a necessity. For those interested in learning more about this functionality, I suggest the RStudio manual. Example: MyProject.Rproj. 2.5 Explaining the RStudio Screen After installing the two programs, R and RStudio, open RStudio by double clicking its icon. Be aware that R also has its own interface and this often causes confusion. You should find the correct shortcut for RStudio by going through your software folders. In Windows, you can search for RStudio using the Start button and typing Rstudio. After opening RStudio, the resulting window should look like Figure 2.1. Note that RStudio automatically detected the installation of R and initialized your screen on the left side. If you do not see something like this on the screen of RStudio: R version 3.3.3 (2017-03-06) -- "Another Canoe" Copyright (C) 2017 The R Foundation for Statistical Computing Platform: x86_64-w64-mingw32/x64 (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. then R was not installed correctly. Repeat the installation steps in the previous chapter and confirm the startup message on the lower left side of RStudio. As a first exercise, click file, New File, and R Script. A text editor should appear on the left side of the screen. It is there that we will enter our commands, which are executed from top to bottom, in the same direction that we normally read text. A side note, all .R files created in RStudio are just text files and can be edited in other editors as well. It is not uncommon for experienced programmers to use a specific software to write code and another to run it. The resulting screen should look like the following: The main items/panels of the RStudio screen in Figure 2.2 are: • Script Editor: located on the left side and above the screen. This panel is used to write scripts and functions, mostly on files with the .R extension; • R prompt: located on the left side and below the script editor. It displays the prompt of R, which can also be used to give commands to R. The main function of the prompt is to test code and display the results of the commands entered in the script editor; • Environment: located on the top-right of the screen. Shows all objects, including variables and functions currently available to the user. Also note a History panel, which shows the history of commands previously executed by the user; • Panel Packages: shows the packages installed and loaded by R. Here you have four tabs: Files, to load and view system files; Plots, to view statistical figures created in R; Help to access the help system and Viewer to display dynamic and interactive results, such as a web page. As an introductory exercise, let’s initialize two objects in R. Inside the prompt (lower left side), insert the following commands and press enter at the end of each. The <- symbol is nothing more than the result of joining < (less than) with the - (minus sign). The ' symbol represents a single quotation mark and, in the computer keyboard, it is found under the escape (esc) key. # set x x <- 1 # set y y <- 'My humble text' If done correctly, notice that two objects appeared in the environment panel, one called x with a value of 1, and another called y with the text content "My humble text". Notice how we used specific symbols to define objects x and y. The use of double quotes (" ") or single quotes (' ') defines objects of the class character. Numbers are defined by the value itself. As will be discussed later, each object in R has a class and each class has a different behaviour. After sending the previous commands to R, the history tab has been updated. Now, let’s show the values of x on the screen. To do this, type the following command: # print contents of x print(x) ## [1] 1 The print function is one of the main functions for displaying values in the prompt of R. The text displayed as [1] indicates the index of the first line number. To verify this, enter the following command, which will show a lengthy sequence of numbers on the screen: # print a sequence print(50:100) ## [1] 50 51 52 53 54 55 56 57 58 59 60 61 62 63 ## [15] 64 65 66 67 68 69 70 71 72 73 74 75 76 77 ## [29] 78 79 80 81 82 83 84 85 86 87 88 89 90 91 ## [43] 92 93 94 95 96 97 98 99 100 In this case, we use the : symbol in 50:100 to create a sequence starting at 50 and ending at 100. Note that on the left side of each line, we have the values 1, 15, and 29. These represent the index of the first element presented in the line. For example, the fifteenth element of 50:100 is 64. 2.6 Running Scripts from RStudio Now, let’s combine all the previously typed codes into a single file by copying and pasting all commands into the editor’s screen (upper left side). The result looks like Figure 2.3. After pasting all the commands in the editor, save the .R file to a personal folder where you have read and write permissions. One possibility is to save it in the My Documents folder with a name like 'MyFirstRScript.R'. This saved file, which at the moment does nothing special, records the steps of a simple algorithm that creates several objects and shows their content. In the future, this file can take an expressive size by containing all stages of the data analysis such as importing data, cleaning it, performing the data analysis and exporting tables and figures. 2.6.1 RStudio shortcuts In RStudio, there are some predefined and time-saving shortcuts for running code from the editor. To execute an entire script, simply press control + shift + s. This is the source command. With RStudio open, I suggest testing this key combination and checking how the code saved in a .R file is executed. The output of the script is shown in the prompt of R. The result in RStudio should look like Figure 2.4. Another very useful command is code execution by the lines. In this case, the whole file is not executed, but only the line where the cursor is located. For that, just press control + enter. This shortcut is very useful in developing scripts because it allows each line of the code to be tested before running the entire program. As an example of usage, point the cursor to the print(x) line and press control + enter. As you will notice, only the line print(x) was executed. Therefore, before running the whole script, you can test it line by line and check for possible errors. Next, I highlight these and other RStudio shortcuts, which are also very useful. • control + shift + s: executes (source) the current RStudio file; • control + shift + enter: executes the current file with echo, showing the commands on the prompt; • control + enter: executes the selected line, showing on-screen commands; • control + shift + b: executes the codes from the beginning of the file to the current line where the cursor is; • control + shift + e: executes the codes of the lines where the cursor is until the end of the file. My suggestion is to use these shortcuts from day one. They greatly facilitate the use of the program. For those who like to use the mouse, an alternate way to execute code is to click the source button in the upper-right corner of the text editor. If you want to set your own shortcuts in RStudio, go to option, “Tools” and “Modify Keyboard Shortcuts”. One personal suggestion here is to set the source command to F5, which is used by several other software as an “execute” shortcut. If you want to run code in a .R file within another .R file, you can use the source command. For example, imagine that you have a main script with your data analysis and another script that performs some support operation such as importing data to R. These operations have been dismembered as a way of organizing the code. To run the support script, just call it with function source in the main script, as in the following code: # execute import script source('import-data.R') In this case, all code in import-data.R will be executed. This is equivalent to manually opening file import-data.R and hitting control + shift + s. 2.7 Testing and Debugging Code The development of code follows a cycle. At first, you will write a command line on a script, try it using control + enter and check the output. A new line of code is written once the previous line worked as expected. A moving cycle is clear, writing code is followed by line execution, followed by result checking, modify and repeat if necessary. This is a normal and expected process. You need to make sure that every line of code is correctly specified before moving to the next one. When you are trying to find an error in a preexisting script, R offers some tools for controlling and assessing its execution. This is specially useful when you have a long and complicated script. The simplest and easiest tool that R and RStudio offers is code breakpoint. In RStudio, you can click in the left side of the script editor and a red circle will appear, as in Figure 2.5. This red circle indicate a code breakpoint that will force the code to stop at that line. You can use it to test existing code and check its objects at a certain part of the execution. When the execution hits the breakpoint, the prompt will change to Browse[1]> and you’ll be able to try new code of verify the content of the objects. From the Console, you have the option to continue the execution to the next breakpoint or stop it. The same result can be achieved using function browser. Have a look: # set x x <- 1 # set y browser() y <- 'My humble text' # print contents of x print(x) The practical result is the same as using RStudio’s red circle, but it gives you more control for the case of several commands in the same line. 2.8 Creating Simple Objects One of the most basic and most used commands in R is the creation of objects. As shown in previous sections, you can define an object using the <- command, which is verbally translated to assign. For example, consider the following code: # set x x <- 123 # set x, y and z in one line my.x <- 1 ; my.y <- 2; my.z <- 3 We can read this code as the value 123 is assigned to x. The direction of the arrow defines where the value is stored. For example, using 123 -> x also works, although this is not recommended as the code becomes less readable. Also notice that you can create objects within the same line by separating the commands using a semi-colon. The use of an arrow symbol <- for object definition is specific to R. The reason for this choice was that, at the time of conception of the S language, keyboards with a key that directly defined the arrow symbol were available and used. This means that the programmer only had to hit one key in the keyboard in order to set the arrow symbol. Modern keyboards, however, are different. If you find it troublesome to use this symbol, you can use shortcuts as well. In Windows, the shortcut for the the symbol <- is alt plus -. You can also use the = symbol to define objects such as in x = 123, but the use of = with this specific purpose is not recommended. The symbol of equality has a special use within the definition of function arguments. This case will be better explained and demonstrated in future section. The name of the object is important in R. With the exception of very specific cases, the user can name objects as he likes. This freedom, however, can be a problem. It is desirable to always give short names that make sense to the content of the script and which are simple to understand. This facilitates the understanding of the code by other users and is part of the suggested set of rules for structuring code. Note that all objects created in this book have nomenclature in English and specific formatting, where the white space between nouns are replaced by a dot, as in my.x <- 1 and name.of.file <- 'MyDataFile.csv'. R executes the code looking for objects available in the environment, including functions. Be aware that R is case sensitive, that is, object m is different than M. If we try to access an object that does not exist, R will return an error message and stop the execution. Have a look: print(z) ## Error in print(z): object 'z' not found The error occurred because object z does not exist in the current environment. If we create a variable z as z <- 321 and repeat the command print(z), we will not have the same error message. 2.9 Creating Vectors In the previous examples, we created simple objects such as x <- 1 and x <- 'abc'. While this is sufficient to demonstrate the basic commands in R, in practice, such commands are very limited. A real problem of data analysis will certainly have a greater volume of information. One of the most used procedures in R is the creation of atomic vectors. These are objects that can have several elements. All elements of an atomic vector must have the same class, which justifies its atomic property. An example would be the representation of a series of daily stock prices as an atomic vector of the class numeric. Once you have a vector, you can manipulate it anyway you want. Atomic vectors are created in R using the c command, which comes from the verb combine. For example, if we wanted to combine the values 1, 2 and 3 in one object, we could do it with the following command: # create numeric atomic vector x <- c(1,2,3) # print it print(x) ## [1] 1 2 3 This command works the same way for any other class of object, such as character: # create character atomic vector y <- c('text 1', 'text 2', 'text 3', 'text 4') # print it print(y) ## [1] "text 1" "text 2" "text 3" "text 4" The only restriction on the use of the c command is that all elements must have the same class. If we insert data from different classes in a call to c(), R will try to mutate all elements into the same class following its own logic. If the conversion of all elements to a single class is not possible, an error message is returned. Note the following example, where numeric values are set in the first and second element of x and a character in the last element. # a mixed vector x <- c(1, 2, '3') # print result of forced conversion print(x) ## [1] "1" "2" "3" The values of x are all of type character. The use of class command confirms this result: # print class of x class(x) ## [1] "character" 2.10 Knowing Your Environment and Objects After using various commands, further development of the script requires you to understand what objects are available and what is their content. You can find this information simply by looking at the upper right screen of RStudio. However, there is a command that shows the same information in the prompt. In order to know what objects are currently available in R’s memory, you can use command ls. Note the following example: # set some objects x <- 1 y <- 2 z <- 3 # print all objects in the environment print(ls()) ## [1] "x" "y" "z" The objects x, y and z were created and are available in the current working environment. If we had other objects, they would also appear in the output to ls. Notice that object returned from ls is a character vector. To display the content of each object, just enter the names of objects and press enter in the prompt: # print objects by their name x ## [1] 1 y ## [1] 2 z ## [1] 3 Typing the object name on the screen has the same effect as using the print command. In fact, when executing the sole name of a variable in the prompt or script, R internally passes the object to the print function. In R, all objects belong to a class. As previously mentioned, to find the class of an object, simply use the class function. In the following example, x is an object of the class numeric, y is a text (character) object and my.fct is a function object. # set objects x <- 1 y <- 'a' my.fct <- function(){} # print their classes print(class(x)) ## [1] "numeric" print(class(y)) ## [1] "character" print(class(my.fct)) ## [1] "function" Another way to learn more about an object is to check their textual representation. Every object in R has a textual representation and we can find it with function str: # print the textual representation of a vector print(str(1:10)) ## int [1:10] 1 2 3 4 5 6 7 8 9 10 ## NULL This function is particularly useful when trying to understand the details of a more complex object, such as a dataframe. We will learn more about using function str for learning the contents of a dataframe in chapter 4. 2.11 Displaying and Formatting Output So far, we saw that you can show the value of an R object on the screen in two ways. You can either enter its name in the prompt or use the print function. Explaining it further, the print function focuses on the presentation of objects and can be customized for any type. For example, if we had an object of a class called MyTable to represent a specific type of table, we could create a function called print.MyTable that would show a table on the screen with a special format for the rows and column names. Function print, therefore, is oriented towards presenting objects and the user can customize it for different classes. The base package, which is automatically initialized with R, contains several print function for various kinds of objects. However, there are other specific functions to display text in the prompt. The main one is cat (concatenate and print). This function takes a text as input, processes it for specific symbols and displays the result on the screen. Function cat is more powerful and customizable than print. For example, if we wanted to show the text, The value of x is equal to 2 on screen using a numerical object, we could do it as follows: # set x x <- 2 # print customized message cat('The value of x is', x) ## The value of x is 2 You can also customize the screen output using specific commands. For example, if we wanted to break a line in the screen output, we could do it through the use of the reserved character \n: # set text with break line my.text <- ' First Line,\n Second line' # print it cat(my.text) ## First Line, ## Second line Note that the use of print would not result in the same effect as this command displays the text as it is, without processing it for specific symbols: print(my.text) ## [1] " First Line,\n Second line" Another example in the use of specific commands for text is to add a tab space with the symbol \t. See an example next: # set text with tab my.text <- 'A->\t<-B' # concatenate and print it! cat(my.text) ## A-> <-B We’ve only scratched the surface on the possible ways to manipulate text output. Other ways to manipulate text output based on specific symbols can be found in the official R manual, available on the book website. 2.11.1 Customizing the Output Another way to customize text output is using specific functions to manipulate objects of the class character. For that, there are two very useful functions: paste and format. Function paste glues a series of objects together. It is a very useful function, and will be used intensely for the rest of the examples in this book. Consider the following example: # set some text objects my.text.1 <- 'I am a text' my.text.2 <- 'very beautiful' my.text.3 <- 'and informative.' # paste all objects together and print cat(paste(my.text.1, my.text.2, my.text.3)) ## I am a text very beautiful and informative. The previous result is not far from what we did in the example with the print function. Note, however, that the paste function adds a space between each text. If we did not want this space, we could use function paste0 as in: # example of paste0 cat(paste0(my.text.1, my.text.2, my.text.3)) ## I am a textvery beautifuland informative. Another very useful possibility with the paste function is to insert a text or symbol between the junction of texts. For example, if we wanted to add a comma (,) between each item to be pasted, we could do this by using the input option sep as follows: # example using the argument sep cat(paste(my.text.1, my.text.2, my.text.3, sep = ', '))  ## I am a text, very beautiful, and informative. If we had an atomic vector with all elements to be glued in an single object, we could achieve the same result using the collapse argument. See an example next. # set character object my.text <-c('I am a text', 'very beautiful', 'and informative.') # example of using the collapse argument in paste cat(paste(my.text, collapse = ', '))  ## I am a text, very beautiful, and informative. Going forward, command format is used to format numbers and dates. It is especially useful when we create tables and we want to present the numbers in a visually appealing way. By definition, R presents a set number of digits after the decimal point: # example of decimal points in R cat(1/3) ## 0.3333333 If we wanted only two digits on the screen, we could use the following code: # example of using format on numerical objects cat(format(1/3, digits=2)) ## 0.33 Likewise, if we wanted to use a scientific format in the display, we could do the following: # example of using scientific format cat(format(1/3, scientific=TRUE)) ## 3.333333e-01 Function format has many more options. If you need your numbers to come out in a specific way, have a look at the help manual for this function. It is also a generic function and can be used for many types of objects. 2.12 Finding the Size of Objects In the practice of programming with R, it is very important to know the size of the objects being used. Here, size means the number of individual elements. This information serves not only to assist the programmer in checking possible code errors, but also to know the length of iteration procedures such as loops, which will be treated in a later chapter of this book. In R, the size of an object can be checked with the use of four main functions: length, nrow, ncol and dim. Function length is intended for objects with a single dimension, such as atomic vectors: # create atomic vector x <- c(2,3,3,4,2,1) # get length of x n <- length(x) # display message cat('The size of x is ', n) ## The size of x is 6 For objects with more than one dimension, such as matrices, use functions nrow, ncol and dim (dimension) to find the number of rows (first dimension) and the number of columns (second dimension). See the difference in usage below. # create a matrix M <- matrix(1:20, nrow = 4, ncol = 5) # print matrix print(M) ## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 5 9 13 17 ## [2,] 2 6 10 14 18 ## [3,] 3 7 11 15 19 ## [4,] 4 8 12 16 20 # calculate size in different ways my.nrow <- nrow(M) my.ncol <- ncol(M) my.n.elements <- length(M) # display message cat('The number of lines in M is ', my.nrow) ## The number of lines in M is 4 cat('The number of columns in M is ', my.ncol) ## The number of columns in M is 5 cat('The number of elements in M is ', my.n.elements) ## The number of elements in M is 20 The dim function shows the dimension of the object, resulting in a numeric vector as output. This function should be used when the object has more than two dimensions. In practice, however, such cases are rare. An example is given next: # get dimension of M my.dim <- dim(M) # print it print(my.dim) ## [1] 4 5 In the case of objects with more than two dimensions, we can use the array function to create the object and dim to find its size. Have a look in the next example: # create an array with three dimensions my.array <- array(1:9, dim = c(3,3,3)) # print it print(my.array) ## , , 1 ## ## [,1] [,2] [,3] ## [1,] 1 4 7 ## [2,] 2 5 8 ## [3,] 3 6 9 ## ## , , 2 ## ## [,1] [,2] [,3] ## [1,] 1 4 7 ## [2,] 2 5 8 ## [3,] 3 6 9 ## ## , , 3 ## ## [,1] [,2] [,3] ## [1,] 1 4 7 ## [2,] 2 5 8 ## [3,] 3 6 9 # display its dimensions print(dim(my.array)) ## [1] 3 3 3 An important note here is that the use of the functions, length, nrow, dim and ncol are not intended to discover the number of letters in a text. This is a common mistake. For example, if we had a character type of object and we use the length function, the result would be the following: # set text object my.char <- 'abcde' # print result of length print(length(my.char)) ## [1] 1 This occurred because the length function returns the number of elements in an object. In this case, my.char has only one element. To find out the number of characters in the object, we use the nchar function as follows: # find the number of characters in an character object print(nchar(my.char)) ## [1] 5 2.13 Selecting the Elements of an Atomic Vector After creating an atomic vector of a class, it is possible that the user is interested in only one or more elements of it. For example, if we were updating the value of an investment portfolio, our interest in a vector containing stock prices is only for the latest price. All other prices were not relevant to our analysis and therefore could be ignored. The selection of pieces of an atomic vector is called indexing and it is accomplished with the use of square brackets ([ ]). Consider the following example: # set x my.x <- c(1, 5, 4, 3, 2, 7, 3.5, 4.3) If we wanted only the third element of my.x, we use the bracket operator as follows: # get third element of x elem.x <- my.x[3] # print it print(elem.x) ## [1] 4 The procedure of indexing also works with vectors. If we are only interested in the last and penultimate values of my.x, we use the following code: # get last and penultimate value of my.x piece.x.1 <- my.x[ (length(my.x)-1):length(my.x) ] # print it print(piece.x.1) ## [1] 3.5 4.3 A cautionary note. A unique property of the R language is that if a non existing element is accessed, the program returns the value NA (not available). See the next example code, where we attempt to obtain the fourth value of a vector with only three components. # set object my.vec <- c(1,2,3) # print non-existing fourth element print(my.vec[4]) ## [1] NA It is important to know this behaviour because the lack of treatment of these errors can lead to problems that are difficult to identify in more complex code. In other programming languages, attempting to access non-existing elements generally returns an error and cancels the execution of the rest of the code. In the case of R, given that access to non-existent elements does not generate an error or warning message, it is possible that this will create a problem in other parts of the script as NA objects are contagious. That is, anything that interacts with NA will also become NA. The user should pay attention every time that NA values are found unexpectedly. An inspection in the length and indexation of vectors may be required. The use of indices is very useful when you are looking for items of a vector that satisfy some condition. For example, if we wanted to find out all values in my.x that are greater than 3, we could use the following command: # find all values in my.x that are greater than 3 piece.x.2 <- my.x[my.x>3] # print it print(piece.x.2)  ## [1] 5.0 4.0 7.0 3.5 4.3 It is also possible to index elements by more than one condition using the logical operators & and | (or). For example, if we wanted the values of my.x greater than 2 and lower than 4, we could use the following command: # find all values of my.x that are greater than 2 and lower then 4 piece.x.3 <- my.x[ (my.x>2) & (my.x<4) ] print(piece.x.3) ## [1] 3.0 3.5 Likewise, if we wanted all items that are lower than 3 or greater than 6, we use: # find all values of my.x that are lower than 3 or higher than 6 piece.x.4 <- my.x[ (my.x<3)|(my.x>6) ] # print it print(piece.x.4) ## [1] 1 2 7 Moreover, logic indexing also works with the interaction of different objects. That is, we can use a logical condition in one object to select items from another: # set my.x and my.y my.x <- c(1,4,6,8,12) my.y <- c(-2,-3,4,10,14) # find all elements of my.x where my.y is higher than 0 my.piece.x <- my.x[ my.y > 0 ] # print it print(my.piece.x) ## [1] 6 8 12 Looking more closely at the indexing process, it is worth noting that, when we use a data indexing condition, we are in fact creating a variable of the logical type. This object takes only two values: TRUE and FALSE. Have a look in the code presented next, where we create a logical object, print it and present its class. # create a logical object my.logical <- my.y > 0 # print it print(my.logical)  ## [1] FALSE FALSE TRUE TRUE TRUE # find its class class(my.logical) ## [1] "logical" 2.14 Removing Objects from the Memory After creating several variables, the R environment can become full of content that’s already been used and is dispensable. In this case, it is desirable to clear the memory to erase objects that are no longer needed. Generally, this is accomplished at the beginning of a script, so that every time the script runs, the memory will be cleared before any calculation. In addition to cleaning the computer’s memory, it also helps to avoid possible errors in the code. In most cases, cleaning the working environment should be performed only once at the beginning of the script. For example, given an object x, we can delete it from memory with the command rm, as shown next: # set x x <- 1 # print all available objects ls() ## [1] "elem.x" "M" "my.array" ## [4] "my.char" "my.dim" "my.engine" ## [7] "my.fct" "my.logical" "my.ncol" ## [10] "my.n.elements" "my.nrow" "my.out.width" ## [13] "my.piece.x" "my.str" "my.text" ## [16] "my.text.1" "my.text.2" "my.text.3" ## [19] "my.vec" "my.x" "my.y" ## [22] "my.z" "n" "piece.x.1" ## [25] "piece.x.2" "piece.x.3" "piece.x.4" ## [28] "stay.quiet" "x" "y" ## [31] "z" # remove x rm('x') # print again all available objects ls() ## [1] "elem.x" "M" "my.array" ## [4] "my.char" "my.dim" "my.engine" ## [7] "my.fct" "my.logical" "my.ncol" ## [10] "my.n.elements" "my.nrow" "my.out.width" ## [13] "my.piece.x" "my.str" "my.text" ## [16] "my.text.1" "my.text.2" "my.text.3" ## [19] "my.vec" "my.x" "my.y" ## [22] "my.z" "n" "piece.x.1" ## [25] "piece.x.2" "piece.x.3" "piece.x.4" ## [28] "stay.quiet" "y" "z" Note that after executing the command rm('x'), the value of x is no longer available in the output of ls(). In practical situations, however, it is desirable to clean up all the memory used by all objects created in R. We can achieve this goal with the following code: rm(list=ls()) The term list in rm(list=ls()) is a function argument of rm that defines which objects will be deleted. The ls() command shows all the currently available objects. Therefore, by chaining together both commands, we erase all current objects available in the environment. As mentioned before, it is good programming policy to always start the script by clearing the memory. However, you should only wipe out all of R’s memory if you have already saved the results of interest or if you can replicate them. 2.15 Displaying and Setting the Working Directory Like other programming platforms, R always works in a directory. If no directory is set, a default value is used when R starts up. It is based on the current directory that R searches for files to load data or other R scripts. It is in this directory that R saves any output we want if we do not explicitly define an address on the computer. This output can be a graphic file, text or a spreadsheet like file. A good programming policy is to change the working directory to the same place where the script is located. In chapter 11 we will further discuss the topic of file and folder organization. To show the current working directory, use function getwd: # get current dir my.dir <- getwd() # display it print(my.dir) ## C:/Dropbox/My Books/pafdR (en)/Book Content The result of the previous code shows the folder in which this book was written and compiled. As you can see, the book files are saved in a subfolder of my Dropbox directory. From the path, you should also realize that I’m working in a Windows OS. The root directory C:/ gives that information away. The change of working directory is performed with the setwd command. For example, if we wanted to change our working directory to C:/My Research/, simply type in the prompt: # set where to change directory my.d <- 'C:/My Research/' # change it setwd(my.d) As for simple cases such as the above, remembering the directory name is easy. In practical cases, however, the working directory can be in a deeper directory of the file system. In this situation, an efficient strategy to locate the path is to use a file explorer, like Windows explorer. To do so, open the explorer application and navigate to the location where you want to work with your script. Place the cursor in the address bar and select the whole path. Press control + c to copy the address to the clipboard. Go back to your code and paste it in. An important step here: Windows uses the backslash to set addresses on the computer, while the R uses the forward slash. If you try to use backslashes, an error is displayed on the screen. See the following example. # set directory (WRONG WAY) my.d <- 'C:\My Research\' ## ##Error: '\M' is an unrecognized escape in character string This message means that R was not able to understand the use of backslashes. This is a reserved symbol for macros and should not be used anywhere in a code. Therefore, after copying the address, modify all backslashes to forward slashes, as in the following code: # set directory (CORRECT WAY) my.d <- 'C:/My Research/' # change dir setwd(my.d) You can also use double backslashes \\ but this is not recommended as it is not compatible with other operating systems. Another important information here is that you can also use relative paths. For example, if you are working in a folder that contains a subdirectory called Data, you can enter this subfolder with the code: # change to subfolder setwd('Data') Another possibility is to go to a previous level of directory using .., as in: # change to previous level setwd('..') So, if you are working in directory C:/My Research/ and execute the command setwd('..'), the current folder becomes C:/, which is one level below C:/My Research/. Another, more modern, way of setting the directory is to use RStudio API functions. This is a set of functions that only work inside RStudio and provides information about current file, project and many more. To find out the path of the current R script being edited in RStudio and set the working directory to there, you can write: my.path <- dirname(rstudioapi::getActiveDocumentContext()$path)
setwd(my.path)

This way, the script will change the directory to its own location, no matter where you copy it. Be aware, however, that this trick only works in RStudio script editor and within a saved file. It will not work from the prompt.

2.16 Cancelling Code Execution

Whenever R is running some code, a visual cue in the shape of a small red circle in the right corner of the prompt will appear. If you read it, the text shows the stop word. This button is not only an indicator for running code but also a shortcut for cancelling its execution. Another way to cancel an execution is to point the mouse to the prompt and press the escape (esc) button from the keyboard.

To try it out, run the next chunk of code in RStudio and cancel its execution using esc.

for (i in 1:100) {
cat('\nRunning code (please make it stop by hitting esc!)')
Sys.sleep(1)
}

In the previous code, we used a for loop to display the message '\nRunning code (please make it stop by hitting esc!)' every second. For now, do not worry about the code and functions used in the example. We will discuss the use of loops in chapter 8.

2.17 Code Comments

In R, comments are set using the hash tag symbol #. Anything after this symbol will not be processed by R. This gives you freedom to write whatever you want within the script. An example:

# this is a comment (R will not parse it)
# this is another comment (R will not parse it)

x <- 'abc' # this is an inline comment 

Comments are a way to communicate any important information that cannot be directly inferred from the code. In general, you should avoid using comments that are too obvious or too generic. For example:

# read csv file
df <- read.csv('MyDataFile.csv')

As you can see, it is quite obvious from line df <- read.csv('MyDataFile.csv) that the code is reading a .csv file. The name of the function already states that. So, the comment was not a good one as it did not add any new information to the user. A better approach at commenting would be to set the author, description of script and better explain the origin and last update of the data file. Have a look:

# Script for analyzing a dataset
# Author: Mr data analyst (dontspamme@emailprovider.com)
# Last script update: 2017-03-10
#
# File downloaded from www.sitewithdatafiles.com/data-files/
# The description of the data goes here
# Last file update: 2017-03-10
df <- read.csv('MyDataFile.csv')

So, by reading the comments, the user will know the purpose of the script, who wrote it and the date of the last edit. It also includes the origin of the data file and the date of the latest update. If the user wants to update the data, all he has to do is to go to the referred website and download the new file. If the datafile is updated, a new date should be placed in “Last file update”.

Another use of comments is to set sections in the code, such as in:

# Script for analyzing a dataset
# Author: Mr data analyst (dontspamme@emailprovider.com)
# Last script update: 2017-03-10
#
# File downloaded from www.sitewithdatafiles.com/data-files/
# The description of the data goes here
# Last file update: 2017-03-10
...

# Clean data
# - remove outliers
# - remove unnecessary columns

...

# Report results
# - remove outliers
# - remove unnecessary columns

...

This way, once you need to change a particular part of the code, you can look for the related section in the comments. If you share code with other people, you’ll soon realize that comments are essential and expected. They help transmit information that is not available from the code. A note here, throughout the book you’ll see that the code comments are, most of the time, a bit obvious. This was intentional as clear and direct messages are important for new users, which is part of the audience of this book.

2.18 Looking for Help

A common task in the use of R is to seek help. Even advanced users often seek instructions on specific tasks, whether it is to better understand the details of some functions or simply to study a new procedure. The use of the R help system is part of everyday routine with the software.

You can get help by using the help panel in RStudio or directly from the prompt. Simply enter the question mark next to the object on which you want help, as in ?mean. In this case, object mean is a function and the use of the help command will open a panel on the right side of RStudio.

In R, the help screen of a function is the same as shown in Figure 2.6. It presents a general description of the function, explains its input arguments and the format of the output. The help screen follows with references and suggestions for other related functions. More importantly, examples of usage are given last and can be copied to the prompt or script in order to accelerate the learning process.

If we are looking for help for a given text and not a function name, we can use double question marks as in ??"standard deviation". This operation will search for the occurrence of the term in all packages of R and it is very useful to learn how to perform a particular task. In this case, we looked for the available functions to calculate the standard deviation of a vector.

As a suggestion of usage, the easiest and most direct way to learn a new function is trying out the examples in the manual. This way, you can see which type of input objects the function expects and what type of output it gives. Once you have it working, read the help screen to understand if it does exactly what you expected and what are the options for its use. If the function performs the desired procedure, you can copy and paste the code example for your own script, adjusting where necessary.

Another very important source of help is the Internet itself. Sites like stackoverflow and specific mailing lists, whose content is also on the Internet, are a valuable source of information. If you find a problem that could not be solved by reading the standard help files, the next logical step is to seek a solution using your error message or the description of the problem in search engines. In many cases, your problem, no matter how specific it is, has already occurred and has been solved by other users. In fact, it is more surprising not to find the solution for a programming problem on the internet, than the other way around.

2.19 R Packages

One of the greatest benefits of using R is its package collection. A package is nothing more than a group of procedures aimed at solving a particular computational problem. R has at its core a collaborative philosophy. Users provide their codes for others to use. And, most importantly, all packages are free. For example, consider a case where the user is interested in accessing data about historical inflation in the USA. He can install and use a R package that is specifically designed for importing economic statistics for a country.

Every function in R belongs to a package. When R initializes, packages stats, graphics, grDevices, utils, datasets, methods and base are loaded by default. Almost every function we have used so far belongs to the package base. R packages can be accessed and installed from different sources. The main being CRAN (The Comprehensive R Archive network), R-Forge and Github. The quantity and diversity of R packages increases every day. At the time of the publication of this book, the author of this book has six packages available on CRAN:

• GetHFData - Allows direct access to high frequency financial transaction data from Bovespa (Brazilian Financial Exchange);

• GetTDData - Enables access to prices and yields of bonds issued by the Brazilian government;

• RndTexExams - Enables the creation and correction of single choice exams with randomized content;

• BatchGetSymbols - Package for easy access to daily data from Yahoo! Finance and Google Finance;

• Predatory - Package to identify predatory journals based on the Beall site data;

• pafdR - Provides code, data and exercises for this book.

CRAN is the official repository of R and it is built by the community. Anyone can send a package. However, there is an evaluation process to ensure that certain strict rules about code format are respected. For those interested in creating and distributing packages, a clear and easy to learn material on how to create and send packages to CRAN is presented on the site R packages. Complete rules are available on the CRAN website. The suitability of the code to CRAN standards is the developer’s responsibility. By personal experience, sending and publishing a package on CRAN demands a significant amount of work, especially in the first submission. After that, it becomes a lot easier. Don’t be angry if you package is rejected. My own packages were rejected several times before entering CRAN. Listen to what the maintainers tell you and try fixing all problems before resubmitting. If you’re having issues that you cannot solve or find a solution in the Internet, look for help in the R-packages mailing list. You’ll be surprised at how accessible and helpful the R community can be.

The complete list of packages available on CRAN, along with a brief description, can be accessed at the packages link on the R site. A practical way to check if there is a package that does a specific procedure is to load the previous page and search in your browser for a keyword. If there is a package that does what you want, it is very likely that the keyword is used in the description of the package. Another important source for finding packages is Task Views. There you can find the most important packages for a given area of expertise. See the Task Views screen in Figure 2.7.

Unlike CRAN, R-Forge and Github have no restriction on the code sent to their repository and, because of this, these repositories tend to be chosen by developers. Responsibility in the use, however, is with the user. In practice, it is very common for developers to maintain a development version on Github or R-Forge and the official version in CRAN. When the development version reaches a certain stage of maturity, it is then sent to CRAN.

The most interesting part of this is that the packages can be accessed and installed directly from the prompt using the internet. To find out the current amount of packages on CRAN, type and execute the following commands in the prompt:

# get matrix with available packages
df.cran.pkgs <- available.packages()

# find the number of packages
n.cran.packages <- nrow(df.cran.pkgs)

# print it
print(n.cran.packages)
## [1] 11886

If asked about which mirror to use, simply select the one closest to you. Currently (2017-11-26 13:38:53), there are 11886 packages available on the CRAN servers. We can see some details of the first three packages in df.cran.pkgs with function print and some indexing:

# print information about the first three packages
print(df.cran.pkgs[1:3, ])
##        Package  Version Priority
## A3     "A3"     "1.0.0" NA
## abbyyR "abbyyR" "0.5.1" NA
## abc    "abc"    "2.1"   NA
##        Depends
## A3     "R (>= 2.15.0), xtable, pbapply"
## abbyyR "R (>= 3.2.0)"
## abc    "R (>= 2.10), abc.data, nnet, quantreg, MASS, locfit"
##        Imports                                  LinkingTo
## A3     NA                                       NA
## abbyyR "httr, XML, curl, readr, plyr, progress" NA
## abc    NA                                       NA
##        Suggests                               Enhances
## A3     "randomForest, e1071"                  NA
## abbyyR "testthat, rmarkdown, knitr (>= 1.11)" NA
## abc    NA                                     NA
##        License              License_is_FOSS
## A3     "GPL (>= 2)"         NA
## abbyyR "MIT + file LICENSE" NA
## abc    "GPL (>= 3)"         NA
##        License_restricts_use OS_type Archs MD5sum
## A3     NA                    NA      NA    NA
## abbyyR NA                    NA      NA    NA
## abc    NA                    NA      NA    NA
##        NeedsCompilation File
## A3     "no"             NA
## abbyyR "no"             NA
## abc    "no"             NA
##        Repository
## A3     "https://cloud.r-project.org/src/contrib"
## abbyyR "https://cloud.r-project.org/src/contrib"
## abc    "https://cloud.r-project.org/src/contrib"

In short, object df.cran.pkgs displays the names of packages, its current version, its dependencies, along with various other information.

You can also check the amount of locally installed packages in R with the installed.packages command:

# find number of packages currently installed
n.local.packages <- nrow(installed.packages())

# print it
print(n.local.packages)
## [1] 391

In this case, the computer on which the book was written has 391 packages currently installed. This value is probably different from yours. Give it a try!

2.19.1 Installing Packages from CRAN

To install a package, simply use the command install.packages. You only need to do it once for each new package. As an example, we will install a package called quantmod that will be used in future chapters.

# install package quantmod
install.packages("quantmod")

That’s it! After executing this simple command, package quantmod and all of its dependencies will be installed and the functions related to the package will be ready for use once the package is loaded in a script. Note that we defined the package name in the installation as if it were text with the use of quotation marks (" "). If the installed package is dependent on another package, R detects this dependency and automatically installs the missing packages. Thus, all the requirements for using the installed package will already be satisfied and everything will work perfectly. It is possible, however, that a package has an external dependency. As an example, package RndTexExams depends on the existence of a LaTeX installation. These cases are usually announced in the description of the package and an error informs that a requirement is missing. External dependencies for R packages are not common, but they do happen.

2.19.2 Installing Packages from Github

To install a package hosted in Github, you must install the devtools package, available on CRAN:

# install devtools
install.packages('devtools')

After that, load up the package devtools and use the function install_github to install a package directly from Github. In the following example, we install the development version of the package ggplot2, whose official version is also available at CRAN:

# load up devtools
library(devtools)

# install ggplot2 from github
install_github("hadley/ggplot2")

Note that the username of the developer is also included. In this case, the hadley name belongs to the developer of ggplot2, Hadley Wickham. Throughout the book, you will notice that this name appears several times. Hadley is a prolific and competent developer of several R packages and currently works for RStudio.

2.19.3 Loading Packages

Within a script, use function library to load a package, as in the following example.

# load package quantmod
library(quantmod)

After running this command, all functions of the package will be available to the user. In this case, it is not necessary to use " " to load the package. If the package you want to use is not available, R will throw an error message. See an example next, where we try to load a non-existing package called unicorn.

library(unicorn)
## Error in library(unicorn): there is no package called 'unicorn'

Remember this error message. It will appear every time a package is not found. If you got the same message when running code from this book, you need to check what are the required packages of the example and install them using install.packages, as in install.packages('unicorn').

If you use a specific package function and do not want to load all functions from the package, you can do it through the special symbol ::, as in the following example.

# example of using a function without loading package
fortunes::fortune(10)
##
## Overall, SAS is about 11 years behind R and S-Plus in
## statistical capabilities (last year it was about 10 years
## behind) in my estimation.
##    -- Frank Harrell (SAS User, 1969-1991)
##       R-help (September 2003)

In this case, we use the function fortune from the package fortunes, which shows on screen a potentially funny phrase chosen from the R mailing list. For our example, we selected message number 10. One interesting use of the package fortune is to display a different message every time R starts. As mentioned before, you can find many tutorials on how to achieve this effect by searching on the web for “customizing R startup”.

Another way of loading a package is using the require function. A call to require has a different behaviour than a call to library. When using library, if the package is not found in the local libraries, it returns an error. This means that the script stops and no further code is evaluated. As for require, if a package is not found, it returns an object with value FALSE and the rest of the code is evaluated. So, in order to avoid code being executed without its explicit dependencies, it is advised to always use library for loading package in scripts.

The use of require is left for loading up packages inside of functions. If you create a custom function that requires procedures from a particular package, you must load the package within the scope of the function. For example, see the following code, where we create a new function called my.fct that depends on the package quantmod:

my.fct <- function(x){
require(quantmod)

df <- getSymbols(x, auto.assign = F)
return(df)
}

In this case, the first time that my.fct is called, it loads up the package quantmod and all of its functions. Using require inside a function is good programming policy because the function becomes self contained, making it easier to use it in the future. This was the first time where the complete definition of a function in R is presented. Do not worry about it now. We will explain it further in chapter 8.

2.19.4 Upgrading Packages

Over time, it is natural that packages available on CRAN are upgraded to accommodate new features, correct bugs and adapt to changes. Thus, it is recommended that users update their installed packages to a new version over the internet. In R, this procedure is quite easy. A direct way of upgrading packages is to click the button update located in the package panel, lower right corner of RStudio, as shown in Figure 2.8.

The user can also update packages through the prompt. Simply type command update.packages() and hit enter, as shown below.

# update all installed packages
update.packages()

The command update.packages compares the version of the installed packages with the versions available in CRAN. If it finds any difference, the new versions are downloaded and installed. After running the command, all packages will be synchronized with the versions available in CRAN.

2.20 Using Code Completion with tab

A very useful feature of RStudio is code completion. This is an editing tool that facilitates the search of names for objects, packages, function arguments and files. Its usage is very simple. After you type any first character, just press the tab (left side of keyboard, above capslock) and a number of options will appear. See Figure 2.9 where, after entering the f letter and pressing tab, a window appears with a list of object names that begins with that letter.

This also works for packages. To check it, type library(r) in the prompt or editor, place the cursor in between the parentheses and press tab. The result should look something like Figure 2.10, shown next.

Note that a description of the package or object is also offered by the code completion tool. This greatly facilitates the day to day work as the memorization of package names and R objects is not an easy task. The use of the tab decreases the time to look up names, also avoiding possible coding errors.

The use of this tool becomes even more beneficial when objects and functions are named with some sort of pattern. In the rest of the book, you will notice that objects tend to be named with the prefix my., as my.x, my.num. Using this naming rule (or any other) facilitates the lookup for names of objects created by the user. You can just type my., press tab, and a list of all objects previously created by the user will appear.

You can also find files and folders on your computer using tab. To try it, write the command my.file <- "" in the prompt or a script, point the cursor to the middle of the quotes and press the tab key. A screen with the files and folders from the current working directory should appear, as shown in Figure 2.11. You can use the keyboard arrow keys to navigate.

The use of autocomplete is also possible for finding the name and description of function arguments. To try it out, write cat() and place the mouse cursor inside the parentheses. After that, press tab. The result should be similar to Figure 2.12. By using tab inside of a function, we have the names of all arguments and their description. This is the same information found in the help files.

Summing up, using code completion will make you more productive. You’ll find names of files, objects, arguments and packages much faster. Use it whenever possible.

2.21 Interacting with Files and the Operating System

In many data analysis situations, it will be necessary to interact with files in the computer, either by creating new folders, decompressing and compressing files, listing and removing files from the hard drive of the computer or any other type of operation. In most cases, R will interact with files containing data.

2.21.1 Listing Files and Folders

To list files from your computer, use function list.files, where the path argument sets the directory to list the files from. For the compilation of the book, I’ve created a directory called data. This folder contains all the data needed to recreate the book’s examples. You can check the files in the subfolder data with the following code:

# list files in data folder
my.f <- list.files(path = "data", full.names = TRUE)
print(my.f)
##  [1] "data/AdjustedPrices-InternacionalIndices.RDATA"
##  [2] "data/BovStocks_2011-12-01_2016-11-29.csv"
##  [3] "data/BovStocks_2011-12-01_2016-11-29.RData"
##  [4] "data/example_gethfdata.RDATA"
##  [5] "data/FileWithLatinChar.txt"
##  [6] "data/grunfeld.csv"
##  [7] "data/HFData_6_Assets_15 min.RData"
##  [8] "data/HFData.csv"
##  [9] "data/MktIndices_and_Symbols.csv"
## [10] "data/MySQLiteDatabase.SQLITE"
## [11] "data/SP500_2011-11-13_2016-11-11.csv"
## [12] "data/SP500.csv"
## [13] "data/SP500-Excel.xlsx"
## [14] "data/SP500-Stocks_long.csv"
## [15] "data/SP500-Stocks_wide.csv"
## [16] "data/SP500-Stocks-WithRet.RData"
## [17] "data/TDData.csv"
## [18] "data/temp.csv"
## [19] "data/temp.RData"
## [20] "data/temp.txt"
## [21] "data/temp.xlsx"
## [22] "data/temp_xts.RData"

There are several files with different extensions in this directory. These files contain data that will be used in future chapters. When using list.files, it is recommended to set input full.names as TRUE. This option makes sure that the names returned by the function contains the full path of the found files. This facilitates further manipulation, such as reading and importing information from data files. It is worth noting that you can also list the files recursively, that is, list all files from all subfolders contained in the original address. To check it, try using the following code in your computer:

# list all files for all subfolders (IT MAY TAKE SOME TIME...)
list.files(path = getwd(), recursive = T, full.names = TRUE)

The previous command will list all files in the current folder and subfolders. Depending on the current working directory, it may take some time to run it all. If you executed it, be patient or just cancel it pressing esc.

To list folders (directories) on your computer, use the command list.dirs. See below.

# store names of directories
my.dirs <- list.dirs(recursive = F)

# print it
print(my.dirs)
##  [1] "./_book"
##  [2] "./_bookdown_files"
##  [3] "./data"
##  [4] "./eqs"
##  [5] "./fig_ggplot"
##  [6] "./figs"
##  [7] "./ftp files"
##  [8] "./latex_files"
##  [9] "./many_datafiles"
## [10] "./ProcAnFinDataR_ed_1_cache"
## [11] "./ProcAnFinDataR_ed_1_files"
## [12] "./Removed chapters"
## [13] "./.Rproj.user"
## [14] "./Scripts"
## [15] "./tabs"

The command list.dirs(recursive = F) listed all directories of the current path without recursion. The output shows the directories that I have used to write this book. It includes the output directory of the book ( ./_book), the directory with the data (./data), among others. In this same directory, you can find the chapters of the book, organized by files and based on the RMarkdown language (.Rmd file extension). To list only files with the extension .Rmd, we can use the pattern input in function list.files as follows:

# list all files with extension .Rmd
list.files(pattern = "*.Rmd")
##  [1] "00-Preface.Rmd"
##  [2] "01-Introduction.Rmd"
##  [3] "02-BasicOperations.Rmd"
##  [4] "03-BasicObjects.Rmd"
##  [5] "04-DataStructureObjects.Rmd"
##  [6] "05-Financial-data-and-common-operations.Rmd"
##  [7] "06-ImportingExportingLocal.Rmd"
##  [8] "07-ImportingInternet.Rmd"
##  [9] "08-Programming.Rmd"
## [10] "09-Figures.Rmd"
## [11] "10-Models.Rmd"
## [12] "11-ResearchScripts.Rmd"
## [13] "12-references.Rmd"
## [14] "index.Rmd"
## [15] "ProcAnFinDataR_ed_1.Rmd"
## [16] "_Welcome.Rmd"

The files presented above contain all the contents of this book, including this specific paragraph, located in file 02-BasicOperations.Rmd!

References

Leisch, Friedrich. 2002. “Sweave: Dynamic Generation of Statistical Reports Using Literate Data Analysis.” In Compstat, 575–80. Springer.

Xie, Yihui. 2016. Bookdown: Authoring Books and Technical Documents with R Markdown. CRC Press.

Baumer, Ben, Mine Cetinkaya-Rundel, Andrew Bray, Linda Loi, and Nicholas J Horton. 2014. “R Markdown: Integrating a Reproducible Analysis Tool into Introductory Statistics.” arXiv Preprint arXiv:1402.1894.