TDM 10100: Project 2 — 2022
The R[1] environment is a powerful tool to perform data analysis. R’s language is often compared to Python. Both languages have their advantages and disadvantages, and both are worth learning.
In this project we will dive in head first and learn some of the basics while solving data-driven problems.
5 basic types of data
-
Values like 1.5 are called numeric values, real numbers, decimal numbers, etc.
-
Values like 7 are called integers or whole numbers.
-
Values TRUE or FALSE are called logical values or Boolean values.
-
Texts consist of sequences of words (also called strings), and words consist of sequences of characters.
-
Values such as 3 + 2i[2] are called complex numbers. We usually do not encounter these in The Data Mine.
R and Python both have their advantages and disadvantages. A key part of learning data science methods is to understand the situations in which R is a more helpful tool to use, or Python is a more helpful tool to use. Both of them are good for their own purposes. In a similar way, hammers and screwdrivers and drills and many other tools are useful for construction, but they all have their own individual purposes. |
Context: In the last project we set the stage for the rest of the semester. We got some familiarity with our project templates, and modified and ran some examples.
In this project, we will continue to use R within Jupyter Lab to solve problems. Soon, you will see how powerful R is and why it is often more effective than using spreadsheets as a tool for data analysis.
Scope: r, vectors, lists, indexing
Dataset(s)
The following questions will use the following dataset(s):
myDF <- read.csv("/anvil/projects/tdm/data/flights/subset/1995.csv", stringsAsFactors = TRUE)
ONE
The data that we may be working with does not always come to us neat and clean[3]. It is important to get a good understanding of the dataset(s) with which you are working. This is the best first step to help solve any data-driven problems.
Insider Knowledge
Datasets can be thought or as one or more observations of one or more variables. For most datasets, each row is an observation and each column is a variable. (There may be some datasets do not follow that convention.)
We are going to use the read.csv
function to load our datasets into a dataframe named …
We want to use functions such as head
, tail
, dim
, summary
, str
, class
, to get a better understanding of our dataframe(DF).
Helpful Hints
#looks at the head of the dataframe
head(myDF)
#looks at the tail of the dataframe
tail(myDF)
#returns the type of data in a column of the dataframe, for instance, the type of data in the column that stores the destination airports of the flights
class(myDF$Dest)
-
How many columns does this dataframe have?
-
How many rows does this dataframe have?
-
What type/s of data are in this dataframe (example: numerical values, and/or text strings, etc.)
-
Code used to solve this problem.
-
Output from running the code.
-
The answers to the three questions above
TWO
We can create a new vector[4] containing all of the origin airports (i.e., the airports where the flights departed) from the column myDF$Origin
of the data frame myDF
.
#takes the selected information from the dataframe and puts it into a new vector called `myairports`
myairports <- myDF$Origin
Insider Knowledge
A vector is a simple way to store a sequence of data. The data can be numeric data, logical data, textual data, etc.
To assist with this question, please also see the end of the video from Question 1 (above).
-
What type of data is in the vector
myairports
? -
The vector
myairports
contains all of the airports where flights departed in 1995. Print the first 250 of those airports. [Do not print all of the airports, because there are 5327435 such values!] How many of the first 250 flights departed from O’Hare? -
How many flights departed by O’Hare altogether in 1995?
-
Code used to solve this problem.
-
Output from running the code.
-
The answers to the 3 questions above.
THREE
Indexing
Insider Knowledge
Accessing data can be done in many ways, one of those ways is called indexing. Typically we use brackets [ ] when indexing. By doing this we can select or even exclude specific elements. For example we can select a specific column and a certian range within the column. Some examples of symbols to help us select elements include:
* < less than
* > greater than
* ⇐ less than or equal to
* >= greater than or equal to
* == is equal
* != is not equal
It is also important to note that indexing in R begins at 1. (This means that the first row of the dataframe will be numbered starting at 1.)
Helpful Hints
#finding data by their indices
myDF$Distance[row_index_start:row_index_end,]
#creates a new vector with the specific info
mynewvector <- myDF$putcolumnnamehere
#all of the data from row 3
myDF[3,]
#all of the data in all of the rows, with columns between myfirstcolumn and mylastcolumn
myDF[,myfirstcolumn:mylastcolumn]
#and/or
#the first 250 values from column 17
head(myDF[,17], n=250)
#puts all variables that are less than 6 from the dataframe
longdistances = myDF$Distance[myDF$Distance > 2000]
-
How many flights departed from Indianapolis (
IND
) in 1995? How many flights landed there? -
Consider the flight data from row 894 the data frame. What airport did it depart from? Where did it arrive?
-
How many flights have a distance of less than 200 miles?
-
Code used to solve this problem.
-
Output from running the code.
-
The answers to the 3 questions above.
FOUR
Summarizing vectors using tables
The table
command is helpful to know, for summarizing large quantities of data.
Insider Knowledge
It is useful to use functions in R and see how they behave, and then to take a function of the result, and take a function of that result, etc. For instance, it is common to summarize a vector in a table, and then sort the results, and then take the first few largest or smallest values. Remember also that R is a case-sensitive language.
table(myDF$Origin) # summarizes how many flights departed from each airport
sort(table(myDF$Origin)) # sorts those results in numeric order
tail(sort(table(myDF$Origin)),n=10) # finds the 10 most popular airports, according to the number of flights that departed from each airport.
-
Rank the airline companies (in the column
myDF$UniqueCarrier
) according to their popularity, i.e., according to the number of flights on each airline). -
Which are the three most popular airlines from 1995?
-
Now find the ten airplanes that had the most flights in 1995. List them in order, from most popular to least popular. Do you notice anything unusual about the results?
-
Code used to solve this problem.
-
Output from running the code.
-
The answers to the 3 questions above.
FIVE
Basic graph types are helpful for visualizing data. They can be an important tool in discovering insights into the data you are working with.
R has a number of tools built in for basic graphs, such as scatter plots, bar charts, histograms, etc.
Insider Knowledge
A dot plot, also known as a dot chart, is similar to a bar chart or a scatter plot. In R, the categories are displayed along the vertical axis and the corresponding values are displayed according to the horizontal axis.
We can assign groups a color to help differentiate while plotting a dot chart
We can also plot a column that we find interesting as well to take a look at what the data might show us.
For example if we wanted to see if there was a difference in days of the week and number of flights, we would use hist
.
mydays<- myDF$DayOfWeek
hist(mydays)
Helpful Hints
mycities <- tail(sort(table(myDF$Origin)),n=10)
dotchart(mycities, pch = 21, bg = "green", pt.cex = 1.5)
-
Pick a column of data that you are interested in studying, or a question that you want answered. Create either a
plot
, or adotchart
. Before making the plot, think about how many dots will be displayed on yourplot
ordotchart
. If you try to display millions of dots, you might cause your Jupyter Lab session to freeze or crash. It is useful to think ahead and to consider how your plot might look, before you accidentally try to display millions of dots. -
Descibe any patterns you may see in your plot or your dotchart. If there are none, that is okay, and you can just write "there seem to be no patterns."
-
Code used to solve this problem.
-
Output from running the code.
-
The plot or dotchart and your commentary about what you created and what you observed.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |