STAT 39000: Project 1 — Fall 2020
Motivation: In this project we will jump right into an R review. In this project we are going to break one larger data-wrangling problem into discrete parts. There is a slight emphasis on writing functions and dealing with strings. At the end of this project we will have greatly simplified a dataset, making it easy to dig into.
Context: We just started the semester and are digging into a large dataset, and in doing so, reviewing R concepts we’ve previously learned.
Scope: data wrangling in R, functions
Make sure to read about, and use the template found here, and the important information about projects submissions here.
You can find useful examples that walk you through relevant material in The Examples Book:
It is highly recommended to read through, search, and explore these examples to help solve problems in this project.
It is highly recommended that you use rstudio.scholar.rcac.purdue.edu/. Simply click on the link and login using your Purdue account credentials. |
We decided to move away from ThinLinc and away from the version of RStudio used last year (desktop.scholar.rcac.purdue.edu). The version of RStudio is known to have some strange issues when running code chunks.
Remember the very useful documentation shortcut ?
. To use, simply type ?
in the console, followed by the name of the function you are interested in.
You can also look for package documentation by using help(package=PACKAGENAME)
, so for example, to see the documentation for the package ggplot2
, we could run:
help(package=ggplot2)
Sometimes it can be helpful to see the source code of a defined function. A function is any chunk of organized code that is used to perform an operation. Source code is the underlying R
or c
or c++
code that is used to create the function. To see the source code of a defined function, type the function’s name without the ()
. For example, if we were curious about what the function Reduce
does, we could run:
Reduce
Occasionally this will be less useful as the resulting code will be code that calls c
code we can’t see. Other times it will allow you to understand the function better.
Dataset:
/class/datamine/data/airbnb
Often times (maybe even the majority of the time) data doesn’t come in one nice file or database. Explore the datasets in /class/datamine/data/airbnb
.
Questions
Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go. |
Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like |
Question 1
You may have noted that, for each country, city, and date we can find 3 files: calendar.csv.gz
, listings.csv.gz
, and reviews.csv.gz
(for now, we will ignore all files in the "visualisations" folders).
Let’s take a look at the data in each of the three types of files. Pick a country, city and date, and read the first 50 rows of each of the 3 datasets (calendar.csv.gz
, listings.csv.gz
, and reviews.csv.gz
). Provide 1-2 sentences explaining the type of information found in each, and what variable(s) could be used to join them.
|
Depending on the country that you pick, the listings and/or the reviews might not display properly in RMarkdown. So you do not need to display the first 50 rows of the listings and/or reviews, in your RMarkdown document. It is OK to just display the first 50 rows of the calendar entries. |
To read a compressed csv, simply use the read.csv
function:
dat <- read.csv("/class/datamine/data/airbnb/brazil/rj/rio-de-janeiro/2019-06-19/data/calendar.csv.gz")
head(dat)
Let’s work towards getting this data into an easier format to analyze. From now on, we will focus on the listings.csv.gz
datasets.
-
Chunk of code used to read the first 50 rows of each dataset.
-
1-2 sentences briefly describing the information contained in each dataset.
-
Name(s) of variable(s) that could be used to join them.
Question 2
Write a function called get_paths_for_country
, that, given a string with the country name, returns a vector with the full paths for all listings.csv.gz
files, starting with /class/datamine/data/airbnb/…
.
For example, the output from get_paths_for_country("united-states")
should have 28 entries. Here are the first 5 entries in the output:
[1] "/class/datamine/data/airbnb/united-states/ca/los-angeles/2019-07-08/data/listings.csv.gz" [2] "/class/datamine/data/airbnb/united-states/ca/oakland/2019-07-13/data/listings.csv.gz" [3] "/class/datamine/data/airbnb/united-states/ca/pacific-grove/2019-07-01/data/listings.csv.gz" [4] "/class/datamine/data/airbnb/united-states/ca/san-diego/2019-07-14/data/listings.csv.gz" [5] "/class/datamine/data/airbnb/united-states/ca/san-francisco/2019-07-08/data/listings.csv.gz"
|
Use |
-
Chunk of code for your
get_paths_for_country
function.
Question 3
Write a function called get_data_for_country
that, given a string with the country name, returns a data.frame containing the all listings data for that country. Use your previously written function to help you.
Use |
Use |
-
Chunk of code for your
get_data_for_country
function.
Question 4
Use your get_data_for_country
to get the data for a country of your choice, and make sure to name the data.frame listings
. Take a look at the following columns: host_is_superhost
, host_has_profile_pic
, host_identity_verified
, and is_location_exact
. What is the data type for each column? (You can use class
or typeof
or str
to see the data type.)
These columns would make more sense as logical values (TRUE/FALSE/NA).
Write a function called transform_column
that, given a column containing lowercase "t"s and "f"s, your function will transform it to logical (TRUE/FALSE/NA) values. Note that NA values for these columns appear as blank (""
), and we need to be careful when transforming the data. Test your function on column host_is_superhost
.
-
Chunk of code for your
transform_column
function. -
Type of
transform_column(listings$host_is_superhost)
.
Question 5
Create a histogram for response rates (host_response_rate
) for super hosts (where host_is_superhost
is TRUE
). If your listings do not contain any super hosts, load data from a different country. Note that we first need to convert host_response_rate
from a character containing "%" signs to a numeric variable.
-
Chunk of code used to answer the question.
-
Histogram of response rates for super hosts.