• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

View

# HomeWork01old

last edited by 10 years ago

Hacker Dojo Machine Learning

Homework 1

Mike Bowles, PhD & Patricia Hoffman, PhD.

1) This question uses the data at  myfirstdata.csv

a) Read in the data in R using data<-read.csv("myfirstdata.csv",header=FALSE). Note, you first need to specify your working directory using the setwd() command. Determine whether each of the two attributes (columns) is treated as qualitative (categorical) or quantitative (numeric) using R. Explain how you can tell using R.  (An example of how to setwd is given in lecture1.r )

b) What is the specific problem that causes one of these two attributes to be read in as qualitative (categorical) when it seems it should be quantitative (numeric)?  (remember how to ask for help ... ie in R console  type
?is.factor )

c) Use the command plot() in R to make a plot for each column by entering plot(data[,1]) and plot(data[,2]). Because one variable is read in as quantitative (numeric) and the other as qualitative (categorical) these two plots are showing completely different things by default. Explain exactly what is being plotted in each of the two cases. Include these two plots in your homework.

d) Read the data into Excel. Excel should have no problem opening the file directly since it is .csv. Create a new column that is equal to the second column plus 10. What is the result for the problem observations (rows) you identified in part b? What specific outcome does Excel display?

2) This question uses the data at twomillion.csv

a) Read the data into R using data<-read.csv("twomillion.csv",header=FALSE). Note, you first need to specify your working directory using the setwd() command. Extract a simple random sample with replacement of 10,000 observations (rows). Show your R commands for doing this.

b) For your sample, use the functions mean(), max(), var() and quantile(,.25) to compute the mean, maximum, variance and 1st quartile respectively. Show your R code and the resulting values.

c) Compute the same quantities in part b on the entire data set and show your answers. How much do they differ from your answers in part b?

d) Save your sample from R to a csv file using the command write.csv(). Then open this file with Excel and compute the mean, maximum, variance and 1st quartile. Provide the values and name the Excel functions you used to compute these.

e) Exactly what happens if you try to open the full data set with Excel?

3) This question uses a sample of 1500 California house prices at CA_house_prices.csv
and a sample of 10,000 Ohio house prices at   Download both data sets to your computer. Note that the house prices are in thousands of dollars.

a) Use R to produce a single graph displaying a boxplot for each set (as in ICE #16). Include the R commands and the plot. Put your name in the title of the plot (for example, main="Britney Spears' Boxplots").

b) Use R to produce a frequency histogram for only the California house prices. Use intervals of width \$500,000 beginning at 0 and ending at \$3.5 million. Include the R commands and the plot. Put your name in the title of the plot.

c) Use R to plot the ECDF of the California houses and Ohio houses on the same graph (as in ICE #11). Include a legend. Include the R commands and the plot. Put your name in the title of the plot.

4) This question uses the data at  football.csv
Download it to your computer. This data set gives the total number of wins for each of the 117 Division 1A college football teams for the 2003 and 2004 seasons.

a) Use plot() in R to make a scatter plot for this data with 2003 wins on the x-axis and 2004 wins on the y-axis. Use the range 0 to 12 for both the x-axis and y-axis. Include the R commands and the plot. Put your name in the title of the plot.

b) Why are there fewer than 117 points visible on your graph in part a? Describe the solution we discussed in class to deal with this problem (but don't actually do it).

c) Compute the correlation in R using the function cor().

d) How does the value in part c change if you add 10 to all the values for 2004?

e) How does the value in part c change if you multiply all the 2004 values by 2?

f) How does the value in part c change if you multiply all the 2004 values by -2?

5) This question uses the sample of 10,000 Ohio house prices at  OH_house_prices.csv
Download the data set to your computer. Note that the house prices are in thousands of dollars.

a) What is the median value? Is it larger or smaller than the mean?

b) What does your answer to part a suggest about the shape of the distribution (right-skewed or left-skewed)?

c) How does the median change if you add 10 (thousand dollars) to all the values?

d) How does the median change if you multiply all the values by 2?

5) This question uses the following people's ages: 19,23,30,30,45,25,24,20. Store them in R using the syntax ages<-c(19,23,30,30,45,25,24,20).

a) Compute the standard deviation in R using the sd() function.

b) Compute the same value by hand and show all the steps.

c) Using R, how does the value in part a change if you add 10 to all the values?

d) Using R, how does the value in part a change if you multiply all the values by 100?