R Programming Homework Help

R programming is a language and an environment for graphics and statistical computing. In the following R programming homework help, we have used R statistical software for data analysis, manipulation and visualization. Data from global agencies are collected and processed that depicts the scenario of technology usage around the world.The data collected is often messy and vague and hence it goes through the process of cleaning up and the data is ready for analysis and visualization. We have made the data comparable from different files and produced the visualization to examine the data quality in R programming assignment help. Finally, based on the analysis we have interpreted and reported the results.

We will be using R programming language to manipulate data and draw conclusions from it. At the end we will perform a linear regression using the data we have chosen.

Firstly, we have created three .csv files from the three .xlxs files that were provided. Converting to comma separated files make a file that is easily readable with R. All three .csv files were loaded R programming homework help and transformed in data frames. Some of the columns were omitted as was instructed in the assignment.

All three dataframes crated from the .csv files had the same number of columns :

>length(CellPhones)

[1] 23

>length(InternetUsers)

[1] 23

>length(PerComp)

[1] 18

R script

#**********************************************

# Assignment Name: week 12

# Data Scientist Name:

# Completion Date:

# Project Description: This project uses R programming assignment help software to manipulate data tables.

#           The clear data is then examined and described.

Linear regression is performed on the data and conclusions are made.

# Consultations:

# Citations:

R Graphics: Multiple Graphs and par(mfrow=(A,B))

http://stackoverflow.com/questions/4862178/remove-rows-with-nas-in-data-frame

http://stackoverflow.com/questions/12486264/average-across-columns-in-r-excluding-nas

http://onlinestatbook.com/2/describing_bivariate_data/pearson.html

#***********************************************

#loading the data Cellphones

D<-read.csv(“Cp.csv”,header=T)

head(D)

#choosing the collumns

cn<-D[,1]

cy<-D[,27:48]

#creating dataframe

CellPhones<-data.frame(cn,cy)

head(CellPhones)

First two dataframes :CellPhones and InternetUsers have the same number R programming project help  of years covered 1990-2011. Those are the same years as data manipulation was performed so to create such a situation. The third dataframePerComp has only 18 years covered. Those are years from 1990 to 2006. It was necessary to perform the estimation of the remaining years 2006-2011 according to formulas:

  • 2007=2006+(((2006-2000)/2000)*2006)
  • 2008=2007+(((2006-2001)/2001)*2006)
  • 2009=2008+(((2006-2002)/2002)*2006)
  • 2010=2009+(((2006-2003)/2003)*2006)
  • 2011=2010+(((2006-2004)/2004)*2006)

These new variables are added to the dataframe named PerComp and this dataframe now consists of same number of columns as the other two:

>length(PerComp)

[1] 23

Since it is not wise to monitor the changes in each year individually as help with R programming assignment  each market and consumption can vary drastically, I created three year groups. First group covers the data from 1990-1996, second from 1997-2003 and the third from 2004-2011. These groups were created by using the mean function in R. They are filled with the mean of data values corresponding to years from which each group Is created. I still haven’t removed the missing data values as that would make the prediction of PerComp for years 2007-2011 extremely bad.  In other, words, each of the nine groups (3 from CellPhones, 3 from InternetUsers and three from PerComp) still have some missing values in them.

Namely the missing values in the groups are caused by:

  • In dataframesCellPhones and InternetUsers the missing values occur only if all the members of the year groups were missing in the original data.
  • In dataframePerComp the missing values are caused either by the data which is really missing, or if some of the predictors used in predicting R programming project help the missing columns is missing. If only one of the predictors is missing in the prediction of the missing columns than the missing columns will be filled with NA’s. Therefore, this dataframe is the one with the least information in it as most of it is omitted.

Now that I have three variables, each with three year groups, we have merged R programming homework help them in one dataframe. This dataframe has 10 columns as there is the location variable included:

>length(data)

[1] 10

The total number of rows in this data frame is:

>nrow(data)

[1] 185

This data has missing values in it. At this point it was necessary to remove the NA’s so that the statistics can be performed. I have removed all the rows from the merged dataframe which had at least one missing value in any column of that row. The new datatable with removed rows counts much less observations than the previous one:

>nrow(data)

[1] 108

Even though quite a lot of information was lost because the help with R programming homework  rows  were omitted for having one missing value, it was necessary so that the  descriptive statistics can be performed.

I have created the descriptive statistics table which is shown below:

The variables are named c-cell phone users, I-internet users, p-personal computer users and the g letter is short for group. There are three groups of each variable corresponding to three time periods.

cg1

cg2

cg3

ig1

ig2

ig3

pg1

pg2

pg3

Min.

0

0.04689

3.388

0

0.0951

0.8286

0.02333

0.13

0.465

1st Qu.

0.01353

2.96124

47.087

0.01742

1.0637

9.0648

0.29

1.244

8.312

Median

0.25124

11.53497

80.678

0.13799

3.8648

24.1396

1.25

4.543

22.252

Mean

1.6345

21.04866

77.667

1.03065

10.7953

32.3304

4.56173

11.654

55.377

3rd Qu.

1.81743

34.47994

107.071

0.94704

16.3192

54.0559

4.40336

13.316

90.245

Max.

13.57574

73.30983

161.68

11.8016

51.6256

90.2737

37.7

58.43

581.505

Standard Deviations

2.894409

21.86306

35.59035

2.024625

14.11633

26.6115

7.353599

15.3171

79.66879

In the table there are means, medians, minimums, maximums and standard deviations of each year group of each of the variables.
#loading the data Internet
D<-read.csv(“Internet.csv”,header=T)
#creating dataframe
InternetUsers<-data.frame(D)
head(InternetUsers)
#loading the data personal Computers
D<-read.csv(“Pc.csv”,header=T)
head(D)
#choosing collumns
p<-D[,1:18]
#creating data frame
PerComp<-data.frame(p)
#creating predictions
#predictions will fail if one of the predictors is NA.
#I have not excluede the failure statistics using R programming homework help because we would get wrong results otherwise
#All NA’s will be omitted later when R programming assignment help the final data frame is created.
PerComp$X2007<-PerComp$X2006+(((PerComp$X2006-PerComp$X2000)/PerComp$X2000)*PerComp$X2006)
PerComp$X2008<-PerComp$X2007+(((PerComp$X2006-PerComp$X2001)/PerComp$X2001)*PerComp$X2006)
PerComp$X2009<-PerComp$X2008+(((PerComp$X2006-PerComp$X2002)/PerComp$X2002)*PerComp$X2006)
PerComp$X2010<-PerComp$X2009+(((PerComp$X2006-PerComp$X2003)/PerComp$X2003)*PerComp$X2006)
PerComp$X2011<-PerComp$X2010+(((PerComp$X2006-PerComp$X2004)/PerComp$X2004)*PerComp$X2006)
head(PerComp)
It is clearly seen in this table that the mean of each R programming homework help variable increases as the years increase. The standard deviations and medians increase accordingly. Only the last year group of the variable Personal Computer users have the extremely high maximum and thus the very large standard deviation.
Pearson’s Product Moment Correlation results are shown in the table below:

cg1 cg2 cg3 ig1 ig2 ig3 pg1 pg2 pg3
cg1

1

cg2

0.775607

1

cg3

0.503776

0.767462

1

ig1

0.826473

0.652487

0.417134

1

ig2

0.865827

0.848995

0.592029

0.87648

1

ig3

0.756207

0.888611

0.76612

0.738805

0.916572

1

pg1

0.748333

0.747019

0.509166

0.832976

0.873389

0.805167

1

pg2

0.824182

0.850373

0.586851

0.826223

0.951015

0.884521

0.928464

1

pg3

0.290429

0.362988

0.238245

0.273848

0.443159

0.444679

0.368071

0.441712

1

I can see that there are some strong correlations between variables. Most strong correlations statistics using R programming homework help are between two adjacent year groups of the same variable which was expected as this data is not truly independent but rather dependent as a time-series.
However, it can be seen that some variables are strongly correlated. For example, Personal Computer users from second time period are strongly correlated to the cell phone users in the first time period. Same can be said for Personal computer users in second period and internet users in the second period. A visual insight into the relationships of the variables can be seen using plots.

I have plotted each of the pairs of variables from the same time period. The first plot shows R Programming homework helphow the number of Cell Phones users depends on the number of Internet usersstatistics using R programming assignment help  throughout three time groups.

It can be seen that in each time period, the number of cell phone users is directly proportional to the number of internet users. This is clear from the linear regression line which goes through each of the plots. In each plot, the line is rising meaning that these two variables R programming problems with answers are positively correlated.

#checking the dimensions of the dataframes

length(CellPhones)

length(InternetUsers)

length(PerComp)

nrow(CellPhones)

nrow(InternetUsers)

nrow(PerComp)

 

#creating groups by years

#I am using function “rowMeans” which R programming questions with answers calculates the mean of the selected rows

#NA’s are ommited from the calculations

#cell phone groups

cellphoneYG9096<-rowMeans(cbind(CellPhones[,2:8]), na.rm=TRUE)

cellphoneYG9703<-rowMeans(cbind(CellPhones[,9:15]),na.rm=TRUE)

cellphoneYG0411<-rowMeans(cbind(CellPhones[,16:23]),na.rm=TRUE)

#internet groups

internetYG9096<-rowMeansR programming homework solution (cbind(InternetUsers[,2:9]), na.rm=TRUE)

internetYG9703<-rowMeans(cbind(InternetUsers[,9:15]),na.rm=TRUE)

internetYG0411<-rowMeans(cbind(InternetUsers[,16:23]),na.rm=TRUE)

#PC groups

personalYG9096<-rowMeansR programmingassignment  solution(cbind(PerComp[,2:8]), na.rm=TRUE)

personalYG9703<-rowMeans(cbind(PerComp[,9:15]),na.rm=TRUE)

personalYG0411<-rowMeans(cbind(PerComp[,16:23]),na.rm=TRUE)

Next, I produced three scatterplots showing how the number of internet users depend on theR Programming assignment help number of personal R programming homework help computer users throughout three year groups.

Again there is the positive linear correlation which implies that the data R programming assignment help analysis using more PC users there are the more Internet users will be.

In the third graph corresponding to year group from year 2004 to year 2011 I have excluded one observation I have identified as an outlier. I have done so by changing the range of the x-axis so as not to show that particular R programming homework help  observation.

I have plotted the scatterplots of the remaining combination of variables showing how does a number of PC users depend on the number of cell phone users.R Programming project help

As can be seen, there is a positive correlation between the two variables at each time period.

By using these three scatterplots I have visualized the relationship between the three variables. The conclusion is that the increase in any sort of technology induces thedata analysis using R programming homework help increase in the other sort of technology in each epoch.

I can also clearly deduce that in the earliest epoch from year 1990 to year 1996 there were a lot of observations with the low values of number of technology users. With the progress of time these pay for R programminghomework clusters have dispersed throughout the whole scatterplots implying that the technology had become more accessible and more people are using it.

#attaching data so that I can access variables easily

attach(data)

 

#creating the summary of all the collumns corresponding to variables

summary(data[,2:10])

 

#creating an empty array to store all the standard deviations

deviations<-rep(0, times=9)

 

#creating a loop to calculate and store R programming assignment help the standard deviation of each variable

for (i in 2:10)

{

deviations[i-1]=sd(data[,i])

}

#listing all standard deviations from first to last variable

deviations

 

#creating matrix of Pearson’s correlation coefficients

cor(data[,2:10],method=”pearson”)

 

#parameters of the plot display device

# three rows and one collumn

par(mfrow=c(3,1))

 

#ploting three plots of three year groups of Cell phones vs Internet users

plot(ig1,cg1, xlab=”internet users”, ylab=”Cellular phone users”, main=”years 1990-1996″ )

 

Now that I have explored all the available data I will construct a linear regression model to test a hypothesis.

My assumption addresses the question of technology R programming homework help  availability to the consumers. Namely, I assume that the lack of internet and the lack of available cell phones resulted in the increased relative number of PC users. In other words, I assume that a good internet and R programming homework help  the availability of cell phones decreases the relative number of PC users.

I will test my assumption by monitoring how does a pay for R programming homework number of PC users depend on the difference between the number of  internet  users and the number of cell phone users. In other words,  I will test whether, at some degree, people prefer the cell phone usage with wireless internet coverage over the PC usage with good internet. The assumption will be tested in such a manner that I will monitor the behavior of the same linear regression model over three time periods. Since the cell phone technology wasn’t as accessible in the first two time periods as it is in the third, I assume that the model will behave drastically different in these three epochs.

I have created the variable which is equal to the pay for R programming assignment difference of Internet users and Cell Phone users for each of the epochs. As a dependent variable in model I will use the number of PC users from the appropriate epoch.

Every hypothesis testing is conducted so that the null hypothesis is whether rejected or accepted. My null hypothesis is that the number of PC users is not dependent on the difference of Internet users and Cell Phone users.

I have ran my linear regression in the first time period: pg1~(ig1-cg1) and the summary is:

Call:

lm(formula = y1 ~ x1)

Residuals:

Min      1Q  Median      3Q     Max

-9.0102 -0.6171 -0.4776 -0.0275  9.7286

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept)  0.46163    0.24585   1.878   0.0632 .

x1           0.33216    0.03643   9.117 5.46e-15 ***

Residual standard error: 2.177 on 106 degrees of freedom

Multiple R-squared:  0.4395,    Adjusted R-squared:  0.4342

F-statistic: 83.11 on 1 and 106 DF,  p-value: 5.458e-15

From the regression summary I see that the p-value isR programming homework for money extremely small and almost identical to zero. This means tha the difference in Internet Users and Cell Phone users is a significant predictor in predicting the number of PC users. Thus, the null hypothesis must be rejected in this epoch.

#dataframe for cellphones with four columns: Location and three year groups

c<-data.frame(CellPhones[,1],cellphoneYG9096,cellphoneYG9703,cellphoneYG0411)

colnames(c) <- c(“country”, “cg1″,”cg2″,”cg3”)

#dataframe for internet with four columns: Location and three year groups

i<-data.frame(InternetUsers[,1],internetYG9096,internetYG9703,internetYG0411)

colnames(i) <- c(“country”, “ig1″,”ig2″,”ig3”)

#dataframe for Pcs with four columns: Location and three year groups

p<-data.frame(PerComp[,1],personalYG9096,personalYG9703,personalYG0411)

colnames(p) <- c(“country”, “pg1″,”pg2″,”pg3″)

 

#creating one big dataframe with 10 columns

#first column is the location, other R programming assignment for money columns are three year groups of three variables:

#cellphones, internet and PCs

data<-merge(x=c,y=i)

data<-merge(x=data,y=p)

#checking to see if the merged data frame is good

head(data)

#there are NA’s in the data frame

nrow(data)

#Omit all the rows containing NA’s since it is impossible to do statistics otherwise.

data<-na.omit(data)

I have ran the regression in the R Coding assignment helpsecond epoch and the summary is:

Call:

lm(formula = y2 ~ x2)

Residuals:

Min      1Q  Median      3Q     Max

-28.206 -10.617  -5.015   5.283  54.296

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept)  11.1102     1.9804   5.610 1.63e-07 ***

x2            3.5728     0.4116   8.681 5.16e-14 ***

#adding straight line which is the linear fit of these two variables

abline(lm(cg1~ig1))

plot(ig2,cg2, xlab=”internet users”, ylab=”Cellular phone users”,R Coding homework help  main=”years 1997-2003″ )

abline(lm(cg2~ig2))

plot(ig3,cg3, xlab=”internet users”, ylab=”Cellular phone users”, main=”years 2004-2011″ )

abline(lm(cg3~ig3))

#three plots of Internet users vs PC users

plot(pg1,ig1, xlab=”PC users”, ylab=”Internet users”, main=”years 1990-1996″ )

#adding linear regressino line

abline(lm(ig1~pg1))

plot(pg2,ig2, xlab=”PC users”, ylab=”Internet users”, main=”years 1997-2003″ )

abline(lm(ig2~pg2))

plot(pg3,ig3, xlab=”PC users”, ylab=”Internet users”, xlim=c(0,400), main=”years 2004-2011″ )

abline(lm(ig3~pg3))

#three plots of PC users vs Cell phone users

plot(cg1,pg1, xlab=”Cellular phone users”, ylab=”PC users”, main=”years 1990-1996″ )

#adding linear regression line

abline(lm(pg1~cg1))

plot(cg2,pg2, xlab=”Cellular phone R programming homework help  users”, ylab=”PC users”, main=”years 1997-2003″ )

abline(lm(pg2~cg2))

plot(cg3,pg3, xlab=”Cellular phone users”, ylab=”PC users”, ylim=c(0,300), main=”years 2004-2011″ )

abline(lm(pg3~cg3))

Residual standard error: 16.79 on 106 degrees of freedom

Multiple R-squared:  0.4155,    Adjusted R-squared:   0.41

F-statistic: 75.35 on 1 and 106 DF,  p-value: 5.16e-14

The p-value of the regression is very small and it implies that the number of R programming assignment help  PC users strongly depend on the difference of InternetStatistics using R Coding homework help Users and Cell Phone users. The null hypothesis has to be rejected in the second epoch.

Running the regression in the third epoch yields:

Call:

lm(formula = y3 ~ x3)

Residuals:

Min      1Q  Median      3Q     Max

-73.424 -29.697   3.875  28.672  83.615

 

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 76.67945    3.87920  19.767<2e-16 ***

x3           0.02830    0.05159   0.549    0.584

Residual standard error: 35.71 on 106 degrees of freedom

Multiple R-squared:  0.00283,   Adjusted R-squared:  -0.006577

F-statistic: 0.3009 on 1 and 106 DF,  p-value: 0.5845

The p-value of this regression is 0.5845 which is way above the 0.1 threshold. Thus the null hypothesis cannot be rejected in the third epoch. The number of PC users does not depend on the difference of Internet users and Cell phones users in this epoch.

To summarize the results, the number of PC users strongly depended on the do my R programming homework  difference of Internet users and cell phone users in the first two epochs and it doesn’t depend on it in the third epoch.

This means that when the number of cell phones was very small like in the first epoch, the availability of Internet and the lack of cell phones induced a great number of PC users,.

In the second epoch the number of cell phones was still relatively small and the availability of internet was iduced the great number of PC users.

In the third epoch, the cell phones became an everyday item. The internet became available for cell phones. In this epoch the number of PC users does not depend at all on the difference of Cell phone users and Internet users. Almost all cell phone users have an internet access and this difference is negligible. Therefore, the people tend to have a cell phone while the number of PCs remained more or less the same.

These regression results showed that the mere  Online R programming tutors availability of cell phones has induced the great increase in the number of cell phones while the PCs have remained in a slow and constant growth. It also shows that the mobile internet has made the number of internet users an unreliable predictor of the certain technology preference. The increase in the number of Internet users make my R programming homework implies the increase in users of both PCs and cell phones.

For further research I suggest dividing the PC internet users experts in R programming  and the cell phone internet users in two distinct groups so that the more detailed analysis can be performed.