School of Mathematics and Statistics
Te Kura M¯atai Tatauranga
STAT 393
TEST 1
3 September 2020
• Start any time between 8:00am, 3 September to 8:00am, 4 September, 2020.
• MUST submit the test in THREE HOURS since the time of the start.
• Submit a pdf file ONLINE on Blackboard using the formatting file name: surname+two
letters of the first name.pdf. For example: nguyenth.pdf is the name of my submitted file.
• Attempt all questions.
• Questions with ** indicates the possibility of being a little bit more challenging.
• The bonus question is not a part of the total 100 marks, but if you do it right, you will get
an extra 2 marks.
• R markdown is RECOMMENDED for questions that ask for R codes and output.
• The realestate.csv dataset is given in another link.
• Useful statistical tables are attached from page 5 to page 8. Use them if necessary and if
you use the table, do mention that the numbers you have got are from the tables.
1
1. Consider the simple linear model:
y
β
,
n×1 = Xn×2 2×1 + εn×1
where E(ε) = 0 and Var(ε) = σ2In and r(X) = 2. Note that we do not assume here that ε
follows a multivariate normal distribution.
(a) What is (y − Xβ)T (y − Xβ) in linear models?
(b) Consider the following estimator for β:
β =
b
(XT X)−1XT y.
i. Show that β
b is the least square estimator of β, that minimizes
(y − Xβ)T (y − Xβ).
ii. Show that β
b is an unbiased estimator of β.
(c) Define the hat matrix H = X(XT X)−1XT and the projection matrix P = I − H. Show
that
i. Both H and P are symmetric and idempotent.
ii. H and P are orthogonal, i.e., HP = 0.
iii. tr(H) = p and tr(P) = n − p.
iv. ** Denote the fitted value as y = Xβ = Hy and the residuals as e = y − y = Py.
b
b
b
b
Show that y and e are independent, i.e.,
b
b
Cov(y, e) = 0.
b b
Hints: Use the following formula for two random vectors/variables y and z
Cov(y, z) = E[(y − E(y))(z − E(z))T ].
2. Consider the following n pairs of observations (x ,
i yi), i = 1, . . . , n. To avoid the data
x
=
i being degerenate, we assume that not all xi
c for i = 1, . . . , n, where c ∈ R is a
real-value number.
(a) If the model
iid
y = µ + ε ,
ε
i
i
i ∼ N (0, σ2),
(1)
fits the given data, then what does it tell us?
(b) Suppose that the given n pairs of observations satisfy
y = µ + β + ε ,
i
xi
i
(2)
where for all i = 1, . . . , n, E(εi) = 0 and Var(εi) = σ2, no normality assumption
imposed here.
2
i. Formulate the linear regression in (2) in terms of linear model
y
β
.
n×1 = Xn×p p×1 + εn×1
Specifying the corresponding response vector y, the design matrix X and the
model parameter β clearly.
ii. Determine the rank of X.
iii. ** Argue that the matrix XT X is invertible, i.e., det(XT X) , 0.
(c) Consider the following n pairs of observations (x ,
i yi), i = 1, . . . , n satisfying
iid
y = β + ε ,
ε
i
xi
i
i ∼ N (0, σ2).
(3)
Note that this model is different from model (2) because it does not have an intercept.
i. Formulate the linear regression in (3) in terms of linear model
y
β
.
n×1 = Xn×p p×1 + εn×1
Specifying the corresponding design matrix X and the model parameter β clearly.
ii. Use β =
b
(XT X)−1XT y to find ˆ
β in terms of xi and yi.
iii. Define SSR(β) = yT Hy and SSE(β) = yT Py with H and P being the hat and
projection matrices respectively, defined as in question 1(c). Derive SSR and
SSE for model (3) in terms of xi and yi.
iv. Use SSR and SSE to construct a test statistic for testing H0 : β = 0 against
H1 : β , 0. What is the distribution of the test statistic?
v. For the following data set
i
1
2
3
4
5
6
7
8
xi
2
2
3
5
8
4
6
5
yi
0.2
0.3
0.85
1
0.9
0.3
0.5
0.4
we have SSR = 0.206 and SSE = 0.4811. To test H0 : β = 0 against H1 : β , 0,
what do you conclude at 5% significance level?
(d) Suppose that a quadratic term is added to the model (2) as follows:
iid
y = µ + β + γ
+ ε ,
ε
i
xi
x2
∼ N(0, σ2)
(4)
i
i
i
i. Should the model (4), which is non-linear in x, still be called a linear model?
Explain?
ii. Write down the model (4) in matrix form y
β
n×1 = Xn×p p×1 + εn×1 with the
specified design matrix X and vector of parameters β.
iii. (Bonus question) Can you think of any specific practical example where it is
appropriate to consider fitting the model (4) to the data?
3
3. The realestate.csv dataset was built for regression analysis to investigate the relationship
between house age, location, distance to nearest MRT station and house price, where
only house age is extracted as the predictor of the main interest. House ages are in years
and house prices are in $10000 USD. The following simple linear regression model is
then performed:
Price = β + β
0
1 × Age + ε,
ε ∼ N(0, σ2)
(5)
(a) Read the data and fit the model by using an appropriate lm() function and reports
the least square estimates (LSE) of the parameters of the model.
(b) Should the Maximum Likelihood Estimates of the parameters be the same as the
LSE?
(c) Plot the fitted regression line and write down the predicted model for the house price
based on the age. What is the trend for the house price when the houses are getting
older? Make a specific interpretation.
(d) To test H
=
0 : β1
0 against H1 : β1 , 0, what is the test statistic to use and its
distribution? Give the formula. Report the test value and its corresponding p-value?
What is your conclusion at 5% significance level?
(e) In R, work out the 95% confidence interval for β1. Interpret the result.
(f) What is the adjusted R2? Interpret the result.
(g) In R, manually calculate SSR and SST and F-test to compare with the F-test obtained
from the lm() function.
4
Table of Standard Normal Probabilities
P(0 ≤ Z ≤ z) where Z ∼ N(0, 1)
z
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.0
0.0000
0.0040
0.0080
0.0120
0.0160
0.0199
0.0239
0.0279
0.0319
0.0359
0.1
0.0398
0.0438
0.0478
0.0517
0.0557
0.0596
0.0636
0.0675
0.0714
0.0753
0.2
0.0793
0.0832
0.0871
0.0910
0.0948
0.0987
0.1026
0.1064
0.1103
0.1141
0.3
0.1179
0.1217
0.1255
0.1293
0.1331
0.1368
0.1406
0.1443
0.1480
0.1517
0.4
0.1554
0.1591
0.1628
0.1664
0.1700
0.1736
0.1772
0.1808
0.1844
0.1879
0.5
0.1915
0.1950
0.1985
0.2019
0.2054
0.2088
0.2123
0.2157
0.2190
0.2224
0.6
0.2257
0.2291
0.2324
0.2357
0.2389
0.2422
0.2454
0.2486
0.2517
0.2549
0.7
0.2580
0.2611
0.2642
0.2673
0.2704
0.2734
0.2764
0.2794
0.2823
0.2852
0.8
0.2881
0.2910
0.2939
0.2967
0.2995
0.3023
0.3051
0.3078
0.3106
0.3133
0.9
0.3159
0.3186
0.3212
0.3238
0.3264
0.3289
0.3315
0.3340
0.3365
0.3389
1.0
0.3413
0.3438
0.3461
0.3485
0.3508
0.3531
0.3554
0.3577
0.3599
0.3621
1.1
0.3643
0.3665
0.3686
0.3708
0.3729
0.3749
0.3770
0.3790
0.3810
0.3830
1.2
0.3849
0.3869
0.3888
0.3907
0.3925
0.3944
0.3962
0.3980
0.3997
0.4015
1.3
0.4032
0.4049
0.4066
0.4082
0.4099
0.4115
0.4131
0.4147
0.4162
0.4177
1.4
0.4192
0.4207
0.4222
0.4236
0.4251
0.4265
0.4279
0.4292
0.4306
0.4319
1.5
0.4332
0.4345
0.4357
0.4370
0.4382
0.4394
0.4406
0.4418
0.4429
0.4441
1.6
0.4452
0.4463
0.4474
0.4484
0.4495
0.4505
0.4515
0.4525
0.4535
0.4545
1.7
0.4554
0.4564
0.4573
0.4582
0.4591
0.4599
0.4608
0.4616
0.4625
0.4633
1.8
0.4641
0.4649
0.4656
0.4664
0.4671
0.4678
0.4686
0.4693
0.4699
0.4706
1.9
0.4713
0.4719
0.4726
0.4732
0.4738
0.4744
0.4750
0.4756
0.4761
0.4767
2.0
0.4772
0.4778
0.4783
0.4788
0.4793
0.4798
0.4803
0.4808
0.4812
0.4817
2.1
0.4821
0.4826
0.4830
0.4834
0.4838
0.4842
0.4846
0.4850
0.4854
0.4857
2.2
0.4861
0.4864
0.4868
0.4871
0.4875
0.4878
0.4881
0.4884
0.4887
0.4890
2.3
0.4893
0.4896
0.4898
0.4901
0.4904
0.4906
0.4909
0.4911
0.4913
0.4916
2.4
0.4918
0.4920
0.4922
0.4925
0.4927
0.4929
0.4931
0.4932
0.4934
0.4936
2.5
0.4938
0.4940
0.4941
0.4943
0.4945
0.4946
0.4948
0.4949
0.4951
0.4952
2.6
0.4953
0.4955
0.4956
0.4957
0.4959
0.4960
0.4961
0.4962
0.4963
0.4964
2.7
0.4965
0.4966
0.4967
0.4968
0.4969
0.4970
0.4971
0.4972
0.4973
0.4974
2.8
0.4974
0.4975
0.4976
0.4977
0.4977
0.4978
0.4979
0.4979
0.4980
0.4981
2.9
0.4981
0.4982
0.4982
0.4983
0.4984
0.4984
0.4985
0.4985
0.4986
0.4986
3.0
0.4987
0.4987
0.4987
0.4988
0.4988
0.4989
0.4989
0.4989
0.4990
0.4990
3.1
0.4990
0.4991
0.4991
0.4991
0.4992
0.4992
0.4992
0.4992
0.4993
0.4993
3.2
0.4993
0.4993
0.4994
0.4994
0.4994
0.4994
0.4994
0.4995
0.4995
0.4995
3.3
0.4995
0.4995
0.4995
0.4996
0.4996
0.4996
0.4996
0.4996
0.4996
0.4997
3.4
0.4997
0.4997
0.4997
0.4997
0.4997
0.4997
0.4997
0.4997
0.4997
0.4998
3.5
0.4998
0.4998
0.4998
0.4998
0.4998
0.4998
0.4998
0.4998
0.4998
0.4998
3.6
0.4998
0.4998
0.4999
0.4999
0.4999
0.4999
0.4999
0.4999
0.4999
0.4999
3.7
0.4999
0.4999
0.4999
0.4999
0.4999
0.4999
0.4999
0.4999
0.4999
0.4999
3.8
0.4999
0.4999
0.4999
0.4999
0.4999
0.4999
0.4999
0.4999
0.4999
0.4999
3.9
0.5000
0.5000
0.5000
0.5000
0.5000
0.5000
0.5000
0.5000
0.5000
0.5000
5
Table of Inverse Normal Probabilities
p = P(0 ≤ Z ≤ z) where Z ∼ N(0, 1)
p
z
p
z
p
z
p
z
p
z
0.00
0.0000
0.10
0.2533
0.20
0.5244
0.30
0.8416
0.40
1.2816
0.01
0.0251
0.11
0.2793
0.21
0.5534
0.31
0.8779
0.41
1.3408
0.02
0.0502
0.12
0.3055
0.22
0.5828
0.32
0.9154
0.42
1.4051
0.03
0.0753
0.13
0.3319
0.23
0.6128
0.33
0.9542
0.43
1.4758
0.04
0.1004
0.14
0.3585
0.24
0.6433
0.34
0.9945
0.44
1.5548
0.05
0.1257
0.15
0.3853
0.25
0.6745
0.35
1.0364
0.45
1.6449
0.06
0.1510
0.16
0.4125
0.26
0.7063
0.36
1.0803
0.46
1.7507
0.07
0.1764
0.17
0.4399
0.27
0.7388
0.37
1.1264
0.47
1.8808
0.08
0.2019
0.18
0.4677
0.28
0.7722
0.38
1.1750
0.48
2.0537
0.09
0.2275
0.19
0.4959
0.29
0.8064
0.39
1.2265
0.49
2.3263
0.470
1.8808
0.480
2.0537
0.490
2.3263
0.4990
3.0902
0.49990
3.7190
0.471
1.8957
0.481
2.0749
0.491
2.3656
0.4991
3.1214
0.49991
3.7455
0.472
1.9110
0.482
2.0969
0.492
2.4089
0.4992
3.1559
0.49992
3.7750
0.473
1.9268
0.483
2.1201
0.493
2.4573
0.4993
3.1947
0.49993
3.8082
0.474
1.9431
0.484
2.1444
0.494
2.5121
0.4994
3.2389
0.49994
3.8461
0.475
1.9600
0.485
2.1701
0.495
2.5758
0.4995
3.2905
0.49995
3.8906
0.476
1.9774
0.486
2.1973
0.496
2.6521
0.4996
3.3528
0.49996
3.9444
0.477
1.9954
0.487
2.2262
0.497
2.7478
0.4997
3.4316
0.49997
4.0128
0.478
2.0141
0.488
2.2571
0.498
2.8782
0.4998
3.5401
0.49998
4.1075
0.479
2.0335
0.489
2.2904
0.499
3.0902
0.4999
3.7190
0.49999
4.2649
6
Inf
254.31
19.496
8.5264
5.6281
4.3650
3.6689
3.2298
2.9276
2.7067
2.5379
2.4045
2.2962
2.2064
2.1307
2.0658
2.0096
1.9604
1.9168
1.8780
1.8432
1.8117
1.7831
1.7570
1.7330
1.7110
1.6906
1.6717
1.6541
1.6376
1.6223
1.5089
1.3893
1.2539
1.0000
60
252.20
19.479
8.5720
5.6877
4.4314
3.7398
3.3043
3.0053
2.7872
2.6211
2.4901
2.3842
2.2966
2.2229
2.1601
2.1058
2.0584
2.0166
1.9795
1.9464
1.9165
1.8894
1.8648
1.8424
1.8217
1.8027
1.7851
1.7689
1.7537
1.7396
1.6373
1.5343
1.4290
1.3180
5%.
of 30
250.10
19.462
8.6166
5.7459
4.4957
3.8082
3.3758
3.0794
2.8637
2.6996
2.5705
2.4663
2.3803
2.3082
2.2468
2.1938
2.1477
2.1071
2.0712
2.0391
2.0102
1.9842
1.9605
1.9390
1.9192
1.9010
1.8842
1.8687
1.8543
1.8409
1.7444
1.6491
1.5543
1.4591
20
248.01
19.446
8.6602
5.8025
4.5581
3.8742
3.4445
3.1503
2.9365
2.7740
2.6464
2.5436
2.4589
2.3879
2.3275
2.2756
2.2304
2.1906
2.1555
2.1242
2.0960
2.0707
2.0476
2.0267
2.0075
1.9898
1.9736
1.9586
1.9446
1.9317
1.8389
1.7480
1.6587
1.5705
probability
a 15
with
245.95
19.429
8.7029
5.8578
4.6188
3.9381
3.5107
3.2184
3.0061
2.8450
2.7186
2.6169
2.5331
2.4630
2.4034
2.3522
2.3077
2.2686
2.2341
2.2033
2.1757
2.1508
2.1282
2.1077
2.0889
2.0716
2.0558
2.0411
2.0275
2.0148
1.9245
1.8364
1.7505
1.6664
12
243.91
19.413
8.7446
5.9117
4.6777
3.9999
3.5747
3.2839
3.0729
2.9130
2.7876
2.6866
2.6037
2.5342
2.4753
2.4247
2.3807
2.3421
2.3080
2.2776
2.2504
2.2258
2.2036
2.1834
2.1649
2.1479
2.1323
2.1179
2.1045
2.0921
2.0035
1.9174
1.8337
1.7522
exceeded 10
is
241.88
19.396
8.7855
5.9644
4.7351
4.0600
3.6365
3.3472
3.1373
2.9782
2.8536
2.7534
2.6710
2.6022
2.5437
2.4935
2.4499
2.4117
2.3779
2.3479
2.3210
2.2967
2.2747
2.2547
2.2365
2.2197
2.2043
2.1900
2.1768
2.1646
2.0772
1.9926
1.9105
1.8307
9
which
240.54
19.385
8.8123
5.9988
4.7725
4.0990
3.6767
3.3881
3.1789
3.0204
2.8962
2.7964
2.7144
2.6458
2.5876
2.5377
2.4943
2.4563
2.4227
2.3928
2.3660
2.3419
2.3201
2.3002
2.2821
2.2655
2.2501
2.2360
2.2229
2.2107
2.1240
2.0401
1.9588
1.8799
8
f2).
freedom
238.88
19.371
8.8452
6.0410
4.8183
4.1468
3.7257
3.4381
3.2296
3.0717
2.9480
2.8486
2.7669
2.6987
2.6408
2.5911
2.5480
2.5102
2.4768
2.4471
2.4205
2.3965
2.3748
2.3551
2.3371
2.3205
2.3053
2.2913
2.2783
2.2662
2.1802
2.0970
2.0164
1.9384
,d
of 7
f1
236.77
19.353
8.8867
6.0942
4.8759
4.2067
3.7870
3.5005
3.2927
3.1355
3.0123
2.9134
2.8321
2.7642
2.7066
2.6572
2.6143
2.5767
2.5435
2.5140
2.4876
2.4638
2.4422
2.4226
2.4047
2.3883
2.3732
2.3593
2.3463
2.3343
2.2490
2.1665
2.0868
2.0096
(d
F
grees 6
∼
de
X
233.99
19.330
8.9406
6.1631
4.9503
4.2839
3.8660
3.5806
3.3738
3.2172
3.0946
2.9961
2.9153
2.8477
2.7905
2.7413
2.6987
2.6613
2.6283
2.5990
2.5727
2.5491
2.5277
2.5082
2.4904
2.4741
2.4591
2.4453
2.4324
2.4205
2.3359
2.2541
2.1750
2.0986
df2
5
df1,
30.16
where
19.296
9.0135
6.2561
5.0503
4.3874
3.9715
3.6875
3.4817
3.3258
3.2039
3.1059
3.0254
2.9582
2.9013
2.8524
2.8100
2.7729
2.7401
2.7109
2.6848
2.6613
2.6400
2.6207
2.6030
2.5868
2.5719
2.5581
2.5454
2.5336
2.4495
2.3683
2.2899
2.2141
χ)
with 4
≤
F
X
19.247
9.1172
6.3882
5.1922
4.5337
4.1203
3.8379
3.6331
3.4780
3.3567
3.2592
3.1791
3.1122
3.0556
3.0069
2.9647
2.9277
2.8951
2.8661
2.8401
2.8167
2.7955
2.7763
2.7587
2.7426
2.7278
2.7141
2.7014
2.6896
2.6060
2.5252
2.4472
2.3719
of
224.582
≤
3
(0
alue
P
v
215.71
19.164
9.2766
6.5914
5.4095
4.7571
4.3468
4.0662
3.8625
3.7083
3.5874
3.4903
3.4105
3.3439
3.2874
3.2389
3.1968
3.1599
3.1274
3.0984
3.0725
3.0491
3.0280
3.0088
2.9912
2.9752
2.9604
2.9467
2.9340
2.9223
2.8387
2.7581
2.6802
2.6049
the 2
is
199.50
19.000
9.5521
6.9443
5.7861
5.1433
4.7374
4.4590
4.2565
4.1028
3.9823
3.8853
3.8056
3.7389
3.6823
3.6337
3.5915
3.5546
3.5219
3.4928
3.4668
3.4434
3.4221
3.4028
3.3852
3.3690
3.3541
3.3404
3.3277
3.3158
3.2317
3.1504
3.0718
2.9957
ution
alue 1
v
161.45
18.513
10.128
7.7086
6.6079
5.9874
5.5914
5.3177
5.1174
4.9646
4.8443
4.7472
4.6672
4.6001
4.5431
4.4940
4.4513
4.4139
4.3807
4.3512
4.3248
4.3009
4.2793
4.2597
4.2417
4.2252
4.2100
4.1960
4.1830
4.1709
4.0847
4.0012
3.9201
3.8415
-Distrib
ulated \df1
F
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
40
60
tab
120
Inf
df2
5%
The
7
Student’s t Distribution
The value in bold at the head of each column gives the probability that t will exceed the tabulated
value with df degrees of freedom.
e.g. with 7 d.f., P(t > 2.998) = 1%
2p
0.20
0.10
0.05
0.02
0.01
0.002
0.001
df
p
0.10
0.05
0.025
0.01
0.005
0.001
0.0005
1
3.078
6.314
12.706
31.821
63.657
318.309
636.619
2
1.886
2.920
4.303
6.965
9.925
22.327
31.599
3
1.638
2.353
3.182
4.541
5.841
10.215
12.924
4
1.533
2.132
2.776
3.747
4.604
7.173
8.610
5
1.476
2.015
2.571
3.365
4.032
5.893
6.869
6
1.440
1.943
2.447
3.143
3.707
5.208
5.959
7
1.415
1.895
2.365
2.998
3.499
4.785
5.408
8
1.397
1.860
2.306
2.896
3.355
4.501
5.041
9
1.383
1.833
2.262
2.821
3.250
4.297
4.781
10
1.372
1.812
2.228
2.764
3.169
4.144
4.587
11
1.363
1.796
2.201
2.718
3.106
4.025
4.437
12
1.356
1.782
2.179
2.681
3.055
3.930
4.318
13
1.350
1.771
2.160
2.650
3.012
3.852
4.221
14
1.345
1.761
2.145
2.624
2.977
3.787
4.140
15
1.341
1.753
2.131
2.602
2.947
3.733
4.073
16
1.337
1.746
2.120
2.583
2.921
3.686
4.015
17
1.333
1.740
2.110
2.567
2.898
3.646
3.965
18
1.330
1.734
2.101
2.552
2.878
3.610
3.922
19
1.328
1.729
2.093
2.539
2.861
3.579
3.883
20
1.325
1.725
2.086
2.528
2.845
3.552
3.850
21
1.323
1.721
2.080
2.518
2.831
3.527
3.819
22
1.321
1.717
2.074
2.508
2.819
3.505
3.792
23
1.319
1.714
2.069
2.500
2.807
3.485
3.768
24
1.318
1.711
2.064
2.492
2.797
3.467
3.745
25
1.316
1.708
2.060
2.485
2.787
3.450
3.725
26
1.315
1.706
2.056
2.479
2.779
3.435
3.707
27
1.314
1.703
2.052
2.473
2.771
3.421
3.690
28
1.313
1.701
2.048
2.467
2.763
3.408
3.674
29
1.311
1.699
2.045
2.462
2.756
3.396
3.659
30
1.310
1.697
2.042
2.457
2.750
3.385
3.646
40
1.303
1.684
2.021
2.423
2.704
3.307
3.551
50
1.299
1.676
2.009
2.403
2.678
3.261
3.496
60
1.296
1.671
2.000
2.390
2.660
3.232
3.460
120
1.289
1.658
1.980
2.358
2.617
3.160
3.373
Inf
1.282
1.645
1.960
2.326
2.576
3.090
3.291
8
STAT 393 Test2 2020 Trimester 2
Name (Student ID): xxxxxxxxx (xxxxxxxxxx)
Q1.
Auto data set in the library ISLR contains 392 observations on the following 9 variables.
• mpg: miles per gallon
• cylinders: Number of cylinders between 4 and 8
• displacement: Engine displacement (cu. inches)
• horsepower: Engine horsepower
• weight: Vehicle weight (lbs.)
• acceleration: Time to accelerate from 0 to 60 mph (sec.)
• year: Model year (modulo 100)
• origin: Origin of car (1. American, 2. European, 3. Japanese)
• name: Vehicle name
We print the first 6 observations from Auto data set in the library ISLR.
library(ISLR)
head(Auto)
# Print the first 6 observations
##
mpg cylinders displacement horsepower weight acceleration year origin
## 1
18
8
307
130
3504
12.0
70
1
## 2
15
8
350
165
3693
11.5
70
1
## 3
18
8
318
150
3436
11.0
70
1
## 4
16
8
304
150
3433
12.0
70
1
## 5
17
8
302
140
3449
10.5
70
1
## 6
15
8
429
198
4341
10.0
70
1
##
name
## 1 chevrolet chevelle malibu
## 2
buick skylark 320
## 3
plymouth satellite
## 4
amc rebel sst
## 5
ford torino
## 6
ford galaxie 500
Suppose we fit the following multiple linear regression model
mpg =
β
+
β
+
ε
i
0 +
β1 ∗ weight
i
2 ∗ year
i
i,
i = 1
, . . . , n.
where we assume the error terms
i,
i = 1
, . . . , n, are mutually independent.
1. [25 marks] Let us denote the matrix form of the model
Y =
Xβ +
ε.
Do the following questions.
1
a. [2 marks] State the distribution of the error term
ε.
b. [10 marks] Let ˆ
β = (
XT X)−1
XT y be the LSE of
β. Show that
E( ˆ
β) =
β and show that
V ar( ˆ
β) =
σ2(
XT X)−1. Finally, state the distribution of ˆ
β.
c. [10 marks] Derive the distribution of ˆ
y =
X ˆ
β. Show your working.
d. [3 marks] What is the degrees of freedom associated with
SSE = (
y − ˆ
y)
T (
y − ˆ
y)?
2. [15 marks] Use R to do the following questions.
a. [5 marks] Find the design matrix
X. Print the first 10 row of the design matrix.
b. [3 marks] Calculate the LSE of
β.
c. [3 marks] Calculate the predicted values ˆ
y of
y. Print the first 10 predicted values.
d. [4 marks] Calculate the SSE and RSE.
3. [10 marks] Fit a multiple linear regression model with response variable mpg and covariates weight and
year.
a. [2 marks] Use summary() function to print the result.
b. [2 marks] Print the 99% confidence intervals of the coefficients of these values.
c. [6 marks] Interpret the p-values and the 99% confidence intervals for corresponding to weights
and year.
4. [5 marks] Suppose a new data is given as below.
new_cars =
data.frame(weight =
c(3500, 5000), year =
c(76, 81))
new_cars
##
weight year
## 1
3500
76
## 2
5000
81
Calculate the predicted value ˆ
y for the new data. Then compute the 99% confidence interval and the 99%
prediction interval for the value ˆ
y.
9. [5 marks] Fit the null model
Y =
β0 +
ε
and the full model
Y =
β0 +
β1 ∗ weight +
β2 ∗ year +
ε.
Then compare these models with anova() function. Interpret the p-value from the output.
10. [5 marks] Compare the full model given above and the model containing all variables in the Auto data
set. Then compare these models with anova() function. Interpret the p-value from the output.
[Q1 Total: 65 marks]
Q2.
A data set
Qmixed.csv containing the following variables:
• id: Patients ID.
• treat: Indicating
treatment or
control.
• time: Before (
pre) or after (
post) treatment.
• score: Score indicating health condition of patients. Higher score is considered better.
dflong <-
read.csv("Qmixed.csv", header=TRUE)
dflong
##
X id
treat time score
## 1
1
1
treat
pre
20
## 2
2
1
treat post
70
2
## 3
3
2
treat
pre
10
## 4
4
2
treat post
50
## 5
5
3
treat
pre
60
## 6
6
3
treat post
90
## 7
7
4
treat
pre
20
## 8
8
4
treat post
60
## 9
9
5
treat
pre
10
## 10 10
5
treat post
50
## 11 11
6 control
pre
50
## 12 12
6 control post
20
## 13 13
7 control
pre
10
## 14 14
7 control post
10
## 15 15
8 control
pre
40
## 16 16
8 control post
30
## 17 17
9 control
pre
20
## 18 18
9 control post
50
## 19 19 10 control
pre
10
## 20 20 10 control post
10
The data set
Qmixed.csv can be downloaded from the test folder on the blackboard.
library(dplyr)
library("ggplot2")
ggplot(dflong,
aes(x=
reorder(time,
desc(time)) , y=score, group = id, color=
factor(treat)) )
+
geom_line()
+geom_point()
+ xlab("time")
+ scale_colour_discrete("treat")
75
treat
50
control
score
treat
25
pre
post
time
We consider the following model
Yij =
β0 +
β1 · treat
i +
β2 · time
j +
β3 · treat
i · time
j +
bi +
εij, j = 1
, 2
, i = 1
, . . . , 10
,
where
bi ∼
N (0
, σ2)
, ε
b
ij ∼
N (0
, σ2)
and
bi, εij are mutually independent for all
i and
j.
1. [10 marks] Let us denote the matrix from of the model by
Y =
Xβ +
Zb +
ε.
a. [6 marks] Use R to find the fixed effect design matrix
X and the random effects design matrix
Z.
3
b. [2 marks] State the distribution of the error term
ε.
c. [2 marks] Write down components of the fixed effects parameter
β and the random effects parameter
b.
2. [15 marks] The marginal model takes in the following matrix form
Y =
Xβ +
ε∗
, ε∗ ∼
N (
0, Σ)
.
a. [8 marks] Derive an expression of the variance matrix
Σ in the marginal model. Show your working
(You need to show calculation involved in the derivation of the variance matrix).
b. [5 marks] Use the expression of the variance matrix
Σ and R to compute the matrix
Σ (use the
values ˆ
σb = 14
.14214 and ˆ
σ = 11
.40175 for the computation).
c. [2 marks] Identify the covariance structure of the model.
3. [15 marks] Use the matrix formula and R to compute and print the followings.
a. [5 marks] The GLS estimate of
β.
b. [5 marks] Estimate of
b.
c. [5 marks] The predicted value of
y.
4. [5 marks] Use lme() function in the library nlme to fit the model. Then print the estimate of the fixed
effect coefficient along with the corresponding p-values. Comment on the effect of the treatment on the
scores.
[Q2 Total: 45 marks]
[Test 2 Total = Q1 Total + Q2 Total: 110 marks]
4
SCHOOL OF MATHEMATICS AND STATISTICS
STAT 393
Assignment 1
Due: 4pm Friday 23 July 2021
Linear Models
This assignment is worth 10%
1. In assessing the impact of rainfall on the growth pattern of tussock grass (Chionochloa
rubra), several square metre plots were observed in 9 locations. 4 locations were in one
block A, and 5 in a second block B. In each plot the annual rainfall (x) and the number of
shoots per square metre (y) were recorded.
Block
A
A
A
A
B
B
B
B
B
Annual rainfall (mm), x
47
26
116
178
19
75
160
31
12
No. shoots/m2, y
15.1
14.1
13.2
12.7
14.6
13.8
11.9
14.8
15.3
(a) Draw a scatterplot of the data, using distinct symbols for each block. Attach the plot
created in R.
(b) Ignoring the grouping of the plots into blocks, fit a linear regression model y = α + β x
in R. State the equation of the fitted regression line to 4 significant figures.
(c) Draw the regression line on your scatterplot. Attach the plot created in R.
(d) Write down the model/design matrix (X ) and β for the regression model in (b) in the
form y = X β + ε .
(e) The ANCOVA model allowing for different regression relationships in each block is
yi j = µ + αi + β0xi j + βixi j + εi j,
where i refers to Block: i = 1 (A), and 2 (B). Treating Block A as the reference block.
Write down the constraints on the parameters.
(f) The model in (e) can be expressed in the form y = X β + ε . Write down y, X , and β for
the rainfall data.
(g) Fit the model in (e) in R. State the parameter estimates, including σ to 4 significant
b
figures.
(h) Are the slopes of two blocks the same? Justify your answer.
2. Let
a
11
a12 a13
b11 b12 b13
A = a21 a22 a23 ,
B = b21 b22 b23
a31 a32 a33
b31 b32 b33
(a) Find AB and BA.
(b) If both A and B are symmetric, are AB and BA equal?
(c) Show that tr(AB) = tr(BA).
3. Let
1
5
0
B = 1
0
5
1
0
5
(a) Are the column vectors of B linearly independent?
(b) What is the rank of B?
(c) Is B full row rank?
4. If y ∼ N(µ , Σ) and A has full row rank, it follows that Ay ∼ N(Aµ , AΣAT ) (Theorem 2 in the
notes on p. 39).
Use this result to find the joint distribution of w1 and w2, where w1 = y1 + y2 − 2y3 and
w2 = y1 + y2 given that y ∼ N(µ , Σ), with
1
4
1
0
µ = 0
and
Σ = 1
2
1 .
0
0
1
3
5. Let y
2
T
n×1 ∼ N(µ , σ I), where µ
= (µ, µ, . . . , µ). Find the distribution of the mean ¯y using
Theorem 2 in the notes on p. 39.
SCHOOL OF MATHEMATICS AND STATISTICS
STAT 393
Assignment 2
Due: 4pm Tuesday 10 August 2021
Linear Models
This assignment is worth 10%
Note: For any part of the assignment that requires using R, include R code and output. Make sure
that your answers are clearly refered to the appropriate place in the R output. My suggestion is as
follows:
• copy and paste R code and output into a word document, or
• create the pdf that comes from knitting the file using R Markdown.
You could print the document and write your answers. Alternatively, you could type your answers
in the document.
1. Low birth weight, defined as birth weight less than 2500 grams, is an outcome that has been
of concern to physicians. This is due to the fact that infant mortality rates and birth defect
rates are very high for low birth weight babies. A woman’s behavior during pregnancy (e.g.,
smoking habits) can greatly alter the chances of carrying the baby to term and, consequently,
of delivering a baby of normal birth weight.
(a) Consider the following variables:
• y: baby’s weight (kg),
• smoke: smoking status during pregnancy (0 for no; 1 for yes),
• race: with three categories (1, 2, and 3),
• weight: weight of mother at last menstrual period (kg).
Generate the data using the R code (a2_Q1_data.R), available on Blackboard. Note:
run the code without changing anything. For example, if the order of lines changes, the
dataset becomes different. Attach the dataset.
(b) Let yi be the baby’s weight, smokei be the smoking status, weighti be the weight of
mother at last mentrual period, and racei be the race of the ith mother. Consider the
following models:
M1: yi = α + β smokei + γ weighti + εi
M2: yi = α + β smokei + γ weighti + τrace + ε
i
i
iid
where
2
εi ∼ N(0, σ ). Fit models M1 and M2, respectively, using lm() in R, with the
constraint τ3 = 0. Run summary(). For each of the models, give coefficients estimates,
coefficients standard errors, p-values, and residual standard error. Note: Don’t forget
to setup race as a factor in R using babydata$race <- factor(race).
(c) Consider the model M2 from part (b). Show the model matrix X in R. Write down the
vector of parameters β in the matrix form y = X β + ε .
(d) Consider the model M2 from part (b). Use the matrix method to reproduce the following
results given by summary():
i. coefficients estimates,
ii. residual standard error, and
iii. coefficients standard errors.
(e) Consider the model M2 from part (b). State in words the interpretation of
i. ˆ
β
ii. ˆ
γ
iii. ˆ
τ1
iv. ˆ
τ2
2. Consider a linear model y
2
i = β xi + εi, i = 1, . . . , n, with εi ∼ N(0, σ ) for a given n-pairs of
values (xi, yi), i = 1, . . . , n. Show that the MLE of β has the form
n
ˆ
∑
x
i=1 iyi
β =
n
∑
x2
i=1 i
STAT393 Assignment 3
Data set: Housing Values in Suburbs of Boston
The Boston data frame has 506 rows and 14 columns.
This data frame contains the following columns:
• crim: per capita crime rate by town.
• zn: proportion of residential land zoned for lots over 25,000 sq.ft.
• indus: proportion of non-retail business acres per town.
• chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
• nox: nitrogen oxides concentration (parts per 10 million).
• rm: average number of rooms per dwelling.
• age: proportion of owner-occupied units built prior to 1940.
• dis: weighted mean of distances to five Boston employment centres.
• rad: index of accessibility to radial highways.
• tax: full-value property-tax rate per $10,000.
• ptratio: pupil-teacher ratio by town.
• black: 1000(Bk - 0.63)ˆ2 where Bk is the proportion of blacks by town.
• lstat: lower status of the population (percent).
• medv: median value of owner-occupied homes in $1000s.
Use MASS library to get Boston data set.
library(MASS)
head(Boston)
##
crim zn indus chas
nox
rm
age
dis rad tax ptratio
black lstat
## 1 0.00632 18
2.31
0 0.538 6.575 65.2 4.0900
1 296
15.3 396.90
4.98
## 2 0.02731
0
7.07
0 0.469 6.421 78.9 4.9671
2 242
17.8 396.90
9.14
## 3 0.02729
0
7.07
0 0.469 7.185 61.1 4.9671
2 242
17.8 392.83
4.03
## 4 0.03237
0
2.18
0 0.458 6.998 45.8 6.0622
3 222
18.7 394.63
2.94
## 5 0.06905
0
2.18
0 0.458 7.147 54.2 6.0622
3 222
18.7 396.90
5.33
## 6 0.02985
0
2.18
0 0.458 6.430 58.7 6.0622
3 222
18.7 394.12
5.21
##
medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
1. [5 marks] We will fit the model
medv =
β0 +
β1lstat +
β2age +
ε.
1
Use function model.matrix() to create the design matrix
X of the model. Then print the fist 10 rows of the
matrix.
2. [5 marks] Calculate the LSE
ˆ
β = (
XT X)−1
XT y.
3. [5 marks] Calculate the predicted values
ˆ
y =
X ˆ
β.
Then print the first 10 predicted values.
4. [5 marks] Calculate the SSE
SSE = (
y − ˆ
y)
T (
y − ˆ
y)
.
5. [5 marks] Calculate RSE (residual standard error)
s
SSE
RSE =
,
n −
p
where
n is the number of observations and
p is the number of
βs.
6. [5 marks] Calculate the variance matrix of ˆ
β
V ar( ˆ
β) =
σ2(
XT X)−1
.
Then calculate
SE( ˆ
β0)
, SE( ˆ
β1)
, SE( ˆ
β2).
7. [5 marks] Calculate 95% confidence intervals of
β0,
β1,
β2. The formula is
ˆ
βi ±
tn−
p,0
.975
SE( ˆ
βi)
.
Give interpretations of the confidence intervals.
8. [5 marks] Calculate
t-test statistic and its
p-value for testing
H0 :
βi = 0 vs
H1 :
βi 6= 0
for
i = 0
, 1
, 2. Give interpretations of the results.
9. [5 marks] Calculate SST (total sum of squares) and SSR (regression sum of squares)
SST = (
y − ¯
y1)
T (
y − ¯
y1)
and
SSR = (ˆ
y − ¯
y1)
T (ˆ
y − ¯
y1)
.
Check the equation
SST =
SSR +
SSE.
10. [5 marks] Do the ANOVA
F -test of the hypothesis
H0 :
β1 =
β2 = 0 vs
H1 : at least one of
β1 and
β2 is non-zero
.
11. [5 marks] Compute
R2 and adjusted-
R2
SSR
R2 =
SST
and
¯
SSE/(
n −
p)
R2 = 1 −
.
SST /(
n − 1)
Give interpretations of
R2 and adjusted-
R2.
12. [5 marks] Use lm() function to fit the model in 1.
2
model <- lm(medv ~ lstat + age, data = Boston)
Check that the following code reproduce the calculation in 1-11.
model.matrix(model)[1:10,]
summary(model)$coef
summary(model)$sigma
summary(model)$r.squared
summary(model)$adj.r.squared
summary(model)$fstatistic
confint(model)
vcov(model)
13. [5 marks] Use predict() function to compute confidence interval and prediction interval of
y at the
value lstat=mean(lstat) and age=mean(age).
14. [5 marks] We wish to compare 3 models.
• Model 1
medv =
β0 +
ε.
• Model 2
medv =
β0 +
β1lstat +
ε.
• Model 3
medv =
β0 +
β1lstat +
β2age +
ε.
Use anova() function to do the sequential
F -tests.
15. [5 marks] Use AIC, BIC functions to compare 3 models in the last question.
16. [5 marks] Get
R2 and adjusted-
R2 for the 3 models. Hint: use summary(model_name)$r.squared and
summary(model_name)$adj.r.squared.
[Total 80 marks]
3
STAT393 Assignment 4
Q1.
Data set dental contains the average sugar consumption (Sugar) and the estimated mean number of decayed,
missing and filled teeth at age 12 years (DMFT) for 90 countries (Country). The variable Indus indicates the
country is industrialized (Ind) and non-industrialized (NonInd).
1. [2 marks] Read dental.csv file in R. Then print summary(dental).
dental <- read.csv(file = "(change this part)/dental.csv", header=TRUE)
dental$Country <- as.factor(dental$Country)
dental$Indus <- as.factor(dental$Indus)
summary(dental)
2. [5 marks] Use the codes to create plots. Then make comments of plots.
par(mfrow=c(1,2))
plot( DMFT ~ Sugar, las=1, data=dental, pch=ifelse( Indus=="Ind", 19, 1),
xlab="Mean annual sugar consumption\n(kg/person/year)",
ylab="Mean DMFT at age 12")
legend("topleft", pch=c(19, 1), legend=c("Indus.","Non-indus."))
boxplot(DMFT ~ Indus, data=dental, las=1, ylab="Mean DMFT at age 12", xlab="Type of country")
3. [5 marks] Fit the model with the interaction
DMFT =
β0 +
β1Sugar +
β2Indus +
β3Sugar ∗ Indus +
ε.
Then use anova() function to create ANOVA table for the fitted model. Interpret the ANOVA table.
model <- lm(DMFT ~ Sugar*Indus, data=dental)
anova(model)
4. [5 marks] Print estimated values of coefficients (Estimate) along with Std. Erroe, t bvalue and
p-value. Interpret the p-values in the output. Also derive the estimated regression line for industrialized
countries and non- industrialized countries, separately. Comment how much change in DMFT is related
to one unit increase in Sugar in each group of countries.
Q2.
We continue to work on the dental data set.
1. [5 marks] Run the following code and give comments.
model <- lm(DMFT ~ Sugar*Indus, data=dental)
par(mfrow=c(1,3))
scatter.smooth(rstandard(model)~fitted(model),
xlab="Fitted values", ylab="Standardized residuals", las=1)
qqnorm(rstandard(model), las=1)
qqline(rstandard(model))
plot(cooks.distance(model), type="h", las=1)
1
im <- influence.measures(model)
summary(im)
3. [5 marks] The following code fit log transformed DMFT. Run the codes below and make comments.
model_log <- lm(log(DMFT) ~ Sugar*Indus, data=dental)
par(mfrow=c(1,3))
scatter.smooth(rstandard(model_log)~fitted(model_log),
xlab="Fitted values", ylab="Standardized residuals", las=1)
qqnorm(rstandard(model_log), las=1)
qqline(rstandard(model_log))
plot(cooks.distance(model_log), type="h", las=1)
im_log <- influence.measures(model_log)
summary(im_log)
4. [5 marks] Use AIC() and BIC() functions to compare the two models. Give comments.
5. [10 marks] Give interpretation of the final model.
[Total 42 marks]
2
SCHOOL OF MATHEMATICS AND STATISTICS
Te Kura Matai Tatauranga
STAT 393
Practice test 1 – August 2021
15 points 2/3
1. Write down your name and your favourite restaurant in Wellington on a piece of paper.
2. Run the R code (practice_test1.R). Copy and paste R code and output into a Word document.
Write down your name and student ID. You can also use R Markdown to run the R code.
Save (1) and (2) as one pdf document. The instructions below show how to save (1) and (2) as a single
pdf document. Upload the pdf file. Note: If R Markdown is used, create the pdf that comes from knitting
the file.
Test 1 Instructions:
• The test is open book, meaning you can have your notes and any Blackboard resources
open.
• You need to include your R code and output. Make sure that you write the Question
numbers on the R output. You can use R Markdown to run the R code and write your
answers.
• For hand-written asnwers, write down your answers on your own paper. Make sure
your answers numbered as well.
• Your written answers can be typed, hand-written or a mixture of the two. You need to
save your R code and output and written answers as one pdf file. To create the pdf,
you can take photos of hand-written answers on your phone, then copy and paste them
with your R code and output to a Word document that you save as a pdf.
• If your written answers are a mixture of hand-written and R Markdown, you may
create a Word document that comes from knitting the file. Then, you need to follow
the previous step to create one pdf file. If you knit R Markdown to a pdf file, you need
to use another methods to merge it with your hand-written pdf files.
• Check that your photos/answers are easy for us to read. If we cannot read them, they
will not be marked.
• Upload the pdf file of your answers following the instructions on Blackboard. It is
important that your answers are included in only ONE pdf document.
1
SCHOOL OF MATHEMATICS AND STATISTICS
Te Kura Matai Tatauranga
STAT 393
Test 1 – August 2021
15 points 2/3
Test 1 Instructions:
• The test is open book, meaning you can have your notes and any Blackboard resources
open.
• You need to include your R code and output. Make sure that you write the Question
numbers on the R output. You can use R Markdown to run the R code and write your
answers.
• For hand-written answers, write down your answers on your own paper. Make sure
your answers numbered as well.
• Your written answers can be typed, hand-written or a mixture of the two. You need to
save your R code and output and written answers as one pdf file. To create the pdf,
you can take photos of hand-written answers on your phone, then copy and paste them
with your R code and output to a Word document that you save as a pdf.
• If your written answers are a mixture of hand-written and R Markdown, you may
create a Word document that comes from knitting the file. Then, you need to follow
the previous step to create one pdf file. If you knit R Markdown to a pdf file, you need
to use another methods to merge it with your hand-written pdf files.
• Check that your photos/answers are easy for us to read. If we cannot read them, they
will not be marked.
• Upload the pdf file of your answers following the instructions on Blackboard. It is
important that your answers are included in only ONE pdf document.
-----------------------------------------------------------------------------
1. [50 marks] High level of stress and anxiety in workers are an important issue for a company. To
understand this issue, a study was conducted to collect the information about workers’ anxiety
scores based on a survey. Consider the following variables:
• y: anxiety score,
• comptype: company type (1: private-sector; 2: public-sector),
• citysize: size of city (1: small; 2: medium; 3: large),
• nworkers: total number of workers at company.
Generate the data using the R code (test1_Q1.R), available on Blackboard.
(a) Attach the dataset. Note : You need to generate your own dataset. Before you run the
R code, change the seed using your student ID. If you use the dataset without changing
the seed, you will receive a 0 mark.
(b) Treat both comptype and citysize as categorical. Show how you do it in R.
1
(c) Let yi be the anxiety score, ci be the company type, si be the size of city, and wi be the number
of workers at company for the ith worker. Consider the model:
yi = µ + α wi + βs + ε
i
i
iid
where
2
εi ∼ N(0, σ ). Fit the model using lm() in R, with the constraint β1 = 0. Run
summary(). Show your R output. Complete the following table (upto 3dp.):
Parameters
Estimate
Standard Error
p-value
µ
α
β1
β2
β3
(d) Consider the model in (1c). Show the model matrix X in R. Write down the vector of param-
eters β in the matrix form y = X β + ε.
(e) Consider the model in (1c). Use the matrix method to reproduce the following results given
by summary():
i. coefficients estimates, and
ii. residual standard error.
(f) Suppose that the uncorrected SSR = 61939, the uncorrected total sum of squares SST =
61951, and the SSE = 12. Fill in the following table:
Source
df
SS (uncorrected)
MS
Model
Error
Total
(g) Write down the null and alternative hypotheses for the test statistic
MSR (uncorrected)
F =
.
MSE
What is the distribution of the test statistic under the null hypothesis?
(h) Calculate the corrected SSR, the corrected total sum of squares SST, and the SSE using (1f)
to fill in the following table. Show your work.
Source
df
SS (corrected)
MS
Model
Error
Total
(i) Refer to (1c). Consider the following model:
yi = µ + α wi + βs + γ + ε
i
ci
i
iid
where
2
εi ∼ N(0, σ ). Fit the model using lm() in R, with the constraints β3 = 0 and γ2 = 0.
Run summary(). Show your R output. State in words the interpretation of
i. α
b
ii. γb1
2. [5 marks] Define Type I error, Type II error, and Power of a test.
2
3. [15 marks] Consider a test statistic W that follows a central F distribution with (2, 6) degrees of
freedom under H0. Under H1, W follows a non-central F distribution with (2, 6) degrees of
freedom with the non-centrality parameter λ . Give that λ = 2.5, calculate the power of the test at
a 5% significance level. What do you expect when λ increases and when a 10% significance level
is used? Justify your answer.
4. [30 marks] The linear model is defined by
y = X β + ε
where ε ∼ N(0, 2
σ I)
with n × 1 data vector y and n × p model matrix X .
(a) The MLE of
2
2
σ
is ˜
σ
= 1 (y − X
n
b
β)T (y − X b
β), where b
β is the MLE of β . Is it biased for
2
2
σ ? If yes, give the unbiased estimator of σ .
(b) Let the hat matrix be
H = X (X T X )−1X T = X X −
Show that the hat matrix is symmetric, and idempotent.
(c) Let the projection matrix be
P = I − H = I − X (X T X )−1X T = I − X X −
Show that
HP = 0
That is, H and P are orthogonal to each other.
(d) Consider H
2
0: β = 0 vs. H1: β 6= 0. Suppose we know the value of σ , find a test statistic
for the null hypothesis H0: β = 0.
3
STAT393 Test 2 (20 October 2021)
Q1. Credit Card Balance Data
In question 1, we work with Credit Card Balance Data. The aim here is to predict which customers will
default on their credit card debt.
• ID: Identification
• Income: Income in $10,000’s
• Student: A factor with levels No and Yes indicating whether the individual was a student
• Balance: Average credit card balance in $.
set.seed(123)
library(ISLR)
sample <- sample(1:400, 50)
Credit <- Credit[sample,c("ID", "Income", "Student", "Balance")]
head(Credit)
##
ID Income Student Balance
## 179 179 28.316
No
453
## 14
14 43.682
No
1081
## 195 195 30.406
No
0
## 306 306 24.460
No
0
## 118 118 91.362
No
1341
## 299 299 20.791
No
0
dim(Credit)
## [1] 50 4
str(Credit)
## 'data.frame':
50 obs. of 4 variables:
## $ ID
: int 179 14 195 306 118 299 229 244 399 374 ...
## $ Income : num 28.3 43.7 30.4 24.5 91.4 ...
## $ Student: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 2 ...
## $ Balance: int 453 1081 0 0 1341 0 156 856 0 1003 ...
We fit two models.
Model1: Balance =
β0 +
β1Income +
β2Student +
ε
and
Model2: Balance =
β0 +
β1Income +
β2Student +
β3(Income × Student) +
ε.
The following outputs from R may be useful to answer the questions in Q1.
y <- Credit$Balance
X1 <- model.matrix(Balance ~ Income + Student, data= Credit)
X2 <- model.matrix(Balance ~ Income + Student + Income:Student, data= Credit)
n <- length(y)
1
p1 <- ncol(X1)
p2 <- ncol(X2)
• ˆ
β for model1 and model2:
beta_hat1 <- solve(t(X1)%*%X1) %*% t(X1) %*% y
beta_hat1
##
[,1]
## (Intercept) 285.322506
## Income
4.193717
## StudentYes 529.370842
beta_hat2 <- solve(t(X2)%*%X2) %*% t(X2) %*% y
beta_hat2
##
[,1]
## (Intercept)
278.612796
## Income
4.341373
## StudentYes
1268.073852
## Income:StudentYes -18.279093
• SSE for model 1 and model2 and SST.
y_hat1 <- X1%*%beta_hat1
y_hat2 <- X2%*%beta_hat2
y_bar <- mean(y)
SSE1 <- t(y-y_hat1) %*% (y-y_hat1)
SSE2 <- t(y-y_hat2) %*% (y-y_hat2)
SST <- t(y-y_bar) %*% (y-y_bar)
c(SSE1=SSE1, SSE2=SSE2, SST=SST)
##
SSE1
SSE2
SST
## 7232589 7077041 9204506
• RSE for model 1 and model2.
RSE1 <- sqrt(SSE1/(n-p1))
RSE2 <- sqrt(SSE2/(n-p2))
c(RSE1=RSE1, RSE2=RSE2)
##
RSE1
RSE2
## 392.2816 392.2355
• Variance matrix of ˆ
β for model1 and model2:
Var1 <- as.numeric(RSE1ˆ2)*solve(t(X1) %*% X1)
Var1
##
(Intercept)
Income StudentYes
## (Intercept)
8814.4584 -120.355511 -3955.52584
## Income
-120.3555
2.648581
13.42831
## StudentYes
-3955.5258
13.428308 41884.62395
Var2 <- as.numeric(RSE2ˆ2)*solve(t(X2) %*% X2)
Var2
##
(Intercept)
Income StudentYes Income:StudentYes
## (Intercept)
8856.9150 -121.307127 -8856.9150
121.307127
2
## Income
-121.3071
2.669523
121.3071
-2.669523
## StudentYes
-8856.9150 121.307127 581593.7577
-13355.263390
## Income:StudentYes
121.3071
-2.669523 -13355.2634
330.473946
• Diagnostics plots for model 2.
m2 <- lm(Balance ~ Income + Student + Income:Student, data= Credit)
plot(m2)
Residuals vs Fitted
91
118
500
0
Residuals
−500
153
−1000
400
600
800
1000
1200
Fitted values
lm(Balance ~ Income + Student + Income:Student)
3
Normal Q−Q
2
91
118
1
0
ed residuals
−1
Standardiz
−2
153
−2
−1
0
1
2
Theoretical Quantiles
lm(Balance ~ Income + Student + Income:Student)
Scale−Location
153
1.5
ls
a
91
118
u
id
s
re
1.0
d
e
iz
rd
a
d
0.5
n
ta
S
0.0
400
600
800
1000
1200
Fitted values
lm(Balance ~ Income + Student + Income:Student)
4
Residuals vs Leverage
2
1
374
1
383
0.5
0
ed residuals
0.5
−1
1
−2
Standardiz
Cook's distance
153
0.0
0.2
0.4
0.6
Leverage
lm(Balance ~ Income + Student + Income:Student)
Q1.1 [10 marks] Use model1 and do the test whether the average
Balance is significantly different between
Student and
Non-student. Use
α = 0
.05.
•
1) State the hypothesis;
•
2) calculate the value of the test statistic and state the distribution of the test statistic;
•
3) Find the p-value;
•
4) Make a conclusion.
Q1.2 [10 marks] Do the F-test to test whether the interaction term has significantly contribute to a prediction
of the average balance. Use
α = 0
.05.
Q1.3 [10 marks] Use model 2, write down the estimated regression line for
Student and
Non-student
separately.
Q1.4 [10 marks] Compute
R2 and adjusted-
R2 for model1 and model2 separately. Interpret
R2s and conclude
which model is better and give reasons why.
Q1.5 [10 marks] Give interpretation of diagnostics plots (Residuals vs Fitted, Normal Q-Q, Scale-Location,
Residuals vs Leverage).
Q2. Linear mixed effects model
Suppose we have paired data: for each individual (Ind=1,2) we have response values (Resp) for the pre-
treatment (Treat=0) and the post-treatment (Treat=1). We wish to know the effect of the treatment.
library("ggplot2")
df <- data.frame(Treat = c(0, 1, 0, 1),
Resp = c(8, 20, 3, 7),
Ind = c(1, 1, 2, 2))
df
5
##
Treat Resp Ind
## 1
0
8
1
## 2
1
20
1
## 3
0
3
2
## 4
1
7
2
ggplot(df, aes(x = Treat, y = Resp, color = factor(Ind), group = Ind)) +
geom_point(size = 3) + geom_line() + labs(color = "Individual")
20
15
Individual
1
Resp
10
2
5
0.00
0.25
0.50
0.75
1.00
Treat
We will fit the following random intercept linear mixed effects model
Resp
=
β
ij
1 +
β2Treat
ij +
bi +
εij ,
i = 1
, 2
, j = 1
, 2
.
We assume
εij ∼
N (0
, σ2) and
bi ∼
N (0
, σ2), and
b
b
i,
εij i = 1
, 2,
j = 1
, 2 are mutually independent.
We denote the response variable Resp
by
y
ij
ij and let
y
11
ε11
β
b
y
ε
β =
1
, b =
1
, y =
12
, ε = 12
β
2
b2
y
ε
21
21
y22
ε22
Then the linear mixed effects model can be written in the form
y =
Xβ +
Zb +
ε.
Q2.1 [10 marks] Find the design matrices
X and
Z. Show your working. (Do not use R for this question.)
Q2.2 [10 marks] What is the distribution of
ε ?
Q2.3 [10 marks] Show that the distribution of
y is
y ∼
N (
Xβ, Σ)
,
6
where
Σ =
σ2
ZZT +
σ2
I
b
4
and
I4 is the 4-dimensional identity matrix.
Q2.4 [10 marks] Use the value
σ2 = 1 and
σ2 = 2 to calculate the covariance matrix
b
Σ =
σ2
ZZT +
σ2
I
b
4
.
Identity the structure of the covariance matrix.
7
SCHOOL OF MATHEMATICS AND STATISTICS
STAT 393
Assignment 1
Due: 11:59pm Thursday 28 July 2022
Linear Models
This assignment is worth 10%
1. In assessing the impact of rainfall on the growth pattern of tussock grass (Chionochloa
rubra), several square metre plots were observed in 9 locations. 4 locations were in one
block A, and 5 in a second block B. In each plot the annual rainfall (x) and the number of
shoots per square metre (y) were recorded.
Block
A
A
A
A
B
B
B
B
B
Annual rainfall (mm), x
47
26
116
178
19
75
160
31
12
No. shoots/m2, y
15.1
14.1
12.3
12.7
14.6
13.8
11.9
14.8
15.3
(a) Draw a scatterplot of the data, using distinct symbols for each block. Attach the plot
created in R.
(b) Ignoring the grouping of the plots into blocks, fit a linear regression model y = α + β x
in R. State the equation of the fitted regression line to 4 significant figures.
(c) Draw the regression line on your scatterplot. Attach the plot created in R.
(d) Write down the model/design matrix (X ) and β for the regression model in (b) in the
form y = X β + ε .
(e) The ANCOVA model allowing for different regression relationships in each block is
yi j = µ + αi + β0xi j + βixi j + εi j,
where i refers to Block: i = 1 (A), and 2 (B). Treating Block B as the reference block.
Write down the constraints on the parameters.
Hint: In R, use
Block <- factor(Block)
lm(y~ relevel(Block, ref="B") + x + relevel(Block, ref="B"):x, x=T)
to fit the model.
(f) The model in (e) can be expressed in the form y = X β + ε . Write down y, X , and β for
the rainfall data.
(g) Fit the model in (e) in R. State the parameter estimates, including σ to 4 significant
b
figures.
(h) Are the slopes of two blocks the same? Justify your answer.
2. Let
1
5
0
B = 1
0
5
1
0
5
(a) Are the column vectors of B linearly independent?
(b) What is the rank of B?
(c) Is B full row rank?
3. For a n × p matrix A of full column rank (n ≥ p), we define the generalised inverse of A by
A− = (AT A)−1AT
(a) State the dimensions of the following matrices:
A
AT
AT A
(AT A)−1
A−
AA−
A−A
(b) Show that AA− is symmetric
(c) Show that AA− is idempotent
(d) What is the rank of AA−?
(e) Show that A−A = I
4. If y ∼ N(µ , Σ) and A has full row rank, it follows that Ay ∼ N(Aµ , AΣAT ) (Theorem 2 in the
notes on p. 39).
Use this result to find the joint distribution of w1 and w2, where w1 = y1 + y2 − 2y3 and
w2 = y1 + y2 given that y ∼ N(µ , Σ), with
1
4
1
0
µ = 0
and
Σ = 1
2
1 .
0
0
1
3
5. Let y
2
T
n×1 ∼ N(µ , σ I), where µ
= (µ, µ, . . . , µ). Find the distribution of the mean ¯y using
Theorem 2 in the notes on p. 39.
SCHOOL OF MATHEMATICS AND STATISTICS
STAT 393
Assignment 2
Due: 11:59pm Thursday 11 August 2022
Linear Models
This assignment is worth 10%
Note: For any part of the assignment that requires using R, include R code and output. Make sure
that your answers are clearly refered to the appropriate place in the R output. My suggestion is as
follows:
• copy and paste R code and output into a word document, or
• create the pdf that comes from knitting the file using R Markdown.
You could print the document and write your answers. Alternatively, you could type your answers
in the document.
1. Low birth weight, defined as birth weight less than 2500 grams, is an outcome that has been
of concern to physicians. This is due to the fact that infant mortality rates and birth defect
rates are very high for low birth weight babies. A woman’s behavior during pregnancy (e.g.,
smoking habits) can greatly alter the chances of carrying the baby to term and, consequently,
of delivering a baby of normal birth weight.
(a) Consider the following variables:
• y: baby’s weight (kg),
• smoke: smoking status during pregnancy (0 for no; 1 for yes),
• race: with three categories (1, 2, and 3),
• weight: weight of mother at last menstrual period (kg).
Generate the data using the R code (a2_Q1_data.R), available on Blackboard. Note:
run the code without changing anything. For example, if the order of lines changes, the
dataset becomes different. Attach the dataset.
(b) Let yi be the baby’s weight, smokei be the smoking status, weighti be the weight of
mother at last mentrual period, and racei be the race of the ith mother. Consider the
following models:
M1: yi = α + β smokei + γ weighti + εi
M2: yi = α + β smokei + γ weighti + τrace + ε
i
i
iid
where
2
εi ∼ N(0, σ ). Fit models M1 and M2, respectively, using lm() in R, with the
constraint τ1 = 0. Run summary(). For each of the models, give coefficients estimates,
coefficients standard errors, p-values, and residual standard error. Note: Don’t forget
to setup race as a factor in R using babydata$race <- factor(race).
(c) Consider the model M2 from part (b). Show the model matrix X in R. Write down the
vector of parameters β in the matrix form y = X β + ε .
(d) Consider the model M2 from part (b). Use the matrix method to reproduce the following
results given by summary():
i. coefficients estimates,
ii. residual standard error, and
iii. coefficients standard errors.
(e) Consider the model M2 from part (b). State in words the interpretation of
i. ˆ
β
ii. ˆ
γ
iii. ˆ
τ2
iv. ˆ
τ3
2. Consider a linear model y
2
i = β xi + εi, i = 1, . . . , n, with εi ∼ N(0, σ ) for a given n-pairs of
values (xi, yi), i = 1, . . . , n. Show that the MLE of β has the form
n
ˆ
∑
x
i=1 iyi
β =
n
∑
x2
i=1 i
STAT 393 Assignment 3: due by 11:59pm, Tuesday 4 October
This assignment needs you to use R.
You should include all your R code and output. Ensure that
you refer to relevant code and output explicitly. As with earlier assignments, you can either copy
and paste R code and output into a Word document that you then save as a pdf file, or you can create a pdf
by knitting a file in R Markdown. Ensure your submission is your own, independent work. Any sources of
information that you use which are external to STAT 393 course materials must be correctly cited/referenced.
Data set: Housing Values in Suburbs of Boston
The Boston data frame has 506 rows (observations) on 14 variables (columns):
• crim: per capita crime rate by town.
• zn: proportion of residential land zoned for lots over 25,000 sq.ft.
• indus: proportion of non-retail business acres per town.
• chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
• nox: nitrogen oxides concentration (parts per 10 million).
• rm: average number of rooms per dwelling.
• age: proportion of owner-occupied units built prior to 1940.
• dis: weighted mean of distances to five Boston employment centres.
• rad: index of accessibility to radial highways.
• tax: full-value property-tax rate per $10,000.
• ptratio: pupil-teacher ratio by town.
• black: 1000(Bk - 0.63)ˆ2 where Bk is the proportion of blacks by town.
• lstat: lower status of the population (percent).
• medv: median value of owner-occupied homes in $1000s.
Install the MASS package to access the Boston data set.
library(MASS)
attach(Boston)
head(Boston)
##
crim zn indus chas
nox
rm
age
dis rad tax ptratio
black lstat
## 1 0.00632 18
2.31
0 0.538 6.575 65.2 4.0900
1 296
15.3 396.90
4.98
## 2 0.02731
0
7.07
0 0.469 6.421 78.9 4.9671
2 242
17.8 396.90
9.14
## 3 0.02729
0
7.07
0 0.469 7.185 61.1 4.9671
2 242
17.8 392.83
4.03
## 4 0.03237
0
2.18
0 0.458 6.998 45.8 6.0622
3 222
18.7 394.63
2.94
## 5 0.06905
0
2.18
0 0.458 7.147 54.2 6.0622
3 222
18.7 396.90
5.33
## 6 0.02985
0
2.18
0 0.458 6.430 58.7 6.0622
3 222
18.7 394.12
5.21
##
medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
1
1. [2 marks] Define a new variable, the reciprocal of lstat, as follows:
lstatinv <- 1/lstat
2. [12 marks] Fit the following six models, each with explanatory variables entered in alphabetical order:
model1:
medv =
β0 +
β2dis +
β3lstat +
β4ptratio +
β5rm +
ε
model2:
medv =
β0 +
β1age +
β2dis +
β3lstat +
β4ptratio +
β5rm +
ε
model3:
medv =
β0 +
β2dis +
β3lstatinv +
β4ptratio +
β5rm +
ε
model4:
medv =
β0 +
β1age +
β2dis +
β3lstatinv +
β4ptratio +
β5rm +
ε
model5:
medv =
β0 +
β2dis +
β3 log(lstatinv) +
β4ptratio +
β5rm +
ε
model6:
medv =
β0 +
β1age +
β2dis +
β3 log(lstatinv) +
β4ptratio +
β5rm +
ε
Include R output from the summary() command for each of the six models.
For the next six questions (Q3-Q8), recall that if modelA has explanatory variables that are a subset of those
in modelB, a sequential F-test can be performed with the R command anova(modelA, modelB). Also recall
that anova(modelA) will break down the total sum of squares (SST) into terms explained sequentially by
modelA, which add to give the regression sum of squares (SSR), along with the residual (or estimated error)
sum of squares, SSE. Further, remember that models with the same dependent variable can be compared
using AIC(), BIC(), adjusted
R2 [i.e. ( ¯
R2)], or residual standard error (RSE).
3. [2 marks] Based on AIC values, rank the six models fitted in Question 2 from best to worst, for the
prediction/explanation of medv values.
4. [2 marks] Based on adjusted
R2 values, rank the six models fitted in Question 2 from best to worst, for
the prediction/explanation of medv values.
5. [2 marks] Based on BIC values, rank the six models fitted in Question 2 from best to worst, for the
prediction/explanation of medv values.
6. [6 marks] Comment on/discuss the similarities and/or differences in the model rankings that you have
provided in Questions 3 to 5.
7. [3 marks] Explain why the explanatory variable age is statistically significant in anova(model6) but
not significant in summary(model6).
8. [3 marks] Discuss the result from anova(model5, model6) and the statistical significance of the
explanatory variable age in summary(model6). Include in your discussion explicit commentary about
the F-statistic from anova(model5, model6) and the t-statistic on age in summary(model6).
Consider only model3 in Questions 9-17, that is:
model3:
medv =
β0 +
β2dis +
β3lstatinv +
β4ptratio +
β5rm +
ε
9. [2 marks] Give an interpretation of the
R2 value for model3.
10. [5 marks] Use the predict() function to compute a 95% confidence interval and a 95% prediction
interval for the response variable (medv) in model3, evaluated at the mean values of all the explanatory
variables in the model.
For the next four questions (Q11-14), use matrix methods to reproduce for model3 some of the standard
results from the lm() command.
11. [2 marks] Use the function model.matrix() to create the design matrix
X of model3, then print the
first 10 rows of the matrix
X.
12. [2 marks] Calculate and print the LSE
ˆ
β = (
XT X)−1
XT y
2
13. [2 marks] Calculate the predicted values
ˆ
y =
X ˆ
β
Print out the first 10 predicted values.
14. [8 marks] Calculate and print the residual standard error (RSE), then calculate the covariance matrix
of ˆ
β,
V ar( ˆ
β) =
σ2(
XT X)−1
.
Given
V ar( ˆ
β), calculate and print the standard errors of the estimated regression coefficients, SE( ˆ
β).
For the next three questions (Q15-17), make use of the diagnostic metrics available in the augment() function
from package broom. You will probably find it helpful to use syntax from the tidyverse package as well; for
example, to add index numbers to the rows of data, as in the lecture notes (see pp.6-7 in “Diagnostics for
linear regression”). You should also include specific diagnostic plots, as appropriate. As above, only consider
model3 for these questions:
model3:
medv =
β0 +
β2dis +
β3lstatinv +
β4ptratio +
β5rm +
ε
15.
a. [2 marks] State the maximum value of the response variable medv in the Boston data.
b. [2 marks] Print out diagnostic information about the five observations with the largest fitted values
from model3.
16.
a. [2 marks] Print out diagnostic information about the five observations with the largest (absolute
value) residuals from model3.
b. [2 marks] Print diagnostic plot 1, which displays Residuals vs Fitted values.
c. [3 marks] How many observations have
standardized residuals greater than 3 in absolute value,
indicating they may be possible outliers?
d. [2 marks] What percentage of the data (to 2dp) are possible outliers, using the definition from
Q16c?
17.
a. [2 marks] Print out diagnostic information about the five observations with the largest leverages
from model3.
b. [2 marks] Print diagnostic plot 5, which displays the (standardized) Residuals vs Leverages.
c. [4 marks] Calculate the value 2
p/n that identifies observations with “high leverage”. How many
times bigger (to 2dp) than 2
p/n is the highest leverage value?
18. [6 marks] Given your answers to Questions 15-17, discuss any features of the Boston data which are
problematic for prediction using any of the linear regression models fitted in this assignment. Despite
those issues, which of the six models fitted in Question 2 do you recommend as ‘best’, and why?
[Total: 80 marks]
3
SCHOOL OF MATHEMATICS AND STATISTICS
Te Kura Matai Tatauranga
STAT 393
Test 1 – August 2022
15 points 2/3
Test 1 Instructions:
• The test is open book, open notes.
• You need to include your R code and output. Make sure that you write the Question
numbers on the R output. You can use R Markdown to run the R code and write your
answers.
• For hand-written answers, write down your answers on your own paper. Make sure
your answers are numbered as well.
• Your written answers can be typed, hand-written or a mixture of the two. You need to
save your R code and output and written answers as one pdf file. To create the pdf,
you can take photos of hand-written answers on your phone, then copy and paste them
with your R code and output to a Word document that you save as a pdf.
• If your written answers are a mixture of hand-written and R Markdown, you may
create a Word document that comes from knitting the file. Then, you need to follow
the previous step to create one pdf file. If you knit R Markdown to a pdf file, you need
to use another method to merge it with your hand-written pdf files.
• Check that your photos/answers are easy for us to read. If we cannot read them, they
will not be marked.
• Upload the pdf file of your answers following the instructions on Blackboard. It is
important that your answers are included in only ONE pdf document.
• You do not need to type mathematical formulae or tables in a Word document, because
it might take too long to do so. Hand-written answers are fine.
• You need to generate your own dataset to answer questions. The instruction is in
Question 1(a).
-----------------------------------------------------------------------------
1. [80 marks] A study was conducted to investigate several psychological variables which have been
suggested as causes or effects of debt. We consider the following variables:
• y (prodebt): score on a scale of attitudes to debt (high values = favourable to debt)
• incomegp: income group (1=lowest, 5=highest)
• house: security of housing tenure (1=rent, 2=mortgage, 3=owned outright)
• manage: self-rating of money management skill (high values = high skill)
• locintrn: score on a locus of control scale (low values = external, high values = internal)
A locus of control scale refers to how much individuals believe that they can control events affect-
ing them. A high value implies a strong internal locus of control that individuals believe events in
their life primarily from their own actions. A low value implies a weak internal locus of control
that events are controlled by environmental factors which individuals cannot influence.
1
(a) Generate the data using the R code (test1_Q1.R), available on Blackboard. Attach the
dataset. Note : You need to generate your own dataset. Before you run the R code,
change the seed using your student ID. If you use the dataset without changing the seed,
you will receive a 0 mark.
(b) Treat house as categorical. Show how you do it in R.
(c) Let yi be the prodebt score, xi be the income group, hi be the security of housing tenure, mi
be the self-rating of money management skill, and ci be the score on a locus of control scale
for the ith subject. Consider the model:
yi = α + β1 xi + γh + β
i
2 mi + β3 ci + εi
iid
where
2
εi ∼ N(0, σ ).
Fit the model using lm() in R, with the constraint γ3 = 0. Run
summary(). Show your R output. Complete the following table (upto 4dp.):
Parameters
Estimate
Standard Error
p-value
α
β1
γ1
γ2
γ3
β2
β3
(d) Consider the model in (1c). Show the model matrix X in R. Write down the vector of param-
eters β in the matrix form y = X β + ε.
(e) Consider the model in (1c). Use the matrix method to reproduce the following results given
by summary():
i. coefficients estimates, and
ii. residual standard error.
h
i
(f) Consider the model in (1c). Use the matrix method to find d
Var b
β . What is the estimate of
Cov( b
β1, b
β2)?
(g) State in words the interpretation of
i. γb1
(h) Use matrix methods, calculate the corrected total sum of squares SST, the corrected SSR,
and the SSE in R.
(i) Write down the null and alternative hypotheses for F = MSR (corrected) . Find the value of F
MSE
and the p-value. Make your conclusion at a 5% level.
(j) When the null hypothesis is not true, the test statistic F = MSR (corrected) follows a non-
MSE
central F distribution. Suppose the non-centrality parameter has value of 3.4. Find the
power of the test at a 5% significance level.
2. [20 marks] The linear model is defined by
y = X β + ε
where ε ∼ N(0, 2
σ I)
with n × 1 data vector y and n × p model matrix X .
(a) Define H = X (X T X )−1X T (hat) and P = I − H = I − X (X T X )−1X T (projection).
2
i. Give the expressions for the fitted values of y and residuals using H and/or P.
ii. Express
2
σ
using H and/or P.
b
(b) Write down the distribution of y, including the mean and variance/covariance matrix.
(c) For the following model:
yi = µ + β xi + γx2 +
i
εi,
where i = 1, 2, 3, 4, 5. Write down the model matrix (X ) and parameter vector β.
3
STAT 393 Test 2 (7 November 2022)
Test 2 Instructions
For some questions you will need to include your
R code and output. Make sure that you write the
Question numbers on the R output. If you wish, you can use R Markdown to run the R code and generate
your answers.
For hand-written answers, write down answers on your own paper. Make sure your written answers are
numbered clearly as well.
Your written answers can be typed, hand-written or a mixture of the two. You need to save your R code plus
output and written answers as
one pdf file. To create the pdf, you can take photos of hand-written answers
on your phone, then copy and paste them with your R code and output to a Word document that you save
as a pdf.
If your written answers are a mixture of hand-written and R Markdown, you need to create a single pdf file,
for example by entering separate files into a single Word document that you then save as a pdf.
Check that your photos/answers are easy for us to read. If we cannot read them, they can not be marked.
Upload the pdf file of your answers following the instructions on Blackboard. It is important that your
answers are included in only
ONE pdf document.
You do not need to type mathematical formulae or tables in a Word document, because it might take too
long to do so. Hand-written answers are fine, as long as the writing is legible.
In question 1 you will need to use a sample from a dataset in the “ISLR” add-on package in R.
So, if it has not already been installed, install the “ISLR” add-on package in R using the following command:
install.packages("ISLR") # this will not be necessary, if it has been done previously
You need to sample your own dataset to answer question 1, using your Student ID. The R code for sampling
the data is included in the question. Before you run the R code,
change the seed to be your Student ID.
If you use the dataset without setting the seed to your Student ID, you will receive a mark of
zero for that question.
Q1.
In question 1, you will work with your own unique sample, generated using your Student ID, from the ISLR
data: Credit Card Balance Data. You will compare some models for Balance, which is the average credit
card balance in $.
The 12 variables in the dataset are:
• ID: Identification
• Income: Income in $10,000’s
• Limit: Credit limit
1
• Rating: Credit rating
• Cards: Number of credit cards
• Age: Age in years
• Education: Number of years of education
• Gender: A factor with levels Male and Female
• Student: A factor with levels No and Yes indicating whether the individual was a student
• Married: A factor with levels No and Yes indicating whether the individual was married
• Ethnicity: A factor with levels African American, Asian, and Caucasian indicating the individual’s
ethnicity
• Balance: Average credit card balance in $.
Q1.1. Generate your sample of data using the R code below. Note: You need to generate your
own dataset.
Before you run the R code,
change the seed to be your Student ID. If you use the dataset without
changing the seed, you will receive a mark of zero. Include the first few lines of your data in your answer.
#
# You need to replace "456" in set.seed() with your Student ID.
# For example, if your Student ID is 123456789, use
# set.seed(123456789)
#
set.seed(456)
# another warning - change the seed to be your Student ID!
sample <- sample(1:400, 50)
library(ISLR)
Credit <- Credit[sample,]
attach(Credit)
head(Credit)
str(Credit)
Q1.2. Now fit four linear models, using the R code below. The first has explanatory variables Income, Limit
and Student, while three further models respectively add the interaction terms Income × Student, Limit
× Student, and both Income × Student and Limit × Student. Include the summary output from all four
models in your answer.
m1<-lm(Balance ~ Income + Limit + Student, data= Credit)
summary(m1)
m2<-lm(Balance ~ Income + Limit + Student + Income:Student, data= Credit)
summary(m2)
m3<-lm(Balance ~ Income + Limit + Student + Limit:Student, data= Credit)
summary(m3)
m4<-lm(Balance ~ Income + Limit + Student + Income:Student + Limit:Student, data= Credit)
summary(m4)
AIC(m1,m2,m3,m4)
2
Q1.3. Use AIC to compare models m1, m2, m3, m4. Which model is preferred?
Q1.4. Use sequential ANOVA tests to compare models m1, m2, m3, and then models m1, m2, m4. In each case,
which models are preferred?
Q1.5. Consider model m1; that is, the linear model with explanatory variables Income, Limit and Student.
1.5. a. For model m1, what is the value of the penalty term that is added to -2 times the maximised
log-likelihood, to give the AIC value?
1.5. b. For model m1; what is the value of the maximised log-likelihood?
1.5. c. For model m1, show how to calculate the BIC value, starting from the maximised log-likelihood you
calculated in part b.
1.5. d. Describe/list the parameters that are counted in the penalty term, when the values of AIC and BIC
are calculated for model m1.
Q1.6. For model m1, obtain the model matrix
X in R then use
matrix methods to reproduce the following
results.
1.6. a. The coefficient estimates.
1.6. b. The residual standard error.
1.6. c. The covariance matrix d
Var(b
β).
1.6. d. The correlation matrix
cor(b
β).
[ Hint: use the function cov2cor() ]
d
1.6. e. What is the correlation between the estimated coefficients for Limit and Income?
Q1.7. What proportion of the variation in Balance is explained using model m1? Now remove the explanatory
variable Limit from model m1, leaving only Income and Student as explanatory variables. What proportion
of the variation in Balance is explained using this new, reduced model?
Q1.8. Using model m3, which includes the interaction term Limit × Student, write down the estimated
regression lines for predicting Balance in the separate cases of
Non-students and
Students separately.
That is, include the values of the estimated coefficients in the two prediction equations that you write down.
3
Q2. Do not use R for this question.
Suppose we have paired data: for each individual (Ind=1,2,3) we have response values (Resp) for pre-
treatment (Treat=0) and post-treatment (Treat=1). We wish to estimate the effect of the treatment.
library("ggplot2")
df <- data.frame(Treat = rep(c(0, 1), times=3),
Resp = c(8, 21, 3, 7, 5, 14),
Ind = rep(c(1, 2, 3), each=2))
df
##
Treat Resp Ind
## 1
0
8
1
## 2
1
21
1
## 3
0
3
2
## 4
1
7
2
## 5
0
5
3
## 6
1
14
3
ggplot(df, aes(x = Treat, y = Resp, color = factor(Ind), group = Ind)) +
geom_point(size = 3) + geom_line() + labs(color = "Individual")
20
15
Individual
1
Resp
2
10
3
5
0.00
0.25
0.50
0.75
1.00
Treat
We will consider the following random intercept linear mixed effects model:
Resp
=
β
ij
1 +
β2Treat
ij +
bi +
εij ,
i = 1
, 2
, 3
, j = 1
, 2
.
We assume
εij ∼
N (0
, σ2) and
bi ∼
N (0
, σ2), with
b
b
i,
εij i = 1
, 2
, 3,
j = 1
, 2 mutually independent.
4
Denote the response variable Resp
by
y
ij
ij and let
y
11
ε11
y
ε
b
12
12
β
1
y
ε
β =
1
, b =
b
21
, ε = 21
β
2
,
y =
2
y
ε
b
22
22
3
y
ε
31
31
y32
ε32
Then the linear mixed effects model can be written in the form:
y =
Xβ +
Zb +
ε.
Q2.1. Find the design matrices
X and
Z. Show your working. (Remember: do not use R for this question.)
Q2.2. What is the distribution of
ε?
Q2.3. Show that the distribution of
y is
y ∼
N (
Xβ, Σ)
,
where
Σ =
σ2
ZZT +
σ2
I
b
6
and
I6 is the 6-dimensional identity matrix.
Q2.4. Use the values
σ2 = 1 and
σ2 = 2 to calculate the covariance matrix
b
Σ =
σ2
ZZT +
σ2
I
b
6
.
Hence identify (i.e. name) the correlation structure in this covariance matrix.
Q3. Do not use R for this question.
Q3.1. Consider a linear model with subjects indexed by
i = 1
, 2 and repeated measurements on each subject
indexed by
j = 1
, 2
, 3
, 4. Hence there are 8 observations in total. A general model for repeated measurements
has dependence between observations on each subject, but observations on different subjects are independent.
Suppose the overall covariance structure for the data is given in the covariance matrix Σ =
σ2
V , where
V
V =
0
0
.
0
V0
The matrix
V0 corresponds to the covariance structure for each subject. Give examples of the form of the
4 × 4 matrix
V0 for the following covariance structures:
• Independent
• Compound symmetry
• First-order auto-regressive AR(1)
• Unstructured
Q3.2. Consider a model
y =
Xβ +
ε, with E(
ε) = 0 and Var(
ε) =
σ2
V . Show that the weighted least
squares estimator
˜
βW = (
XT W X)−1
XT W y
is unbiased, whatever the choice of symmetric matrix
W .
5
Document Outline