.chapter16<-function(i=0){ " i Chapter 16: Distributions, i Hypothesis Tests - ---------------------------- -- --------------------- 1 Uniform distribution 21 Sample vs. Population 2 : 4 functions 22 One-sided vs. two-sided tests 3 : examples 23 significant level vs. confidence interval 4 Normal distributions 24 Three decision rules 5 : 4 functions 25 Usages of those 4 distributions 6 : examples 26 Critical T-value and p-value 7 : normality test 27 Relationship between T-value and P-value 8 Student-t distribution 28 Explain mean=0 9 : 4 distribution 29 t.test for mean return = 0 10 chisq distribution 30 11 31 F-test for equal variance 12 32 Durbin-Watson Autocorrelation 13 F-distribution 33 Durbin-Watson Autocorrelation Test 14 34 Grange Causality Test 15 F-distribution table 35 Grange Causality Test (an example) 16 36 17 37 18 38 19 Videos 39 20 Links 40 Example #1:>.c16 # find out the list Example #2:>.c16() # the same as the above Example #3:>.c16(1) # see the first explanation ";.zchapter16(i)} .n16chapter<-40 .zchapter16<-function(i){ if(i==0){ print(.c16) }else{ .printEachQ(16,i,.n16chapter) } } .c16<-.chapter16 .C16EXPLAIN4<-"density vs. cumulative distributions ///////////////////////////////////// ///////////////////////////////////// " .C16EXPLAIN5<-"Normal distributions ///////////////////////////////////// Normal and student-t distribution are symmetric. Normal distributions' density function is dnorm() --------------------------------------------- x<-seq(-3,3,0.1) y<-dnorm(x) # for cumulative normal distribution is pnorm() plot(x,y) ///////////////////////////////////// " .C16EXPLAIN6<-"student-t distributions ///////////////////////////////////// Student-t distributions' density function is dt(). assume df=60 (df is degree of freedom) --------------------------------------------- x<-seq(-3,3,0.1) y<-dt(x,60) # for cumulative t-distribution is pt() plot(x,y,main='student-t') Student-t distributions' density function is dt(). assume df=60 (df is degree of freedom) --------------------------------------------- x<-seq(-3,3,0.1) y<-dt(x,60) # for cumulative t-distribution is pt() plot(x,y,main='student-t') ///////////////////////////////////// " .C16EXPLAIN6<-"Generate a set random number from a normal distribution ///////////////////////////////////// set.seed(123) # everyone get the same set of values x<-rnorm(500, mean=0,sd=0.2) ///////////////////////////////////// " .C16EXPLAIN7<-"Chisq distribution ///////////////////////////////////// Chisq distributions' density function is dchisq(). Assume df=60 --------------------------------------------- x<-seq(0,70,0.5) y<-dchisq(x,60) # for cumulative is pchisq() plot(x,y,main='chisq') ///////////////////////////////////// " .C16EXPLAIN8<-"F-distribution ///////////////////////////////////// F-distributions' density funciton is df(). Assume df1=5,df2=10 --------------------------------------------- x<-seq(0,7,0.5) y<-df(x,5,10) # for cumulative is pf() plot(x,y,type='l',main='F density') ///////////////////////////////////// " .C16EXPLAIN9<-"F-distribution table ///////////////////////////////////// alpha=0.05 The column one is df1 while row 1 for df2 1 2 3 4 5 6 7 8 9 1 161.4476 199.5 215.7073 224.5832 230.1619 233.986 236.7684 238.8827 240.5433 2 18.5128 19 19.1643 19.2468 19.2964 19.3295 19.3532 19.371 19.3848 3 10.128 9.5521 9.2766 9.1172 9.0135 8.9406 8.8867 8.8452 8.8123 4 7.7086 6.9443 6.5914 6.3882 6.2561 6.1631 6.0942 6.041 5.9988 5 6.6079 5.7861 5.4095 5.1922 5.0503 4.9503 4.8759 4.8183 4.7725 6 5.9874 5.1433 4.7571 4.5337 4.3874 4.2839 4.2067 4.1468 4.099 7 5.5914 4.7374 4.3468 4.1203 3.9715 3.866 3.787 3.7257 3.6767 8 5.3177 4.459 4.0662 3.8379 3.6875 3.5806 3.5005 3.4381 3.3881 9 5.1174 4.2565 3.8625 3.6331 3.4817 3.3738 3.2927 3.2296 3.1789 10 4.9646 4.1028 3.7083 3.478 3.3258 3.2172 3.1355 3.0717 3.0204 11 4.8443 3.9823 3.5874 3.3567 3.2039 3.0946 3.0123 2.948 2.8962 12 4.7472 3.8853 3.4903 3.2592 3.1059 2.9961 2.9134 2.8486 2.7964 13 4.6672 3.8056 3.4105 3.1791 3.0254 2.9153 2.8321 2.7669 2.7144 14 4.6001 3.7389 3.3439 3.1122 2.9582 2.8477 2.7642 2.6987 2.6458 15 4.5431 3.6823 3.2874 3.0556 2.9013 2.7905 2.7066 2.6408 2.5876 http://www.socr.ucla.edu/applets.dir/f_table.html ///////////////////////////////////// " .C16EXPLAIN12<-"When conducting a statistical test, usually we look at our T-value, P-value or F-value. ///////////////////////////////////// 1) T-value If we choose a 5% level, the critical value of a T-value is 2 or -2. It means that if our T-value is bigger than 2 or less than -2, we reject the null hypothesis. 2) P-value If we choose a 5% level, the critical value of a p-value is 5%. It means that if our p-value is less than 5%, we reject the null hypothesis. 3) F-value If our F-value is bigger than our critical F-value, we reject the null hypothesis. ///////////////////////////////////// " .C16EXPLAIN19<-"Videos ///////////////////////////////////// StatsCast: What is a t-test? (9m56s) https://www.youtube.com/watch?v=0Pd3dc1GcHc Student's t-test (10m10s) https://www.youtube.com/watch?v=pTmLQvMM-1M t-test in Microsoft Excel https://www.youtube.com/watch?v=BlS11D2VL_U ///////////////////////////////////// " .C16EXPLAIN20<-"Links ///////////////////////////////////// Statistical Significance (T-Test) http://docs.statwing.com/examples-and-definitions/t-test/statistical-significance/ T-test http://www.investopedia.com/terms/t/t-test.asp Student's t-test https://en.wikipedia.org/wiki/Student%27s_t-test ///////////////////////////////////// " .C16EXPLAIN21<-"Sample vs. Population ///////////////////////////////////// When simulating variance, we have the following two formulae, one for the sample while the other is for the population. Assume that we have n observations, x1, x2, ..., xn x1 + x2 + x3 + ..... + xn mean = ------------------------------------- n Variance based on population: (x1-mean)^2 + (x2-mean)^2 + ... + (xn-mean)^2 var1= ------------------------------------------------- n Variance based on sample: (x1-mean)^2 + (x2-mean)^2 + ... + (xn-mean)^2 var2= ------------------------------------------------- n -1 This is true for standard deviation: std1= sqrt(var1) # standard deviation based on population std2= sqrt(var2) # standard deviation based on sample ///////////////////////////////////// " .C16EXPLAIN22<-"Null hypothesis vs. alternative hypothesis (hypotheses) ///////////////////////////////////// We use H0 as a null hypothesis Ha as an alternative hypothesis. Below we have a few examples. 1) H0 : x_mean= x0 Ha(1): x_mean < x0 Ha(2): x_mean > x0 2) H0 : x_mean > x0 Ha : x_mean < x0 3) H0: var = var0 Ha(1): var < var0 Ha(2): var > var0 4) H0: var > var0 Ha: var < var0 ///////////////////////////////////// " .C16EXPLAIN23<-"alpha: significant level vs. confidence interval ///////////////////////////////////// Alpha: significant level usually takes three values 1%, 5%, and 10%. Confidence interval = 1- alpha One-sided vs. two-sided tests ----------------------------- For the following hypothesis test, it is a 2-sided test. H0 : x_mean= x0 Ha(1): x_mean < x0 Ha(2): x_mean > x0 when estimating the critical value, we use alpha/2. Below is an example of a one-sided test. H0 : x_mean > x0 Ha : x_mean < x0 ///////////////////////////////////// " .C16EXPLAIN24<-"Three decision rules ///////////////////////////////////// For example, we have a null hypothesis H0 x_mean= x0 1) Range Rule: if x0 is inside the range --> accept H0 if x0 is outside the range --> reject 2) Distance Rule: if the distance > critical distance --> reject < critical distance --> accept 3) Shaded Area Rule: if the area < alpha --> reject > alpha --> accept Note #1 How to estimate the range? Range --> mean +- margin of error lower bound = mean - margin of error upper bound = mean + margin of error Margin of error = t(critical value) * stdErr std stdErr = ------- sqrt(n) ///////////////////////////////////// " .C16EXPLAIN25<-"Usages of those 4 distributions ///////////////////////////////////// Case #1: x_mean = x0, std is given -> Apply a normal distribution Case #2: x_mean > x0, std is given -> Apply a normal distribution Case #3: x_mean < x0, std is given -> Apply a normal distribution Case #4: x_mean = x0, std is not given -> Apply a t-distribution Case #5: x_mean > x0, std is not given -> Apply a t-distribution Case #6: x_mean < x0, std is not given -> Apply a t-distribution Case #7: var = var0, -> Apply a Chisq distribution Case #8: var > var0, -> Apply a Chisq distribution Case #9: var < var0, -> Apply a Chisq distribution Case #10: var1 = var2, -> Apply a F-distribution Case #11: var2 > var2, -> Apply a F-distribution var1 f-statistic = ------ var2 var1>var2 ///////////////////////////////////// " .C16EXPLAIN26<-"critical T-value and p value ///////////////////////////////////// p<-0.05 x<-qt(p/2, 40) print(x) [1] -2.021075 ///////////////////////////////////// " .C16EXPLAIN27<-"Explain the concept of 'does their mean equal zero?; ///////////////////////////////////// The first set values have a mean of 0.05, while the second one with e mean of 0.01. Which one is close to zero? Many would answer that 0.01 is close to zero However, this is not correct. Let's look at some hypothetical values 0.012, 0.02,0.01,0.005,0.006,0.012,0.009,0.009,0.001,0.013,0.012,0.011,0.01 Obviously, the mean of this set is not zero. If we ask 'is the next period return is zero?' By just looking at those values, we could say NO. The second set of values are given below. 0.30, -0.10,-0.135,-0.105,0.35,0.32,-0.10,-0.05,0.25,-0.13,0.21,0.28,0.21 If we ask 'is the next period return is zero?' By just looking at those values, we could not answer. Methodology: Step 1: calculate mean Step 2: calculate standard deviation std Step 3: calculate standard error stdErr = --------- sqrt(n) where std is from step 2, sqrt() is the squared root, n is the number of observations Assume that we choose 5% confidence level we calculate 95% confidence interval [mean-2*stdErr, mean + 2*stdErr] If 0 is in the interval, it means that the mean is not statistically different from zero. ///////////////////////////////////// " .C16EXPLAIN28<-"t.test() for mean=0 ///////////////////////////////////// x<-read.csv('http://datayyy.com/data_csv/ibmDaily.csv') n<-nrow(x) head(x,2) Date Open High Low Close Adj.Close Volume 1 1962-01-02 7.713333 7.713333 7.626667 7.626667 0.670064 387200 2 1962-01-03 7.626667 7.693333 7.626667 7.693333 0.675921 288000 p<-x$Adj.Close ret<-p[2:n]/p[1:(n-1)]-1 t.test(ret,mu=0) One Sample t-test data: ret t = 3.771, df = 14222, p-value = 0.0001633 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: 0.000240639 0.000761588 sample estimates: mean of x 0.0005011135 ///////////////////////////////////// " .C16EXPLAIN31<-"F-test for equal variance ///////////////////////////////////// x<-rnorm(500,mean=0,sd=0.5) y<-rnorm(500,mean=0.5,sd=0.2) var.test(x,y) F test to compare two variances data: x and y F = 5.5419, num df = 499, denom df = 499, p-value < 2.2e-16 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: 4.648984 6.606210 sample estimates: ratio of variances 5.541856 ///////////////////////////////////// " .C16EXPLAIN32<-"Autocorrelation and Durbin-Watson test ///////////////////////////////////// Usually, the closing stock prices are autocorrelated. returns are negatively autocorrelated. DW Test Description -------- ----------------- close to 2 not autocorrelated > 2 negatively autocorrelated < 2 positively autocorrelated ///////////////////////////////////// " .C16EXPLAIN33<-"Durbin-Watson Autocorrelation Test ///////////////////////////////////// library(lmtest) x<-read.csv('http://canisius.edu/~yany/data/ibmDaily.csv') n<-nrow(x) p<-x$Adj.Close ret<-p[2:n]/p[1:(n-1)]-1 n2<-length(ret2) m<-as.integer(n2/2) x<-rep(c(-1,1),m) dwtest(ret[1:(2*m)]~x) Durbin-Watson test data: ret[1:(2 * m)] ~ x DW = 1.452, p-value = 0.107 alternative hypothesis: true autocorrelation is greater than 0 ///////////////////////////////////// " .C16EXPLAIN34<-"Grange Causality Test ///////////////////////////////////// A couses (leads to ) B ----> (logic reason) Existence of A could help explain B --- > (Grange Causality) The Granger causality test is a statistical hypothesis test for determining whether one time series is useful in forecasting another, first proposed in 1969. Ordinarily, regressions reflect 'mere' correlations, but Clive Granger argued that causality in economics could be tested for by measuring the ability to predict the future values of a time series using prior values of another time series. Since the question of 'true causality' is deeply philosophical, and because of the post hoc ergo propter hoc fallacy of assuming that one thing preceding another can be used as a proof of causation, econometricians assert that the Granger test finds only 'predictive causality' https://en.wikipedia.org/wiki/Granger_causality ///////////////////////////////////// " .C16EXPLAIN35<-"Grange Causality Test (an example) ///////////////////////////////////// library(lmtest) data(ChickEgg) dim(ChickEgg) head(ChickEgg) # chicken --- > egg, i.e., use chicken explains eggs grangertest(egg ~ chicken,order=3, data=ChickEgg) Granger causality test Model 1: egg ~ Lags(egg, 1:3) + Lags(chicken, 1:3) Model 2: egg ~ Lags(egg, 1:3) Res.Df Df F Pr(>F) 1 44 2 47 -3 0.5916 0.6238 > # egg --> chicken grangertest(chicken ~ egg,order=3, data=ChickEgg) Granger causality test Model 1: chicken ~ Lags(chicken, 1:3) + Lags(egg, 1:3) Model 2: chicken ~ Lags(chicken, 1:3) Res.Df Df F Pr(>F) 1 44 2 47 -3 5.405 0.002966 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ///////////////////////////////////// " " library(EventStudy) data(StockPriceReturns) data(SplitDates) "