Appendix B  Statistical expressions
This appendix provides the details on computing the statistics in the Test Response System. The statistics are summarized below.
Statistics in Test Response System

Course and section 
Course only 
Alpha 

Part 2 
Count of items 

Part 1, Part 2 
Count of students 

Part 1, Part 2 
Guessing penalty 
Part 4, 5 

Item difficulty 

Part 2 
Itemtest correlation 

Part 2 
KuderRichardson 

Part 2 
Mean score 
Parts 3, 4 
Part 2 
Response frequencies 

Parts 1, 2 
Score: Raw 
Parts 3, 4, 8, 11 

Score: Percentage 
Parts 4, 5, 9, 12 

Score: New maximum 
Parts 6, 7, 10, 13 

Standard deviation 

Part 2 
Standard error of measurement 

Part 2 
Student rank 
Parts 3, 4 

Unweighted results 
Part 4 

Variance 

Part 2 
Weighted results 
Part 4 

Data for the examples are calculated from the example described in the printout guide. The reader will note that the data used in these examples are not the same as the data used to illustrate the Topics examples.
Mean score
In the example, the calculation of the mean is:
where M is the mean score for each section or for the entire class,
N is the number of students, and
X is a student's score.
Standard deviation
The standard deviation is an indication of the range of scores in the test.
where SDt is the standard deviation of the test,
M is the mean of all students' scores,
X is a student's score and
N is the number of students.
In the example the calculation is:
See Part 2 for more details.
Itemtest correlation and correction (Magnusson, p. 200)
The itemtest correlation is a point biserial correlation coefficient that indicates the extent to which an item measures the same attribute as the test as a whole. The itemtest correlation, r_{it}, is the correlation between item i and scores for the scores for all the items. Correlations of at least +.30 are desirable.
where r_{it} is the itemtest correlation of an item i with the total test score,
M_{c} is the mean score of students who answered the item correctly,
M_{w} is the mean score of students who answered the item incorrectly,
SDt is the standard deviation of the test,
SDi is the standard deviation of the item,, where
p is the proportion of students who answered the item correctly, and
q is the proportion of students who answered the item incorrectly
See Part 2 for more details.
Since the item contributes to the total score, it should be removed from the correlation. To correct for inclusion of item i the following correction is applied (Magnusson p. 212):
r_{i(ti)} =
where r_{i(ti)} is the corrected itemtest correlation of an item,
SDt is the standard deviation of the test, where
SDi is the standard deviation of an item,
p is the proportion of students who answered the item correctly,
q is the proportion of students who answered the item incorrectly and
r_{it }is the unadjusted itemtest correlation.
The table below contains the data required to calculate the correction for Question 1. This item was answered correctly answered by students Ann, Bob, Cam, Don, Fay, Guy, Hal, Ian and Joy, whose mean score for all items was 4.4. Eve, the only student who answered the item wrongly, received a score of 5.0 for all items.
The calculation for the itemtest correlation for Question 1 is:
r_{it} =
The data required to calculate the correction for Question 1.

Correctly answered by . . . 
Σ x 
p 
q 
Quest ion 
Ann 
Bob 
Cam 
Don 
Eve 
Fay 
Guy 
Hal 
Ian 
Joy 



1 










9 
0.9 
0.1 
2 










7 
0.7 
0.3 
3 










6 
0.6 
0.4 
4 










5 
0.5 
0.5 
5 










3 
0.3 
0.7 
6 










6 
0.6 
0.4 
7 










7 
0.7 
0.3 
8 










2 
0.2 
0.8 
Score (X) 
7 
6 
5 
5 
5 
4 
4 
4 
3 
2 
45 


Deviation (X  M) 
6.25 
2.25 
0.25 
0.25 
0.25 
0.25 
0.25 
0.25 
2.25 
6.25 
(X  M)=18.5 
The correction to exclude Question 1 is:
=
Kuder Richardson 20 measurement of reliability (dichotomous answers) (Magnusson, p. 116)
See Part 2 for more details.
The KuderRichardson Formula 20 (KR20) is an estimate of the test's reliability. It varies between 0.00 and 1.00.
where KR20 is the Kuder Richardson 20 reliability estimate of the test, assuming dichotomous answers (answers containing only two choices, ie.,
correct or incorrect).
k is the number of items in the test,
p is the proportion of students who answered the item correctly,
q is the proportion of students who answered the item incorrectly and
SDt is the standard deviation of the test
The calculation of Kuder Richardson in the example is:
The statistical significance of KR20 is assessed using the F distribution with the formula,
In the example the computation for F of Kuder Richardson is
The probability of the F value is obtained from the SAS probf distribution (SAS, p.579), which returns the probability that an observation from an F distribution is less than or equal to the observed numeric random variable. It has the following form,
1  PROBF(x, ndf ,ddf),
where x is the F value of the Kuder Richardson statistic,
ndf is the degrees of freedom in the numerator (number of students in the class  1), and
ddf is the degrees of freedom in the denominator (number of students  1 × number of items  1).
In the example the result of .3270 is obtained with the following parameters:
F = 1.17
ndf = 10  1 = 9
ddf = (n  1)(k  1) = (10  1)(8  1) = 63.
Under the null hypothesis (KR20=0, no test reliability), the probability of the observed statistic (KR20=.15) is .3270. By convention, only probabilities below .05 are considered significant. Therefore the data provide no evidence that this examination is reliable.
Raw coefficient alpha (continuous answers) (Nunnaly, p. 214)
See Part 2 for more details.
Cronbach's is a more general measurement of reliability than Kuder Richardson. It applies to a continuous distribution of values and is therefore appropriate for measuring the reliability of attitudes, opinions or behaviour with the fivepoint scale used in the Test Response System. It is confined to students who answered all the items Because of these computational differences, and KR20 are not directly comparable in TRS reports.
There are two ways to express Cronbach's . In the first, perfect reliability is given the value 1 and the error component is the ratio between the sum of the item variances and the variance of the student scores.
where a is Cronbach's a,
k is the number of items on the test,
SDi is the standard deviation of an item,
SDt is the standard deviation of the test
A second approach is equivalent to the first but expresses the relation as a ratio of two ratios comprised of the variance and covariances of the scores.
where a is Cronbach's a ,
k is the number of items on the test,
is the average covariance, and
is the average variance.
The computation is confined to students who answered all items. As a result, it is difficult to compare the results between Kuder Richardson and Cronbach's alpha unless we use an artificial example. Only four students answered all questions  Ann, Cam, Don and Fay.
Notice that the data, shown in the table below, represent values on a scale of 1 to 5, unlike a binary correctincorrect dichotomy used to assess reliability of academic performance in the Kuder Richardson example.
Q7
Data, shown in the table represent values on a scale of 1 to 5

Q1 
Q2 
Q3 
Q4 
Q5 
Q6 
Q8 
Total 
Ann 
1 
2 
3 
4 
5 
1 
2 
1 
19 
Cam 
1 
2 
5 
4 
5 
5 
2 
2 
26 
Don 
1 
2 
3 
4 
1 
3 
2 
4 
20 
Fay 
1 
2 
3 
3 
4 
1 
3 
5 
22 
Test variance SDt 

9.58 
Item variance SDi 
0.00 
0.00 
1.00 
0.25 
3.58 
3.67 
0.25 
3.33 
12.08 
The computation using the ratio of the sum of the item variances to the test variance is
The second method, mathematically equivalent to the method shown above, uses. a ratio of variance and covariance. Covariance is the mean product of the variation about the mean between each pair of items. Covariance is calculated by dividing the total covariation by N1, where N is the number of students who answered all items on the test. An example of covariance is shown below for items 3 and 4:
Data, shown in the table uses a ratio of variance and covariance

Q3 X 
Variation (XX̅) 
Q4 Y 
Variation (YY̅) 
Covariation (XX̅)(YY̅) 
Ann 
3 
0.5 
4 
0.25 
0.125 
Cam 
5 
1.5 
4 
0.25 
0.375 
Don 
3 
0.5 
4 
0.25 
0.125 
Fay 
3 
0.5 
3 
0.75 
0.375 
Mean () 
3.5 

3.75 


Sum of squared variation 

3 

0.75 

Variance=Sum of squared variation ÷(N1) 


1 

0.25 
Total covariation 




0.5 
Covariance /(N1) 




0.1667 
Covariance matrix for all questions in the example

Q2 
Q3 
Q4 
Q5 
Q6 
Q7 
Q8 
Q1 
0 
0 
0 
0 
0 
0 
0 
Q2 

0 
0 
0 
0 
0 
0 
Q3 


0.1667 
0.8333 
1.6667 
0.1667 
0.6667 
Q4 



0.0833 
0.5 
0.25 
0.6667 
Q5 




0.1667 
0.0833 
2 
Q6 





0.5 
0.6667 
Q7 






0.6667 
The mean covariance (28 cells) is .0446. Coefficient alpha also requires mean variance of the items. This value is derived from the total squared deviation for each item, divided by N1, whose mean for the 8 items is 1.5104. The calculation in the example is:
The negative coefficient is an anomaly arising from an extremely small sample and artificial data.
The F value for is obtained from the formula,
In the example the computation for F of is
The statistical significance of is assessed with the F distribution (as for KR20).
The probability of the F value is obtained from the SAS probf distribution (SAS, p.579), which returns the probability that an observation from an F distribution is less than or equal to the observed numeric random variable. It has the following form,
1  PROBF(x, ndf ,ddf),
where x is the raw coefficient alpha,
ndf is the degrees of freedom in the numerator (number of students in the class  1), and
ddf is the degrees of freedom in the denominator (number of students  1 × number of items  1).
In the example the result of .6439 is obtained with the following parameters:
F = .7703
ndf = 10  1 = 9
ddf = (n  1)(k  1) = (10  1)(8  1) = 63.
There is a high probability (.6439) that sampling error could account for the obtained . We conclude that does not differ significantly from zero and that the examination is not reliable.
Standard error of measurement
See Part 2 for more details.
The standard error of measurement (SE) is an estimate of the error component in a student's score due to imperfections in the test from illness, distractions, and fatigue.
where SE is the standard error of measurement,
SDt is the standard deviation of the test, and
Reliability coefficient is Kuder Richardson 20 or
The calculation in the example is: