Are you in Canada? Click here to proceed to the HK Canada website.

For all other locations, click here to continue to the HK US website.

Human Kinetics Logo

Purchase Courses or Access Digital Products

If you are looking to purchase online videos, online courses or to access previously purchased digital products please press continue.

Mare Nostrum Logo

Purchase Print Products or eBooks

Human Kinetics print books and eBooks are now distributed by Mare Nostrum, throughout the UK, Europe, Africa and Middle East, delivered to you from their warehouse. Please visit our new UK website to purchase Human Kinetics printed or eBooks.

Feedback Icon Feedback Get $15 Off

Human Kinetics is moving to summer hours. Starting May 31 – August 2, our hours will be Mon – Thurs, 7am – 5pm CDT. Orders placed on Friday with digital products/online courses will be processed immediately. Orders with physical products will be processed on the next business day.

Is the Test Reliable?

This is an excerpt from Evidence-Based Practice in Athletic Training by Scot Raab & Deborah Craig.


The first term that is important for a consumer of research to understand is reliability. By strict definition, reliability refers to consistency in measures attained by people or instruments.1,2 Basically, does the tool you are using produce the same outcome each time you use it? An easy example is a pair of tape cutters. If every time you use the cutters to remove tape, you achieve a nice, clean cut, you would consider these cutters reliable. However, if sometimes they cut the tape and other times they do not, you would never know what outcome to expect; you would therefore consider these cutters unreliable. Reliability is important when you are trying to make a clinical decision. If you are performing an anterior drawer test on an ankle and one time you perform the test you feel laxity and the next time you fail to feel laxity, you will have difficulty making a decision. You may question whether the athlete changed, whether you applied different degrees of force, or whether the test itself is unreliable.

Reliability also relates to questionnaires. A well-structured item on a questionnaire should be easy to understand and not subject to misinterpretation. Let's look at an example of an item that uses a Likert scale of 1 to 4, with 1 being low or disagree and 4 being high or strongly agree. Here is the statement: Some of the most important tasks of an athletic trainer are administration duties and the bidding process. Although we can all agree that athletic trainers (ATs) often have some administrative duties and likely need to bid on purchases, this is a poor item that would probably be unreliable. One reader might believe that the most important duties of an AT relate to prevention and mark 1 for this item. Another reader might focus on the bidding portion of the statement and, believing that it's not that important, also mark 1. The researcher in this case has a value of 1 on two occasions but for very different reasons. However, an AT with administrative duties might focus on the administrative portion of the statement and rate this a 4. These varied scores based on individual interpretations of the statement would lead to low reliability and prohibit the researcher from drawing a valid conclusion.

Perhaps a better statement would look like this: Athletic trainers should have administrative skills. The scores for this may also vary, but they are less likely to vary because of misinterpretation. The variance in the scores would most likely be due to a variance in the opinions of the respondents. Thus, it is important to know that a study you are reviewing reports reliability values; without them, it's very difficult to trust the findings and make a judgment based on them.

You should be minimally familiar with several types of reliability measures as an evidence-based clinician. Presented here are four primary types: internal reliability, test - retest reliability, interrater reliability, and intrarater reliability.

Internal Reliability

Internal reliability is often an issue with surveys and questionnaires that employ a Likert scale system of values. The researcher wants to know whether the participant is responding to similar questions consistently. That is, did the person answer different questions about the same topic or construct in a similar manner?1 Let's assume that the researcher has a 30-item survey to find out athletes' opinions of the school's coaching staff. On item 4, which states that the school's coaching staff sincerely cares about athletes, an athlete might respond with a 1 (low or disagree). However, item 22 asks whether the athlete agrees that the coaches exhibit a caring nature; the athlete might score this item as a 4 (high or strongly agree). This is an example of two items asking about the same construct and receiving two very different responses, when the researcher would expect these answers to be similar. However, just because one person completes a survey in this fashion does not mean that the survey is unreliable. Only if a large numberof participants answer the survey in this fashion would it be considered low in internal reliability. This indicates that the sample population misunderstood the items, they deliberately answered in an odd fashion, or the items need to be dropped or reworded to increase consistency. This touches on one assumption of survey research, which is that participants answer honestly.

Test - Retest Reliability

The next form of reliability is test - retest reliability. As the name implies, it entails administering the test or assessment on more than one occasion. Test - retest reliability refers to the ability of an instrument or assessment to produce similar outcomes in the same group of people on multiple occasions in an amount of time that would not lead to a change in the measurement of interest.1,2 Let's assume that you use a dynamometer to assess the triceps extension strength in a group of athletes. The athletes are healthy college males between 18 and 24 years of age who are free of injury or disease and not participating in a strength training program. You would not expect their triceps strength to change in a week's time. It is safe to assume that the scores of the first test and those of the second test will be similar. In this case, large changes in strength measures are likely due to unreliable measures being detected by the dynamometer or test administrator error.

Not all test - retest measures are as easy to visualize or straightforward as the strength testing example. If you take a standardized test such as the SAT or GRE twice, a week apart, and you don't study between the tests, your scores will likely be similar. That's because taking a large standardized test covering topics you might not know won't teach you the topics. If a person scores low in algebra on the SAT, merely taking the SAT again won't improve his ability to apply algebra. This is not the case with certain types of cognitive assessment. A standard practice for the assessment of concussion often includes a computerized concussion assessment. These types of tests, however, result in athletes learning how to take the test, subsequently increasing their scores on follow-up trials. This is referred to as a practice effect.

Two of the most common cognitive-based tests in clinical athletic training practice are the ImPACT Concussion Management Software (ImPACT Applications, Pittsburgh, PA) and Concussion Sentinel (CogState Ltd, Victoria, Australia). These tests present with low to moderate test - retest reliability. This means that the scores of participants completing the assessments varied more than what is considered acceptable between tests, which diminishes their clinical relevance. Test - retest reliability is often reported in the literature as a value between 0 and 1, with value of 1 being more reliable or having fewer random effects that are unaccounted for in the results.3 The closer these reported values are to 1, the more reliable the assessments are, or the more trust you can put in them.

Interrater Reliability

Consensus between professionals on how to best treat an injury is important. It does, however, require that similar results or conditions be interpreted the same way. If you are an experienced AT reviewing injury reports upon returning to work after a few days off, you want to trust that other ATs completed certain orthopedic assessments the same way you would. More important, you hope that those tests would render the same outcome regardless of the person performing them. This is interrater reliability - that is, the agreement between clinicians who perform independent assessments of a condition.6 It is important to note that the consensus should be reached by both clinicians, but independently, without cognitive influence from each other. When the second AT is assessing an injury using the same assessment as the first, she should not know the outcome of the first.6 This ensures that the person performing the second assessment is not influenced by the outcome of the first assessment.

Also important when determining interrater reliability is that the athlete not recall the first assessment and respond differently to the second. This could be problematic when the assessment is on an injured body part and resulted in discomfort; the athlete may respond differently to the second assessment out of fear of re-creating the discomfort.

When reviewing articles and trying to determine the value of the results, remember that the higher the values for interrater reliability are, the more consistent the outcomes will be between assessors. Detailed examples are provided in subsequent chapters.

Intrarater Reliability

Intrarater reliability relates to the reproducibility of a clinical measure when performed by the same clinician on more than one occasion on the same athlete.2 The purpose of determining intrarater reliability is to assess whether the clinician is consistently using a reliable method of assessment. If the assessment method is not established as reliable, we would return to test - retest reliability to establish trust in the assessment or instrument (recall the tape cutters). An example would be assessing the range of motion of a knee. The bony landmarks used to measure knee flexion and the mechanics of the goniometer will not change from day 1 to day 2. However, range of motion assessments often vary when repeated by the same clinician. Imagine the difficulty in creating a treatment plan when on day 1 an athlete is assessed with 100° of flexion and on day 2 has only 90° of flexion. Variations in observed outcomes will confound the care provided to athletes.

Learn more about Evidence-Based Practice in Athletic Training.

More Excerpts From Evidence Based Practice in Athletic Training