Untitled Document

Evaluation Methodology of Educational Prototypes

The development of cost-effective methodologies for evaluating educational prototypes was a central question to be studied in this dissertation. Some conclusions are presented below in this regard.

Videotaping:
The importance of videotaping usability evaluations was confirmed: the generation of multimedia files with critical incidences was dependent on the availability of the videotapes; the generation of a list of problems was dependent on the videotapes; a detailed quantification of problems detected was also generated from the videotapes; the process of videotaping also allows further verification of the the problems encountered, as well as an efficient form of archiving user-interaction for future references.

Videotapes also serve as an essential communications medium in situations where it may be difficult to persuade developers and managers that a certain usability problem is in fact a problem. Seeing a real user struggling with the problem convinces managers and developers [Pauch 1991].

The interaction of observer and subjects:
The importance of the observer being physically present and the quality of the interaction between observer and participants was evident in this study. For most of the participants, the think-aloud process was NOT easy and having someone willing to help, prompting then through the evaluation, or just having someone to direct the speech was very important.

The use of critical incidence and think-aloud techniques demonstrated to be efficient ways of obtaining feedback from participants of evaluations, in particular with experts.

One disadvantage of the "think-aloud" method was that it did not lend itself very well to most types of performance measurements. On the contrary, its strength was the wealth of qualitative data that could be collected from a fairly small number of users. Also, the subjects' comments often contained vivid and explicit quotes that could be used to make the results more readable and memorable.

Number of Subjects:
It is clear that the more subjects included in usability evaluations, the more generalizable the results become. However, in most situations, it is not cost efficient or viable to evaluate several people and to preserve the richness and depth or thickness of feedback like that obtained in this study.

There is a trade off between the statistical significance and the level of context of the results that needs to be taken into consideration. A recommendation would be to have a minimum of one group of four to six users, and one group of four to six experts. The inclusion of more participants should be dictated by the relative availability of people, time, and equipment.

Questionnaires:
Questionnaires are recommended, if they can be kept short and if the terminology used can be familiar to the subjects. The use of existing generic questionnaires for usability evaluations seemed to be of limited value, if one takes the experts' feedback from this study into consideration.

Questionnaires, however, if done appropriately and combined with other qualitative instruments, can be useful in terms of indicating general strengths and weaknesses when testing educational prototypes.

From the usability perspective, however, questionnaires are indirect methods, since they do not study the user interface itself, but only users' opinions about the user interface.

List of Problems:
The generation of a detailed list of usability problems is recommended, as this instrument was rated as one of the most valuable tools by the experts in this study.

The availability of a list of this nature can simplify the work of instructional designers when developing educational prototypes. Such a list should indicate the location of the problem, its incidence, and a clear description. This list could be connected via hyperlinks to multimedia files for each observer, giving the designers fast and efficient random access to the problems detected [Nielsen, 1994].

A careful analysis of the List of Problems indicated the existence of four main categories of problems:

a) interface problems;

b) instructional problems;

c) language problems; and

d) programming problems.

This preliminary taxonomy of problems could be developed further.

Use of Statistical Tools:
Descriptive and non-parametric statistics are used more frequently in usability studies, due to the small number of subjects, and due to the relative ease of execution and interpretation of the results.

The reliability of usability studies could be a problem because of the huge differences between subjects' responses. It is not uncommon to find that the best user is 10 to 15 times faster than the slowest user [Egan 1988]. Usability testing fosters situations where designers have to make decisions on the basis of fairly unreliable data, which is still better than making decisions with no data at all.

The use of cluster analysis is more complex and the results are more difficult to be converted into practical solutions for the developers. In this study, several combinations of the questionnaires answers were clustered, in an attempt to find similarities between the kinds of problems found by users.

However, clustering methods are not standardized and may be implemented differently. Also, the problem with doing cluster testing is the difficulty of specifying what the null hypothesis should be.

Perhaps a better way of determining clusters would be by trying to examine the validity of various solutions to the data, or by carrying out replication studies [Kirakowski and Corbett 1990].

Videotaping:	The importance of videotaping usability evaluations was confirmed: the generation of multimedia files with critical incidences was dependent on the availability of the videotapes; the generation of a list of problems was dependent on the videotapes; a detailed quantification of problems detected was also generated from the videotapes; the process of videotaping also allows further verification of the the problems encountered, as well as an efficient form of archiving user-interaction for future references. Videotapes also serve as an essential communications medium in situations where it may be difficult to persuade developers and managers that a certain usability problem is in fact a problem. Seeing a real user struggling with the problem convinces managers and developers [Pauch 1991].
The interaction of observer and subjects:	The importance of the observer being physically present and the quality of the interaction between observer and participants was evident in this study. For most of the participants, the think-aloud process was NOT easy and having someone willing to help, prompting then through the evaluation, or just having someone to direct the speech was very important. The use of critical incidence and think-aloud techniques demonstrated to be efficient ways of obtaining feedback from participants of evaluations, in particular with experts. One disadvantage of the "think-aloud" method was that it did not lend itself very well to most types of performance measurements. On the contrary, its strength was the wealth of qualitative data that could be collected from a fairly small number of users. Also, the subjects' comments often contained vivid and explicit quotes that could be used to make the results more readable and memorable.
Number of Subjects:	It is clear that the more subjects included in usability evaluations, the more generalizable the results become. However, in most situations, it is not cost efficient or viable to evaluate several people and to preserve the richness and depth or thickness of feedback like that obtained in this study. There is a trade off between the statistical significance and the level of context of the results that needs to be taken into consideration. A recommendation would be to have a minimum of one group of four to six users, and one group of four to six experts. The inclusion of more participants should be dictated by the relative availability of people, time, and equipment.
Questionnaires:	Questionnaires are recommended, if they can be kept short and if the terminology used can be familiar to the subjects. The use of existing generic questionnaires for usability evaluations seemed to be of limited value, if one takes the experts' feedback from this study into consideration. Questionnaires, however, if done appropriately and combined with other qualitative instruments, can be useful in terms of indicating general strengths and weaknesses when testing educational prototypes. From the usability perspective, however, questionnaires are indirect methods, since they do not study the user interface itself, but only users' opinions about the user interface.
List of Problems:	The generation of a detailed list of usability problems is recommended, as this instrument was rated as one of the most valuable tools by the experts in this study. The availability of a list of this nature can simplify the work of instructional designers when developing educational prototypes. Such a list should indicate the location of the problem, its incidence, and a clear description. This list could be connected via hyperlinks to multimedia files for each observer, giving the designers fast and efficient random access to the problems detected [Nielsen, 1994]. A careful analysis of the List of Problems indicated the existence of four main categories of problems: a) interface problems; b) instructional problems; c) language problems; and d) programming problems. This preliminary taxonomy of problems could be developed further.
Use of Statistical Tools:	Descriptive and non-parametric statistics are used more frequently in usability studies, due to the small number of subjects, and due to the relative ease of execution and interpretation of the results. The reliability of usability studies could be a problem because of the huge differences between subjects' responses. It is not uncommon to find that the best user is 10 to 15 times faster than the slowest user [Egan 1988]. Usability testing fosters situations where designers have to make decisions on the basis of fairly unreliable data, which is still better than making decisions with no data at all. The use of cluster analysis is more complex and the results are more difficult to be converted into practical solutions for the developers. In this study, several combinations of the questionnaires answers were clustered, in an attempt to find similarities between the kinds of problems found by users. However, clustering methods are not standardized and may be implemented differently. Also, the problem with doing cluster testing is the difficulty of specifying what the null hypothesis should be. Perhaps a better way of determining clusters would be by trying to examine the validity of various solutions to the data, or by carrying out replication studies [Kirakowski and Corbett 1990].