The second Andersen prototype system was tested in 2005 with 13 users (six boys and seven girls) from the target user population of 10-18 years old children and teenagers. All users were Danish school kids aged between 11 and 16 and with an average age of 13 years.
The test method used was a controlled in-laboratory user test.Each user test session took 60-75 minutes. Sessions began with a brief introduction by the experimenter to the system setup and the input modalities available, and calibration of the headset microphone to the user’s voice. Each user tested the system in two different test conditions, a free-style conversation condition followed by a condition based on a conversation problems handout.
At the beginning of each session, the experimenter demonstrated both gesture only behaviour (point, line, circle) and multimodal input with a single example of combination such as "what is in this picture?" combined with a gesture to a picture. Users were also told that they had to speak in English and briefly what Andersen knows about. However, they were otherwise not instructed in how to speak to the system at all.
Then followed 15 minutes of free-style interaction in which it was entirely up to the user to decide what to talk to Andersen about. In the following break, the user was asked to study a handout which listed 11 proposals on what the user could try to find out about Andersen’s knowledge domains, make him do, or explain to him. It was stressed that the user was not required to try to follow all the proposals. Rather, the user could pick those he or she liked, having a good time in the process. The second session had a duration of 20 minutes.
Following the two sessions with Andersen, each user was interviewed separately about his/her background, experiences from interacting with Andersen, views on system usability, proposals for system improvements, etc.
A total of 26 conversations corresponding to 8 hours of speech were recorded, logged and captured on video. Users’ input speech was subsequently transcribed. In the following, Table 1 shows technical and contractual evaluation criteria and results based on the collected data. Table 2 shows technical evaluation criteria and results at component level. Table 3 shows usability criteria and results measured via user interviews. Table 4 shows usability criteria and results based on other user test data than interviews. A full report on the evaluation of the Andersen system can be found in NICE Deliverable D7.2.
Table 1. Technical and contractual evaluation criteria and results at system level. In Column 2, "r" means revised formulation of a deliverable D7.1 criterion, "s" means a split of a D7.1 criterion into several distinct criteria.
| Number | Technical and contractual criteria | Explanation | Evaluation |
|---|---|---|---|
| 1 | Technical robustness | Quantitative; how often does the system crash; how often does it produce a bug which prevents continued interaction (e.g. a loop) | About 12 crashes distributed over various modules. Due to their particular causes, the crashes were unevenly distributed across the 26 user sub-sessions |
| 2 | Handling of out-of-domain input | Qualitative; to which extent does the system react reasonably to out-of-domain input | Out-of-domain handling enabled for user names, nationalities, fairytale names, game names, and explanations of fairytales and games |
| 3 | r,s Real-time performance, spoken part | Quantitative; how long does it usually take to get reaction from the system to spoken input | Mostly real-time. However, up to 10-14 seconds delay when the recogniser does not realise that the user stops talking and thus stays open for the max duration of 15 seconds |
| 4 | r,s Real-time performance, gesture part | Quantitative; how long does it usually take to get reaction from the system to gesture input | The analysis of GR log files indicates that the meantime interval between the detection of a gesture (startOfGesture message produced by the GR) and the resulting message sent by the GR to the GI module was 47 ms (13093 ms / 281 GR frames). Furthermore only one user among the 13 users mentioned a small delay in the processing of gesture |
| 5 | Barge-in | Quantitative; is barge-in implemented | No barge-in. The intended environment of use in museums is considered hostile to barge-in |
| 6 | Number of characters | Characters in the game | One (HCA) |
| 7 | Number of emotions which can be expressed by characters | Quantitative; how many different emotions can be conveyed in principle | Four: neutral, happy, sad, angry |
| 8 | Actual emotion expression verbally and non-verbally | Quantitative; how many different emotions are actually conveyed verbally and non-verbally | Verbally: neutral, angry, happy, sad Non-verbally: neutral, angry, sad |
| 9 | s Number of input modalities | Quantitative; how many input modalities does the system allow | Three: speech, 2D gesture, keyboard (arrow keys and F2) |
| 10 | s Number of output modalities | Quantitative; how many output modalities does the system allow | Six: speech, lip movements (visual speech), facial expression, hand/arm gesture, gaze, autonomous locomotion |
| 11 | Synchronisation of output | Qualitative; is output properly synchronised | Yes, except for a slight delay in onset of lip movements |
| 12 | Number of domains | Quantitative; how many domains can HCA talk about (his life, his fairytales, etc.) | His works (mostly his fairytales), his life, including childhood games and games users like, his physical and personal presence, his study including the objects in there, the user, and generic input including meta-communication |
| 13 | Number of different plots/scenes available | Quantitative; how many different plots/scenes can the user choose among | N/A |
Table 2. Technical evaluation criteria and results at component level. In Column 2, "n" means new criterion, "r" means revised formulation of a deliverable D7.1 criterion, "s" means a split of a D7.1 criterion into several distinct criteria.
| Number | Technical component evaluation | Evaluation |
|---|---|---|
| Speech recogniser | ||
| 1 | n Perfect input utterance recognition | Danish group = 4 gender-balanced target group users randomly chosen from among the 13 users in the user test having Danish as their first language, test condition 2: average 23% English group: 4 new target group users, gender-balanced, having English as their first language, test condition 2: average 33% |
| 2 | n Understanding of user input | Danish group: 49% English group: 60% |
| 3 | n Understanding of user input + handling of non-understood input through meta-communication | Danish group: 85% English group: 87% |
| 4 | Word error rate for English | Average both test conditions = 70,73% Test condition 1 = 80,09% Test condition 2 = 61,38% |
| 5 | Vocabulary coverage for English | Out-of-vocabulary words 2,5% |
| 6 | Perplexity of English language model | Not available |
| 7 | r Real-time performance | In principle yes, but delays were sometimes caused by the recogniser remaining open for 15 seconds although the user stopped speaking earlier |
| Gesture recogniser | ||
| 8 | Recognition accuracy regarding gesture type | Blind labelling of logged gesture shapes led to the evaluation of 87.2 % of correct recognition of gestures (245/ 281). Several noisy shapes were observed. |
| 9 | Number of recognition failures | 36/281=12.8% of the gesture shapes were not classified in the same class by blind labelling and by the GR module |
| Natural language understanding | ||
| 10 | Lexical coverage, English | Not available |
| 11 | NLU robustness, English | Perfect recognition, all thirteen users = 27% Understanding robustness = 34% Utterances understood = 47% |
| 12 | Topic spotter error rate, English | N/A. No topic spotter needed in PT2 due to its ontology-based design |
| 13 | Anaphora resolution error rate, English | N/A. No anaphora resolution in PT2 |
| Gesture interpretation | ||
| 14 | Selection of referenced objects error rate | Failure in processing of gesture-only input for referenceable objects involved the GI in only 4% of the cases |
| Input fusion | ||
| 15 | Robustness to temporal distortion between input modalities | 21 errors in the processing of multimodal behaviours which were due to unexpected delays between speech and gesture. They account for 43% of multimodal errors. 85% of these 21 errors were due to delays in start of speech which proved inappropriate when compared to the video |
| 16 | Fusion error rate | 40% of multimodal behaviours from an interaction point of view (75% from the point of view of the IF fusionStatus). For the processing of gesture-only behaviours, 13 cases were merged with wrong detection of speech |
| 17 | Cases in which events have not been merged and should have | 48 cases amounting to 75% of all multimodal error cases |
| 18 | Cases in which events have been merged and should not have | 25% of all multimodal error cases |
| 19 | Recognised modality combination error rate | Not considered relevant for evaluation |
| Character module | ||
| 20 | Meta-communication facilities | User input facilities: repeat, correct, clarify System output facilities: repeat, rephrase, change topic, end conversation, i.e. a graceful degradation chain of context-dependent outputs, Kukbox, specific handling of why, where, and when questions System-internal facilities: low speech recognition confidence score, high speech recognition confidence score |
| 21 | Handling of initiative | Fully mixed initiative. The user can take the initiative any time s/he wants and the system will follow |
| 22 | Performance of conversational history | Distributed discourse context and domain context histories in the character module. The former ensure graceful degradation to user input, appropriate reaction to repeated insults, and ability to remember the latest output. The latter ensure that HCA will not on his own initiative say the same thing twice and that certain implications of user input are taken into account |
| 23 | Handling of changes in emotion | HCA’s emotional state is updated for each user input |
| Response generation | ||
| 24 | Coverage of action set (non-verbal action) | 130 out of 150 available non-verbal behaviour primitives used |
| Graphical rendering (animation) | ||
| 25 | Synchronisation with speech output | Eleven visemes used |
| 26 | s Naturalness of animation, facial | Up to 5 non-verbal primitives are used per output turn out of 74 available |
| 27 | s Naturalness of animation, gesture | Up to 17 non-verbal primitives per output turn out of 50-60 available primitives for gesture |
| 28 | s Naturalness of animation, movement | Used in scripts. A script contains up to 40 lines of behaviour descriptions |
| Text-to-speech | ||
| 29 | Speech quality, English | Good |
| 30 | Intelligibility, English | Good |
| 31 | Naturalness, English | Fairly good, missing pauses in some places, prosody jumps and mispronunciation of homographs |
| Integration | ||
| 32 | Communication among modules | Good, except for the missing integration of NCA, CF and CA |
| 33 | Message broker | Works well |
| 34 | Processing time per module | Real-time, except when the recogniser remains open for 15 seconds although the user has stopped speaking; this results in perceived delays in answering |
Table 3. Usability criteria and results measured via user interviews. In Column 2, "n" means new criterion, "r" means revised formulation of a deliverable D7.1 criterion, "s" means a split of a D7.1 criterion into several distinct criteria.
| Number | Basic usability criteria | Explanation | Evaluation |
|---|---|---|---|
| 1 | Speech understanding adequacy | Subjective; how well does the system understand speech input | Fair, larger vocabulary and grammar desirable |
| 2 | Gesture understanding adequacy | Subjective; how well does the system understand gesture input | Good, but more objects should perhaps be active |
| 3 | n Combined speech/gesture understanding adequacy | Subjective; how well does the system understand combined speech/gesture input | Good, as long as the pointed-to object is active. Only about half of the users spoke while pointing |
| 4 | Output voice quality | Subjective; how intelligible and natural is the system output voice | Good, easy to understand |
| 5 | Output phrasing adequacy | Subjective; how adequate are the system’s output formulations | Good, occasionally slightly too long output |
| 6 | Animation quality | Subjective; how natural is the animated output | Lip synchrony okay, improvements of other animation aspects needed |
| 7 | Quality of graphics | Subjective; how good is the graphics | Rather good |
| 8 | Ease of use of input devices | Subjective; how easy are the input devices to use, such as the touch screen | Easy |
| Number | Core usability criteria | Explanation | Evaluation |
| 9 | r How natural is it to communicate via the available modalities | Subjective; how natural is it to communicate via the available modalities | Natural to use speech and touch screen |
| 10 | Output behaviour naturalness | Subjective; character believability, coordination and synchronisation of verbal and non-verbal behaviour, display of emotions, dialogue initiative and flow, non-communicative function, etc. | Looks like real HCA, lip synchrony okay, display of emotions very limited, Non-Communicative Action somewhat odd for a 55 years old man |
| 11 | r Ease of use of the game: How well did users complete the scenario tasks? | Subjective; how easy is it for the user to find out what to do and how to interact | Rather easy to interact with the system but somewhat difficult for several users to find out what to talk about. The problem sheet (2nd test condition) was felt to provide useful support |
| 12 | s Error handling adequacy, spoken part | Subjective; how good is the system at detecting errors relating to spoken input and how well does it handle them | Improvements needed |
| 13 | s Error handling adequacy, gesture-only part | Subjective; how good is the system at detecting and handling errors relating to gesture input | No error handling |
| 14 | Entertainment value | Subjective; this measure in-cludes game quality and ori-ginality, interest taken in the game, feeling like playing a-gain, time spent playing, user game initiative, etc. | Fun, good entertainment value |
| 15 | Educational value | Subjective; to which extent did the user learn from interacting with the system | Learned something, e.g. about HCA’s life or English |
| 16 | User satisfaction | Subjective; how satisfied is the user with the system | Rather good |
Table 4. Usability criteria and results based on other user test data than interviews. In Column 2, "n" means new criterion, "r" means revised formulation of a D7.1 criterion, "s" means a split of a D7.1 criterion into several distinct criteria.
| Number | Basic usability criteria | Explanation | Evaluation |
|---|---|---|---|
| 1 | s Frequency of interaction problems, spoken part | Quantitative; how often does a problem occur related to spoken interaction (e.g. the user is not understood or is misunderstood) | Danish user group: system misunderstandings, average = 15% English user group: system misunderstandings, average = 13% |
| 2 | s Frequency of interaction problems, gesture part | Quantitative; how often does a problem occur related to gesture interaction | The answers to the question "Was he aware of what you pointed to and did he answer?" were all positive. The comparative analysis of the videos and the log files reveals that 51% of the gesture only behaviours were successful from an interaction point of view and that 62% of the failures were due to gestures on non referenceable objects. |
| 3 | s Frequency of interaction problems, graphics rendering part | Quantitative; how often does a problem occur related to graphics | Overheating graphics card made body parts fall off in the test with the first two users. Max five crashes due to graphics |
| 4 | Sufficiency of domain coverage | Subjective; how well does the system cover the domains it announces to the user | Coverage is insufficient for travels, modern technology, and some personal questions |
| 5 | r Number of objects the subject(s) interacted with through gesture | Quantitative; serves to check to which extent the possibil-ities offered by the system are also used by users | All 18 referencable objects were gestured at. Moreover, a total of additionally 16 objects were gestured at. Each user gestured at between 6% and 89% of the 18 referenceable objects (average 62%). Two users did very little gesturing |
| 6 | r Average frequency of domains addressed by users in the conversation in percentage of number of turns | Quantitative; serves to check which domains users actually address and how often | User = 9.0; life = 8.1; works = 9.6; study = 15.6; hca = 5.7; generic = 51.9 |
| Number | Core usability criteria | Explanation | Evaluation |
| 7 | r Conversation success | Quantitative; how often is a transaction exchange between the user and the system successful | See Table 2 |
| 8 | Sufficiency of the system’s reasoning capabilities | Subjective; how good is the system at reasoning about user input | Needs identified in PT1 implemented. More reasoning concerning how much has been said about a topic already would be good |
| 9 | Scope of user modelling | Subjective; to which extent does the system exploit what it learns about the user | Very limited |