The second Andersen prototype system was tested in 2005 with 13 users (six boys and seven girls) from the target user population of 10-18 years old children and teenagers. All users were Danish school kids aged between 11 and 16 and with an average age of 13 years.

The test method used was a controlled in-laboratory user test.Each user test session took 60-75 minutes. Sessions began with a brief introduction by the experimenter to the system setup and the input modalities available, and calibration of the headset microphone to the userís voice. Each user tested the system in two different test conditions, a free-style conversation condition followed by a condition based on a conversation problems handout.

At the beginning of each session, the experimenter demonstrated both gesture only behaviour (point, line, circle) and multimodal input with a single example of combination such as "what is in this picture?" combined with a gesture to a picture. Users were also told that they had to speak in English and briefly what Andersen knows about. However, they were otherwise not instructed in how to speak to the system at all.

Then followed 15 minutes of free-style interaction in which it was entirely up to the user to decide what to talk to Andersen about. In the following break, the user was asked to study a handout which listed 11 proposals on what the user could try to find out about Andersenís knowledge domains, make him do, or explain to him. It was stressed that the user was not required to try to follow all the proposals. Rather, the user could pick those he or she liked, having a good time in the process. The second session had a duration of 20 minutes.

Following the two sessions with Andersen, each user was interviewed separately about his/her background, experiences from interacting with Andersen, views on system usability, proposals for system improvements, etc.

A total of 26 conversations corresponding to 8 hours of speech were recorded, logged and captured on video. Usersí input speech was subsequently transcribed. In the following, Table 1 shows technical and contractual evaluation criteria and results based on the collected data. Table 2 shows technical evaluation criteria and results at component level. Table 3 shows usability criteria and results measured via user interviews. Table 4 shows usability criteria and results based on other user test data than interviews. A full report on the evaluation of the Andersen system can be found in NICE Deliverable D7.2.

Table 1. Technical and contractual evaluation criteria and results at system level. In Column 2, "r" means revised formulation of a deliverable D7.1 criterion, "s" means a split of a D7.1 criterion into several distinct criteria.

Number Technical and contractual criteria Explanation Evaluation
1 Technical robustness Quantitative; how often does the system crash; how often does it produce a bug which prevents continued interaction (e.g. a loop) About 12 crashes distributed over various modules. Due to their particular causes, the crashes were unevenly distributed across the 26 user sub-sessions
2 Handling of out-of-domain input Qualitative; to which extent does the system react reasonably to out-of-domain input Out-of-domain handling enabled for user names, nationalities, fairytale names, game names, and explanations of fairytales and games
3 r,s Real-time performance, spoken part Quantitative; how long does it usually take to get reaction from the system to spoken input Mostly real-time. However, up to 10-14 seconds delay when the recogniser does not realise that the user stops talking and thus stays open for the max duration of 15 seconds
4 r,s Real-time performance, gesture part Quantitative; how long does it usually take to get reaction from the system to gesture input The analysis of GR log files indicates that the meantime interval between the detection of a gesture (startOfGesture message produced by the GR) and the resulting message sent by the GR to the GI module was 47 ms (13093 ms / 281 GR frames). Furthermore only one user among the 13 users mentioned a small delay in the processing of gesture
5 Barge-in Quantitative; is barge-in implemented No barge-in. The intended environment of use in museums is considered hostile to barge-in
6 Number of characters Characters in the game One (HCA)
7 Number of emotions which can be expressed by characters Quantitative; how many different emotions can be conveyed in principle Four: neutral, happy, sad, angry
8 Actual emotion expression verbally and non-verbally Quantitative; how many different emotions are actually conveyed verbally and non-verbally Verbally: neutral, angry, happy, sad Non-verbally: neutral, angry, sad
9 s Number of input modalities Quantitative; how many input modalities does the system allow Three: speech, 2D gesture, keyboard (arrow keys and F2)
10 s Number of output modalities Quantitative; how many output modalities does the system allow Six: speech, lip movements (visual speech), facial expression, hand/arm gesture, gaze, autonomous locomotion
11 Synchronisation of output Qualitative; is output properly synchronised Yes, except for a slight delay in onset of lip movements
12 Number of domains Quantitative; how many domains can HCA talk about (his life, his fairytales, etc.) His works (mostly his fairytales), his life, including childhood games and games users like, his physical and personal presence, his study including the objects in there, the user, and generic input including meta-communication
13 Number of different plots/scenes available Quantitative; how many different plots/scenes can the user choose among N/A

To top of page

Table 2. Technical evaluation criteria and results at component level. In Column 2, "n" means new criterion, "r" means revised formulation of a deliverable D7.1 criterion, "s" means a split of a D7.1 criterion into several distinct criteria.

Number Technical component evaluation Evaluation
Speech recogniser
1 n Perfect input utterance recognition Danish group = 4 gender-balanced target group users randomly chosen from among the 13 users in the user test having Danish as their first language, test condition 2: average 23% English group: 4 new target group users, gender-balanced, having English as their first language, test condition 2: average 33%
2 n Understanding of user input Danish group: 49% English group: 60%
3 n Understanding of user input + handling of non-understood input through meta-communication Danish group: 85% English group: 87%
4 Word error rate for English Average both test conditions = 70,73% Test condition 1 = 80,09% Test condition 2 = 61,38%
5 Vocabulary coverage for English Out-of-vocabulary words 2,5%
6 Perplexity of English language model Not available
7 r Real-time performance In principle yes, but delays were sometimes caused by the recogniser remaining open for 15 seconds although the user stopped speaking earlier
Gesture recogniser
8 Recognition accuracy regarding gesture type Blind labelling of logged gesture shapes led to the evaluation of 87.2 % of correct recognition of gestures (245/ 281). Several noisy shapes were observed.
9 Number of recognition failures 36/281=12.8% of the gesture shapes were not classified in the same class by blind labelling and by the GR module
Natural language understanding
10 Lexical coverage, English Not available
11 NLU robustness, English Perfect recognition, all thirteen users = 27% Understanding robustness = 34% Utterances understood = 47%
12 Topic spotter error rate, English N/A. No topic spotter needed in PT2 due to its ontology-based design
13 Anaphora resolution error rate, English N/A. No anaphora resolution in PT2
Gesture interpretation
14 Selection of referenced objects error rate Failure in processing of gesture-only input for referenceable objects involved the GI in only 4% of the cases
Input fusion
15 Robustness to temporal distortion between input modalities 21 errors in the processing of multimodal behaviours which were due to unexpected delays between speech and gesture. They account for 43% of multimodal errors. 85% of these 21 errors were due to delays in start of speech which proved inappropriate when compared to the video
16 Fusion error rate 40% of multimodal behaviours from an interaction point of view (75% from the point of view of the IF fusionStatus). For the processing of gesture-only behaviours, 13 cases were merged with wrong detection of speech
17 Cases in which events have not been merged and should have 48 cases amounting to 75% of all multimodal error cases
18 Cases in which events have been merged and should not have 25% of all multimodal error cases
19 Recognised modality combination error rate Not considered relevant for evaluation
Character module
20 Meta-communication facilities User input facilities: repeat, correct, clarify System output facilities: repeat, rephrase, change topic, end conversation, i.e. a graceful degradation chain of context-dependent outputs, Kukbox, specific handling of why, where, and when questions System-internal facilities: low speech recognition confidence score, high speech recognition confidence score
21 Handling of initiative Fully mixed initiative. The user can take the initiative any time s/he wants and the system will follow
22 Performance of conversational history Distributed discourse context and domain context histories in the character module. The former ensure graceful degradation to user input, appropriate reaction to repeated insults, and ability to remember the latest output. The latter ensure that HCA will not on his own initiative say the same thing twice and that certain implications of user input are taken into account
23 Handling of changes in emotion HCAís emotional state is updated for each user input
Response generation
24 Coverage of action set (non-verbal action) 130 out of 150 available non-verbal behaviour primitives used
Graphical rendering (animation)
25 Synchronisation with speech output Eleven visemes used
26 s Naturalness of animation, facial Up to 5 non-verbal primitives are used per output turn out of 74 available
27 s Naturalness of animation, gesture Up to 17 non-verbal primitives per output turn out of 50-60 available primitives for gesture
28 s Naturalness of animation, movement Used in scripts. A script contains up to 40 lines of behaviour descriptions
Text-to-speech
29 Speech quality, English Good
30 Intelligibility, English Good
31 Naturalness, English Fairly good, missing pauses in some places, prosody jumps and mispronunciation of homographs
Integration
32 Communication among modules Good, except for the missing integration of NCA, CF and CA
33 Message broker Works well
34 Processing time per module Real-time, except when the recogniser remains open for 15 seconds although the user has stopped speaking; this results in perceived delays in answering

To top of page

Table 3. Usability criteria and results measured via user interviews. In Column 2, "n" means new criterion, "r" means revised formulation of a deliverable D7.1 criterion, "s" means a split of a D7.1 criterion into several distinct criteria.

Number Basic usability criteria Explanation Evaluation
1 Speech understanding adequacy Subjective; how well does the system understand speech input Fair, larger vocabulary and grammar desirable
2 Gesture understanding adequacy Subjective; how well does the system understand gesture input Good, but more objects should perhaps be active
3 n Combined speech/gesture understanding adequacy Subjective; how well does the system understand combined speech/gesture input Good, as long as the pointed-to object is active. Only about half of the users spoke while pointing
4 Output voice quality Subjective; how intelligible and natural is the system output voice Good, easy to understand
5 Output phrasing adequacy Subjective; how adequate are the systemís output formulations Good, occasionally slightly too long output
6 Animation quality Subjective; how natural is the animated output Lip synchrony okay, improvements of other animation aspects needed
7 Quality of graphics Subjective; how good is the graphics Rather good
8 Ease of use of input devices Subjective; how easy are the input devices to use, such as the touch screen Easy
Number Core usability criteria Explanation Evaluation
9 r How natural is it to communicate via the available modalities Subjective; how natural is it to communicate via the available modalities Natural to use speech and touch screen
10 Output behaviour naturalness Subjective; character believability, coordination and synchronisation of verbal and non-verbal behaviour, display of emotions, dialogue initiative and flow, non-communicative function, etc. Looks like real HCA, lip synchrony okay, display of emotions very limited, Non-Communicative Action somewhat odd for a 55 years old man
11 r Ease of use of the game: How well did users complete the scenario tasks? Subjective; how easy is it for the user to find out what to do and how to interact Rather easy to interact with the system but somewhat difficult for several users to find out what to talk about. The problem sheet (2nd test condition) was felt to provide useful support
12 s Error handling adequacy, spoken part Subjective; how good is the system at detecting errors relating to spoken input and how well does it handle them Improvements needed
13 s Error handling adequacy, gesture-only part Subjective; how good is the system at detecting and handling errors relating to gesture input No error handling
14 Entertainment value Subjective; this measure in-cludes game quality and ori-ginality, interest taken in the game, feeling like playing a-gain, time spent playing, user game initiative, etc. Fun, good entertainment value
15 Educational value Subjective; to which extent did the user learn from interacting with the system Learned something, e.g. about HCAís life or English
16 User satisfaction Subjective; how satisfied is the user with the system Rather good

To top of page

Table 4. Usability criteria and results based on other user test data than interviews. In Column 2, "n" means new criterion, "r" means revised formulation of a D7.1 criterion, "s" means a split of a D7.1 criterion into several distinct criteria.

Number Basic usability criteria Explanation Evaluation
1 s Frequency of interaction problems, spoken part Quantitative; how often does a problem occur related to spoken interaction (e.g. the user is not understood or is misunderstood) Danish user group: system misunderstandings, average = 15% English user group: system misunderstandings, average = 13%
2 s Frequency of interaction problems, gesture part Quantitative; how often does a problem occur related to gesture interaction The answers to the question "Was he aware of what you pointed to and did he answer?" were all positive. The comparative analysis of the videos and the log files reveals that 51% of the gesture only behaviours were successful from an interaction point of view and that 62% of the failures were due to gestures on non referenceable objects.
3 s Frequency of interaction problems, graphics rendering part Quantitative; how often does a problem occur related to graphics Overheating graphics card made body parts fall off in the test with the first two users. Max five crashes due to graphics
4 Sufficiency of domain coverage Subjective; how well does the system cover the domains it announces to the user Coverage is insufficient for travels, modern technology, and some personal questions
5 r Number of objects the subject(s) interacted with through gesture Quantitative; serves to check to which extent the possibil-ities offered by the system are also used by users All 18 referencable objects were gestured at. Moreover, a total of additionally 16 objects were gestured at. Each user gestured at between 6% and 89% of the 18 referenceable objects (average 62%). Two users did very little gesturing
6 r Average frequency of domains addressed by users in the conversation in percentage of number of turns Quantitative; serves to check which domains users actually address and how often User = 9.0; life = 8.1; works = 9.6; study = 15.6; hca = 5.7; generic = 51.9
Number Core usability criteria Explanation Evaluation
7 r Conversation success Quantitative; how often is a transaction exchange between the user and the system successful See Table 2
8 Sufficiency of the systemís reasoning capabilities Subjective; how good is the system at reasoning about user input Needs identified in PT1 implemented. More reasoning concerning how much has been said about a topic already would be good
9 Scope of user modelling Subjective; to which extent does the system exploit what it learns about the user Very limited

To top of page