What do we need to examine now?
-
Will multiple raters come up with the same scores for the same output?
-
Can multiple labs set up on their own and do the same?
-
Can a rater come up with the same result reliably?
-
What is the relationship between scores and user satisfaction/intelligibility?