What do we need to examine now?
- 
            
Will multiple raters come up with the same scores for the same output?
 - 
            
Can multiple labs set up on their own and do the same?
 - 
            
Can a rater come up with the same result reliably?
 - 
            
What is the relationship between scores and user satisfaction/intelligibility?