Benchmarking the User Experience

December 1, 2023

Qualitative vs. Quantitative Data: Learning More About Quant

One of the most talked about subjects on the UX forums I’m a part of is the need for both qualitative and quantitative data. The stories that designers and researchers alike tell about their workplaces never cease to amaze me: many discuss the presence of one type of research at the expense of the other. This is a fundamental lack of understanding and can be solved!

I have a range of interdisciplinary research experience but admittedly don’t have a ton of quantitative UX research experience. Typically, I’ve always used simpler metrics like time-on-task, task completion and something akin to “open-ended” feedback.

Despite being somewhat familiar with many study metrics, I wanted to learn more. I can almost always count on seeing a post talking about why the NPS is not sufficient but there’s rather less about the benefits of other contenders. Specifically, I wanted to compare/contrast the pros/cons of study and task metrics and also hear about scenarios in which they have been used by a seasoned quantitative research professional.

‍

Defining the Jargon

There’s a lot of jargon in the UX industry so first and foremost, let’s define a few things as presented in this specific book (as defined by Jeff Sauro):

UX - the combination of all the behaviors and attitudes people have while interacting with an interface

UX Benchmarking - evaluating an interface using a standard set of metrics to gauge its relative performance

‍

Types of Benchmarking Studies

Before jumping into the actual metrics, Sauro zooms out and talks about the overall structure of a benchmarking study that we ought to consider. He differentiates between retrospective studies (more about attitudes) and task-based studies (more about interactions) and the pros/cons of both. He then moves into stand-alone vs. comparative studies; this is where things start to get more complex.

A stand-alone study is actually what it sounds like: it’s a single, isolated study. On the other hand, the comparative study can focus on testing an interface with different participants (between-subjects approach) or testing multiple interfaces with the same participants (within-subjects approach) or some combination of the two (mixed approach). There are pros and cons to be considered for either: carryover effects, sequence effects, impact on attitudes, and comparative judgment.

The structure of the study itself will largely depend on goals and resources (recruiting is difficult!). Sauro does a fantastic job with outlining pros/cons of varying structures.

‍

Study-Based Metrics

System Usability Scale (SUS) - the 10-item questionnaire asking users to rank perceived usability on a scale from 1-5 (strongly disagree - strongly agree). The SUS is an older scale so there’s a large pool of data to compare captured SUS scores against!

Standardized User Experience Percentile Rank-Questionnaire (SURP-Q) - 8-item psychometrically validated questionnaire designed to measure quality of user experience on a scale from 1-5 (strongly disagree-strongly agree). It includes the NPS and is also widely-accepted; there is data to compare your own site with. Advantages over SUS: it’s shorter, measures trust/appearance, and is a normalized score.

Net Promoter Score (NPS) - 1-item questionnaire which measures customer loyalty. Hotly debated.

Usability Metric for User Experience (UMUX)-LTE - 2-item questionnaire akin to the SUS which uses either a 5 or 7-point response option variation. Is ideal for benchmarking internal IT products.

Standardized User Experience Percentile Rank-Questionnaire for mobile apps (SURP-Qm) - 16-item instrument (4-6 usually suffice) for measuring mobile app experiences; a broader measure was needed due to the highly domain-specific nature of many mobile apps.

‍

Task-Level Metrics

Task Completion Rate - binary measure (0-1): can users complete their assigned task?

Task Time - time users to measure efficiency: average completion time, average time on task, or average time till failure

Task Ease - the measure of a participant’s attitude toward an experience immediately after the task; the single ease question (SEQ) is an example: 7-point scale asking users to assess the perceived difficulty of a task

Single Usability Metric (SUM) - the aggregation of post-task metrics into a composite score (most-task metrics tend to be correlated with each other: speed vs. completion)

Confidence - assess participant’s confidence of completing a task successfully using a 7-point scale; this can be used to benchmark between versions of a product: does confidence increase or decrease?

Disasters - when users report high confidence of completing a task successfully despite failing the task

Open-Ended - asking participants to briefly describe the task experience; this feels like the equivalent of the “other” option on a survey. It can make stakeholders aware of new information!

‍

Thoughts on the book

Exactly what I was looking for. Sauro guides readers through the entire process of benchmarking so this book is a complete introductory guide to benchmarking. At times he makes the argument that quantitative analysis should probably be done by a third party if you want the best results but he still provides the necessary information to “do-it-yourself” which I appreciate. I learned a number of things from this book besides just familiarizing with the metrics themselves. Sauro discusses, in depth, the pros/cons of testing with a completed product vs. a prototype vs. a participant’s own system with their own data. He pointed out that participants using their own information (ex. buying something from an ecommerce site) might be hesitant to complete the procedure and this is when a prototype or ‘canned’ system could be more appropriate. He also stresses the importance of randomization and points out the human tendency to scan the top/bottom of lists; this is something vital to keep in mind when prioritizing tasks or listing things generally.

One of the hardest things to get right in my experience thus far has been writing task scenarios; Sauro seems to agree that it becomes a very delicate balance of providing just enough information without telling participants how to proceed. He goes on to provide some statistics based on his own experience such as: ~10% of participant feedback (in surveys or unmoderated studies) has to be thrown out due to “cheating.” This is something that can be identified but only with carefully sequenced questions; however, he points out that a great way to combat this is with effective screening and recruiting (the hardest thing!).

Lastly, Sauro states that In order to get the most data, branching and logic are an ideal tool for extracting information; predicting/isolating sequences is an essential part of research but this is challenging and requires some practice. Overall, I enjoyed this book; it felt more like a textbook than a book at times (which I appreciate) and was absolutely packed with helpful tips and tools to apply to all kinds of scenarios. I took 18 pages of notes which pretty much says it all. This will likely be a book that I’ll revisit frequently in the future.

Notes

Benchmarking the User Experience

Benchmarking the User Experience

Qualitative vs. Quantitative Data: Learning More About Quant

Defining the Jargon

Types of Benchmarking Studies

Study-Based Metrics

Task-Level Metrics

Thoughts on the book

Other Blog Posts