4.67 Satisfaction Is Pretty Good, Right? Right?

4.67 Satisfaction Is Pretty Good, Right? Right?
Grigoreva Alina/Shutterstock.com
Summary: This article explores what a star rating or satisfaction score could mean, and explains how to design surveys so you know what they mean.

Star Rating Or Satisfaction Score: What Does The Number Mean?

Fictitious story. Imagine you're running a transportation service for individuals with disabilities, so they can get to and back from the places they need to live meaningful lives. This is a shared service with limited seats and a limited pool of drivers. You even have an app for users to schedule and monitor rides. You are data-driven, so after each ride, the app prompts users to rate their drivers on a one-to-five-star rating scale based on their satisfaction score. You build a dashboard and monitor this over time. The average comes out to 4.67. You initially set an overall target of 4.3 as a minimum and 4.6 as a stretch goal. You beat your stretch goal! Yay. Simple: everything is running smoothly because 4.67 satisfaction score is pretty good, right? Right?

It Depends

Well, "the devil is in the details," as some say... Humans are complex. Two people can look at the same question, same context, same everything, and yet come to a different interpretation. Not to mention Artificial Intelligence (AI). The ingredients are all there. Yet, something is off...

So, is 4.67 satisfaction score good? Those who work with me on data (especially surveys and evaluations) are probably already hearing the answer:

It depends on how you interpret the result, and what you're planning to do about it.

If you're not planning to take any actions, it's a pretty good result. But then why are we collecting the data in the first place?

What Does 4.67 Star Rating Mean?

Let's hope that you're not planning actions based on a single metric (let alone an average you magically created out of stars) but assume that number means a lot for your organization. Let's look at the pros of the single question approach first:

  1. You care.
    You show you care about the customers and make sure all drivers behave according to the strict standards you set.
  2. You collect data.
    Your data collection is scalable, consistent, and "reliable" as long as the app works.
  3. You don't overwhelm customers with long surveys.
    Single question. Always the same, always at the end, at the same time, right after a ride ends. Consistency is key.
  4. You monitor your data.
    Not just as a single metric but trending over time. Good start!
  5. You segment your data.
    By vehicle, by route, by driver, etc., and you have proactive plan to act immediately if something happens. Good thing you have a data strategy.
  6. You plan to make decisions and act on results.
    You have no idea how many dashboards die in the long run without any meaningful decision made based on them.

What Can Go Wrong With This Approach?

Oh, the details... Before we get into the details, let's start with an experiment. Wherever you are, reading this article, right now: Say the word "fine" out loud. Just simply say the word. Hopefully you didn't cause some concern. Now, imagine the following scenarios where the answer is the same word, "fine." You don't need to say it out loud unless you really want to entertain the people around you.

a) Bored mother scenario
After three missed calls from your mother, you finally pick up the phone just to tell her you're busy when she asks: "How are you doing?" - Fine.

b) Manager scare scenario
Your manager asks you to come into their office (or a quick one-on-one virtual call) unexpectedly and puts the question out there up front: "How are you doing?" - Fine (?)

c) Your call is important to us scenario
After 3 transfers and 45 minutes on hold with customer service, the fourth department agent finally answers the call. With brimming enthusiasm, the agent opens the convo: "How are you doing?" - Fine!

Context And Perception Matter

What does this experiment have to do with satisfaction surveys? Context and perception matter! Who asks you the question, when they ask you the question, how they ask you the question, how often they ask you the question... All the details matter.

Your answer may be the same, but what you mean by that may not. When you are in a direct conversation with someone, they can read your tone, your body language, etc. But, sending out a survey question is different. You're losing the context. Are you sure you're measuring what you're measuring? Are you sure your data is reliable? Are you sure your "insights" are correct? Bamm, it's a lot to consider!

In my data literacy workshops, I refer to these potential issues collectively as BAMM (biases, assumptions, myths, and misconceptions).
Here's some of the details about what can go wrong from end-to-end when you get BAMM'ed:

  1. Lack of context
    You have an agenda and a goal in mind. However, it would take too much time to explain the context, so you just summarize it in a question. All the context stays in your head. On paper, it's a single sentence, up for interpretation.
  2. Selection bias
    You need to decide on your audience. Everyone? Every time? Sample? Anonymous, pseudo-anonymous, tracking user IDs? This brings data privacy and data security in the mix.
  3. Misconceptions and misinterpretations
    You need to then decide the exact words you're using. Every. Single. Word. Matters. (Have you ever tried to get a consensus on a simple survey question across marketing, legal, product, HR, etc.?)
  4. Data classification misconceptions
    You need to decide what kind of data you're collecting. The type of question you're going to ask will determine the data type (not going into data classification here, but you should). True or False? Likert scale? Slider? Single select? Multi-select? Matrix? Open text? Combination?
  5. Timing of the survey
    Finally, you land on a question and the type. Who's going to get this question? When? How?
  6. Validity issues
    In our story, they decide to include the question in the app, right after a ride ends, focusing on the driver. Data can be valid for one purpose but not for another. For example, it's fine to use DISC letters to have a conversation about preferences, but it shouldn't be used to pigeonhole people into jobs.
  7. Interpretation and context
    The customer receives the question. Remember the "fine" experiment? The context in which the customer answers the question matters, but you will not know anything about it because all you get is the number of stars. Stars can capture emotions unrelated to what you're actually asking.
  8. Biases
    Conscious and unconscious factors may interfere in how customers respond. For example, road rages are often impulsive reactions to past experiences.
  9. Loaded questions
    Every. Word. Matters. For loaded questions, you get loaded answers. For example, wording the question with positive words such as "Tell us about how great our customer service representative..." can influence the answer.
  10. Ambiguity
    What's one star vs. two starts? The customer selects the number of stars. In your mind, there's an associated context with each star. One is a showstopper and requires immediate intervention. Five is a great experience. Well, again, it's in your mind. I know people who never give one or five. They reserve it for extreme events.
  11. Data manipulation
    You receive the data. However, we're not talking about stars anymore. You turn the five-star ratings into numbers, assuming a smooth scale of one to five. Is it really the same to get from three to four as to get from four to five? Technically, you just introduced a rounding error. If you treat your data as a continuous one-to-five range but you don't let customers select any number, you're rounding their results into whole numbers.
  12. Using rounded values
    You calculate the average. Rounding is fine but you should be careful using rounded values for further calculations. Basically, you force customers to select a whole number but then you claim that the second digits are significant in the average? Also, is it going to be the mean? Median? Are you going to look at the distribution? Outliers? Shape of your data? Or just the plain, single number.

And the list could go on...

What Other Biases Might Interfere?

Your app pops the satisfaction score question at the end of the ride. This potentially can lead to survivorship bias because you'll only get feedback when there was a ride. What about cancellations? Wouldn't you want to know how satisfied your customers are when they had to cancel a ride?

Generally, people tend to submit more positive responses in satisfaction scores than they in reality feel. This may be a combination of factors. Social expectations, wanting to keep the service because there is no alternative, selecting the answer they think is expected vs. how they feel, etc. If you have multiple questions, the order of questions can interfere. The first answer may "anchor" the rest. The order of options can also be problematic. There are ways to mitigate biases but only if you are aware of their potential existence and have a plan ahead of time.

How Could You Improve Your Question To Mitigate Biases?

One approach is to provide a conditional open-text when the answer is not preferred. If you do that with a single question, it can help customers expand on their selection, just make sure it's optional. Now you have both quantitative and qualitative data to work with. It is more nuanced.

But, if you have multiple questions using the same method within a survey, it can be perceived as annoying potentially because it's extending the time of the survey. People already dislike surveys so when they perceive you "cheating" on the length, it can get ugly.

Final Word About The 4.67

Back to our story. Interpreting 4.67 as the overall satisfaction score with the ride can be misleading. Always make sure you measure what you intended to measure, and it provides actionable insights for the purpose it was created for. If you ask about the driver, the data is about the driver and not about the drive itself. Personally, for learning surveys, I've found that using Will Thalheimer's approach can provide more actionable and meaningful data mitigating many of these factors mentioned above [1].

References:

[1] Learner Surveys and Learning Effectiveness with Will Thalheimer

Originally published at www.linkedin.com