# Likert Scale Analysis

Recently, I worked with researchers at Deaconess Hospital and the University of Southern Indiana. I consulted on a project to analyze Likert data and discovered the long standing controversy on the topic, which was intriguing. Here, I share a little background on Likert data, considerations about the analysis, and basic R code to perform an analysis. In the future, I’ll make available a workflow for to analyze Likert data, and develop a RShiny app to analyze and visualize the results for a standard survey.

knitr::opts_chunk$set(error = TRUE, tidy = TRUE) wd <- "/Users/jdstallings/Google Drive/Blog/dataindeed/content/post" # dirname(rstudioapi::getActiveDocumentContext()$path) # sets working directory to where ever you put this document
setwd(wd)

.cran_packages <- c("ggplot2", "sjPlot", "sjmisc", "gridExtra", "cowplot", "grid",
"plyr", "data.table", "dplyr", "tidyr", "foreign", "coin", "car", "easyGgplot2",
"mvnormtest", "HH")  # packages used in this Rmarkdown document

sapply(.cran_packages, require, character.only = TRUE)

## Introduction

In 1932, Rensis Likert (pronounced ‘Lik-ert’) developed Likert items to measure respondents’ attitudes to a particular question or statement. For example, the Likert item is typically composed of a statement and series of responses:

The Likert responses are typically considered as ordered categorical (i.e., ordinal) data, because they convey size, order, rank or sequence. Responses typically range from positive to negative conceptually, and often there is no actual measurable distance between the responses. We can not assume that a respondent perceives the value between Strongly Approve (1) and Approve (2) to equal to the value between Approve (2) and Undecided (3), despite the distance on the scale being the same.

Responses could also use seven or nine answers for more granularity, or only four (or other even number) to avoid neutral or undecided answers, forcing the respondent to select a positive or negative response. Likert responses are not continuous (i.e., there are no actual decimal points in Likert responses), and they are constrained at their ends (i.e., 1-5 is the range in the figure above; there are no responses below the value of 1 or above the value of 5).

## The Controversy

In 1946, S.S. Stevens addressed the classification of scales of measurement. Stevens exquisitely described the theory of nominal, ordinal, interval and ratio data, their empirical operations, mathematical group structure and permissible statistics. The table below summarizes his key discussion points:

Stevens’ discussion on the analysis of ordinal data is very clear:

As a matter of fact, most of the scales used widely and effectively by psychologists are ordinal scales. In the strictest propriety the ordinary statistics involving means and standard deviations ought not to be used with these scales, for these statistics imply a knowledge of something more than the relative rank-order of data. On the other hand, for this ‘illegal’ statisticizing there can be invoked a kind of pragmatic sanction: In numerous instances it leads to fruitful results. While the outlawing of this procedure would probably serve no good purpose, it is proper to point out that means and standard deviations computed on an ordinal scale are in error to the extent that the sucessive intervals on the scale are unequal in size. When only the rank-order of the data is known, we should proceed cautiously with our statistics, and especially with the conclusions we draw from them.

Stevens’ discussion on the analysis of interval data:

Most psychological measurement aspires to create interval scales, and it sometimes succeeds. The problem usually is to devise operations for equalizing the units of the scales - a problem not always easy of solution but one for which there are several possible modes of attack. Only occaionally is there concern for the location of a ‘true’ zero point, because the human attributes measured by psychologists usually exist in a positive degree that is large compared with the range of its variation. In this respect these attributes are analogous to temerature as it is encountered in everyday life. Intelligence, for example, is usefully assessed on ordinal scales which try to approximate interval scales, and it is not necessary to define zero intelligence would mean.

Thus, although we assign numeric values to the responses, i.e., Strongly Approve (1), Approve (2), Undecided (3), Disapprove (4), and Strongly Dispprove (5), they can not be treated in the same manner as interval data, because they are ordinal in nature as described by Stevens. Consider the following questions:

• What is Strongly Disapprove minus Undecided?
• If we calculate $$5-3$$, resulting in $$2$$, how is the answer interpreted?
• Is it equal to Approve?
• What is Disapprove divided by Undecided?
• If we calculate $$4/3$$ resulting in $$1.33$$, how is the answer interpreted?
• Is it slightly less than Strongly Disapprove and Strongly Approve, respectively?

Clearly, we are assigning numerals to Likert responses to represent facts and conventions about the responses that are ranked (i.e. 1 is a more positive response/attitude than 5). The measurement of those numerals should directly relate to the type of scale (i.e., ordinal or interval). In the spirit of W.M. Kuzon Jr, M.G. Urbanchek, and S. McCabe (1996), the average of Strongly Approve and Approve is not Approve-and-a-half; regardless if we assign integers to represent Strongly Approve and Approve! Thus, in agreement with H.M. Marcus-Roberts and F.S. Roberts (1987), the answers to the Likert item with ordinal responses may have “meaningful statements”, but the basic empirical operations used to calculate means and standard deviations render “meaningless statistics.” Consider the following from P.A. Bishop and R.L. Herron (2015):

In practice, however, Likert scale response data are often treated as if it were interval data. In 1990, T.R. Knapp addressed the long standing controversy in depth, addressing key considerations on both the conservative (i.e., “pro-Stevens”) and liberal (i.e., “anti-Stevens”) sides. Basically the liberal practitioners argue that despite have strictly ordinal data, the differences between the responses are considered equal, and therefore treated at interval data. After all, B.O. Baker, C.D. Hardyk, and L.F. Petrinovich (1966), S. Labovitz (1967), G.V. Glass, P.D. Peckham and J.D. Sanders (1972) and others have shown empirically that doing so matters little (i.e., using parametric tests on ordinal data). Many modern day practitioners, such as J. Carifio and R. Perla (2008) and G. Norman (2010) continue to argue adamantly for the anti-Stevens position, in that violation of appropriateness is justified due to improvement in robustness. Certainly, as discussed by G.M. Sullivan and A.R. Artino (2013), improving the responses provided to begin with or implementing measurable visual analogue responses (VARs) helps to justify the use of parametric tests. Consider the following:

Knapp concludes, however, that “empirical robustness” are no longer convincing in light of H.M. Marcus-Roberts and F.S. Roberts (1987). Regardless of what side you decide to align with (conservative/pro-Stevens vs. liberal/anti-Stevens), Knapp ends his review with a set of guidelines:

1. Choose the measurement perspective. If the goal is to interpret the results in a scale-free manner, then follow the Stevens approach.

2. Practical concession to ‘illegal staticizing’ is OK when there are numerous data.

3. When making up the scale and your goal is to have an interval scale, consider Stevens and ensure your scale reflects your intention.
• Is it continuous?
• Does it resemble an actual scale with units?
• Is there a zero value (even if arbitrary)?
• Do transformations preserve the order/meaningfulness?
4. Consider meaningfulness when considering the use of descriptive statistics. Knapp writes:

If you have to, forgo traditional statistics such as means, standard deviations, and Pearson r’s there are always Tukey’s (1977) very creative exploratory data analysis (EDA) techniques, and Agresti (1984) has written a whole book on the analysis of ordinal data.

In 2001, A. Agresti published a fantastic book on “Categorical Data Analysis.” You also may consider Hoaglin, Moesttler and J.W. Tukey (2001) “Understanding Robust and Exploratory Data Analysis” instead.

1. The type of scale of data claimed will determine the use of non-parametric or parametric tests.

In conclusion, I recommend taking time to familiarize yourself with Steven and Knapp. Ultimately, you have to decide how you want to analyze the data. Below, I provide code to visualize likert data, and conduct both non-paramteric and parametric analyses.

## Basic Likert Data Analysis

### Dataset

To demonstrate the Likert analysis workflow, I’ll use the MACH-IV dataset, which is a freely available download. It draws on answers from Christie, R. & Geis, F. (1970) “Studies in Machiavellianism”. NY: Academic Press.), which is a tool that is used to measure Machiavellianism personality traits. The test consists of twenty Likert items. Each item is a statement that you must indicate how accurate it would be when applied to you. The data are presented in wide form with each row as a single observation.

• Answers: 1 = strongly disagree, 2 = disagree, 3 = neutral, 4 = agree, 5 = strongly agree (0 = they skipped). For the puposes of this analysis, we will consider the responses to be ordinal.
• Age: Entered into a text field.
• Gender: Chosen from a drop down list (1 = male, 2 = female, 3 = other; 0 = no answer)
• Questions:
• Q1. Never tell anyone the real reason you did something unless it is useful to do so.
• Q2. The best way to handle people is to tell them what they want to hear.
• Q3. One should take action only when sure it is morally right.
• Q4. Most people are basically good and kind.
• Q5. It is safest to assume that all people have a vicious streak and it will come out when they are given a chance.
• Q6. Honesty is the best policy in all cases.
• Q7. There is no excuse for lying to someone else.
• Q8. Generally speaking, people won’t work hard unless they’re forced to do so.
• Q9. All in all, it is better to be humble and honest than to be important and dishonest.
• Q10. When you ask someone to do something for you, it is best to give the real reasons for wanting it rather than giving reasons which carry more weight.
• Q11. Most people who get ahead in the world lead clean, moral lives.
• Q12. Anyone who completely trusts anyone else is asking for trouble.
• Q13. The biggest difference between most criminals and other people is that the criminals are stupid enough to get caught.
• Q14. Most people are brave.
• Q15. It is wise to flatter important people.
• Q16. It is possible to be good in all respects.
• Q17. P.T. Barnum was wrong when he said that there’s a sucker born every minute.
• Q18. It is hard to get ahead without cutting corners here and there.
• Q19. People suffering from incurable diseases should have the choice of being put painlessly to death.
• Q20. Most people forget more easily the death of their parents than the loss of their property.

Perhaps our research question will be to determine whether men and women demonstrate a signficant difference in Machiavellianism personality traits. We will examine each question individually, and then evaluate total score, treating that measure as interval data. Let’s start by downloading the dataset:

url <- "http://personality-testing.info/_rawdata/MACH2.zip"  # web address of dataset

if (!file.exists("./MACH2.zip")) {
unzip("MACH2.zip", exdir = "./")  # unzip datafiles into a folder /MACH2
}

ds <- "/MACH2/data.csv"  # the data set of interest
cb <- "/MACH2/codebook.txt"  # the code to the questions

dt <- data.frame(read_csv(paste(wd, ds, sep = "")))  # import data into a data.table

dtQ <- data.table(read_lines(paste(wd, cb, sep = ""))[3:22])  # reading the lines of the questions from the text file

dtQ[, :=(c("question", "question_text"), tstrsplit(V1, "[0-9]. ", fixed = TRUE))]  # splits the 'QX. from the Questions.' at the first '.' after a digit.

colnames(dt)[1:20] <- dtQ$question_text[1:20] # colnames to full questions posQ <- colnames(dt)[c(1:2, 5, 8, 12:13, 15, 17:20)] # postive questions negQ <- colnames(dt)[c(3:4, 6:7, 9:11, 14, 16)] # negative questions ### Explore data and remove outliers Before we begin the analysis, let’s explore the data and remove outliers in the demographic and total score columns. The scores reflect a minimum of $$20$$ and maximum of $$100$$ points. When we assign gender as a factor with two levels (male and female), those respondents that did not answer will be removed when we remove NAs. Obviously there are issues with age, with a range of -9 to 999999. We will limit the respondents from 10 to 100 years old. In a similar manner, the range for elapsed time is very large. We will limit seconds_elapsed from 60 seconds (3 seconds per question) to 600 (30 seconds per question). Finally, once we limit respondents to those who answered every question, create factors and levels for each question, most of the outlier questions will be removed. First, let’s take a look at gender first: Number of Respondents By Gender dt[, 1:20] <- lapply(dt[, 1:20], factor, levels = c(1, 2, 3, 4, 5), labels = c("strongly disagree", "disagree", "neutral", "agree", "strongly agree")) # change columns to factors dt$gender <- factor(dt$gender, levels = c(1, 2), labels = c("male", "female")) # change to factor plot_ly(data = as.data.frame(dt), x = ~gender) There are quite a few more male respondents than female. Now let’s take a look at age: Distribution of Age dt <- dt[dt$age < 110 & dt$age >= 10, ] # remove observatons with ages outside of 18 - 100. dt <- na.omit(dt) # remove respondents with NAs plot_ly(data = as.data.frame(dt), x = ~age) # histogram of age Most of the respondents are fairly young. Next, lets take a look at seconds_elapsed: Distrubtion of Seconds Elapsed dt <- dt[dt$seconds_elapsed < 1200 & dt$seconds_elapsed >= 60, ] # remove observatons outside of 60 - 1200 seconds. plot_ly(data = as.data.frame(dt), x = ~seconds_elapsed) # histogram of seconds_elapsed The vast majority of repondents spend very less than 4 minutes on then questionnaire. Finally, let’s take a look at the distribution of scores: Distrubtion of Scores (Machiavellianism Index) plot_ly(data = as.data.frame(dt), x = ~score) # histogram of age The total scores are normally distributed. The scores are a composite of all the answers. The range is 20-100. The test pt <- tabular(gender ~ (age + seconds_elapsed + score) * (mean + sd), data = dt) print(pt)  age seconds_elapsed score  gender mean sd mean sd mean sd male 28.72 11.59 238.5 155.7 67.89 13.03 female 29.88 12.17 255.1 175.1 62.51 12.41 pander(pt) gender age mean sd seconds_elapsed mean sd score mean sd male 28.72 11.59 238.5 155.7 67.89 13.03 female 29.88 12.17 255.1 175.1 62.51 12.41 ### Descriptive Statistics According to Stevens, descriptive statistics for ordinal data should be median. But first, one simple method to visualize the percentages by gender is to use the likert function in the likert package. These figures allow you to group by a factor, and provides color coordinated horizontal bar charts to compare the percent of answers. dt_pos <- dt[posQ] # select only the 11 positive questions according to the codetext dt_pos$gender <- dt$gender # create a standalone data.table for this analysis. mach_l1 <- likert::likert(dt_pos[, 1:4, drop = FALSE], grouping = dt_pos$gender)  # view the first 4 (1:4 questions) and group them by gender
likert.bar.plot(mach_l1)

Similarly, you can look at the fiirst 4 negative questions. These visual graphics take up quite a bit of room.

dt_neg <- dt[negQ]
dt_neg$gender <- dt$gender
mach_l2 <- likert::likert(dt_neg[, 1:4, drop = FALSE], grouping = dt_neg$gender) likert.bar.plot(mach_l2) In the following examples, I’ve selected the first 2 questions from males to show the summary tables that sjtPlot provides with the sjt.frq function. attach(dt) # must attach the dataset to analyze sjt.frq(dt[dt$gender == "male", c(1:2)], emph.md = TRUE, show.summary = FALSE, emph.quart = TRUE,
no.output = FALSE)  # the sjtPlot provides quite a few options! emph.md emphasizes the median, emph.quart draws a line are the quartiles.  If you show.summary = TRUE, it provides mean, sigma and other values in the bottom of the table.
## Error in sjt.frq(dt[dt$gender == "male", c(1:2)], emph.md = TRUE, show.summary = FALSE, : could not find function "sjt.frq" Similarly, the first 2 questions by females. sjt.frq(dt[dt$gender == "female", c(1:2)], emph.md = TRUE, show.summary = TRUE, emph.quart = TRUE,
no.output = FALSE)
## Error in sjt.frq(dt[dt$gender == "female", c(1:2)], emph.md = TRUE, show.summary = TRUE, : could not find function "sjt.frq" Rather than the tabular function above, you can use the sjt.grpmean in sjtPlot to evaluate the interval data. They produce very nice looking tables, with many options for analysis! sjt.grpmean(var.cnt = age, var.grp = gender, digits = 1, no.output = FALSE) ## Error in sjt.grpmean(var.cnt = age, var.grp = gender, digits = 1, no.output = FALSE): could not find function "sjt.grpmean" sjt.grpmean(var.cnt = seconds_elapsed, var.grp = gender, digits = 1, no.output = FALSE) ## Error in sjt.grpmean(var.cnt = seconds_elapsed, var.grp = gender, digits = 1, : could not find function "sjt.grpmean" sjt.grpmean(var.cnt = score, var.grp = gender, digits = 1, no.output = FALSE) ## Error in sjt.grpmean(var.cnt = score, var.grp = gender, digits = 1, no.output = FALSE): could not find function "sjt.grpmean" ## Statistical Inference Techniques The Wilcoxon tests is performed in R with the wilcox.test function, which uses binary factors, such as Gender (M/F), or Pre-Post Analysis. For example, if only Pre-scores are given, or if both Pre-scores ($$x$$) and Post-Scores ($$y$$) are given and paired is TRUE (i.e., the same person’s answers), a Wilcoxon signed rank test of the null that the distribution of $$x$$ (in the one sample case) or of $$x$$ - $$y$$ (in the paired two sample case) is symmetric about $$\mu$$ is performed. Our example analysis will use gender and a single likert question. library(FSA) dt_wc <- dt dt_wc$Q1. Never tell anyone the real reason you did something unless it is useful to do so. <- as.numeric(dt_wc$Q1. Never tell anyone the real reason you did something unless it is useful to do so.) pander(Summarize(Q1. Never tell anyone the real reason you did something unless it is useful to do so. ~ gender, data = dt_wc, digit = 3)) gender n mean sd min Q1 median Q3 max male 7692 3.538 1.229 1 3 4 5 5 female 4079 3.099 1.252 1 2 3 4 5 wt <- wilcox.test(as.numeric(dt_wc[, 1]) ~ as.factor(gender), data = dt_wc, alternative = "two.sided", exact = FALSE) pandoc.table(wt) Table continues below statistic parameter p.value null.value c(W = 18789658) NULL 4.62656668408054e-74 c(location shift = 0) alternative method data.name two.sided Wilcoxon rank sum test with continuity correction as.numeric(dt_wc[, 1]) by as.factor(gender) The Kruskal Wallis test is performed in R with the kruskal.test function, which uses multivariate factors, such as Education & Certification Levels (Associates, Bachelors, Masters, or Doctorates. The current study does not contain multivariate data. I randomly added unknown gender to the categories, such that it is now multivariate. dt_kw <- dt dt_kw$gender <- factor(dt_kw$gender, levels = c("male", "female", "unknown")) # create a third category dt_kw$gender[sample(nrow(dt_kw), 2000)] <- "unknown"  # label

dt_kw$Q1. Never tell anyone the real reason you did something unless it is useful to do so. <- as.numeric(dt_kw$Q1. Never tell anyone the real reason you did something unless it is useful to do so.)

pander(Summarize(Q1. Never tell anyone the real reason you did something unless it is useful to do so. ~
gender, data = dt_kw, digit = 3))
gender n mean sd min Q1 median Q3 max
male 6385 3.546 1.225 1 3 4 5 5
female 3386 3.105 1.254 1 2 3 4 5
unknown 2000 3.353 1.262 1 2 4 4 5
kt <- kruskal.test(as.numeric(dt_kw[, 1]) ~ as.factor(gender), data = dt_kw)

pandoc.table(kt)
Table continues below
statistic parameter p.value
c(Kruskal-Wallis chi-squared = 277.85233366503) c(df = 2) 4.62521955006316e-61
method data.name
Kruskal-Wallis rank sum test as.numeric(dt_kw[, 1]) by as.factor(gender)

The One-Way ANOVA is a multivariate analysis only used when ordinal data (Likert data) are transformed to interval data. The data, however, still have to meet the assumptions of the test (normality, etc.). For example, the ordinal data can be ordered, and then a table is produced and the data converted to probability or frequency. For this analysis we used the score values and

For the interval data, such as scores, we can use the HH packages to conduct an ANOVA. The hov and hovPlot functions indicate:

Oneway analysis of variance makes the assumption that the variances of the groups are equal. Brown and Forsyth, 1974 present the recommended test of this assumption. The Brown and Forsyth test statistic is the F statistic resulting from an ordinary one-way analysis of variance on the absolute deviations from the median.

The “trellis” object with three panels containing boxplots for each group: The observed data “y”, the data with the median subtracted “y-med(y)”, and the absolute deviations from the median “abs(y-med(y))” The Brown and Forsyth test statistic is the F statistic resulting from an ordinary one-way analysis of variance on the data points in the third panel.

library(HH)

dt_gen <- dt

dt_gen$gender <- factor(dt_gen$gender, levels = c("male", "female", "unknown"))  # create a third category
dt_gen$gender[sample(nrow(dt_gen), 2000)] <- "unknown" # label hovPlot(as.numeric(score) ~ as.factor(gender), data = dt_gen) # example test of the assumtions hv <- hov(as.numeric(score) ~ as.factor(gender), data = dt_gen) # example parametric test pandoc.table(hv) Table continues below statistic parameters p.value c(F = 7.32049970867287) c(df:as.factor(gender) = 2, df:Residuals = 11768) 0.000664849651096874 alternative method data.name variances are not identical hov: Brown-Forsyth as.numeric(score) set.seed(101) fac_num <- function(x) { if (x == "strongly disagree") { x = 1 } if (x == "disagree") { x = 2 } if (x == "neutral") { x = 3 } if (x == "agree") { x = 4 } if (x == "strongly agree") { x = 5 } } dt_gen_num <- sapply(dt_gen[, c(1:20)], function(x) as.integer(factor(as.character(x), levels = c("strongly disagree", "disagree", "neutral", "agree", "strongly agree")))) dt_gen_num <- cbind(dt_gen_num, dt_gen[, c(21:24)]) colnames(dt_gen_num) <- colnames(dt_gen) mat <- dt_gen_num[, c(1:20)] myPCA <- prcomp(mat, scale. = F, center = F) plot(myPCA) myPCA$rotation  # loadings
                                                                                                                                                             PC1

myPCA\$x  # scores
         PC1           PC2           PC3           PC4           PC5