For decades, much work in cognitive linguistics, psycholinguistics, and corpus-linguistic has placed a great emphasis on the frequencies with which words and other linguistic elements occur in language; such work argued, among other things, that
More recently, interest was revived in the notion of dispersion, i.e. measures that are probabilistically related to frequency but quantify the evenness/clumpiness/burstiness with which a word is distributed in a corpus. Here we will compute a measure of dispersion for a handful of words in the International Corpus of English - the British Component (ICE-GB). This measure is called DP (for Deviation of Proportions), for which you need the following information:
For example, if you have a corpus that consists of three parts A, B, and C, and the word in question occurs in it 6 times such that
then:
In R:
corpus.part.sizes <- c(4000, 3000, 3000); (expected <- corpus.part.sizes/sum(corpus.part.sizes))
## [1] 0.4 0.3 0.3
(observed <- 1:3/6)
## [1] 0.1666667 0.3333333 0.5000000
The measure DP is then computed as follows:
sum(abs(observed-expected))/2
## [1] 0.2333333
DP falls into the range of [0,1]:
(Some might find it desirable to actually use 1-DP because the majority of other dispersion measures has that kind orientation, i.e. one where low and high values mean ‘uneven’ and ‘even distribution’ respectively.)
If we want to determine the dispersion of several words in a corpus, We need to
The above tasks require the following functions:
c
;dir
or some file-choosing
function to define the corpus files and then a for
-loop
where we
scan
tolower
;gsub
or
gregexpr
/regmatches
or
exact.matches.2
<-
);expected
,
i.e. the size of each file: length
;for
-loop)
observed
by retrieving its frequencies in
all files from the somehow/somewhere: [...]
That means, before we enter
for
-loop, we need a collector that will collect
the corpus;for
-loop, we need a collector that will collect
the DP-values.Let’s break this down. As you will see, there are multiple ways of doing this, I am outlining only the first one here and we might develop the other ones later:
words.of.interest
with
some words of interestlist
) start a for
-loop over each file;
{
and
everything after and including the }
;{
and
}
;for
-loop)
observed
by counting its frequencies in all
files from the collector;rm(list=ls(all=TRUE))
We define the words of interest:
words.of.interest <- c("the", "on", "with", "of", "for", "at", "forward", "pigmeat", "ozzie", "accidie")
We unzip and define the corpus files from the International Corpus of English, the British Component:
unzip("files/ICEGB_sampled.zip", # unzip this zip archive
exdir="files") # into the files folder/directory
head(corpus.files <- dir( # make corpus.files the content of
"files/ICEGB_sampled", # this directory
recursive=TRUE, # browse all sub-folders
full.names=TRUE) # return full names
) # end of head
## [1] "files/ICEGB_sampled/S1A-001.COR" "files/ICEGB_sampled/S1A-002.COR"
## [3] "files/ICEGB_sampled/S1A-003.COR" "files/ICEGB_sampled/S1A-004.COR"
## [5] "files/ICEGB_sampled/S1A-005.COR" "files/ICEGB_sampled/S1A-006.COR"
We define a collector structure corpus
to collect ‘the
whole corpus’; this will be a list that will contain as many character
vectors as the corpus has files:
corpus <- vector( # make corpus a
mode="list", # list
length=length(corpus.files)) # with 1 part per corpus file
# name each part by the corpus file
names(corpus) <- sub(".COR$", "", basename(corpus.files), perl=TRUE)
# check the structure
head(corpus)
## $`S1A-001`
## NULL
##
## $`S1A-002`
## NULL
##
## $`S1A-003`
## NULL
##
## $`S1A-004`
## NULL
##
## $`S1A-005`
## NULL
##
## $`S1A-006`
## NULL
Then, we loop over each file name and
system.time({ # measure the time of everything that follows
for(i in seq(corpus.files)) { # access each corpus file
# we read in each corpus file
current.corpus.file <- tolower(scan( # make current.corpus file the lower case version of
corpus.files[i], # the i-th/current corpus file
what=character(), # as a character vector
sep="\n", # with linebreaks as separators between vector elements
quiet=TRUE)) # no quote characters and no feedback about the number of elements read
gregexpr
to find everything after a {
till before the next }
: # we find where the words are ...
locations.of.words <- gregexpr( # make the locations of the words the result of gregexpring
"(?<={)[^}]+", # for not-} after a {
current.corpus.file, # in current.corpus.file
perl=TRUE) # using Perl-compatible regular expressions
regmatches
to retrieve those words: # ... and get them
current.words <- regmatches( # make current.words the matches
current.corpus.file, # in current.corpus.file
locations.of.words) # that the gregexpr just found
# we save the words in the file in the list collecting the whole corpus
corpus[[i]] <- tolower(unlist(current.words))
cat("\f", i/length(corpus.files)) # output to the screen the % of files dealt w/ now
} # end of for
}) # approximately 11 seconds
## 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.034 0.036 0.038 0.04 0.042 0.044 0.046 0.048 0.05 0.052 0.054 0.056 0.058 0.06 0.062 0.064 0.066 0.068 0.07 0.072 0.074 0.076 0.078 0.08 0.082 0.084 0.086 0.088 0.09 0.092 0.094 0.096 0.098 0.1 0.102 0.104 0.106 0.108 0.11 0.112 0.114 0.116 0.118 0.12 0.122 0.124 0.126 0.128 0.13 0.132 0.134 0.136 0.138 0.14 0.142 0.144 0.146 0.148 0.15 0.152 0.154 0.156 0.158 0.16 0.162 0.164 0.166 0.168 0.17 0.172 0.174 0.176 0.178 0.18 0.182 0.184 0.186 0.188 0.19 0.192 0.194 0.196 0.198 0.2 0.202 0.204 0.206 0.208 0.21 0.212 0.214 0.216 0.218 0.22 0.222 0.224 0.226 0.228 0.23 0.232 0.234 0.236 0.238 0.24 0.242 0.244 0.246 0.248 0.25 0.252 0.254 0.256 0.258 0.26 0.262 0.264 0.266 0.268 0.27 0.272 0.274 0.276 0.278 0.28 0.282 0.284 0.286 0.288 0.29 0.292 0.294 0.296 0.298 0.3 0.302 0.304 0.306 0.308 0.31 0.312 0.314 0.316 0.318 0.32 0.322 0.324 0.326 0.328 0.33 0.332 0.334 0.336 0.338 0.34 0.342 0.344 0.346 0.348 0.35 0.352 0.354 0.356 0.358 0.36 0.362 0.364 0.366 0.368 0.37 0.372 0.374 0.376 0.378 0.38 0.382 0.384 0.386 0.388 0.39 0.392 0.394 0.396 0.398 0.4 0.402 0.404 0.406 0.408 0.41 0.412 0.414 0.416 0.418 0.42 0.422 0.424 0.426 0.428 0.43 0.432 0.434 0.436 0.438 0.44 0.442 0.444 0.446 0.448 0.45 0.452 0.454 0.456 0.458 0.46 0.462 0.464 0.466 0.468 0.47 0.472 0.474 0.476 0.478 0.48 0.482 0.484 0.486 0.488 0.49 0.492 0.494 0.496 0.498 0.5 0.502 0.504 0.506 0.508 0.51 0.512 0.514 0.516 0.518 0.52 0.522 0.524 0.526 0.528 0.53 0.532 0.534 0.536 0.538 0.54 0.542 0.544 0.546 0.548 0.55 0.552 0.554 0.556 0.558 0.56 0.562 0.564 0.566 0.568 0.57 0.572 0.574 0.576 0.578 0.58 0.582 0.584 0.586 0.588 0.59 0.592 0.594 0.596 0.598 0.6 0.602 0.604 0.606 0.608 0.61 0.612 0.614 0.616 0.618 0.62 0.622 0.624 0.626 0.628 0.63 0.632 0.634 0.636 0.638 0.64 0.642 0.644 0.646 0.648 0.65 0.652 0.654 0.656 0.658 0.66 0.662 0.664 0.666 0.668 0.67 0.672 0.674 0.676 0.678 0.68 0.682 0.684 0.686 0.688 0.69 0.692 0.694 0.696 0.698 0.7 0.702 0.704 0.706 0.708 0.71 0.712 0.714 0.716 0.718 0.72 0.722 0.724 0.726 0.728 0.73 0.732 0.734 0.736 0.738 0.74 0.742 0.744 0.746 0.748 0.75 0.752 0.754 0.756 0.758 0.76 0.762 0.764 0.766 0.768 0.77 0.772 0.774 0.776 0.778 0.78 0.782 0.784 0.786 0.788 0.79 0.792 0.794 0.796 0.798 0.8 0.802 0.804 0.806 0.808 0.81 0.812 0.814 0.816 0.818 0.82 0.822 0.824 0.826 0.828 0.83 0.832 0.834 0.836 0.838 0.84 0.842 0.844 0.846 0.848 0.85 0.852 0.854 0.856 0.858 0.86 0.862 0.864 0.866 0.868 0.87 0.872 0.874 0.876 0.878 0.88 0.882 0.884 0.886 0.888 0.89 0.892 0.894 0.896 0.898 0.9 0.902 0.904 0.906 0.908 0.91 0.912 0.914 0.916 0.918 0.92 0.922 0.924 0.926 0.928 0.93 0.932 0.934 0.936 0.938 0.94 0.942 0.944 0.946 0.948 0.95 0.952 0.954 0.956 0.958 0.96 0.962 0.964 0.966 0.968 0.97 0.972 0.974 0.976 0.978 0.98 0.982 0.984 0.986 0.988 0.99 0.992 0.994 0.996 0.998 1
## user system elapsed
## 12.164 0.000 12.164
object.size(corpus) # 28,282,272
## 28278272 bytes
We’re not doing any further cleaning here (for things such as
numbers, pauses (“<,>”), unclear words (“
expected
From this we can now compute expected
for all words of
interest, i.e. the file sizes: they will be the lengths of each list
element:
expected <- sapply( # make expected the result of applying
corpus, # to each list element within corpus
length) # the function length
# and then convert those into relative lengths/frequencies
head(expected <- expected/sum(expected))
## S1A-001 S1A-002 S1A-003 S1A-004 S1A-005 S1A-006
## 0.002047279 0.002018379 0.002133981 0.002135846 0.001980156 0.001919558
observed
and DPNow we can turn to observed
for each word of interest
and, then, DP. First, we create a collector for the
DP-values:
DP.values <- rep(NA, length(words.of.interest)) # create a collector for the DP-values
names(DP.values) <- words.of.interest # use the words for its names
Then, we loop over each word of interest (calling it i
)
and
sapply
to look into each list element and use an
anonymous function to check how many (sum
) words in the
list element are (==
) i
;for (i in words.of.interest) { # for each word of interest
observed <- sapply( # make observed the result of applying to
corpus, # to each list element within corpus
# an anonymous function that takes the list element & checks which of its words is the current
function (currlistelem) sum(currlistelem==i))
i
’s frequency); observed <- observed/sum(observed) # change that into relative lengths/frequencies, too
DP.values[i] <- sum(abs(observed-expected))/2 # compute and store DP
}
Let’s check the result:
DP.values # check result
## the on with of for at forward pigmeat
## 0.1557128 0.1704746 0.1745771 0.1978785 0.1988233 0.2018111 0.7386465 0.9966447
## ozzie accidie
## 0.9976535 0.9977551
In terms of programming this, in some sense that was the easiest way to do this, but in fact it is terrible in terms of R programming and memory management. Why do we store the whole corpus like we did?! Why do we store multiple occurrences of the same word when we can store it once with a frequency? This what we change now.
The collector structure corpus
is actually set up in the
same way as before, an empty list with an element for each corpus
file:
corpus <- vector( # make corpus a
mode="list", # list
length=length(corpus.files)) # with 1 part per corpus file
# name each part by the corpus file
names(corpus) <- sub(".COR$", "", basename(corpus.files), perl=TRUE)
# check the structure
head(corpus)
## $`S1A-001`
## NULL
##
## $`S1A-002`
## NULL
##
## $`S1A-003`
## NULL
##
## $`S1A-004`
## NULL
##
## $`S1A-005`
## NULL
##
## $`S1A-006`
## NULL
And the loop etc. is also the same, with one tiny exception: Now we don’t save all the words themselves, we save the frequency table:
system.time({ # measure the time of everything that follows
for(i in seq(corpus.files)) { # access each corpus file
# we read in each corpus file
current.corpus.file <- tolower(scan( # make current.corpus file the lower case version of
corpus.files[i], # the i-th/current corpus file
what=character(), # as a character vector
sep="\n", # with linebreaks as separators between vector elements
quiet=TRUE)) # no quote characters and no feedback about the number of elements read
# we find where the words are ...
locations.of.words <- gregexpr( # make the lcoations of the words the result of gregexpring
"(?<={)[^}]+", # for not-} after a {
current.corpus.file, # in current.corpus.file
perl=TRUE) # using Perl-compatible regular expressions
# ... and get them
current.words <- regmatches( # make current.words the matches
current.corpus.file, # in current.corpus.file
locations.of.words) # that the gregexpr just found
# we save the frequency list of the words in the file in the list collecting the whole corpus # <---
corpus[[i]] <- table(tolower(unlist(current.words))) # <---
# this was # <---
# corpus[[i]] <- tolower(unlist(current.words)) # <---
cat("\f", i/length(corpus.files)) # output to the screen the % of files dealt w/ now
} # end of for
}) # approximately 13 seconds
## 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.034 0.036 0.038 0.04 0.042 0.044 0.046 0.048 0.05 0.052 0.054 0.056 0.058 0.06 0.062 0.064 0.066 0.068 0.07 0.072 0.074 0.076 0.078 0.08 0.082 0.084 0.086 0.088 0.09 0.092 0.094 0.096 0.098 0.1 0.102 0.104 0.106 0.108 0.11 0.112 0.114 0.116 0.118 0.12 0.122 0.124 0.126 0.128 0.13 0.132 0.134 0.136 0.138 0.14 0.142 0.144 0.146 0.148 0.15 0.152 0.154 0.156 0.158 0.16 0.162 0.164 0.166 0.168 0.17 0.172 0.174 0.176 0.178 0.18 0.182 0.184 0.186 0.188 0.19 0.192 0.194 0.196 0.198 0.2 0.202 0.204 0.206 0.208 0.21 0.212 0.214 0.216 0.218 0.22 0.222 0.224 0.226 0.228 0.23 0.232 0.234 0.236 0.238 0.24 0.242 0.244 0.246 0.248 0.25 0.252 0.254 0.256 0.258 0.26 0.262 0.264 0.266 0.268 0.27 0.272 0.274 0.276 0.278 0.28 0.282 0.284 0.286 0.288 0.29 0.292 0.294 0.296 0.298 0.3 0.302 0.304 0.306 0.308 0.31 0.312 0.314 0.316 0.318 0.32 0.322 0.324 0.326 0.328 0.33 0.332 0.334 0.336 0.338 0.34 0.342 0.344 0.346 0.348 0.35 0.352 0.354 0.356 0.358 0.36 0.362 0.364 0.366 0.368 0.37 0.372 0.374 0.376 0.378 0.38 0.382 0.384 0.386 0.388 0.39 0.392 0.394 0.396 0.398 0.4 0.402 0.404 0.406 0.408 0.41 0.412 0.414 0.416 0.418 0.42 0.422 0.424 0.426 0.428 0.43 0.432 0.434 0.436 0.438 0.44 0.442 0.444 0.446 0.448 0.45 0.452 0.454 0.456 0.458 0.46 0.462 0.464 0.466 0.468 0.47 0.472 0.474 0.476 0.478 0.48 0.482 0.484 0.486 0.488 0.49 0.492 0.494 0.496 0.498 0.5 0.502 0.504 0.506 0.508 0.51 0.512 0.514 0.516 0.518 0.52 0.522 0.524 0.526 0.528 0.53 0.532 0.534 0.536 0.538 0.54 0.542 0.544 0.546 0.548 0.55 0.552 0.554 0.556 0.558 0.56 0.562 0.564 0.566 0.568 0.57 0.572 0.574 0.576 0.578 0.58 0.582 0.584 0.586 0.588 0.59 0.592 0.594 0.596 0.598 0.6 0.602 0.604 0.606 0.608 0.61 0.612 0.614 0.616 0.618 0.62 0.622 0.624 0.626 0.628 0.63 0.632 0.634 0.636 0.638 0.64 0.642 0.644 0.646 0.648 0.65 0.652 0.654 0.656 0.658 0.66 0.662 0.664 0.666 0.668 0.67 0.672 0.674 0.676 0.678 0.68 0.682 0.684 0.686 0.688 0.69 0.692 0.694 0.696 0.698 0.7 0.702 0.704 0.706 0.708 0.71 0.712 0.714 0.716 0.718 0.72 0.722 0.724 0.726 0.728 0.73 0.732 0.734 0.736 0.738 0.74 0.742 0.744 0.746 0.748 0.75 0.752 0.754 0.756 0.758 0.76 0.762 0.764 0.766 0.768 0.77 0.772 0.774 0.776 0.778 0.78 0.782 0.784 0.786 0.788 0.79 0.792 0.794 0.796 0.798 0.8 0.802 0.804 0.806 0.808 0.81 0.812 0.814 0.816 0.818 0.82 0.822 0.824 0.826 0.828 0.83 0.832 0.834 0.836 0.838 0.84 0.842 0.844 0.846 0.848 0.85 0.852 0.854 0.856 0.858 0.86 0.862 0.864 0.866 0.868 0.87 0.872 0.874 0.876 0.878 0.88 0.882 0.884 0.886 0.888 0.89 0.892 0.894 0.896 0.898 0.9 0.902 0.904 0.906 0.908 0.91 0.912 0.914 0.916 0.918 0.92 0.922 0.924 0.926 0.928 0.93 0.932 0.934 0.936 0.938 0.94 0.942 0.944 0.946 0.948 0.95 0.952 0.954 0.956 0.958 0.96 0.962 0.964 0.966 0.968 0.97 0.972 0.974 0.976 0.978 0.98 0.982 0.984 0.986 0.988 0.99 0.992 0.994 0.996 0.998 1
## user system elapsed
## 12.863 0.021 12.884
object.size(corpus) # 24,078,424
## 24074424 bytes
This took maybe a bit more time (but very little), but needs &approx 14% less RAM – with bigger data, you have a trade-off decision to make.
expected
Before we had all the words of a corpus file in each list element, so
we used length
to get the size of a file. Now we have
frequencies in the list elements, so we sum
:
expected <- sapply( # make expected the result of applying
corpus, # to each list element within corpus
length) # the function length
# and then convert those into relative lengths/frequencies
head((expected <- expected/sum(expected)))
## S1A-001 S1A-002 S1A-003 S1A-004 S1A-005 S1A-006
## 0.001493844 0.001399911 0.001533236 0.001621109 0.001530206 0.001469603
observed
and DPBefore we had all the words of a corpus file in each list element, so
we used sum(...==i)
(within sapply
) to sum up
how often each word was in each corpus file. Now we have frequencies in
the list elements so we use the subsetting function [
(also
within sapply
) to retrieve the already computed frequency
of each word in each corpus file. However, since the absence of a word
will return NA, we then change NA to 0; the rest is as before.
DP.values <- rep(NA, length(words.of.interest)) # create a collector for the DP-values
names(DP.values) <- words.of.interest # use the words for its names
for (i in words.of.interest) { # for each word of interest
observed <- sapply( # make observed the result of applying to
corpus, # to each list element within corpus
"[", # the subsetting function # <---
i) # namely subsetting for exactly the current word # <---
observed[is.na(observed)] <- 0 # change NA for unattested words into 0 # <---
observed <- observed/sum(observed) # change that into relative lengths/frequencies, too
DP.values[i] <- sum(abs(observed-expected))/2 # compute and store DP
}
DP.values # check result
## the on with of for at forward pigmeat
## 0.1290733 0.1824955 0.1710262 0.1677500 0.1865404 0.2179838 0.7345712 0.9961609
## ozzie accidie
## 0.9977123 0.9976729
This is how you could compute dispersions for all words using the
above logic. First, you’d have to make sure that R knows you now want to
results for all words, i.e. words.of.interest
must contain
each word type in the corpus. This is how that can be done quickly:
words.of.interest <- sort( # make words.of.interest
unique( # the unique type
unlist( # of the unlisted
sapply( # list when you apply
corpus, # to each frequency table in corpus
names)))) # the function names
Or we do this with the pipe:
library(magrittr)
words.of.interest <- # make words.of.interest by
corpus %>% # taking the list corpus
sapply(names) %>% # retrieving the names from the tables
unlist %>% # unlisting the whole thing
unique %>% # get only unique word types
sort # & sort them alphabetically
We create our collector again with the new size:
DP.values <- rep(NA, length(words.of.interest)) # create a collector for the DP-values
names(DP.values) <- words.of.interest # use the words for its names
Then, we run the above loop, which takes around 6 minutes on a pretty fast AMD Ryzen with 64GB of fast RAM:
system.time({ # measure how long this takes
for (i in words.of.interest) { # for each word of interest
observed <- sapply( # make observed the result of applying to
corpus, # to each list element within corpus
"[", # the subsetting function
i) # namely subsetting for exactly the current word
observed[is.na(observed)] <- 0 # change NA for unattested words into 0
observed <- observed/sum(observed) # change that into relative lengths/frequencies, too
DP.values[i] <- sum(abs(observed-expected))/2 # compute and store DP
}}) # approximately 6 minutes
## user system elapsed
## 365.169 0.133 365.311
summary(DP.values) # check result
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1193 0.9954 0.9975 0.9890 0.9979 0.9988
DP.values[c("the", "snort")]
## the snort
## 0.1290733 0.9974214
And now how one would really do this? One would first create
a term-document matrix tdm
.
words <- unname(unlist(sapply(corpus, function(af) rep(names(af), af), USE.NAMES=FALSE)))
files <- rep(
names(corpus),
sapply(corpus, sum))
tdm <- table(words, files) # tdm[53400:53409, 1:10]
But how big is that thing?!
object.size(tdm) # 124,463,744
## 124463744 bytes
Whoa, this is more than 5 times as big as our smaller version of
corpus
before … So how long does handling this take
now?!
system.time({ # measure how long this takes
DP.values <- apply(
tdm,
1,
function (af) sum(abs((af/sum(af))-expected))/2 )
}) # approximately 1 second
## user system elapsed
## 0.942 0.040 0.982
summary(DP.values) # check result
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1193 0.9954 0.9975 0.9890 0.9979 0.9988
DP.values[c("the", "snort")]
## the snort
## 0.1290733 0.9974214
What a trade-off: using a representation of the data that is 5.16 as great as the smallest we could do, we could speed up processing by a factor of several hundreds … Which also means, when people tell you R is slow, often it’s just their code that is not always optimal …
Housekeeping:
unlink("files/ICEGB_sampled", recursive=TRUE)
sessionInfo()
## R version 4.2.2 Patched (2022-11-10 r83330)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Pop!_OS 22.04 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods compiler
## [8] base
##
## other attached packages:
## [1] magrittr_2.0.3
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.31 R6_2.5.1 jsonlite_1.8.4 evaluate_0.20
## [5] cachem_1.0.7 rlang_1.0.6 cli_3.6.0 rstudioapi_0.14
## [9] jquerylib_0.1.4 bslib_0.4.2 rmarkdown_2.20 tools_4.2.2
## [13] xfun_0.37 yaml_2.3.7 fastmap_1.1.1 htmltools_0.5.4
## [17] knitr_1.42 sass_0.4.5