Ling 201, session 02: R basics (key)

Author

Affiliation

UC Santa Barbara & JLU Giessen

Published

14 Jan 2025 12-34-56

rm(list=ls(all=TRUE)); library(magrittr)

1 Exercise 01

Generate a data frame abc that contains the letters from “a” to “j” in the first column and the integers from 10 to 1 in the second column. Make sure the first column is called “LETTER” and the second “NUMBER” and that columns with categorical variables are factors.

Here is the most stepwise way to do this: We first set up the vectors …

LETTER <-        # create a data structure called LETTER with
   letters[1:10] # the first 10 elements of the inbuilt vector letters
NUMBER <- # create a data structure called NUMBER with
   10:1   # with the integers from 10 to 1 in it

… and then put them in the data frame:

abc <- data.frame(        # create a data frame abc
   LETTER,                # with LETTER as the 1st column
   NUMBER,                # with NUMBER as the 1st column
   stringsAsFactors=TRUE) # and make categorical variables factors (!)

Here’s how to do this in one go, w/out creating LETTER and NUMBER separately first:

abc <- data.frame(        # create a data frame abc
   LETTER=letters[1:10],  # with LETTER as the 1st column, w/ letters from a to j in there
   NUMBER=10:1,           # with NUMBER as the 2nd column, w/ numbers from 10 to 1 in there
   stringsAsFactors=TRUE) # and make categorical variables factors (!)

2 Exercise 02

Load the text file _input/dataframe1.csv into a data frame example such that the first row is recognized as containing the column names and columns with categorical variables are factors.

str(example <- read.delim(  # read in the data frame example & show its structure
   "_input/dataframe1.csv", # from this file and 
   stringsAsFactors=TRUE))  # make categorical variables factors (!)

'data.frame':   12 obs. of  4 variables:
 $ CASE        : int  1 2 3 4 5 6 7 8 9 10 ...
 $ GRMRELATION : Factor w/ 2 levels "obj","subj": 1 1 1 1 1 1 2 2 2 2 ...
 $ LENGTH      : int  2 2 10 6 7 4 3 9 9 9 ...
 $ DEFINITENESS: Factor w/ 2 levels "def","indef": 1 1 1 2 2 2 1 1 1 2 ...

3 Exercise 03

Extract from this data frame example

the second and third column;

# a.
example$GRMRELATION # column 2: individually

 [1] obj  obj  obj  obj  obj  obj  subj subj subj subj subj subj
Levels: obj subj

example$LENGTH      # column 3: individually

 [1]  2  2 10  6  7  4  3  9  9  9  7  9

example[,2:3] # columns 2 & 3 jointly (way 1)

   GRMRELATION LENGTH
1          obj      2
2          obj      2
3          obj     10
4          obj      6
5          obj      7
6          obj      4
7         subj      3
8         subj      9
9         subj      9
10        subj      9
11        subj      7
12        subj      9

example[,c("GRMRELATION", "LENGTH")] # columns 2 & 3 jointly (way 2)

   GRMRELATION LENGTH
1          obj      2
2          obj      2
3          obj     10
4          obj      6
5          obj      7
6          obj      4
7         subj      3
8         subj      9
9         subj      9
10        subj      9
11        subj      7
12        subj      9

the third and fourth row.

# b.
example[3:4,]

  CASE GRMRELATION LENGTH DEFINITENESS
3    3         obj     10          def
4    4         obj      6        indef

4 Exercise 04

Split the data frame example up according to the content of the second column (enter ?split at the R prompt for help) and call the result example.split.

example.split <- split( # make example.split the result of splitting up
   example,             # the data frame example
   example$GRMRELATION) # depending on the values of the column GRMRELATION
# separate manual alternatives:
subset(                # show a subset
   example,            # of the data frame example, namely
   example$GRMRELATION=="obj") # when GRMRELATION is "obj"

  CASE GRMRELATION LENGTH DEFINITENESS
1    1         obj      2          def
2    2         obj      2          def
3    3         obj     10          def
4    4         obj      6        indef
5    5         obj      7        indef
6    6         obj      4        indef

subset(                  # show a subset
   example,              # of the data frame example, namely
   example$GRMRELATION=="subj") # when GRMRELATION is "subj"

   CASE GRMRELATION LENGTH DEFINITENESS
7     7        subj      3          def
8     8        subj      9          def
9     9        subj      9          def
10   10        subj      9        indef
11   11        subj      7        indef
12   12        subj      9        indef

5 Exercise 05

Change the value at the intersection of the third row and the fourth column into “indef” and save the changed data frame into _output/dataframe2.csv such that you can easily load/edit in a spreadsheet software.

example[3,4] <- "indef"
write.table(                      # write the data frame
   example,                       # example
   sep="\t", eol="\n",            # with tabs between columns & line breaks
   row.names=FALSE, quote=FALSE,  # no row names & quotes (for factors)
   file="_output/dataframe2.csv") # into this file

6 Exercise 06

Generate the following data frame and call it EPP (for English personal pronouns):

PRONOUN	PERSON	NUMBER
I	1	sg
you	2	sg
he	3	sg
she	3	sg
it	3	sg
we	1	pl
you	2	pl
they	3	pl

(EPP <- data.frame(        # create (& then show) a data frame EPP
   PRONOUN=c("I", "you", "he", "she", "it", "we", "you", "they"),
   PERSON =c(1:3, 3, 3, 1:3),
   NUMBER =rep(c("sg", "pl"), c(5, 3)),
   stringsAsFactors=TRUE)) # make categorical variables factors (!)

  PRONOUN PERSON NUMBER
1       I      1     sg
2     you      2     sg
3      he      3     sg
4     she      3     sg
5      it      3     sg
6      we      1     pl
7     you      2     pl
8    they      3     pl

7 Exercise 07

Extract from this data frame

the value of the 4th row and the 2nd column;

EPP[4,2]

[1] 3

the values of the 3rd to 4th rows and the 1st to 2nd columns;

EPP[3:4, 1:2]

  PRONOUN PERSON
3      he      3
4     she      3

the rows that have plural pronouns in them;

EPP[EPP$NUMBER=="pl",] # or

  PRONOUN PERSON NUMBER
6      we      1     pl
7     you      2     pl
8    they      3     pl

subset(          # show a subset
   EPP,          # of the data frame EPP, namely
   NUMBER=="pl") # when NUMBER is "pl"

  PRONOUN PERSON NUMBER
6      we      1     pl
7     you      2     pl
8    they      3     pl

the rows with 1st and 3rd person pronouns.

# note: this does NOT work, the output is incomplete:
# EPP[EPP$PERSON==c(1, 3),]
EPP[(EPP$PERSON==1    # of EPP, the rows when PERSON is 1
     |                # or
     EPP$PERSON==3),] # when PERSON is 3

  PRONOUN PERSON NUMBER
1       I      1     sg
3      he      3     sg
4     she      3     sg
5      it      3     sg
6      we      1     pl
8    they      3     pl

# alternatives
EPP[EPP$PERSON!=2,] # of EPP, the rows when PERSON is not 2

  PRONOUN PERSON NUMBER
1       I      1     sg
3      he      3     sg
4     she      3     sg
5      it      3     sg
6      we      1     pl
8    they      3     pl

subset(                 # show a subset
   EPP,                 # of the data frame EPP, namely
   PERSON %in% c(1, 3)) # when PERSON is in the set 1, 3

  PRONOUN PERSON NUMBER
1       I      1     sg
3      he      3     sg
4     she      3     sg
5      it      3     sg
6      we      1     pl
8    they      3     pl

8 Exercise 08

Generate a vector FREQS of the frequencies with which the personal pronouns in EPP occurred in a small corpus: I: 8426, you: 9462, he: 6394, she: 4234, it: 6040, we: 2305, you: 8078, they: 2998. Then, make this vector the fourth column of EPP.

# shortest way:
EPP$FREQS <- c(8426, 9462, 6394, 4234, 6040, 2305, 8078, 2998)
# longest way:
FREQS <- c(8426, 9462, 6394, 4234, 6040, 2305, 8078, 2998)
EPP[,4] <- FREQS
colnames(EPP)[4] <- "FREQS"

9 Exercise 09

Save the data frame into _output/dataframe3.csv such that you can easily load/edit in a spreadsheet software.

write.table(                      # write the data frame
   EPP,                           # EPP
   sep="\t", eol="\n",            # with tabs between columns & line breaks
   row.names=FALSE, quote=FALSE,  # no row names & quotes (for factors)
   file="_output/dataframe3.csv") # into this file

10 Exercise 10

The file _input/dataframe4.csv contains data for the VERB into VERBing construction in the BNC (e.g., He [_V1 forced] him into [_V2 speaking] about it). For each instance of one such construction, the file contains

a column called BNC: the file where the instance was found (A06 in the first case);
a column called VERB_LEMMA: the lemma of the finite verb (force);
a column called ING_FORM: the gerund (speaking);
a column called ING_LEMMA: the lemma of the gerund (speak);
a column called ING_TAG: the part-of-speech tag of the gerund (VVG in the first case).

Load this file into a data frame COV and display the first six rows of COV. Correct the typo in line 3 (use take).

summary(COV <- read.delim(  # summarize the data structure COV imported
   "_input/dataframe4.csv", # from this file
   stringsAsFactors=TRUE))  # and make categorical variables factors (!)

      BNC         VERB_LEMMA        ING_FORM      ING_LEMMA       ING_TAG
 HH3    :  11   force  : 101   thinking : 146   think  : 147   VVG    :1239
 K5D    :  11   trick  :  92   believing: 104   believe: 104   NN1-VVG: 158
 CBG    :  10   fool   :  77   making   :  62   make   :  62   AJ0-VVG: 108
 EUU    :  10   talk   :  62   giving   :  54   give   :  54   VDG    :  49
 HGM    :  10   mislead:  57   accepting:  51   accept :  51   VBG    :  23
 HXE    :  10   coerce :  52   doing    :  49   do     :  49   VHG    :  15
 (Other):1538   (Other):1159   (Other)  :1134   (Other):1133   (Other):   8

COV[1:6,] # or: head(COV)  # look at the first 6 rows

  BNC VERB_LEMMA ING_FORM ING_LEMMA ING_TAG
1 A06      force speaking     speak     VVG
2 A08      nudge    being        be     VBG
3 A0C       talk   taking       tak     VVG
4 A0F      bully   taking      take     VVG
5 A0H  influence   trying       try     VVG
6 A0H     delude thinking     think     VVG

COV$ING_LEMMA[3] <- "take" # fix the type on row 3

But something’s missing here – pay attention when you do the next exercise!

11 Exercise 11

What

is the quickest way of identifying the numbers of verb lemma types and -ing lemma types?

str(COV) # but the following is MUCH better

'data.frame':   1600 obs. of  5 variables:
 $ BNC       : Factor w/ 929 levels "A06","A08","A0C",..: 1 2 3 4 5 5 6 6 7 8 ...
 $ VERB_LEMMA: Factor w/ 208 levels "activate","aggravate",..: 76 126 186 26 96 51 75 149 152 186 ...
 $ ING_FORM  : Factor w/ 422 levels "abandoning","abdicating",..: 354 49 382 382 395 387 387 133 209 175 ...
 $ ING_LEMMA : Factor w/ 417 levels "abandon","abdicate",..: 349 41 378 378 390 383 383 378 207 173 ...
 $ ING_TAG   : Factor w/ 10 levels "AJ0-NN1","AJ0-VVG",..: 10 7 10 10 10 10 10 1 10 10 ...

length(                    # how many
   levels(COV$VERB_LEMMA)) # different levels of VERB_LEMMA are there?

[1] 208

length(                    # how many
   unique(COV$VERB_LEMMA)) # unique types of VERB_LEMMA are there?

[1] 208

length(                   # how many
   levels(COV$ING_LEMMA)) # different levels of ING_LEMMA  are there?

[1] 417

length(                   # how many
   unique(COV$ING_LEMMA)) # unique types of ING_LEMMA  are there?

[1] 416

Why are the results for ING_LEMMA conflicting?

Code

# a. continued
# you overwrote the level "tak", which is why it's not listed as a unique type for ING_LEMMA anymore
# but R still remembers it as, so to speak, a possible/potential level
# thus, you should use droplevels, then it'll work:
COV <- droplevels(COV)
length(levels(COV$ING_LEMMA))

[1] 416

Code

length(unique(COV$ING_LEMMA))

[1] 416

is the most frequent verb lemma?

summary(COV) # but the following is much better

      BNC         VERB_LEMMA        ING_FORM      ING_LEMMA       ING_TAG
 HH3    :  11   force  : 101   thinking : 146   think  : 147   VVG    :1239
 K5D    :  11   trick  :  92   believing: 104   believe: 104   NN1-VVG: 158
 CBG    :  10   fool   :  77   making   :  62   make   :  62   AJ0-VVG: 108
 EUU    :  10   talk   :  62   giving   :  54   give   :  54   VDG    :  49
 HGM    :  10   mislead:  57   accepting:  51   accept :  51   VBG    :  23
 HXE    :  10   coerce :  52   doing    :  49   do     :  49   VHG    :  15
 (Other):1538   (Other):1159   (Other)  :1134   (Other):1133   (Other):   8

tail(                         # show the tail of the
   sort(                      # sorted
      table(COV$VERB_LEMMA)), # frequency table of VERB_LEMMA
   1)                         # namely the last item


force
  101

# but this is the best because it turns exactly what was asked for:
names(                              # show the names of
   table(COV$VERB_LEMMA))[          # the frequency table of VERB_LEMMA, but only those
      which(                        # where
         table(COV$VERB_LEMMA) ==   # the frequency table of VERB_LEMMA is
         max(table(COV$VERB_LEMMA)) # the max of that table
      ) # end of which()
   ] # end of subset

[1] "force"

This is a good example to introduce the pipe (%>%) from the package magrittr that we loaded at the top. Here is a simpler example that uses the tail(1) approach:

COV$VERB_LEMMA %>% table %>% sort %>% tail(1) %>% names

[1] "force"

… and here’s the one that would be able to handle situations where more than one verb lemma has the same highest frequency; check out this:

COV$VERB_LEMMA %>% table %>% sort %>% "["(.==max(.)) %>% names

[1] "force"

Here’s the proof that it indeed works in such situations:

qwe <- c("a", "a", "b", "b", "c")
qwe %>% table %>% sort %>% "["(.==max(.)) %>% names

[1] "a" "b"

is the most frequent -ing lemma with this verb lemma?

with(COV, # avoid having to write COV$... all the time
     # the approach using tail
     ING_LEMMA[VERB_LEMMA=="force"] %>%
     table %>% sort %>% tail(1) %>% names)

[1] "make"

with(COV, # avoid having to write COV$... all the time
     # the approach using what's-the-max
     ING_LEMMA[VERB_LEMMA=="force"] %>%
     table %>% sort %>% "["(.==max(.)) %>% names)

[1] "make"

12 Exercise 12

Changing and saving COV:

Delete the column with the corpus files; the new data frame is to be called COV.2.

COV.2 <- COV[,2:5]

Delete the rows with the four rarest tags; the new data frame is to be called COV.3.

# step 1: determine the four rarest tags
COV.2$ING_TAG %>% table %>% sort %>% head # why do I not say head(4)?

.
AJ0-NN1     CJS     UNC     NN1     VHG     VBG
      1       1       2       4      15      23

# step 2: determine the vector of deletees
deletees <- which(COV.2$ING_TAG # the deletees are where the value for ING_TAG
   %in% # are a member of this set:
   c("AJ0-NN1", "CJS", "UNC", "NN1"))
# much better than this:
# deletees <- which(            # the deletees are where
#    COV.2$ING_TAG=="AJ0-NN1" | # COV.2$ING_TAG is "AJ0-NN1" or where
#    COV.2$ING_TAG=="CJS"     | # COV.2$ING_TAG is "CJS" or where
#    COV.2$ING_TAG=="UNC"     | # COV.2$ING_TAG is "UNC" or where
#    COV.2$ING_TAG=="NN1")      # COV.2$ING_TAG is "NN1"
# step 3: delete
COV.3 <- COV.2[-deletees,]

From COV.3, create a new data frame COV.4 which is sorted according to, first, the column VERB_LEMMA (ascending) and, second, the ING_LEMMA (descending).

oi <- order(               # make oi (for order.index) the order
   COV.3$VERB_LEMMA,       # to sort by COV.3$VERB_LEMMA, then break ties by
   -rank(COV.3$ING_LEMMA)) # (reversed) COV.3$ING_LEMMA
COV.4 <- COV.3[oi,]        # make COV.4 the re-sorted data frame

Save the changed data frame into a text file _output/dataframe5.csv; use tab stops as separators, newlines as line breaks, and make sure you don’t have row numbers and no quotes.

# d. Save
write.table(                      # write the data frame
   COV.4,                         # COV.4
   sep="\t", eol="\n",            # with tabs between columns & line breaks
   row.names=FALSE, quote=FALSE,  # no row names & quotes (for factors)
   file="_output/dataframe5.csv") # into this file

13 Session info

sessionInfo()

R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Pop!_OS 22.04 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  compiler  methods
[8] base

other attached packages:
[1] STGmisc_1.0    Rcpp_1.0.14    magrittr_2.0.3

loaded via a namespace (and not attached):
 [1] digest_0.6.37     fastmap_1.2.0     xfun_0.50         knitr_1.49
 [5] htmltools_0.5.8.1 rmarkdown_2.29    cli_3.6.3         rstudioapi_0.17.1
 [9] tools_4.4.2       evaluate_1.0.3    yaml_2.3.10       rlang_1.1.4
[13] jsonlite_1.8.9    htmlwidgets_1.6.4 MASS_7.3-64