Ling 201, session 02: R basics (key)
1 Exercise 01
Generate a data frame abc
that contains the letters from “a” to “j” in the first column and the integers from 10 to 1 in the second column. Make sure the first column is called “LETTER” and the second “NUMBER” and that columns with categorical variables are factors.
Here is the most stepwise way to do this: We first set up the vectors …
… and then put them in the data frame:
Here’s how to do this in one go, w/out creating LETTER
and NUMBER
separately first:
2 Exercise 02
Load the text file _input/dataframe1.csv into a data frame example
such that the first row is recognized as containing the column names and columns with categorical variables are factors.
str(example <- read.delim( # read in the data frame example & show its structure
"_input/dataframe1.csv", # from this file and
stringsAsFactors=TRUE)) # make categorical variables factors (!)
'data.frame': 12 obs. of 4 variables:
$ CASE : int 1 2 3 4 5 6 7 8 9 10 ...
$ GRMRELATION : Factor w/ 2 levels "obj","subj": 1 1 1 1 1 1 2 2 2 2 ...
$ LENGTH : int 2 2 10 6 7 4 3 9 9 9 ...
$ DEFINITENESS: Factor w/ 2 levels "def","indef": 1 1 1 2 2 2 1 1 1 2 ...
3 Exercise 03
Extract from this data frame example
- the second and third column;
[1] obj obj obj obj obj obj subj subj subj subj subj subj
Levels: obj subj
[1] 2 2 10 6 7 4 3 9 9 9 7 9
GRMRELATION LENGTH
1 obj 2
2 obj 2
3 obj 10
4 obj 6
5 obj 7
6 obj 4
7 subj 3
8 subj 9
9 subj 9
10 subj 9
11 subj 7
12 subj 9
GRMRELATION LENGTH
1 obj 2
2 obj 2
3 obj 10
4 obj 6
5 obj 7
6 obj 4
7 subj 3
8 subj 9
9 subj 9
10 subj 9
11 subj 7
12 subj 9
- the third and fourth row.
4 Exercise 04
Split the data frame example
up according to the content of the second column (enter ?split
at the R prompt for help) and call the result example.split
.
example.split <- split( # make example.split the result of splitting up
example, # the data frame example
example$GRMRELATION) # depending on the values of the column GRMRELATION
# separate manual alternatives:
subset( # show a subset
example, # of the data frame example, namely
example$GRMRELATION=="obj") # when GRMRELATION is "obj"
CASE GRMRELATION LENGTH DEFINITENESS
1 1 obj 2 def
2 2 obj 2 def
3 3 obj 10 def
4 4 obj 6 indef
5 5 obj 7 indef
6 6 obj 4 indef
subset( # show a subset
example, # of the data frame example, namely
example$GRMRELATION=="subj") # when GRMRELATION is "subj"
CASE GRMRELATION LENGTH DEFINITENESS
7 7 subj 3 def
8 8 subj 9 def
9 9 subj 9 def
10 10 subj 9 indef
11 11 subj 7 indef
12 12 subj 9 indef
5 Exercise 05
Change the value at the intersection of the third row and the fourth column into “indef” and save the changed data frame into _output/dataframe2.csv such that you can easily load/edit in a spreadsheet software.
6 Exercise 06
Generate the following data frame and call it EPP
(for English personal pronouns):
PRONOUN | PERSON | NUMBER |
---|---|---|
I | 1 | sg |
you | 2 | sg |
he | 3 | sg |
she | 3 | sg |
it | 3 | sg |
we | 1 | pl |
you | 2 | pl |
they | 3 | pl |
(EPP <- data.frame( # create (& then show) a data frame EPP
PRONOUN=c("I", "you", "he", "she", "it", "we", "you", "they"),
PERSON =c(1:3, 3, 3, 1:3),
NUMBER =rep(c("sg", "pl"), c(5, 3)),
stringsAsFactors=TRUE)) # make categorical variables factors (!)
PRONOUN PERSON NUMBER
1 I 1 sg
2 you 2 sg
3 he 3 sg
4 she 3 sg
5 it 3 sg
6 we 1 pl
7 you 2 pl
8 they 3 pl
7 Exercise 07
Extract from this data frame
- the value of the 4th row and the 2nd column;
- the values of the 3rd to 4th rows and the 1st to 2nd columns;
- the rows that have plural pronouns in them;
PRONOUN PERSON NUMBER
6 we 1 pl
7 you 2 pl
8 they 3 pl
PRONOUN PERSON NUMBER
6 we 1 pl
7 you 2 pl
8 they 3 pl
- the rows with 1st and 3rd person pronouns.
# note: this does NOT work, the output is incomplete:
# EPP[EPP$PERSON==c(1, 3),]
EPP[(EPP$PERSON==1 # of EPP, the rows when PERSON is 1
| # or
EPP$PERSON==3),] # when PERSON is 3
PRONOUN PERSON NUMBER
1 I 1 sg
3 he 3 sg
4 she 3 sg
5 it 3 sg
6 we 1 pl
8 they 3 pl
PRONOUN PERSON NUMBER
1 I 1 sg
3 he 3 sg
4 she 3 sg
5 it 3 sg
6 we 1 pl
8 they 3 pl
subset( # show a subset
EPP, # of the data frame EPP, namely
PERSON %in% c(1, 3)) # when PERSON is in the set 1, 3
PRONOUN PERSON NUMBER
1 I 1 sg
3 he 3 sg
4 she 3 sg
5 it 3 sg
6 we 1 pl
8 they 3 pl
8 Exercise 08
Generate a vector FREQS
of the frequencies with which the personal pronouns in EPP
occurred in a small corpus: I: 8426, you: 9462, he: 6394, she: 4234, it: 6040, we: 2305, you: 8078, they: 2998. Then, make this vector the fourth column of EPP
.
9 Exercise 09
Save the data frame into _output/dataframe3.csv such that you can easily load/edit in a spreadsheet software.
10 Exercise 10
The file _input/dataframe4.csv contains data for the VERB into VERBing construction in the BNC (e.g., He [V1 forced] him into [V2 speaking] about it). For each instance of one such construction, the file contains
- a column called
BNC
: the file where the instance was found (A06 in the first case); - a column called
VERB_LEMMA
: the lemma of the finite verb (force); - a column called
ING_FORM
: the gerund (speaking); - a column called
ING_LEMMA
: the lemma of the gerund (speak); - a column called
ING_TAG
: the part-of-speech tag of the gerund (VVG in the first case).
Load this file into a data frame COV
and display the first six rows of COV
. Correct the typo in line 3 (use take).
summary(COV <- read.delim( # summarize the data structure COV imported
"_input/dataframe4.csv", # from this file
stringsAsFactors=TRUE)) # and make categorical variables factors (!)
BNC VERB_LEMMA ING_FORM ING_LEMMA ING_TAG
HH3 : 11 force : 101 thinking : 146 think : 147 VVG :1239
K5D : 11 trick : 92 believing: 104 believe: 104 NN1-VVG: 158
CBG : 10 fool : 77 making : 62 make : 62 AJ0-VVG: 108
EUU : 10 talk : 62 giving : 54 give : 54 VDG : 49
HGM : 10 mislead: 57 accepting: 51 accept : 51 VBG : 23
HXE : 10 coerce : 52 doing : 49 do : 49 VHG : 15
(Other):1538 (Other):1159 (Other) :1134 (Other):1133 (Other): 8
BNC VERB_LEMMA ING_FORM ING_LEMMA ING_TAG
1 A06 force speaking speak VVG
2 A08 nudge being be VBG
3 A0C talk taking tak VVG
4 A0F bully taking take VVG
5 A0H influence trying try VVG
6 A0H delude thinking think VVG
But something’s missing here – pay attention when you do the next exercise!
11 Exercise 11
What
- is the quickest way of identifying the numbers of verb lemma types and -ing lemma types?
'data.frame': 1600 obs. of 5 variables:
$ BNC : Factor w/ 929 levels "A06","A08","A0C",..: 1 2 3 4 5 5 6 6 7 8 ...
$ VERB_LEMMA: Factor w/ 208 levels "activate","aggravate",..: 76 126 186 26 96 51 75 149 152 186 ...
$ ING_FORM : Factor w/ 422 levels "abandoning","abdicating",..: 354 49 382 382 395 387 387 133 209 175 ...
$ ING_LEMMA : Factor w/ 417 levels "abandon","abdicate",..: 349 41 378 378 390 383 383 378 207 173 ...
$ ING_TAG : Factor w/ 10 levels "AJ0-NN1","AJ0-VVG",..: 10 7 10 10 10 10 10 1 10 10 ...
[1] 208
[1] 208
[1] 417
[1] 416
Why are the results for ING_LEMMA
conflicting?
Code
[1] 416
[1] 416
- is the most frequent verb lemma?
BNC VERB_LEMMA ING_FORM ING_LEMMA ING_TAG
HH3 : 11 force : 101 thinking : 146 think : 147 VVG :1239
K5D : 11 trick : 92 believing: 104 believe: 104 NN1-VVG: 158
CBG : 10 fool : 77 making : 62 make : 62 AJ0-VVG: 108
EUU : 10 talk : 62 giving : 54 give : 54 VDG : 49
HGM : 10 mislead: 57 accepting: 51 accept : 51 VBG : 23
HXE : 10 coerce : 52 doing : 49 do : 49 VHG : 15
(Other):1538 (Other):1159 (Other) :1134 (Other):1133 (Other): 8
tail( # show the tail of the
sort( # sorted
table(COV$VERB_LEMMA)), # frequency table of VERB_LEMMA
1) # namely the last item
force
101
# but this is the best because it turns exactly what was asked for:
names( # show the names of
table(COV$VERB_LEMMA))[ # the frequency table of VERB_LEMMA, but only those
which( # where
table(COV$VERB_LEMMA) == # the frequency table of VERB_LEMMA is
max(table(COV$VERB_LEMMA)) # the max of that table
) # end of which()
] # end of subset
[1] "force"
This is a good example to introduce the pipe (%>%
) from the package magrittr
that we loaded at the top. Here is a simpler example that uses the tail(1)
approach:
… and here’s the one that would be able to handle situations where more than one verb lemma has the same highest frequency; check out this:
Here’s the proof that it indeed works in such situations:
- is the most frequent -ing lemma with this verb lemma?
with(COV, # avoid having to write COV$... all the time
# the approach using tail
ING_LEMMA[VERB_LEMMA=="force"] %>%
table %>% sort %>% tail(1) %>% names)
[1] "make"
with(COV, # avoid having to write COV$... all the time
# the approach using what's-the-max
ING_LEMMA[VERB_LEMMA=="force"] %>%
table %>% sort %>% "["(.==max(.)) %>% names)
[1] "make"
12 Exercise 12
Changing and saving COV
:
- Delete the column with the corpus files; the new data frame is to be called
COV.2
.
- Delete the rows with the four rarest tags; the new data frame is to be called
COV.3
.
# step 1: determine the four rarest tags
COV.2$ING_TAG %>% table %>% sort %>% head # why do I not say head(4)?
.
AJ0-NN1 CJS UNC NN1 VHG VBG
1 1 2 4 15 23
# step 2: determine the vector of deletees
deletees <- which(COV.2$ING_TAG # the deletees are where the value for ING_TAG
%in% # are a member of this set:
c("AJ0-NN1", "CJS", "UNC", "NN1"))
# much better than this:
# deletees <- which( # the deletees are where
# COV.2$ING_TAG=="AJ0-NN1" | # COV.2$ING_TAG is "AJ0-NN1" or where
# COV.2$ING_TAG=="CJS" | # COV.2$ING_TAG is "CJS" or where
# COV.2$ING_TAG=="UNC" | # COV.2$ING_TAG is "UNC" or where
# COV.2$ING_TAG=="NN1") # COV.2$ING_TAG is "NN1"
# step 3: delete
COV.3 <- COV.2[-deletees,]
- From
COV.3
, create a new data frameCOV.4
which is sorted according to, first, the columnVERB_LEMMA
(ascending) and, second, theING_LEMMA
(descending).
- Save the changed data frame into a text file _output/dataframe5.csv; use tab stops as separators, newlines as line breaks, and make sure you don’t have row numbers and no quotes.
13 Session info
R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Pop!_OS 22.04 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: America/Los_Angeles
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets compiler methods
[8] base
other attached packages:
[1] STGmisc_1.0 Rcpp_1.0.14 magrittr_2.0.3
loaded via a namespace (and not attached):
[1] digest_0.6.37 fastmap_1.2.0 xfun_0.50 knitr_1.49
[5] htmltools_0.5.8.1 rmarkdown_2.29 cli_3.6.3 rstudioapi_0.17.1
[9] tools_4.4.2 evaluate_1.0.3 yaml_2.3.10 rlang_1.1.4
[13] jsonlite_1.8.9 htmlwidgets_1.6.4 MASS_7.3-64