Central question: How many X does the phrase some X next to a Y refer to? Your predictors are
OBJECT
: the sizes of the objects X: large
vs. small;REFPOINT
: the sizes of the reference points Y:
large vs. small.Analyze the data properly with a regression model and summarize the results (briefly). [Difficulty level: 1]
rm(list=ls(all.names=TRUE))
summary(d <- read.delim( # summarize d, the result of loading
file="inputfiles/202_11a_some.csv", # this file
stringsAsFactors=TRUE)) # change categorical variables into factors
## CASE OBJECT REFPOINT ESTIMATE
## Min. : 1.00 large:8 large:8 Min. : 2.0
## 1st Qu.: 4.75 small:8 small:8 1st Qu.:38.5
## Median : 8.50 Median :44.0
## Mean : 8.50 Mean :51.5
## 3rd Qu.:12.25 3rd Qu.:73.0
## Max. :16.00 Max. :91.0
Central question: What determines the number of praises in child-caretaker interaction? The data come from recording of different children and contain the following variables :
PRAISES
: the response variable, the number of times the
children are praised by their caretakers;CHILD
: the name of each child;SEX
: the sex of each child;CAN
: the number of verb phrases where the caretakers
use can when speaking about actions of the child;WANT
: the number of verb phrases where the caretakers
use want when speaking about actions of the child;SHOULD_SHALL
: the number of verb phrases where the
caretakers use should/shall when speaking about
actions of the child;DIRECTIVE
: the number of verb phrases where the
caretakers uses a directive when speaking about actions of the
child;SUCCESS
: the number of times the child does something
as intended;FAILURE
: the number of times the child does something
not as intended.You now want to determine to what degree the number of praises is a function of
SEX
.Analyze the data with properly and summarize the results (briefly). [Difficulty level: 3]
rm(list=ls(all.names=TRUE))
summary(d <- read.delim( # summarize d, the result of loading
file="inputfiles/202_11b_praises.csv", # this file
stringsAsFactors=TRUE)) # change categorical variables into factors
## CHILD SEX PRAISES CAN WANT
## aRetha : 1 f:15 Min. : 0.0 Min. : 0.000 Min. : 0.00
## aRnold : 1 m:13 1st Qu.: 2.0 1st Qu.: 1.000 1st Qu.: 0.75
## baRbara: 1 Median : 5.0 Median : 4.000 Median : 2.00
## beRnard: 1 Mean : 5.5 Mean : 4.321 Mean : 3.25
## chRis : 1 3rd Qu.: 7.5 3rd Qu.: 5.250 3rd Qu.: 6.00
## chRissy: 1 Max. :13.0 Max. :18.000 Max. :10.00
## (Other):22
## SHOULD_SHALL DIRECTIVE SUCCESS FAILURE
## Min. :0.0000 Min. : 0.00 Min. : 0.000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.: 9.00 1st Qu.: 4.000 1st Qu.:1.000
## Median :0.0000 Median :12.00 Median : 6.500 Median :3.000
## Mean :0.8929 Mean :15.61 Mean : 7.679 Mean :3.286
## 3rd Qu.:1.2500 3rd Qu.:19.50 3rd Qu.:10.000 3rd Qu.:5.250
## Max. :6.0000 Max. :46.00 Max. :18.000 Max. :8.000
##
Central question: is the choice of of- vs. s-genitives (the car of my father vs. my father’s car) dependent in some way on the animacy of the possessor (my father) and/or the possessed (the car)? Your predictors are
POSSESSOR
: the animacy of the possessor:
abstract vs. animate vs. concrete;POSSESSED
: the animacy of the possessed:
abstract vs. animate vs. concrete.Analyze the data properly and summarize the results (briefly). [Difficulty level: 2]
rm(list=ls(all.names=TRUE))
summary(d <- read.delim( # summarize d, the result of loading
file="inputfiles/202_11c_genitives.csv", # this file
stringsAsFactors=TRUE)) # change categorical variables into factors
## CASE POSSESSOR POSSESSED GENITIVE
## Min. : 1.00 abstract:139 abstract:206 of:150
## 1st Qu.: 75.75 animate :118 animate : 20 s :150
## Median :150.50 concrete: 43 concrete: 74
## Mean :150.50
## 3rd Qu.:225.25
## Max. :300.00
Central question: is the choice of try to- vs. try
and-constructions (I’m gonna try to fix this problem
vs. I’m gonna try and fix this problem, which is in the column
TRY
) dependent in some way on the following 3 predictors
and all their interactions:
MODE
: whether the data represent spoken (spk)
or written (wrt) English;VARIETY
: whether the data represent American
(amer) or British English (brit);CLAUSE
: does the clause in which try is used
with to or and already involve another to (as
in we’re going -> to <- try and beat this
thing) or not (other)?(Source: Hommerberg, Charlotte & Gunnel Tottie. 2007. Try to or Try and? Verb complementation in British and American English. ICAME Journal 31. 45-64.)
Analyze the data like we discussed and summarize the results (briefly). [Difficulty level: 1]
rm(list=ls(all.names=TRUE))
summary(d <- read.delim( # summarize d, the result of loading
file="inputfiles/202_11d_try-corp.csv", # this file
stringsAsFactors=TRUE)) # change categorical variables into factors
## CASE TRY VARIETY MODE CLAUSE
## Min. : 1 and:1631 amer:1187 spk:2257 other:1662
## 1st Qu.: 808 to :1598 brit:2042 wrt: 972 to :1567
## Median :1615
## Mean :1615
## 3rd Qu.:2422
## Max. :3229
Central question: is the choice of I vs. you ,
which is represented in the column MATCH
dependent in some
way on the following 3 predictors and all their pairwise
interactions:
SEX
: whether the speaker is female or
male;SENTENCE
: where in the file I or you
was used on a scale from 0 (first sentence) to 1 (last sentence);DISTANCE
: where in the sentence I or
you was used on a scale from 0 (first character) to ≈1 (last
character).The following loads the data and prepares the variable
DISTANCE
:
rm(list=ls(all.names=TRUE))
summary(d <- read.delim( # summarize d, the result of loading
file="inputfiles/202_11e_IvsYou.csv", # this file
stringsAsFactors=FALSE)) # don't change categorical variables into factors (!)
## CASE FILE SPEAKER SEX
## Min. : 1 Length:21102 Length:21102 Length:21102
## 1st Qu.: 5276 Class :character Class :character Class :character
## Median :10552 Mode :character Mode :character Mode :character
## Mean :10552
## 3rd Qu.:15827
## Max. :21102
## SENTENCE PRECEDING MATCH SUBSEQUENT
## Min. :0.0000 Length:21102 Length:21102 Length:21102
## 1st Qu.:0.2394 Class :character Class :character Class :character
## Median :0.5147 Mode :character Mode :character Mode :character
## Mean :0.5014
## 3rd Qu.:0.7573
## Max. :1.0000
d$SENTLENGTH <- nchar(d$PRECEDING) +
nchar(d$MATCH) +
nchar(d$SUBSEQUENT)
d$DISTANCE <- nchar(d$PRECEDING)/d$SENTLENGTH
d <- d[,c(1:3,7,4:5,9:10)]; d[,2:5] <- lapply(d[,2:5], as.factor)
summary(d)
## CASE FILE SPEAKER MATCH SEX
## Min. : 1 KRL :4610 PS5VN : 1248 i : 2 : 1043
## 1st Qu.: 5276 KRH :3590 PS62L : 852 I :11637 f: 6676
## Median :10552 KRT :3093 PS63K : 785 you: 8619 m:12480
## Mean :10552 KRP :1997 PS5T8 : 655 You: 844 u: 903
## 3rd Qu.:15827 KR0 :1445 PS5VL : 647
## Max. :21102 KRG :1385 PS59B : 632
## (Other):4982 (Other):16283
## SENTENCE SENTLENGTH DISTANCE
## Min. :0.0000 Min. : 1.0 Min. :0.0000
## 1st Qu.:0.2394 1st Qu.: 65.0 1st Qu.:0.0351
## Median :0.5147 Median : 141.0 Median :0.2453
## Mean :0.5014 Mean : 181.3 Mean :0.3197
## 3rd Qu.:0.7573 3rd Qu.: 250.0 3rd Qu.:0.5600
## Max. :1.0000 Max. :1353.0 Max. :0.9978
##
Analyze the data properly and summarize the results (briefly). [Difficulty level: 4]
Central question: Do n-grams returned early by an algorithm
(BINRANK
: early) get rated better (ordinal
response: RATING
) than returned late by that algorithm
(BINRANK
: late) if one controls for the length of
the n-gram (SIZE
)? The data frame contains the
following variables :
RATING
: the response variable, integers from 1 to
7;SIZE
: the number of parts of each n-gram;BINRANK
: the main predictor as per the above.Analyze the data properly and summarize the results (briefly). [Difficulty level: 2]
rm(list=ls(all.names=TRUE))
summary(d <- read.delim( # summarize d, the result of loading
file="inputfiles/202_11f_MERGE.csv", # this file
stringsAsFactors=TRUE)) # change categorical variables into factors
## CASE RATING SIZE BINRANK
## Min. : 1.0 Min. :1.000 Min. :2.00 early:800
## 1st Qu.: 400.8 1st Qu.:1.000 1st Qu.:2.75 late :800
## Median : 800.5 Median :3.000 Median :3.50
## Mean : 800.5 Mean :3.758 Mean :3.50
## 3rd Qu.:1200.2 3rd Qu.:7.000 3rd Qu.:4.25
## Max. :1600.0 Max. :7.000 Max. :5.00
Central question: Are results on subordinate clause ordering from the studies of Hampe and Diessel comparable/compatible? Here are the data:
CASE
: the usual numbering column;STUDY
: a column indicating to which study a data point
in a row belongs: diessel vs. hampe;ORDER
: the response variable in each study, the order
of main and subordinate clause (and you know this response from another
study in the book);CONJ
: the predictor in each study, the subordinate
conjunction used in the subordinate clause:rm(list=ls(all.names=TRUE))
d <- data.frame(
STUDY=rep(c("diessel", "hampe"), 8),
ORDER=rep(c("sc-mc", "mc-sc"), each=8),
CONJ =rep(rep(c("after", "before", "once", "until"), each=2), 2),
FREQ =c(27, 82, 6, 105, 77, 236, 5, 41, 70, 200, 81, 425, 21, 74, 94, 346))
d <- data.frame(lapply(d[, -4], \(af) { rep(af, d$FREQ) }))
d <- data.frame(lapply(d, as.factor))
summary(d <- cbind(CASE=seq(nrow(d)), d))
## CASE STUDY ORDER CONJ
## Min. : 1.0 diessel: 381 mc-sc:1311 after :379
## 1st Qu.: 473.2 hampe :1509 sc-mc: 579 before:617
## Median : 945.5 once :408
## Mean : 945.5 until :486
## 3rd Qu.:1417.8
## Max. :1890.0
Are Hampe’s and Diessel’s findings ‘the same’? Analyze the data properly and summarize the results (briefly). [Difficulty level: 2]
Central question: What determines how speakers rate the acceptability
(the 7-level response variable RATING
) of to-
vs. -ing complementation (as in I like to swim
vs. I like swimming) in an experiment?
CX_NOW
: whether the current experimental stimulus is a
to or an -ing construction?VNOW_PREF
: whether the verb in the current experimental
stimulus generally prefers to appear with to or an
-ing constructions?CX_PRV
: whether the previous experimental stimulus was
a to or an -ing construction?Analyze the data properly and summarize the results (briefly). [Difficulty level: 2]
rm(list=ls(all.names=TRUE))
summary(d <- read.delim( # summarize d, the result of loading
file="inputfiles/202_11h_try-exp.csv", # this file
stringsAsFactors=TRUE)) # change categorical variables into factors
## CASE RATING CX_NOW VNOW_PREF CX_PRV
## Min. : 1.0 Min. :-3.0000 ing:270 ing:280 ing:278
## 1st Qu.:139.8 1st Qu.:-1.0000 to :286 to :276 to :278
## Median :278.5 Median : 0.0000
## Mean :278.5 Mean : 0.3705
## 3rd Qu.:417.2 3rd Qu.: 2.0000
## Max. :556.0 Max. : 3.0000
Central question: Do children and their caretakers exhibit different correlations (measured in Cramer’s V values) between tense (past vs. non-past) and aspect (perfective vs. imperfective) such that
You have data from a corpus study and these are the variables in the data frame:
AGE
: the age of the child at recording time:
YEAR;MONTH.DAY;KID
: the Cramer’s V value for the child’s
tense-aspect correlation in this recording;CARETAKER
: the Cramer’s V value for the
caretaker’s tense-aspect correlation in this recordingNote: Whatever graphs involving time you use, the axis representing the age of the child must of course be proportional to the age, not just to the position of an age in the vector of ages. I don’t care about how you do that, if you do that in a spreadsheet software, that’s fine, too.
Analyze the data with properly and summarize the results (briefly). [Difficulty level: 3]
rm(list=ls(all.names=TRUE))
summary(d <- read.delim( # summarize d, the result of loading
file="inputfiles/202_11i_russaspect.csv", # this file
stringsAsFactors=FALSE)) # don't change categorical variables into factors (!)
## CASE AGE KID CARETAKER
## Min. : 1.00 Length:80 Min. :0.01645 Min. :0.1627
## 1st Qu.:20.75 Class :character 1st Qu.:0.31861 1st Qu.:0.3004
## Median :40.50 Mode :character Median :0.44217 Median :0.3554
## Mean :40.50 Mean :0.45170 Mean :0.3640
## 3rd Qu.:60.25 3rd Qu.:0.57247 3rd Qu.:0.4355
## Max. :80.00 Max. :1.00000 Max. :0.5586
Central question: what factors co-determine how English changed from a 3rd-person singular -th (e.g., He giveth) to the current 3rd-person singular -s (e.g., He gives)? You have data from a corpus study on how the third person singular form in English changed across five time periods (from P1 at about 1480 to P5 at about 1700). This data set contains annotation for third person singular verbs (extracted from letters) with regard to the following variables:
VARIANT
: the response variable: the third person
singular form as found in the corpus file: es
vs. th;TIME5
: the time period: P1 vs. P2
vs. P3 vs. P4 vs. P5;SENGEND
: the sex of the sender of the letter:
female vs. male;RECGEND
: the sex of the recipient of the letter:
female vs. male;CLOSEFAM
: whether the recipient of the letter is a
close family member of the sender or not: no
vs. yes;FINSYB
: whether the verb stem ends in a sibilant:
no (e.g., see) vs. yes (e.g.,
seize);FOLFRIC
: what the word following the third person
singular form begins with: s (e.g., he sees seagulls)
vs. th (e.g., he sees the seagulls) vs. other
(e.g., he sees many seagulls);GRAM
: whether the verb in question is used as a
grammatical or a lexical verb: yes (grammatical, i.e. be,
do and aux. have) vs. no
(lexical/other).rm(list=ls(all.names=TRUE))
summary(d <- read.delim( # summarize d, the result of loading
file="inputfiles/202_11j_3rdperson.csv", # this file
stringsAsFactors=TRUE)) # change categorical variables into factors
## CASE VARIANT TIME5 SENGEND RECGEND CLOSEFAM
## Min. : 1 es :1524 P1: 505 female: 784 female: 734 no :1917
## 1st Qu.:1036 eth:2619 P2: 99 male :3359 male :3409 yes:2226
## Median :2072 P3:1508
## Mean :2072 P4:1096
## 3rd Qu.:3108 P5: 935
## Max. :4143
## FINSYB FOLFRIC GRAM
## no :3953 other:3666 no :2867
## yes: 190 s : 189 yes:1276
## th : 288
##
##
##
You want to characterize how the predictors and their pairwise
interactions with TIME
are correlated with the change from
-(e)th to -(e)s. Analyze the data properly and
summarize the results (briefly). Note: you must conflate the 3 early
time periods into one, but once you’re done with everything, you should
figure out why. [Difficulty level: 4]