# 1 Assignment 01

Central question: How many X does the phrase some X next to a Y refer to? Your predictors are

• OBJECT: the sizes of the objects X: large vs. small;
• REFPOINT: the sizes of the reference points Y: large vs. small.

Analyze the data properly with a regression model and summarize the results (briefly). [Difficulty level: 1]

rm(list=ls(all.names=TRUE))
summary(d <- read.delim(    # summarize d, the result of loading
file="inputfiles/202_11a_some.csv", # this file
stringsAsFactors=TRUE))  # change categorical variables into factors
##       CASE         OBJECT   REFPOINT    ESTIMATE
##  Min.   : 1.00   large:8   large:8   Min.   : 2.0
##  1st Qu.: 4.75   small:8   small:8   1st Qu.:38.5
##  Median : 8.50                       Median :44.0
##  Mean   : 8.50                       Mean   :51.5
##  3rd Qu.:12.25                       3rd Qu.:73.0
##  Max.   :16.00                       Max.   :91.0

# 2 Assignment 02

Central question: What determines the number of praises in child-caretaker interaction? The data come from recording of different children and contain the following variables :

• PRAISES: the response variable, the number of times the children are praised by their caretakers;
• CHILD: the name of each child;
• SEX: the sex of each child;
• CAN: the number of verb phrases where the caretakers use can when speaking about actions of the child;
• WANT: the number of verb phrases where the caretakers use want when speaking about actions of the child;
• SHOULD_SHALL: the number of verb phrases where the caretakers use should/shall when speaking about actions of the child;
• DIRECTIVE: the number of verb phrases where the caretakers uses a directive when speaking about actions of the child;
• SUCCESS: the number of times the child does something as intended;
• FAILURE: the number of times the child does something not as intended.

You now want to determine to what degree the number of praises is a function of

• all predictors as main effects
• and interaction of a predictor with SEX.

Analyze the data with properly and summarize the results (briefly). [Difficulty level: 3]

rm(list=ls(all.names=TRUE))
summary(d <- read.delim(   # summarize d, the result of loading
file="inputfiles/202_11b_praises.csv", # this file
stringsAsFactors=TRUE)) # change categorical variables into factors
##      CHILD    SEX       PRAISES          CAN              WANT
##  aRetha : 1   f:15   Min.   : 0.0   Min.   : 0.000   Min.   : 0.00
##  aRnold : 1   m:13   1st Qu.: 2.0   1st Qu.: 1.000   1st Qu.: 0.75
##  baRbara: 1          Median : 5.0   Median : 4.000   Median : 2.00
##  beRnard: 1          Mean   : 5.5   Mean   : 4.321   Mean   : 3.25
##  chRis  : 1          3rd Qu.: 7.5   3rd Qu.: 5.250   3rd Qu.: 6.00
##  chRissy: 1          Max.   :13.0   Max.   :18.000   Max.   :10.00
##  (Other):22
##   SHOULD_SHALL      DIRECTIVE        SUCCESS          FAILURE
##  Min.   :0.0000   Min.   : 0.00   Min.   : 0.000   Min.   :0.000
##  1st Qu.:0.0000   1st Qu.: 9.00   1st Qu.: 4.000   1st Qu.:1.000
##  Median :0.0000   Median :12.00   Median : 6.500   Median :3.000
##  Mean   :0.8929   Mean   :15.61   Mean   : 7.679   Mean   :3.286
##  3rd Qu.:1.2500   3rd Qu.:19.50   3rd Qu.:10.000   3rd Qu.:5.250
##  Max.   :6.0000   Max.   :46.00   Max.   :18.000   Max.   :8.000
## 

# 3 Assignment 03

Central question: is the choice of of- vs. s-genitives (the car of my father vs. my father’s car) dependent in some way on the animacy of the possessor (my father) and/or the possessed (the car)? Your predictors are

• POSSESSOR: the animacy of the possessor: abstract vs. animate vs. concrete;
• POSSESSED: the animacy of the possessed: abstract vs. animate vs. concrete.

Analyze the data properly and summarize the results (briefly). [Difficulty level: 2]

rm(list=ls(all.names=TRUE))
summary(d <- read.delim(         # summarize d, the result of loading
file="inputfiles/202_11c_genitives.csv", # this file
stringsAsFactors=TRUE))       # change categorical variables into factors
##       CASE           POSSESSOR      POSSESSED   GENITIVE
##  Min.   :  1.00   abstract:139   abstract:206   of:150
##  1st Qu.: 75.75   animate :118   animate : 20   s :150
##  Median :150.50   concrete: 43   concrete: 74
##  Mean   :150.50
##  3rd Qu.:225.25
##  Max.   :300.00

# 4 Assignment 04

Central question: is the choice of try to- vs. try and-constructions (I’m gonna try to fix this problem vs. I’m gonna try and fix this problem, which is in the column TRY) dependent in some way on the following 3 predictors and all their interactions:

• MODE: whether the data represent spoken (spk) or written (wrt) English;
• VARIETY: whether the data represent American (amer) or British English (brit);
• CLAUSE: does the clause in which try is used with to or and already involve another to (as in we’re going -> to <- try and beat this thing) or not (other)?

(Source: Hommerberg, Charlotte & Gunnel Tottie. 2007. Try to or Try and? Verb complementation in British and American English. ICAME Journal 31. 45-64.)

Analyze the data like we discussed and summarize the results (briefly). [Difficulty level: 1]

rm(list=ls(all.names=TRUE))
summary(d <- read.delim(        # summarize d, the result of loading
file="inputfiles/202_11d_try-corp.csv", # this file
stringsAsFactors=TRUE))      # change categorical variables into factors
##       CASE       TRY       VARIETY      MODE        CLAUSE
##  Min.   :   1   and:1631   amer:1187   spk:2257   other:1662
##  1st Qu.: 808   to :1598   brit:2042   wrt: 972   to   :1567
##  Median :1615
##  Mean   :1615
##  3rd Qu.:2422
##  Max.   :3229

# 5 Assignment 05

Central question: is the choice of I vs. you , which is represented in the column MATCH dependent in some way on the following 3 predictors and all their pairwise interactions:

• SEX: whether the speaker is female or male;
• SENTENCE: where in the file I or you was used on a scale from 0 (first sentence) to 1 (last sentence);
• DISTANCE: where in the sentence I or you was used on a scale from 0 (first character) to ≈1 (last character).

The following loads the data and prepares the variable DISTANCE:

rm(list=ls(all.names=TRUE))
summary(d <- read.delim(      # summarize d, the result of loading
file="inputfiles/202_11e_IvsYou.csv", # this file
stringsAsFactors=FALSE))   # don't change categorical variables into factors (!)
##       CASE           FILE             SPEAKER              SEX
##  Min.   :    1   Length:21102       Length:21102       Length:21102
##  1st Qu.: 5276   Class :character   Class :character   Class :character
##  Median :10552   Mode  :character   Mode  :character   Mode  :character
##  Mean   :10552
##  3rd Qu.:15827
##  Max.   :21102
##     SENTENCE       PRECEDING            MATCH            SUBSEQUENT
##  Min.   :0.0000   Length:21102       Length:21102       Length:21102
##  1st Qu.:0.2394   Class :character   Class :character   Class :character
##  Median :0.5147   Mode  :character   Mode  :character   Mode  :character
##  Mean   :0.5014
##  3rd Qu.:0.7573
##  Max.   :1.0000
d$SENTLENGTH <- nchar(d$PRECEDING)  +
nchar(d$MATCH) + nchar(d$SUBSEQUENT)
d$DISTANCE <- nchar(d$PRECEDING)/d$SENTLENGTH d <- d[,c(1:3,7,4:5,9:10)]; d[,2:5] <- lapply(d[,2:5], as.factor) summary(d) ## CASE FILE SPEAKER MATCH SEX ## Min. : 1 KRL :4610 PS5VN : 1248 i : 2 : 1043 ## 1st Qu.: 5276 KRH :3590 PS62L : 852 I :11637 f: 6676 ## Median :10552 KRT :3093 PS63K : 785 you: 8619 m:12480 ## Mean :10552 KRP :1997 PS5T8 : 655 You: 844 u: 903 ## 3rd Qu.:15827 KR0 :1445 PS5VL : 647 ## Max. :21102 KRG :1385 PS59B : 632 ## (Other):4982 (Other):16283 ## SENTENCE SENTLENGTH DISTANCE ## Min. :0.0000 Min. : 1.0 Min. :0.0000 ## 1st Qu.:0.2394 1st Qu.: 65.0 1st Qu.:0.0351 ## Median :0.5147 Median : 141.0 Median :0.2453 ## Mean :0.5014 Mean : 181.3 Mean :0.3197 ## 3rd Qu.:0.7573 3rd Qu.: 250.0 3rd Qu.:0.5600 ## Max. :1.0000 Max. :1353.0 Max. :0.9978 ##  Analyze the data properly and summarize the results (briefly). [Difficulty level: 4] # 6 Assignment 06 Central question: Do n-grams returned early by an algorithm (BINRANK: early) get rated better (ordinal response: RATING) than returned late by that algorithm (BINRANK: late) if one controls for the length of the n-gram (SIZE)? The data frame contains the following variables : • RATING: the response variable, integers from 1 to 7; • SIZE: the number of parts of each n-gram; • BINRANK: the main predictor as per the above. Analyze the data properly and summarize the results (briefly). [Difficulty level: 2] rm(list=ls(all.names=TRUE)) summary(d <- read.delim( # summarize d, the result of loading file="inputfiles/202_11f_MERGE.csv", # this file stringsAsFactors=TRUE)) # change categorical variables into factors ## CASE RATING SIZE BINRANK ## Min. : 1.0 Min. :1.000 Min. :2.00 early:800 ## 1st Qu.: 400.8 1st Qu.:1.000 1st Qu.:2.75 late :800 ## Median : 800.5 Median :3.000 Median :3.50 ## Mean : 800.5 Mean :3.758 Mean :3.50 ## 3rd Qu.:1200.2 3rd Qu.:7.000 3rd Qu.:4.25 ## Max. :1600.0 Max. :7.000 Max. :5.00 # 7 Assignment 07 Central question: Are results on subordinate clause ordering from the studies of Hampe and Diessel comparable/compatible? Here are the data: • CASE: the usual numbering column; • STUDY: a column indicating to which study a data point in a row belongs: diessel vs. hampe; • ORDER: the response variable in each study, the order of main and subordinate clause (and you know this response from another study in the book); • CONJ: the predictor in each study, the subordinate conjunction used in the subordinate clause: rm(list=ls(all.names=TRUE)) d <- data.frame( STUDY=rep(c("diessel", "hampe"), 8), ORDER=rep(c("sc-mc", "mc-sc"), each=8), CONJ =rep(rep(c("after", "before", "once", "until"), each=2), 2), FREQ =c(27, 82, 6, 105, 77, 236, 5, 41, 70, 200, 81, 425, 21, 74, 94, 346)) d <- data.frame(lapply(d[, -4], \(af) { rep(af, d$FREQ) }))
d <- data.frame(lapply(d, as.factor))
summary(d <- cbind(CASE=seq(nrow(d)), d))
##       CASE            STUDY        ORDER          CONJ
##  Min.   :   1.0   diessel: 381   mc-sc:1311   after :379
##  1st Qu.: 473.2   hampe  :1509   sc-mc: 579   before:617
##  Median : 945.5                               once  :408
##  Mean   : 945.5                               until :486
##  3rd Qu.:1417.8
##  Max.   :1890.0

Are Hampe’s and Diessel’s findings ‘the same’? Analyze the data properly and summarize the results (briefly). [Difficulty level: 2]

# 8 Assignment 08

Central question: What determines how speakers rate the acceptability (the 7-level response variable RATING) of to- vs. -ing complementation (as in I like to swim vs. I like swimming) in an experiment?

• CX_NOW: whether the current experimental stimulus is a to or an -ing construction?
• VNOW_PREF: whether the verb in the current experimental stimulus generally prefers to appear with to or an -ing constructions?
• CX_PRV: whether the previous experimental stimulus was a to or an -ing construction?
• any interactions of these predictors?

Analyze the data properly and summarize the results (briefly). [Difficulty level: 2]

rm(list=ls(all.names=TRUE))
summary(d <- read.delim(       # summarize d, the result of loading
file="inputfiles/202_11h_try-exp.csv", # this file
stringsAsFactors=TRUE))     # change categorical variables into factors
##       CASE           RATING        CX_NOW    VNOW_PREF CX_PRV
##  Min.   :  1.0   Min.   :-3.0000   ing:270   ing:280   ing:278
##  1st Qu.:139.8   1st Qu.:-1.0000   to :286   to :276   to :278
##  Median :278.5   Median : 0.0000
##  Mean   :278.5   Mean   : 0.3705
##  3rd Qu.:417.2   3rd Qu.: 2.0000
##  Max.   :556.0   Max.   : 3.0000

# 9 Assignment 09

Central question: Do children and their caretakers exhibit different correlations (measured in Cramer’s V values) between tense (past vs. non-past) and aspect (perfective vs. imperfective) such that

• adults’ correlation values don’t change over time anymore;
• children’s correlation values change over time and approximate the adults’ value(s).

You have data from a corpus study and these are the variables in the data frame:

• AGE: the age of the child at recording time: YEAR;MONTH.DAY;
• KID: the Cramer’s V value for the child’s tense-aspect correlation in this recording;
• CARETAKER: the Cramer’s V value for the caretaker’s tense-aspect correlation in this recording

Note: Whatever graphs involving time you use, the axis representing the age of the child must of course be proportional to the age, not just to the position of an age in the vector of ages. I don’t care about how you do that, if you do that in a spreadsheet software, that’s fine, too.

Analyze the data with properly and summarize the results (briefly). [Difficulty level: 3]

rm(list=ls(all.names=TRUE))
summary(d <- read.delim(    # summarize d, the result of loading
file="inputfiles/202_11i_russaspect.csv", # this file
stringsAsFactors=FALSE)) # don't change categorical variables into factors (!)
##       CASE           AGE                 KID            CARETAKER
##  Min.   : 1.00   Length:80          Min.   :0.01645   Min.   :0.1627
##  1st Qu.:20.75   Class :character   1st Qu.:0.31861   1st Qu.:0.3004
##  Median :40.50   Mode  :character   Median :0.44217   Median :0.3554
##  Mean   :40.50                      Mean   :0.45170   Mean   :0.3640
##  3rd Qu.:60.25                      3rd Qu.:0.57247   3rd Qu.:0.4355
##  Max.   :80.00                      Max.   :1.00000   Max.   :0.5586

# 10 Assignment 10

Central question: what factors co-determine how English changed from a 3rd-person singular -th (e.g., He giveth) to the current 3rd-person singular -s (e.g., He gives)? You have data from a corpus study on how the third person singular form in English changed across five time periods (from P1 at about 1480 to P5 at about 1700). This data set contains annotation for third person singular verbs (extracted from letters) with regard to the following variables:

• VARIANT: the response variable: the third person singular form as found in the corpus file: es vs. th;
• TIME5: the time period: P1 vs. P2 vs. P3 vs. P4 vs. P5;
• SENGEND: the sex of the sender of the letter: female vs. male;
• RECGEND: the sex of the recipient of the letter: female vs. male;
• CLOSEFAM: whether the recipient of the letter is a close family member of the sender or not: no vs. yes;
• FINSYB: whether the verb stem ends in a sibilant: no (e.g., see) vs. yes (e.g., seize);
• FOLFRIC: what the word following the third person singular form begins with: s (e.g., he sees seagulls) vs. th (e.g., he sees the seagulls) vs. other (e.g., he sees many seagulls);
• GRAM: whether the verb in question is used as a grammatical or a lexical verb: yes (grammatical, i.e. be, do and aux. have) vs. no (lexical/other).
rm(list=ls(all.names=TRUE))
summary(d <- read.delim(    # summarize d, the result of loading
file="inputfiles/202_11j_3rdperson.csv", # this file
stringsAsFactors=TRUE)) # change categorical variables into factors
##       CASE      VARIANT    TIME5       SENGEND       RECGEND     CLOSEFAM
##  Min.   :   1   es :1524   P1: 505   female: 784   female: 734   no :1917
##  1st Qu.:1036   eth:2619   P2:  99   male  :3359   male  :3409   yes:2226
##  Median :2072              P3:1508
##  Mean   :2072              P4:1096
##  3rd Qu.:3108              P5: 935
##  Max.   :4143
##  FINSYB      FOLFRIC      GRAM
##  no :3953   other:3666   no :2867
##  yes: 190   s    : 189   yes:1276
##             th   : 288
##
##
## 

You want to characterize how the predictors and their pairwise interactions with TIME are correlated with the change from -(e)th to -(e)s. Analyze the data properly and summarize the results (briefly). Note: you must conflate the 3 early time periods into one, but once you’re done with everything, you should figure out why. [Difficulty level: 4]