Biology 483G Assignments

Homework/Exam Data Sets (click to download)

transfrm.syd, transform.xls -  Assorted data for normality testing and (if necessary) transformation
concevol.syd, concevol.xls - Hypothetical data on concerted evolution in classes of repeat   sequences
pcdfa.syd - Size-corrected morphological data (56 characters) for 6 populations of Gila cypha
cluster.syd - Data on burial artifacts from graves in northeast Thailand (taken from Higham; Manly 1994)
protein.syd - Data on protein consumption and employment categories for 24 European countries (Manly 
                   1994)
currency.syd - Data on genuine vs. forged bank notes (modified from Flury and Reidwyl 1993)

 

 

 

 

Second Exam (Due by 12:00 noon, Friday, May 9th)

Complete all questions.  Limit your answers to 2 single-spaced pages per question (i.e. 1A).  Construct your answers in manuscript format, as Materials and Methods and/or Results section(s).  For those questions requiring statistical analysis, support your conclusions with appropriate statistics incorporated in the text or summarized in Table or Figure form.  DO NOT include raw SYSTAT/SPSS output.  Be sure to consider and address all necessary assumptions of the tests you wish to apply.  Avoid including information not directly related to the question; you will lose points both for failing to include pertinent information as well as for providing extraneous information.  All answers must be word-processed; handwritten exams will receive a grade of zero.  All exams should be completed individually.  Do not discuss your answers or efforts with others in the class - ask me if you are unsure about the meaning or intent of a question.   NOTE that thoroughly answering a particular subquestion may sometimes require you to conduct more than one type of statistical analysis.  On the other hand, different subquestions will (generally) necessitate the use of different techniques, or focus on different aspects or ways of applying a particular technique.  Only techniques from the second half of the course are valid for use in answering the questions.

1.  Prior to the 1980s, most states required couples filing for divorce to identify specific reasons for terminating the marriage contract.  Each state maintained a list of acceptable grounds for divorce.  In some cases these were quite obscure or bizarre.  We might expect there to be related sets of causes that might co-occur in groups of states; that is, if a state accepted desertion as grounds for divorce, they might also tend to accept lack of support.  In addition, we might imagine (or hypothesize) that the set of acceptable grounds for divorce might show some sort of regional pattern, reflecting the unique cultural or demographic characteristics of an area.  The data set divorce.xls identifies the acceptable grounds for divorce in each state from the year 1971.  In this data set, ‘1’ indicates that the state recognized a particular cause as acceptable grounds for divorce, while ‘0’ indicates that it did not.  Use these data to address the following issues ina statistically appropriate manner:

a.  Is there any evidence to suggest that there are related sets of grounds for divorce that show patterns of co-occurrence among states ?  That is, do certain types of grounds for divorce tend to show the same distribution among states, and is there any consistent relationship between them (i.e., all are moral causes, vs. financial causes, etc.).  What is the significance and/or strength of your conclusion ?  Justify your conclusions with appropriate numerical or graphical results.  (25 points)

b.  Is there any evidence to suggest that there regional or demographic patterns in acceptable grounds for divorce ?  That is, do are there sets of states that have similar patterns of acceptable causes, and is there any consistent relationship between them (i.e., southern vs. northern, urban vs. rural, etc.).  What is the significance and/or strength of your conclusion ?  Justify your conclusions with appropriate numerical or graphical results.  (25 points)

 

2.  While it is often a financial necessity, parents and child development experts have long debated the advantages and disadvantages of placing children in a day care setting.  Some might argue that the socialization aspect encourages development of interpersonal skills in children, while others might suggest that those same interactions might foster aggression and/or out of control behavior.  We could potentially address this question by comparing the behavior of children exposed to different childcare settings along certain developmental vectors.  The data set daycare.xls provides such a study.  Here, a set of children was characterized on the basis of 4 variables: (1) whether they were cared for by parents, a private sitter, or in a day care setting; (2) the behavioral skills they expressed at the dinner table; (3) their behavioral skills upon encountering a stranger; and (3) their social problem solving skills as measured through a cognitive test.  Use these data to address the following questions in a statistically appropriate manner:

a.  What are the patterns of variation among the behavioral variables across all groups ?  That is, are there associations (positive or negative) between behavioral skills ?  How well do these associations account for the variation among children within the data set ?  Support your conclusions with appropriate numerical and/or graphical results.

b.  Is there any evidence that the childcare setting affects childrens’ behavioral skills ?  That is, are their significant differences in behavioral skill characteristics among groups of children ?  Describe the nature of the patterns of variation as fully as possible.  How consistent or strong are these differences ?  Support your conclusions with appropriate numerical and/or graphical results.  (25 points)

 

 

First Exam (Due by 3:00, Friday, April 11th)

Complete all questions.  Limit your answers to 2 single-spaced pages per question (i.e. 1A).  Construct your answers in manuscript format, as Materials and Methods and/or Results section(s).  For those questions requiring statistical analysis, support your conclusions with appropriate statistics incorporated in the text or summarized in Table or Figure form.  DO NOT include raw SYSTAT/SPSS output.  Be sure to consider and address all necessary assumptions of the tests you wish to apply.  Avoid including information not directly related to the question; you will lose points both for failing to include pertinent information as well as for providing extraneous information.  All answers must be word-processed; handwritten exams will receive a grade of zero.  All exams should be completed individually.  Do not discuss your answers or efforts with others in the class - ask me if you are unsure about the meaning or intent of a question.   NOTE that thoroughly answering a particular subquestion may sometimes require you to conduct more than one type of statistical analysis.  On the other hand, different subquestions will necessitate the use of different techniques, or focus on different aspects of a particular technique.

 

1.  Idiopathic pulmonary fibrosis (IPF) is a disease of the respiratory system which reduces lung capacity and efficiency  throughthe loss of functional alveolae and resultant limit in the transfer of oxygen from air to blood.  The causative agent of the disease is unknown, though there appears to be some involvement of myofibroblasts in the manifestation of the disease once symptoms become established.  As with many respiratory diseases, age and the effects of smoking on clinical symptoms is of potential interest.  The data contained in the file ipf.txt represent predicted residual lung volumes of patients with IPF; the patients studied differ with respect to sex (M,F), age (in years) and smoking status (N = never, F = former, C = current).  Using these data, address the following questions in a statistically-appropriate manner.

a.  Is there a relationship between age and/or smoking status and predicted residual lung volume ?  What is the nature and strength of that relationship ?  Describe the patterns of variation as fully as possible.  Ignore the potential effect of sex.  Justify your conclusions with appropriate numerical results.  (25 points)

b.  Is there a relationship between smoking status and/or sex and predicted residual lung volume ?  What is the nature and strength of that relationship ?  Decsribe the patterns of variation as fully as possible.  Ignore the potential effect of age.  Justify your conclusions with appropriate numerical results.  (25 points)

 

2.  As a borderline (my wife might argue over this choice of adjective) obsessive-compulsive, I can relate to this question.  Imagine that you administer the Multidimensional Perfectionism Scale (MPS) to a sample of 82 adult men; the resulting score is indicative of an individual’s level of obsessive-compulsiveness (higher = more).  It is my assumption that this test is rather extensive and time-consuming to administer and score, so imagine further that you are a graduate student looking for more rapid and simpler indicators of obsessive-compulsiveness that will allow you to evaluate subjects more readily and less invasively.  To that end, you also characterize each of the 82 individuals in your study on the basis of two types of obsessive-compulsive behavior (WASHING = average number of times an individual repeatedly washes their hands that are clean, and CHECKING = average number of times an individual repeatedly checks something that has already been checked).  You also record individuals’ ages (AGE) and level of education (EDUC) as possible contributors to MPS score.  These data are available in the file ocd.xls.  Use these data to address the following questions:

a.  Is there any evidence that a combination of simple variables (WASHING, CHECKING, AGE and/or EDUC) are useful predictors of MPS score ?  What is the particular model that best describes the relationship ?  What is the level of significance and strength associated with that relationship ?  Justify your conclusions with appropriate numerical results.  Be sure to consider all necessary assumptions of the procedures you carry out, as well as potential confounding influences inherent in your data set.  (25 points)

b.  What are the biological relationships among the variables (WASHING, CHECKING, AGE and EDUC), regardless of significance ?  What are the biological relationships between each of these variables and MPS ?  If you had to rely on only one simple variable as a substitute for MPS, which would it be and why ?  Justify your conclusions with appropriate numerical results.  Be sure to consider all necessary assumptions of the procedures you carry out, as well as potential confounding influences inherent in your data set.  (25 points)

 

 



  

 

Homework #4  

The data set cluster.syd provided above represents data on the presence or absence of certain artifacts in graves from a cemetary in northern Thailand.  There are 38 different types of artifacts, and the bodies in the graves are classified as either adult males (1), adult females (2), or children (3).  Carry out a cluster analysis to examine the relationship among the 47 burials.  Is there any evidence to suggest that the type of body in the grave is related to the nature of artifacts associated with that grave ?  How strong is that evidence and why ?  Provide BRIEF and annotated SYSTAT output to support your conclusions, as well as a summary of your results and  conclusions in paragraph form suitable as a manuscript Results and Discussion section.


 

 

Homework #3  

The file 'pcdfa.syd' contains data used in an attempt to differentiate populations of the endangered cyprinid fish Gila cypha.  148 individuals from 6 isolated populations in the upper Colorado River basin were sampled for 56 morphological characteristics; size differences among individuals have been factored out, so the information remaining is thought to reflect patterns of shape variation within and among populations.  The populations are as follows: 1 - Black Rocks; 2 - Cataract Canyon; 3 - Desolation Canyon; 4 - Grand Canyon; 5 - Westwater Canyon; 6 - Yampa River. Carry out both a principal components and discriminant function analysis on these data, and from each analysis consider the following question: is there any evidence to suggest that isolated populations are morphologically distinct ?  What accounts for the very different picture that emerges from the two analyses and why ?

 




 

 

Homework #2

1.  Concerted evolution is a common phenomenon among repetitive DNA sequences such as rDNA, mtDNA, and immunoglobulin gene families.  In concerted evolution, multiple copies are homogenized to some extent (presumably) through unequal crossing-over and gene conversion during DNA replication; the result is that the individual copies of the repeat sequences do not accumulate divergent mutations as quickly as one might expect.  Li (1997) has reviewed the data on concerted evolution and has prosed that the rate of sequence divergence may be affected by a number of biological factors, including but not limited to:

a.  Functional requirements - the need for a large amount of identical gene product for normal functioning of 
          the cell vs. the need for a large amount of diversity in gene product
b.  Structure of the repeat - the number and/or size of introns or other non-coding regions within each 
          repeat sequence
c.  Selection on the repeat - the coefficient of positive selection (favoring divergence) or negative selection 
          (favoring homogeneity) acting on a particular repeat sequence within the gene family
d.  Arrangement of the repeats - the average map distance between adjacent copies of the repeat unit

Using the 'concevol.syd' data set provided above, examine the dependence of the degree of sequence divergence among repeat copies on these four factors.  What is the 'best' model of the relationship between independent and dependent variables and why ?  Provide BRIEF and annotated SYSTAT output to support your conclusions, as well as a summary of your methods and conclusions in paragraph form suitable as manuscript Methods and Results sections.
 

2.  A medical researcher was investigating levels of a protein whose excess is thought to be related to onset of a particular disease syndrome more prevalent in males than in females.  This researcher examined protein concentrations in male and female mice from 3 inbred strains that differ in their susceptibility to the disease; the E strain is most susceptible, while the I strain is least so.  These data are given below.  Test the hypothesis that high protein concentration is associated with disease onset.  Provide BRIEF and annotated SYSTAT output to support your conclusions, as well as a summary of your methods and onclusions in paragraph form suitable as manuscript Methods and Results sections.
 
 

 

 

Strain

Sex

E

W

I

Females

50.1 
52.8 
50.8 
58.8 
59.7 
49.0 
58.8 
62.2 
57.8 
61.2

53.4 
55.2 
51.0 
59.3 
61.5 
61.2 
57.8 
50.1 
56.0 
56.5

54.0 
49.1 
60.5 
57.8 
48.7 
57.0 
61.1 
62.8 
59.8 
60.3

Males

46.5
44.4 
42.0 
51.1 
45.8 
46.3 
41.8 
52.0 
46.5 
39.0

57.5 
59.3 
62.4 
61.1 
59.9 
55.6 
46.8 
59.2 
50.4 
47.8

49.1 
51.8 
55.3 
43.6 
50.1 
51.0 
49.0 
48.8 
52.0 
43.0

 


 

 

Homework #1  

Using the file transfrm.syd above, as well as the data provided on fecundity and body size in fish,  test each variable for normality, and apply appropriate data transformations to restore normality as appropriate.  In the style of a Materials and Methods section, provide a written summary of your statistical approach taken for each variable (approximately 1 paragraph each).  Do not describe all the steps you took, only the protocol for getting from the original data to the appropriately-transformed data.  Given that normality testing represents preliminary (or exploratory) analysis done in advance of the tests of hypotheses central to the study, it is appropriate to provide statistical support for your conclusions within this part of a Materials and Methods section (and by extension, within the context of your answers); however, this would typically not involve use of tables or figures.