Coursework: Data Management and Visualization: Assignment 3: Organizing variables

My hypothesis for the first assignment was that for the US data available from the natality statistics, there will be a statistically significant correlation between prenatal cigarette smoking, alcohol consumption, gestational weight gain, and lower Apgar scores. (Apgar scores are obtained at 1 minute past delivery time and again at 5 minutes past delivery time, to provide an assessment of newborn health based upon infant activity, pulse, grimace reflex, appearance, and respiration.)

Due to the size of this data set (all United States births registered in the 50 States, the District of Columbia, and New York City from 1969 to 2008) I selected a subset which could be downloaded to a csv file and used in this course analysis: all births in January of 2003 in the state of Virginia.

Data management decisions:

In assignment #2, I performed frequency distributions for 3 variables: Apgar scores at 5 minutes, alcohol use, and cigarette use. Apgar scores are well defined on a scale of 1-10, where 10 is the best score and 1 is the worst score. Therefore I didn't feel that any recoding was required here.

Alcohol use and cigarette use are recorded as "True" or "False", and there is no missing data for this list of observations. Therefore no recoding, removal of missing data, or combing into secondary variables is possible.

Originally I had wanted to factor in amount of weight gained by mother during pregnancy. However, values were anything from 1 to 99lbs. Therefore, a frequency distribution per each lb gained didn't make sense. Therefore, below is the code to group weight gain into categories.

My categorization code:

The output:

Full output of my frequency distributions is here:

Apgar scores by COUNT
---------------------
0.0 3
1.0 13
2.0 9
3.0 5
4.0 10
5.0 21
6.0 44
7.0 124
8.0 525
9.0 6958
10.0 280
Name: APGAR_5MIN, dtype: int64

Apgar scores by PERCENTAGE
--------------------------
APGAR_5MIN
0.0 0.037477
1.0 0.162399
2.0 0.112430
3.0 0.062461
4.0 0.124922
5.0 0.262336
6.0 0.549656
7.0 1.549032
8.0 6.558401
9.0 86.920675
10.0 3.497814
dtype: float64

Cigarette use by COUNT (True/False)
-----------------------------------
False 7383
True 622
Name: CIGARETTE_USE, dtype: int64

Cigarette use by PERCENTAGE (True/False)
----------------------------------------
CIGARETTE_USE
False 92.229856
True 7.770144
dtype: float64

Alcohol use by COUNT (True/False)
---------------------------------
False 7383
True 622
Name: ALCOHOL_USE, dtype: int64

Alcohol use by PERCENTAGE (True/False)
--------------------------------------
ALCOHOL_USE
False 92.229856
True 7.770144
dtype: float64

Weight gain (lbs) by groups
---------------------------
(0, 20] 1659
(20, 40] 4624
(40, 60] 1220
(60, 80] 152

Name: WEIGHT_GAIN_POUNDS, dtype: int64

Sunday, December 18, 2016

Assignment 3: Organizing variables