Due to the size of this data set (all United States births registered in the 50 States, the District of Columbia, and New York City from 1969 to 2008) I selected a subset which could be downloaded to a csv file and used in this course analysis: all births in January of 2003 in the state of Virginia.
Data management decisions:
- In assignment #2, I performed frequency distributions for 3 variables: Apgar scores at 5 minutes, alcohol use, and cigarette use. Apgar scores are well defined on a scale of 1-10, where 10 is the best score and 1 is the worst score. Therefore I didn't feel that any recoding was required here.
- Alcohol use and cigarette use are recorded as "True" or "False", and there is no missing data for this list of observations. Therefore no recoding, removal of missing data, or combing into secondary variables is possible.
- Originally I had wanted to factor in amount of weight gained by mother during pregnancy. However, values were anything from 1 to 99lbs. Therefore, a frequency distribution per each lb gained didn't make sense. Therefore, below is the code to group weight gain into categories.
My categorization code:
The output:
Full output of my frequency distributions is here:
Apgar scores by COUNT
---------------------
0.0 3
1.0 13
2.0 9
3.0 5
4.0 10
5.0 21
6.0 44
7.0 124
8.0 525
9.0 6958
10.0 280
Name: APGAR_5MIN, dtype: int64
Apgar scores by PERCENTAGE
--------------------------
APGAR_5MIN
0.0 0.037477
1.0 0.162399
2.0 0.112430
3.0 0.062461
4.0 0.124922
5.0 0.262336
6.0 0.549656
7.0 1.549032
8.0 6.558401
9.0 86.920675
10.0 3.497814
dtype: float64
Cigarette use by COUNT (True/False)
-----------------------------------
False 7383
True 622
Name: CIGARETTE_USE, dtype: int64
Cigarette use by PERCENTAGE (True/False)
----------------------------------------
CIGARETTE_USE
False 92.229856
True 7.770144
dtype: float64
Alcohol use by COUNT (True/False)
---------------------------------
False 7383
True 622
Name: ALCOHOL_USE, dtype: int64
Alcohol use by PERCENTAGE (True/False)
--------------------------------------
ALCOHOL_USE
False 92.229856
True 7.770144
dtype: float64
Weight gain (lbs) by groups
---------------------------
(0, 20] 1659
(20, 40] 4624
(40, 60] 1220
(60, 80] 152
Name: WEIGHT_GAIN_POUNDS, dtype: int64