A note about my dataset, the record of all US births from 1969 - 2008 hosted on Google's BigQuery.
- It's really, really, really big. Much bigger than the datasets provided for the course. I only really thought about this when I went to go create a .csv file and realized it would be impossible for a dataset of this size. Oops! Therefore, for the purpose of this exercise, I selected a subset of rows: all births in January of 2003 in the state of Virginia. This yields 8005 unique records.
And here is the output from running this code:
Apgar scores by COUNT
---------------------
0.0 3
1.0 13
2.0 9
3.0 5
4.0 10
5.0 21
6.0 44
7.0 124
8.0 525
9.0 6958
10.0 280
Name: APGAR_5MIN, dtype: int64
Apgar scores by PERCENTAGE
--------------------------
APGAR_5MIN
0.0 0.037477
1.0 0.162399
2.0 0.112430
3.0 0.062461
4.0 0.124922
5.0 0.262336
6.0 0.549656
7.0 1.549032
8.0 6.558401
9.0 86.920675
10.0 3.497814
dtype: float64
Cigarette use by COUNT (True/False)
-----------------------------------
False 7383
True 622
Name: CIGARETTE_USE, dtype: int64
Cigarette use by PERCENTAGE (True/False)
----------------------------------------
CIGARETTE_USE
False 92.229856
True 7.770144
dtype: float64
Alcohol use by COUNT (True/False)
---------------------------------
False 7383
True 622
Name: ALCOHOL_USE, dtype: int64
Alcohol use by PERCENTAGE (True/False)
--------------------------------------
ALCOHOL_USE
False 92.229856
True 7.770144
dtype: float64
A few comments on these distributions:
- Right away I thought, "so weird! exactly the same number of people who drank during pregnancy also smoked!". But that's a really precise coincidence, isn't it? When I went back and queried the BigQuery db here, I found that for each year that cigarette_use was recorded (starting in 2003) the *exact* same records were recorded TRUE for alcohol_use. This is true for records in various months of various years from 2003-2008 for various states. In fact, in querying the entire db, I couldn't find any records that were TRUE for alcohol_use and not for cigarette_use, and vice versa. Very odd, I therefore need to do some more research on the variables and how the have been recorded.
- Low Apgar scores are quite uncommon. For January births in Virginia in the year 2003, almost 87% of all newborn infants received a 9 Apgar score. That's great news for newborns! However, it then makes sense why people say that Apgar scores are not a clear indication of early childhood health, and I am starting to think therefore not as closely related to maternal health indicators as I had previously thought.
- I did not run a distribution for weight gain. Well, I did, but it was spread among 1 to 99lbs of maternal weight gain. Therefore, I need to figure out how to perform frequency analysis on groups of distributions; for example, # of mothers who gained 1-10lbs, or 11-20lbs, and so on.
- Finally, time permitting, I'd like to figure out how to use the BigQuery API to query the dataset directly from my python program, so that I can perform queries against the entire dataset. If I can't figure that out for this course, then I will select 1-3 states for a particular month in a particular year to create a statistically significant set of observations for analysis.