Sunday, December 11, 2016

Assignment 2: First Python Program

In this assignment, we are asked to publish our first python program that loads our dataset and provides frequency distributions for our chosen variables. My hypothesis for the first assignment was that for the US data available from the natality statistics, there will be a statistically significant correlation between prenatal cigarette smoking, alcohol consumption, gestational weight gain, and lower Apgar scores. (Apgar scores are obtained at 1 minute past delivery time and again at 5 minutes past delivery time, to provide an assessment of newborn health based upon infant activity, pulse, grimace reflex, appearance, and respiration.)

A note about my dataset, the record of all US births from 1969 - 2008 hosted on Google's BigQuery.
  • It's really, really, really big. Much bigger than the datasets provided for the course. I only really thought about this when I went to go create a .csv file and realized it would be impossible for a dataset of this size. Oops!  Therefore, for the purpose of this exercise, I selected a subset of rows: all births in January of 2003 in the state of Virginia. This yields 8005 unique records.

And here is the output from running this code:

Apgar scores by COUNT
---------------------
0.0        3
1.0       13
2.0        9
3.0        5
4.0       10
5.0       21
6.0       44
7.0      124
8.0      525
9.0     6958
10.0     280
Name: APGAR_5MIN, dtype: int64

Apgar scores by PERCENTAGE
--------------------------
APGAR_5MIN
0.0      0.037477
1.0      0.162399
2.0      0.112430
3.0      0.062461
4.0      0.124922
5.0      0.262336
6.0      0.549656
7.0      1.549032
8.0      6.558401
9.0     86.920675
10.0     3.497814
dtype: float64

Cigarette use by COUNT (True/False)
-----------------------------------
False    7383
True      622
Name: CIGARETTE_USE, dtype: int64

Cigarette use by PERCENTAGE (True/False)
----------------------------------------
CIGARETTE_USE
False    92.229856
True      7.770144
dtype: float64

Alcohol use by COUNT (True/False)
---------------------------------
False    7383
True      622
Name: ALCOHOL_USE, dtype: int64

Alcohol use by PERCENTAGE (True/False)
--------------------------------------
ALCOHOL_USE
False    92.229856
True      7.770144
dtype: float64

A few comments on these distributions:

  • Right away I thought, "so weird! exactly the same number of people who drank during pregnancy also smoked!". But that's a really precise coincidence, isn't it? When I went back and queried the BigQuery db here, I found that for each year that cigarette_use was recorded (starting in 2003) the *exact* same records were recorded TRUE for alcohol_use.  This is true for records in various months of various years from 2003-2008 for various states.  In fact, in querying the entire db, I couldn't find any records that were TRUE for alcohol_use and not for cigarette_use, and vice versa.  Very odd, I therefore need to do some more research on the variables and how the have been recorded.
  • Low Apgar scores are quite uncommon. For January births in Virginia in the year 2003, almost 87% of all newborn infants received a 9 Apgar score.  That's great news for newborns! However, it then makes sense why people say that Apgar scores are not a clear indication of early childhood health, and I am starting to think therefore not as closely related to maternal health indicators as I had previously thought.  
  • I did not run a distribution for weight gain.  Well, I did, but it was spread among 1 to 99lbs of maternal weight gain.  Therefore, I need to figure out how to perform frequency analysis on groups of distributions; for example, # of mothers who gained 1-10lbs, or 11-20lbs, and so on. 
  • Finally, time permitting, I'd like to figure out how to use the BigQuery API to query the dataset directly from my python program, so that I can perform queries against the entire dataset.  If I can't figure that out for this course, then I will select 1-3 states for a particular month in a particular year to create a statistically significant set of observations for analysis.