Coursework: Data Management and Visualization

Sunday, December 18, 2016

Assignment 3: Organizing variables

My hypothesis for the first assignment was that for the US data available from the natality statistics, there will be a statistically significant correlation between prenatal cigarette smoking, alcohol consumption, gestational weight gain, and lower Apgar scores. (Apgar scores are obtained at 1 minute past delivery time and again at 5 minutes past delivery time, to provide an assessment of newborn health based upon infant activity, pulse, grimace reflex, appearance, and respiration.)

Due to the size of this data set (all United States births registered in the 50 States, the District of Columbia, and New York City from 1969 to 2008) I selected a subset which could be downloaded to a csv file and used in this course analysis: all births in January of 2003 in the state of Virginia.

Data management decisions:

In assignment #2, I performed frequency distributions for 3 variables: Apgar scores at 5 minutes, alcohol use, and cigarette use. Apgar scores are well defined on a scale of 1-10, where 10 is the best score and 1 is the worst score. Therefore I didn't feel that any recoding was required here.

Alcohol use and cigarette use are recorded as "True" or "False", and there is no missing data for this list of observations. Therefore no recoding, removal of missing data, or combing into secondary variables is possible.

Originally I had wanted to factor in amount of weight gained by mother during pregnancy. However, values were anything from 1 to 99lbs. Therefore, a frequency distribution per each lb gained didn't make sense. Therefore, below is the code to group weight gain into categories.

My categorization code:

The output:

Full output of my frequency distributions is here:

Apgar scores by COUNT
---------------------
0.0 3
1.0 13
2.0 9
3.0 5
4.0 10
5.0 21
6.0 44
7.0 124
8.0 525
9.0 6958
10.0 280
Name: APGAR_5MIN, dtype: int64

Apgar scores by PERCENTAGE
--------------------------
APGAR_5MIN
0.0 0.037477
1.0 0.162399
2.0 0.112430
3.0 0.062461
4.0 0.124922
5.0 0.262336
6.0 0.549656
7.0 1.549032
8.0 6.558401
9.0 86.920675
10.0 3.497814
dtype: float64

Cigarette use by COUNT (True/False)
-----------------------------------
False 7383
True 622
Name: CIGARETTE_USE, dtype: int64

Cigarette use by PERCENTAGE (True/False)
----------------------------------------
CIGARETTE_USE
False 92.229856
True 7.770144
dtype: float64

Alcohol use by COUNT (True/False)
---------------------------------
False 7383
True 622
Name: ALCOHOL_USE, dtype: int64

Alcohol use by PERCENTAGE (True/False)
--------------------------------------
ALCOHOL_USE
False 92.229856
True 7.770144
dtype: float64

Weight gain (lbs) by groups
---------------------------
(0, 20] 1659
(20, 40] 4624
(40, 60] 1220
(60, 80] 152

Name: WEIGHT_GAIN_POUNDS, dtype: int64

Sunday, December 11, 2016

Assignment 2: First Python Program

In this assignment, we are asked to publish our first python program that loads our dataset and provides frequency distributions for our chosen variables. My hypothesis for the first assignment was that for the US data available from the natality statistics, there will be a statistically significant correlation between prenatal cigarette smoking, alcohol consumption, gestational weight gain, and lower Apgar scores. (Apgar scores are obtained at 1 minute past delivery time and again at 5 minutes past delivery time, to provide an assessment of newborn health based upon infant activity, pulse, grimace reflex, appearance, and respiration.)

A note about my dataset, the record of all US births from 1969 - 2008 hosted on Google's BigQuery.

It's really, really, really big. Much bigger than the datasets provided for the course. I only really thought about this when I went to go create a .csv file and realized it would be impossible for a dataset of this size. Oops! Therefore, for the purpose of this exercise, I selected a subset of rows: all births in January of 2003 in the state of Virginia. This yields 8005 unique records.

Here is a link to my code.

And here is the output from running this code:

Apgar scores by COUNT

---------------------

0.0 3

1.0 13

2.0 9

3.0 5

4.0 10

5.0 21

6.0 44

7.0 124

8.0 525

9.0 6958

10.0 280

Name: APGAR_5MIN, dtype: int64

Apgar scores by PERCENTAGE

--------------------------

APGAR_5MIN

0.0 0.037477

1.0 0.162399

2.0 0.112430

3.0 0.062461

4.0 0.124922

5.0 0.262336

6.0 0.549656

7.0 1.549032

8.0 6.558401

9.0 86.920675

10.0 3.497814

dtype: float64

Cigarette use by COUNT (True/False)

-----------------------------------

False 7383

True 622

Name: CIGARETTE_USE, dtype: int64

Cigarette use by PERCENTAGE (True/False)

----------------------------------------

CIGARETTE_USE

False 92.229856

True 7.770144

dtype: float64

Alcohol use by COUNT (True/False)

---------------------------------

False 7383

True 622

Name: ALCOHOL_USE, dtype: int64

Alcohol use by PERCENTAGE (True/False)

--------------------------------------

ALCOHOL_USE

False 92.229856

True 7.770144

dtype: float64

A few comments on these distributions:

Right away I thought, "so weird! exactly the same number of people who drank during pregnancy also smoked!". But that's a really precise coincidence, isn't it? When I went back and queried the BigQuery db here, I found that for each year that cigarette_use was recorded (starting in 2003) the *exact* same records were recorded TRUE for alcohol_use. This is true for records in various months of various years from 2003-2008 for various states. In fact, in querying the entire db, I couldn't find any records that were TRUE for alcohol_use and not for cigarette_use, and vice versa. Very odd, I therefore need to do some more research on the variables and how the have been recorded.

Low Apgar scores are quite uncommon. For January births in Virginia in the year 2003, almost 87% of all newborn infants received a 9 Apgar score. That's great news for newborns! However, it then makes sense why people say that Apgar scores are not a clear indication of early childhood health, and I am starting to think therefore not as closely related to maternal health indicators as I had previously thought.

I did not run a distribution for weight gain. Well, I did, but it was spread among 1 to 99lbs of maternal weight gain. Therefore, I need to figure out how to perform frequency analysis on groups of distributions; for example, # of mothers who gained 1-10lbs, or 11-20lbs, and so on.

Finally, time permitting, I'd like to figure out how to use the BigQuery API to query the dataset directly from my python program, so that I can perform queries against the entire dataset. If I can't figure that out for this course, then I will select 1-3 states for a particular month in a particular year to create a statistically significant set of observations for analysis.

Saturday, December 3, 2016

Assignment 1: Getting Started

Hi Everyone! In this first assignment we are asked to select a data set, develop a research question, perform a review of relevant publications, and finally create a hypothesis.

The Data

I have decided to to select a data set available outside of the course: the "natality" data set made available on Google's BigQuery sample tables.

The natality data set describes all United States births registered in the 50 States, the District of Columbia, and New York City from 1969 to 2008. This data is obtained from the CDC's Division of Vital Statistics which actually provides data sets from 1968 through 2015 via this page. (CDC Wonder data access portals also provide access to this data on this page for years 1995 - 2014).

The Question

So, when I initially saw this variables of this data on the BigQuery schema page, I thought to myself, "interesting, they collect Apgar scores!". Having just had a baby last December, I knew that Apgar scores were obtained at 1 minute past delivery time and again at 5 minutes past delivery time, to provide an assessment of newborn health based upon infant activity, pulse, grimace reflex, appearance, and respiration.

Though Apgar scores are only recorded from 1978 - 2002 in the Vital Statistics database, I am interested to determine if there is an association among Apgar scores, and the variables that pertain the maternal health and well being, including: cigarette smoking, drinking, and weight gain during pregnancy.

Literature Review

Honestly, I would have "guessed" that there would be some type of correlation between maternal health and infant Apgar scores. However, when I searched for articles comparing Apgar scores and cigarette smoking, alcohol consumption, and/or maternal weight gain, the following articles suggest potential conflicting information regarding a correlation between these factors and Apgar scores:

Maternal cigarette smoking, psychoactive substance use, and infant Apgar scores (1982)

"A study of 1,709 mother/child pairs at Boston City Hospital examined whether maternal cigarette smoking, drinking, or the use of other psychoactive substances was associated with low infant Apgar scores"...."None of the substance use variables was significantly associated with low infant Apgar scores at 1 and 5 minutes. Other labor and delivery risks, such as short length of gestation, abnormal delivery presentation, placental abnormalities, nuchal cord, and exposure to general anesthesia during delivery, were associated with low Apgar scores."

Substance Use of Pregnant Women and Early Neonatal Morbidity: Where to Focus Intervention? (2010)

This article is based in Canada and examines "the relation between maternal smoking, alcohol consumption and drug dependence during pregnancy and early neonatal morbidity."
"Markers of neonatal morbidity were Apgar scores (<7 at 5 minutes postpartum) and resuscitation measures (2001-2005, N=191,686), and neonatal intensive care unit (NICU) admissions (2002-2005, N=154,924)."
"The main findings of this analysis are that smoking, daily or high alcohol consumption and drug dependency during pregnancy contribute to early neonatal morbidity and that eliminating maternal smoking would prevent 10-15% of each of the three markers of neonatal morbidity. "

A Prospective Study of Smoking and Pregnancy (1970)

"A 50% increase of prematurity rate was registered among smoking women compared with non-smoking women" however "No effect of smoking on the mean Apgar score of surviving, non-malformed children was seen."

The Seattle longitudinal prospective study on alcohol and pregnancy. (1981)

"An unselected sample of 1529 women (predominantly white, married, and middle-class) were interviewed during pregnancy regarding their use of alcohol, nicotine, caffeine, drugs, and other variables. Subsets of offspring were examined to assess the relationship of self-reported maternal alcohol use to infant health and development."
"The following are among those outcomes significantly related to increase maternal alcohol use after adjusting for smoking and other variables: smaller infant size (birth weight, length and head circumference); lower Apgar scores; poorer neonatal habituation; decreased sucking pressure; increased tremulousness and head-turns-to-left; decreased vigorous activity; and a higher frequency of minor dysmorphic characteristics combined with low birth weight and microcephaly."

Combined associations of prepregnancy body mass index and gestational weight gain with the outcome of pregnancy (2008)

"High and very high GWG [gestational weight gain] added to the associations of high prepregnancy BMI with cesarean delivery and were strongly associated with high postpartum weight retention. Moreover, greater weight gains and high maternal BMI decreased the risk of growth restriction and increased the risk of the infant's being born large-for-gestational-age or with a low Apgar score."

My Hypothesis

I propose that for the US data available from the natality statistics, there will be a statistically significant correlation between prenatal cigarette smoking, alcohol consumption, gestational weight gain, and lower Apgar scores.

Disclaimers: My posting/thoughts/opinions/analysis are my own, and do not reflect any active project at Google. I'm not a doctor, social scientist, or any type of "health expert" - my analysis here is purely for the opportunity to play with data sets, Python, and statistical analysis. I don't expect anyone but my class peers to read this blog but if you stumble upon it, please treat this for what it is: an academic exercise for personal development. Thanks!