Kate Willyard
  • About
  • Research
    • Journal Articles
    • Awarded Research Proposals
    • Datasets
    • Doctoral Dissertation
    • Op-Eds and News
    • Presentations
    • Working Papers
  • Teaching
    • Classical Social Theory
    • Enviornmental Sociology
    • Medical Sociology
    • Social Inequality
    • Social Problems
    • Sociology of Organizations
  • Academic Blog
  • Personal Blog
  • Contact

Academic Blog

"If we knew what we were doing, it would not be called research, would it?" -Albert Einstein

Comparing Census Population Data, Part Three: Downloading Bulk Data from Census FTP using Python Programs

6/29/2018

0 Comments

 
This post describes the second step of my research to compare Census population data over time: Downloading Bulk Data from the Census FTP Site using Python Programs. See Step One: Studying Documentation to Determine the Feasibility of Variable Comparison Over Time, Comparing Census Population Data, Part Two for a description of the work completed prior to getting to this step. See Comparing Census Population Data, Part One to get introduced to my research project.

Census makes their data available through their FTP site. Rather than point and click to get to each file, download, and unzip it, I wrote a Python code to automate the process.  I only grabbed the data segments that I needed. These segments were determined in Step One.  
 
The first Python code I wrote downloads the national Decennial Census 2000 Summary File 3 data, the national Decennial Census 2000 Summary File 3 data, the state Decennial Census 2010 Summary File 1 data, and the state American Community Survey (2008-2012) data. We chose to use the American Community Survey five year estimates in order to increase the sample size.  In the code, the user needs to set the path where they want to download the data (dc00Path, dc10Path, and acs12Path), the three digit code representing the segments to be downloaded from the Decennial Census 2000 Summary File 1 (dc00sf1List), the Decennial Census 2000 Summary File 3 (dc00sf3List) and the Decennial Census 2010 Summary File 1 (dc10sf1List), and the four digit code representing the segments to be downloaded from the American Community Survey (acsList). This Python code (01-downloadCensusData_20180424_p1.py) is available by clicking here. 
 
The second Python code I wrote is actually more of a documentation. It describes several templates I downloaded from Census websites. These templates were downloaded and used so I didn't have to write out the dictionaries for each of the segments in order to read the data into Stata. This text file (01-downloadCensusData_20180524_p2.py) is available by clicking here.  
 
Unfortunately the national Decennial Census 2000 data is for larger levels of geography and do not include block and block group estimates. As such, I had to write Python code that loops through the state folder to download the Decennial Census 2000 Summary File 1 and 3 data. In the code, the user needs to set the path where they want to download the data (dcPath), and the three digit code representing the segments to be downloaded from the Decennial Census 2000 Summary File 1 and 3 (dc00sf3List). This Python code (01-downloadCensusData_20180528_p3.py) is available by clicking here.  
 
The fourth Python code I wrote downloads the Census Block 2000 to 2010 Crosswalk for every state and territory in the United States. In the code, the user needs to set/revise the path where they want to download the data (dc00Path; obviously, if I were doing this again for the pure purpose of sharing this information, I would have named this differently, something like geoPath, but it works). This Python code (01-downloadCensusData_20180604_p4.py) is available by clicking here.  

​The fifth Python code I wrote downloads the American Community Survey (2008-202) data segment 76. I had forgotten to list that file segment in the first Python code I wrote, so I downloaded it using the Python code 
(01-downloadCensusData_20180607_p5.py) available by clicking here.
​
​The sixth Python code I wrote downloads the Decennial Census 2000 Summary File 3 data segment 54. I had forgotten to list that file segment in the third Python code I wrote, so I downloaded it using the Python code 
(01-downloadCensusData_20180608_p6.py) available by clicking here.

​These six programs were run to download and document all of the data our team thinks it needs to develop the contextual analysis variables required for our analysis.
I also like downloading data from FTP servers over using cleaned second hand data because I am a control freak and like knowing everything that is done to data before analyzing in. 

Speaking of my OCD issues, you will notice that all my programs start with 01 and end with a eight digit code representing the date it was run and then the part number. I tend to use this format for the first step in all my files (except when I use SAS, which wont allow me to save a program that starts with a number, so instead I will start it with s01). The first two digits mark the research step, (01 is usually numerous .py files that complete the first step of my research, which is typically downloading datasets), the eight digit code represents the date the program was run (formatted as the four digit year followed by the two digit month and two digit date), and a two digit code representing the part of the step. I do not typically separate parts of the same step into different files. This is something unique I do for downloading original data because rarely do I download all that I need on the first shot. These techniques are just a small part of my overall data management strategy introduced to me by Dr. Scott Long in a data management workshop hosted by the Texas Research Data Center.  I really like the organizational strategy, as it makes it easy to recall everything I have done, years after the research is complete. 
0 Comments

Comparing Census Population Data, Part Two: Studying Documentation to Determine Feasibility of Variable Comparison

6/21/2018

0 Comments

 
This post describes the first step of comparing Census population data over time: studying documentation to determine the feasibility of variable comparison over time. See Comparing Census Population Data, Part One to get introduced to this project.

As described in the previous post, not all of the population tabulations that are available in the 2000 Decennial Census are available in the 2010 Decennial Census and American Community Survey. Additionally, questions can be asked in different ways, which can make them incomparable. 

In addition to determining data comparability, documentation must also be studied to determine variables names, table names and segment identifiers. In order to accomplish this task, I created a table with the following fields:​

- Main Fields: American Community Survey Variable Name, Other Variable Name (If Different)
- Decennial Census 2000, Summary File 1: Table Number, Table Name, Table ID, Segment, Max Size, Universe, Lowest Level of Geography
- Decennial Census 2000, Summary File 3: Table Number, Table Name, Table ID, Segment, Max Size, Universe, Lowest Level of Geography
- Decennial Census 2010, Summary File 1: Table Number, Table Name, Table ID, Segment, Max Size, Universe, Lowest Level of Geography
- American Community Survey,  2010-2014: Table Number, Table Name, Table ID, Segment, Max Size, Universe, Lowest Level of Geography
- Notes: When Comapring 2012 ACS to 2000 DC, When Comparing 2012 ACS to 2010 DC, When Comparing 2012 ACS to 2011 ACS
Click here to access a template I created.

The next step was identifying each variable and writing down the information in the template. The table number, table contents, data dictionary reference name, segment, max size, and smallest summary file level is available in the technical documentation as described below:

- Decennial Census 2000 Summary File 1: Starts on page 227 (click here to access the document)
- Decennial Census 2000 Summary File 3: Starts on page 422 (click here to access the document)
- Decennial Census 2010 Summary File 1: Starts on page 183 (click here to access the document)
- American Community Survey 2008-2012 Summary File: Starts on page 46 (click here to access the document)
After checking the data documentation/code books to find the variable/table/segment details, I looked up the variables using the following three different Census tools and used the information to fill out the comparability notes in the table:

- https://www.census.gov/programs-surveys/acs/guidance/comparing-acs-data/2012.html
- https://www.census.gov/acs/www/guidance/comparing-acs-data/acscensus-table-lookup/index.php
- https://www.census.gov/geo/maps-data/data/relationship.html

​After looking all the information up, in the comparability notes of the table, I highlighted the variables that have no comparability concerns as green, I highlighted the variables that have some comparability concerns as orange, and I highlighted the variables that are not comparable as red. Click here to access the completed table. 

In order to make information about what variables can be used and at what level, I made a simplified table with our variables of interest, if it can be broken down by race, if the variable can be measured using full population count data or sample data, if it is comparable, the smallest level of geography available, and associated notes. You can find this by clicking here.

This completed my first step of checking the documentation to determine feasibility of comparing variables over time. 

0 Comments

Comparing Census Population Data, Part One: Project Introduction

6/4/2018

1 Comment

 
I am currently working on a project that is comparing Census population data from 2000 to 2010 for the entire United States at the lowest geographic level available, whether it be the Census block, the Census block group, or the Census tract (listed smallest to largest). We also want to compare Census population data from 2000 and 2010 and the County and the Core Based Statistical Area (CBSA), which are groups of counties encompassing a metropolitan area. We are using variables, such as the race of individuals and single mother households, that can be obtained using the 2000 and 2010 Decennial Census short form, which is required for 100% of the population and is available at the block-level. But we are also using variables, such as household income and poverty levels, which use the 2000 Decennial Census long form (which was about a 17% sample of the total population), and the  2008-2012 American Community Survey (which was about a 12.5% sample of the total population) and both the 2000 Decennial Census long form and the 2008-2012 American Community Survey tabulations are only available at the level of the block group or tract.

​As we attempt to develop comparable population estimates across space, we must be aware that pulling data from multiple data sets over different periods of time can create some issues with consistency. I discuss some of these issues below.

Common Issues When Comparing Census Population Data Over Time

There are three common concerns when comparing Census population data at a low level of geography:

(1) Changes in Questions Asked- Some questions that are asked in one decade might not be asked in the same way or it might not event be asked at all. Furthermore, the categories, coding scheme, etc., for the answers to the questions might not be the same.  These differences create challenges when comparing data over time. 

(2) Changes in Tabulations that are Publicly Available- While, in order to maintain confidentiality, individual micro-data is only available to approved researchers for approved projects within a restricted Federal Statistical Research Data Center (FSRDC), for the general public data is available as a series of tabulations (summed totals or averages) by block or block group or tract or whatever level of geography. While in one decade, Census might release a table for a particular variable for one decade (such as number of households in a block group that are Hispanic and in poverty), that table and information is not guaranteed to be available in another decade. 

(3) Changes in the Geographic Space included in a Block/Block Group/Tract- The United States Census breaks up the geographic space of the United States in a hierarchy of polygons representing a specific geographic space. Blocks are small geographic areas contained by visible boundaries, such as streets and railroad tracts. Block groups are clusters of blocks representing anywhere from 600-3,000 people (in 2010). Tracts are clusters of block groups representing anywhere from 1,2000 to 8,000 people (in 2010). Counties are clusters of tracts. States are clusters of counties. However, since blocks, block groups and tracts are established based on the number of people within the area, there are significant changes in areas where there are large population changes. For example, say you are looking at an area in rural Arizona in 2000 that is relatively remote. This space might be made up of one tract and a few block groups and blocks. Then you might look at this same area in 2010, but there were a bunch of retirement homes built in the area and it is now a populated suburban area. This space might now be made up of four tracts and numerous block groups and blocks. Since Census blocks, block groups, and tracts do not consistently represent the same geographic space, we face further challenges when comparing community data.

In order to deal with these concerns, I followed the processes described below.

The Processes of Comparing Census Data Over Time

In order to compare Census data over time, several different steps were taken.

  1. Studying Technical Documentation to Determine Feasibility of Variable Comparison
  2. Downloading Bulk Data from Census FTP using Python
  3. Reading Census Data to Stata
  4. Preparing Data for Census Crosswalk
  5. Completing the Crosswalk
  6. Cleaning the Data
  7. Assessing the Crosswalk
  8. Adding CBSA Data

​I will describe these steps in more detail, provide the documentation necessary to replicate my research, and describe things I would have done differently in future posts.
1 Comment

    Author

    Kate Willyard is a political and economic sociologist interested in human organization and the environment.

    Archives

    October 2018
    July 2018
    June 2018
    November 2017
    March 2017
    January 2017
    December 2016
    May 2016
    February 2016
    January 2016
    December 2015
    November 2015
    October 2015
    September 2015
    August 2015
    July 2015

    Categories

    All
    Critical Geography
    Economic Sociology
    Environmental Sociology
    Natural Resources
    Political Sociology
    Quantitative Research Methods
    Sociology Of Organizations

    RSS Feed

Research Gate

ORCID

Academia

LinkedIn

GitHub