Kate Willyard
  • About
  • Research
    • Journal Articles
    • Awarded Research Proposals
    • Datasets
    • Doctoral Dissertation
    • Op-Eds and News
    • Presentations
    • Working Papers
  • Teaching
    • Classical Social Theory
    • Enviornmental Sociology
    • Medical Sociology
    • Social Inequality
    • Social Problems
    • Sociology of Organizations
  • Academic Blog
  • Personal Blog
  • Contact

Academic Blog

"If we knew what we were doing, it would not be called research, would it?" -Albert Einstein

Comparing Census Population Data, Part Five: Preparing Census Data for the Census 2000 to 2010 Geography Crosswalk

10/10/2018

0 Comments

 
This post describes the fourth step of my research to compare Census population data over time: Preparing Census Data for the Census Crosswalk. See Step Three: Reading Census Data into State, Step Two: Downloading Bulk Data from Census FTP using Python Programs, and Step One: Studying Documentation to Determine the Feasibility of Variable Comparison Over Time for a description of the work completed prior to getting to this step. See Comparing Census Population Data, Part One to get introduced to the research project.

I started by importing the Census 2000 to Census 2010 Block Crosswalks into Stata, my preferred statistical program. This information came as a series of .txt files by state (It was downloaded from the Census ftp site during Step Two). Using Stata, I wrote a loop that imported and saved each state file. I decided to keep them as state files in order to minimize memory problems when running the crosswalk. You can find the code I used to manage the crosswalk (06-managecrosswalk_20180801.do) by clicking here.

Keeping within the loop, after I imported the file, I labeled each variable and ensured each unique identifier was in correct format, meaning that state codes are two-digit characters, county codes are three-digit characters, tract codes are six-digits, block group codes are one-digit, and block codes are three-digits. However, I had made a mistake and created block codes as four-digit codes that included the block group number.

Next, I created unique block, block group, tract, and county identifiers for 2000 and 2010. Unique Census block identifiers should be fifteen-digits, but I made mine sixteen-digits, while doubling up on the block group number. In other words, my unique Census block identifier was created using the two-digit state code, followed by the three-digit county code, followed by the six-digit tract code, followed by the one-digit block group code, followed by the one-digit block group code and the three-digit block code. Twelve-digit block group identifiers were created using the two-digit state code, followed by the three-digit county code, followed by the six-digit tract code, followed by the one-digit block group code. Eleven-digit tract identifiers were created using the two-digit state code, followed by the three-digit county code, followed by the six-digit tract code. Finally, five-digit county codes were created using the two-digit state code, followed by the three-digit county code.

After that, I made Census 2000 to Census 2010 block level geographic weights by dividing the area of land that intersected between the two block areas divided by the area of the block in 2000. Then I made Census 2010 to Census 2000 block level weights by dividing the area of land that intersected between the two block areas divided by the area of the block in 2010. Then, for each block, I counted the number of times it was in the crosswalk in 200 and in 2010. Using that information, I identified the block 2000 block 2010 relationship type, meaning, was there no relationship, was there a one to one relationship, a many to one relationship, a one to many relationship or a many to many relationship. Finally, I saved the file for each state and then created a complete, national crosswalk.

I replicated this process to create weights at the block group, tract, and county level. However, when reading the code, you will notice a unique difference. In order to create the crosswalks for geographies larger than the block, I first created a dataset of information that was constant across time. Next I created a dataset of information for 2000, and a dataset of information for 2010. In these I just keep the unique Census identifier for the level of analysis , the area of the block, and the are of land that intersected between the two block areas, and then deleted duplicates. Then, by each unique Census identifier, I summed up the areas to create correct areas by the unit of analysis, and connected these three datasets (constant, Census 2000 and Census 2010) by the unique Census Identifier before creating the geographic weights.

Finally, I prepped the Census data that was extracted in Step Three. You can find the code I used to finish preparing Census data for the crosswalk (07-prepcensusdata_20180807.do) by clicking here. In short, for each geographic level (county, tract, block group, and block), and each dataset (Census 2000 Summary File 1, Census 2000, Summary File 3, Census 2010 Summary File 1, and American Community Survey 2008-2012), I kept the variables I plan to use and labeled them accordingly. Then, I made files for each geographic level for the 2000 Census Data by merging the Census 2000 Summary File 1 and Summary File 3 datasets by the unique Census identifier, and created state-level and national-level datasets. State-level datasets were saved using their unique Census state FIPS code. I finished by making files for each geographic level for the 2010 Census Data by merging the Census 2010 Summary File 1 and American Community Survey 2008-2012 datasets by the unique Census identifier, and creating state-level and national-level datasets.
​
0 Comments

Comparing Census Population Data, Part Four: Reading Census Data into Stata

7/11/2018

0 Comments

 
This post describes the third step of my research to compare Census population data over time: Reading Census Data into Stata. See Step Two: Downloading Bulk Data from Census FTP using Python Programs and Step One: Studying Documentation to Determine the Feasibility of Variable Comparison Over Time, Comparing Census Population Data, Part Two for a description of the work completed prior to getting to this step. See Comparing Census Population Data, Part One to get introduced to my research project.

The data downloaded from Census came as a series of .xls, .accdb, .dbf, .txt, .csv files. In order to get this data in a manageable format, I used Stata, my preferred statistical/mathematical software program. I wrote Stata code to do this for each dataset.

First, I wrote a Stata code to read in the American Community Survey (ACS) data. Census provides several .csv files that include variable labels and descriptions for each of the segments and the geography files. This takes several steps reading the geography files for each state and using the .csv template to label data, reading the segment .txt file for each state and using the template to label the data, linking segment files to geography files using the LOGRECNO variable, creating national county, block group, and tract segment files, and linking segments to create complete county, block group, and tract estimates. In the code, the user needs to set/revise the path where they are storing log files and revised data. This Stata code (02-manageACS12_20180731.do) is available by clicking here. 

Next, I wrote a Stata code to read in the Decennial Census 2010 Summary File One data. Census does not provide variable labels and descriptions for each of the segments and the geography files. Instead, it provides an Access 1999 file with templates for each of the segments. These templates are accessible in .csv format by clicking here. For the Stata code to work, these templates must be saved in the same location as the original data downloaded in the previous step. While it also has a template for the geography file, the template does not work because the state geography files do not have common separators such as commas. As a result, I had to write dictionaries for each state's geography file. For the Stata code to work, the dictionary files must be saved in the same location as the original data downloaded in the previous step. These dictionaries are accessible by clicking here. I then wrote the Stata code that goes through several steps reading the geography files for each state, reading the segment text file for each state and using the template to label the data, linking segment files to geography files, creating national block group, tract, and county segment files, and linking segments to create complete block, block group, tract, and county estimates. In the code, the user needs to set/revise the path where they are storing log files and revised data. This Stata code (03-manageDC00SF1_20180720.do) is available by clicking here.

Then, I wrote a Stata code to read in the Decennial Census 2000 Summary File One data. Census does not provide variable labels and descriptions for each of the segments and the geography files. Instead, it provides an Access 2007 file with templates for each of the segments. These templates are accessible by clicking here. For the Stata code to work, these templates must be saved in the same location as the original data downloaded in the previous step. While it also has a template for the geography file, the template does not work because the state geography files do not have common separators such as commas. As a result, I had to write dictionaries for each state's geography file. For the Stata code to work, the dictionary files must be saved in the same location as the original data downloaded in the previous step. These dictionaries are accessible by clicking here. I then wrote the Stata code that goes through several steps reading the geography files for each state, reading the segment text file for each state and using the template to label the data, linking segment files to geography files, creating national block group and tract segment files, and linking segments to create complete block, block group, and tract estimates. In the code, the user needs to set/revise the path where they are storing log files and revised data. This Stata code (04-manageDC00SF1_20180720.do) is available by clicking here.

Finally, I went through similar processes as described above for the Decennial Census 2000 Summary File Three data. Click here for the templates for each of the segments. Click here for the geography dictionary files. Click here to access the Stata code (05-manageDC00SF3_20180723.do).
0 Comments

Comparing Census Population Data, Part Three: Downloading Bulk Data from Census FTP using Python Programs

6/29/2018

0 Comments

 
This post describes the second step of my research to compare Census population data over time: Downloading Bulk Data from the Census FTP Site using Python Programs. See Step One: Studying Documentation to Determine the Feasibility of Variable Comparison Over Time, Comparing Census Population Data, Part Two for a description of the work completed prior to getting to this step. See Comparing Census Population Data, Part One to get introduced to my research project.

Census makes their data available through their FTP site. Rather than point and click to get to each file, download, and unzip it, I wrote a Python code to automate the process.  I only grabbed the data segments that I needed. These segments were determined in Step One.  
 
The first Python code I wrote downloads the national Decennial Census 2000 Summary File 3 data, the national Decennial Census 2000 Summary File 3 data, the state Decennial Census 2010 Summary File 1 data, and the state American Community Survey (2008-2012) data. We chose to use the American Community Survey five year estimates in order to increase the sample size.  In the code, the user needs to set the path where they want to download the data (dc00Path, dc10Path, and acs12Path), the three digit code representing the segments to be downloaded from the Decennial Census 2000 Summary File 1 (dc00sf1List), the Decennial Census 2000 Summary File 3 (dc00sf3List) and the Decennial Census 2010 Summary File 1 (dc10sf1List), and the four digit code representing the segments to be downloaded from the American Community Survey (acsList). This Python code (01-downloadCensusData_20180424_p1.py) is available by clicking here. 
 
The second Python code I wrote is actually more of a documentation. It describes several templates I downloaded from Census websites. These templates were downloaded and used so I didn't have to write out the dictionaries for each of the segments in order to read the data into Stata. This text file (01-downloadCensusData_20180524_p2.py) is available by clicking here.  
 
Unfortunately the national Decennial Census 2000 data is for larger levels of geography and do not include block and block group estimates. As such, I had to write Python code that loops through the state folder to download the Decennial Census 2000 Summary File 1 and 3 data. In the code, the user needs to set the path where they want to download the data (dcPath), and the three digit code representing the segments to be downloaded from the Decennial Census 2000 Summary File 1 and 3 (dc00sf3List). This Python code (01-downloadCensusData_20180528_p3.py) is available by clicking here.  
 
The fourth Python code I wrote downloads the Census Block 2000 to 2010 Crosswalk for every state and territory in the United States. In the code, the user needs to set/revise the path where they want to download the data (dc00Path; obviously, if I were doing this again for the pure purpose of sharing this information, I would have named this differently, something like geoPath, but it works). This Python code (01-downloadCensusData_20180604_p4.py) is available by clicking here.  

​The fifth Python code I wrote downloads the American Community Survey (2008-202) data segment 76. I had forgotten to list that file segment in the first Python code I wrote, so I downloaded it using the Python code 
(01-downloadCensusData_20180607_p5.py) available by clicking here.
​
​The sixth Python code I wrote downloads the Decennial Census 2000 Summary File 3 data segment 54. I had forgotten to list that file segment in the third Python code I wrote, so I downloaded it using the Python code 
(01-downloadCensusData_20180608_p6.py) available by clicking here.

​These six programs were run to download and document all of the data our team thinks it needs to develop the contextual analysis variables required for our analysis.
I also like downloading data from FTP servers over using cleaned second hand data because I am a control freak and like knowing everything that is done to data before analyzing in. 

Speaking of my OCD issues, you will notice that all my programs start with 01 and end with a eight digit code representing the date it was run and then the part number. I tend to use this format for the first step in all my files (except when I use SAS, which wont allow me to save a program that starts with a number, so instead I will start it with s01). The first two digits mark the research step, (01 is usually numerous .py files that complete the first step of my research, which is typically downloading datasets), the eight digit code represents the date the program was run (formatted as the four digit year followed by the two digit month and two digit date), and a two digit code representing the part of the step. I do not typically separate parts of the same step into different files. This is something unique I do for downloading original data because rarely do I download all that I need on the first shot. These techniques are just a small part of my overall data management strategy introduced to me by Dr. Scott Long in a data management workshop hosted by the Texas Research Data Center.  I really like the organizational strategy, as it makes it easy to recall everything I have done, years after the research is complete. 
0 Comments

Comparing Census Population Data, Part Two: Studying Documentation to Determine Feasibility of Variable Comparison

6/21/2018

0 Comments

 
This post describes the first step of comparing Census population data over time: studying documentation to determine the feasibility of variable comparison over time. See Comparing Census Population Data, Part One to get introduced to this project.

As described in the previous post, not all of the population tabulations that are available in the 2000 Decennial Census are available in the 2010 Decennial Census and American Community Survey. Additionally, questions can be asked in different ways, which can make them incomparable. 

In addition to determining data comparability, documentation must also be studied to determine variables names, table names and segment identifiers. In order to accomplish this task, I created a table with the following fields:​

- Main Fields: American Community Survey Variable Name, Other Variable Name (If Different)
- Decennial Census 2000, Summary File 1: Table Number, Table Name, Table ID, Segment, Max Size, Universe, Lowest Level of Geography
- Decennial Census 2000, Summary File 3: Table Number, Table Name, Table ID, Segment, Max Size, Universe, Lowest Level of Geography
- Decennial Census 2010, Summary File 1: Table Number, Table Name, Table ID, Segment, Max Size, Universe, Lowest Level of Geography
- American Community Survey,  2010-2014: Table Number, Table Name, Table ID, Segment, Max Size, Universe, Lowest Level of Geography
- Notes: When Comapring 2012 ACS to 2000 DC, When Comparing 2012 ACS to 2010 DC, When Comparing 2012 ACS to 2011 ACS
Click here to access a template I created.

The next step was identifying each variable and writing down the information in the template. The table number, table contents, data dictionary reference name, segment, max size, and smallest summary file level is available in the technical documentation as described below:

- Decennial Census 2000 Summary File 1: Starts on page 227 (click here to access the document)
- Decennial Census 2000 Summary File 3: Starts on page 422 (click here to access the document)
- Decennial Census 2010 Summary File 1: Starts on page 183 (click here to access the document)
- American Community Survey 2008-2012 Summary File: Starts on page 46 (click here to access the document)
After checking the data documentation/code books to find the variable/table/segment details, I looked up the variables using the following three different Census tools and used the information to fill out the comparability notes in the table:

- https://www.census.gov/programs-surveys/acs/guidance/comparing-acs-data/2012.html
- https://www.census.gov/acs/www/guidance/comparing-acs-data/acscensus-table-lookup/index.php
- https://www.census.gov/geo/maps-data/data/relationship.html

​After looking all the information up, in the comparability notes of the table, I highlighted the variables that have no comparability concerns as green, I highlighted the variables that have some comparability concerns as orange, and I highlighted the variables that are not comparable as red. Click here to access the completed table. 

In order to make information about what variables can be used and at what level, I made a simplified table with our variables of interest, if it can be broken down by race, if the variable can be measured using full population count data or sample data, if it is comparable, the smallest level of geography available, and associated notes. You can find this by clicking here.

This completed my first step of checking the documentation to determine feasibility of comparing variables over time. 

0 Comments

Comparing Census Population Data, Part One: Project Introduction

6/4/2018

1 Comment

 
I am currently working on a project that is comparing Census population data from 2000 to 2010 for the entire United States at the lowest geographic level available, whether it be the Census block, the Census block group, or the Census tract (listed smallest to largest). We also want to compare Census population data from 2000 and 2010 and the County and the Core Based Statistical Area (CBSA), which are groups of counties encompassing a metropolitan area. We are using variables, such as the race of individuals and single mother households, that can be obtained using the 2000 and 2010 Decennial Census short form, which is required for 100% of the population and is available at the block-level. But we are also using variables, such as household income and poverty levels, which use the 2000 Decennial Census long form (which was about a 17% sample of the total population), and the  2008-2012 American Community Survey (which was about a 12.5% sample of the total population) and both the 2000 Decennial Census long form and the 2008-2012 American Community Survey tabulations are only available at the level of the block group or tract.

​As we attempt to develop comparable population estimates across space, we must be aware that pulling data from multiple data sets over different periods of time can create some issues with consistency. I discuss some of these issues below.

Common Issues When Comparing Census Population Data Over Time

There are three common concerns when comparing Census population data at a low level of geography:

(1) Changes in Questions Asked- Some questions that are asked in one decade might not be asked in the same way or it might not event be asked at all. Furthermore, the categories, coding scheme, etc., for the answers to the questions might not be the same.  These differences create challenges when comparing data over time. 

(2) Changes in Tabulations that are Publicly Available- While, in order to maintain confidentiality, individual micro-data is only available to approved researchers for approved projects within a restricted Federal Statistical Research Data Center (FSRDC), for the general public data is available as a series of tabulations (summed totals or averages) by block or block group or tract or whatever level of geography. While in one decade, Census might release a table for a particular variable for one decade (such as number of households in a block group that are Hispanic and in poverty), that table and information is not guaranteed to be available in another decade. 

(3) Changes in the Geographic Space included in a Block/Block Group/Tract- The United States Census breaks up the geographic space of the United States in a hierarchy of polygons representing a specific geographic space. Blocks are small geographic areas contained by visible boundaries, such as streets and railroad tracts. Block groups are clusters of blocks representing anywhere from 600-3,000 people (in 2010). Tracts are clusters of block groups representing anywhere from 1,2000 to 8,000 people (in 2010). Counties are clusters of tracts. States are clusters of counties. However, since blocks, block groups and tracts are established based on the number of people within the area, there are significant changes in areas where there are large population changes. For example, say you are looking at an area in rural Arizona in 2000 that is relatively remote. This space might be made up of one tract and a few block groups and blocks. Then you might look at this same area in 2010, but there were a bunch of retirement homes built in the area and it is now a populated suburban area. This space might now be made up of four tracts and numerous block groups and blocks. Since Census blocks, block groups, and tracts do not consistently represent the same geographic space, we face further challenges when comparing community data.

In order to deal with these concerns, I followed the processes described below.

The Processes of Comparing Census Data Over Time

In order to compare Census data over time, several different steps were taken.

  1. Studying Technical Documentation to Determine Feasibility of Variable Comparison
  2. Downloading Bulk Data from Census FTP using Python
  3. Reading Census Data to Stata
  4. Preparing Data for Census Crosswalk
  5. Completing the Crosswalk
  6. Cleaning the Data
  7. Assessing the Crosswalk
  8. Adding CBSA Data

​I will describe these steps in more detail, provide the documentation necessary to replicate my research, and describe things I would have done differently in future posts.
1 Comment

GIS Day Presentation

11/14/2017

0 Comments

 
Texas A&M University hosts the largest GIS Day in Texas. I was selected as one of the five finalists for the GIS Day 2017 Paper Competition. I am very excited to obtain feedback from people in different disciplines about my research. Below you will find my paper presentation which I will present today, November 14 at 1:15 in the Evans Library Annex Room 405C. 
gisday20171114.pdf
File Size: 2129 kb
File Type: pdf
Download File

0 Comments

Communities Exposed to Texas Oil and Gas Extraction Facility Venting and Flaring Practices in 2012

3/31/2017

0 Comments

 
I am getting deeper into my dissertation research and understanding the data. I found out that the Production Data Query Dump file is only half of the data the Texas Railroad Commission has on venting and flaring volumes. Apparently there is another file that will cost me over $375 regarding venting and flaring volumes at processing plants (but details on what would be included have yet to be made available). Luckily, my recent research has focused on venting and flaring volumes at the oil and gas extraction facility (i.e., the producing well). 

Although there are bumps in the road and it is taking longer than I thought, I am making progress on achieving my output goals.  First of which is a map associated with one of the papers that will emerge from my dissertation research regarding the characteristics of communities most exposed to Texas oil and gas venting and flaring volumes (at extraction facilities; not processing plants). You can visit the web application I built by clicking here or seeing the map below. 
0 Comments

Resource Dependence Theory Guest Lecture Slides

3/20/2017

0 Comments

 
Tomorrow I am guest lecturing for Dr. Morris' organizational sociology course. You can find the slides for my lecture below.
0 Comments

Texas Railroad Commission Data Connections

1/27/2017

0 Comments

 
For my dissertation, I am merging together various files from the Texas Railroad Commission.

This took a lot of time and work, but I was able to finally complete the task last week. You can find a map of data connections at http://prezi.com/albf920zd5jf/?utm_campaign=share&utm_medium=copy or view it below:
0 Comments

My Dissertation Research in the Texas Federal Statistical Research Data Center: The Effects of Organizational Characteristics and the Characteristics of Organizational Insititutional Enviornments on Texas Oil and Gas Venting and Flaring Practices

12/9/2016

0 Comments

 
Today at 10 at the Texas Federal Statistical Research Data Center (TXRDC), I will be making a presentation about my proposal to access Census and IRS restricted data at the TXRDC (click here for more information about the event). This presentation will be useful to graduate students interested in using RDC datasets for their dissertation research. I provide some insight on the type of dissertation research that is conducted at an RDC, understanding Census terms for business data, research timelines, and other tips.

If you cannot make it, you can find my presentation below. 
txrdcdoctoralstudentinfosession_kacw_20161209.pdf
File Size: 483 kb
File Type: pdf
Download File

0 Comments
<<Previous

    Author

    Kate Willyard is a political and economic sociologist interested in human organization and the environment.

    Archives

    October 2018
    July 2018
    June 2018
    November 2017
    March 2017
    January 2017
    December 2016
    May 2016
    February 2016
    January 2016
    December 2015
    November 2015
    October 2015
    September 2015
    August 2015
    July 2015

    Categories

    All
    Critical Geography
    Economic Sociology
    Environmental Sociology
    Natural Resources
    Political Sociology
    Quantitative Research Methods
    Sociology Of Organizations

    RSS Feed

Research Gate

ORCID

Academia

LinkedIn

GitHub