Census makes their data available through their FTP site. Rather than point and click to get to each file, download, and unzip it, I wrote a Python code to automate the process. I only grabbed the data segments that I needed. These segments were determined in Step One.
The first Python code I wrote downloads the national Decennial Census 2000 Summary File 3 data, the national Decennial Census 2000 Summary File 3 data, the state Decennial Census 2010 Summary File 1 data, and the state American Community Survey (2008-2012) data. We chose to use the American Community Survey five year estimates in order to increase the sample size. In the code, the user needs to set the path where they want to download the data (dc00Path, dc10Path, and acs12Path), the three digit code representing the segments to be downloaded from the Decennial Census 2000 Summary File 1 (dc00sf1List), the Decennial Census 2000 Summary File 3 (dc00sf3List) and the Decennial Census 2010 Summary File 1 (dc10sf1List), and the four digit code representing the segments to be downloaded from the American Community Survey (acsList). This Python code (01-downloadCensusData_20180424_p1.py) is available by clicking here.
The second Python code I wrote is actually more of a documentation. It describes several templates I downloaded from Census websites. These templates were downloaded and used so I didn't have to write out the dictionaries for each of the segments in order to read the data into Stata. This text file (01-downloadCensusData_20180524_p2.py) is available by clicking here.
Unfortunately the national Decennial Census 2000 data is for larger levels of geography and do not include block and block group estimates. As such, I had to write Python code that loops through the state folder to download the Decennial Census 2000 Summary File 1 and 3 data. In the code, the user needs to set the path where they want to download the data (dcPath), and the three digit code representing the segments to be downloaded from the Decennial Census 2000 Summary File 1 and 3 (dc00sf3List). This Python code (01-downloadCensusData_20180528_p3.py) is available by clicking here.
The fourth Python code I wrote downloads the Census Block 2000 to 2010 Crosswalk for every state and territory in the United States. In the code, the user needs to set/revise the path where they want to download the data (dc00Path; obviously, if I were doing this again for the pure purpose of sharing this information, I would have named this differently, something like geoPath, but it works). This Python code (01-downloadCensusData_20180604_p4.py) is available by clicking here.
The fifth Python code I wrote downloads the American Community Survey (2008-202) data segment 76. I had forgotten to list that file segment in the first Python code I wrote, so I downloaded it using the Python code (01-downloadCensusData_20180607_p5.py) available by clicking here.
The sixth Python code I wrote downloads the Decennial Census 2000 Summary File 3 data segment 54. I had forgotten to list that file segment in the third Python code I wrote, so I downloaded it using the Python code (01-downloadCensusData_20180608_p6.py) available by clicking here.
These six programs were run to download and document all of the data our team thinks it needs to develop the contextual analysis variables required for our analysis. I also like downloading data from FTP servers over using cleaned second hand data because I am a control freak and like knowing everything that is done to data before analyzing in.
Speaking of my OCD issues, you will notice that all my programs start with 01 and end with a eight digit code representing the date it was run and then the part number. I tend to use this format for the first step in all my files (except when I use SAS, which wont allow me to save a program that starts with a number, so instead I will start it with s01). The first two digits mark the research step, (01 is usually numerous .py files that complete the first step of my research, which is typically downloading datasets), the eight digit code represents the date the program was run (formatted as the four digit year followed by the two digit month and two digit date), and a two digit code representing the part of the step. I do not typically separate parts of the same step into different files. This is something unique I do for downloading original data because rarely do I download all that I need on the first shot. These techniques are just a small part of my overall data management strategy introduced to me by Dr. Scott Long in a data management workshop hosted by the Texas Research Data Center. I really like the organizational strategy, as it makes it easy to recall everything I have done, years after the research is complete.