The assemble_data Module
assemble_data.py
Core module for assembling a census plan
- class DataPlan(yaml_path, geometry, years=[2000, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019], state=None, county=None)[source]
A class containing information on how to create a desired set of census data.
Inputs for initializing a DataPlan object from a census yaml document
- Yaml_path:
path to a yaml file. Structure defined in census_yaml
- Geometry:
which census geography this plan is for
- Years:
The list of years to query data from. The census_years() function can calculate which years in your timeframe of interest can be queried for the decennial and 5 year acs data. Note that this may not apply for the ACS1 or other data. That function may be updated in the future, but for now creating lists of years besides the defaults is left as an exercise for the interested reader.
- State:
2 digit FIPS code of the state you want to limit the query to (i.e. “06” for CA)
- County:
3 digit FIPS code of the county you want to include. Requires state to be specified
Members:
geometry
: which census geography this plan is foryears
: Thelist
of years that the data should be queried forstate
: 2 digit FIPS code of the state you want to limit the query to (i.e. “06” for CA)county
: 3 digit FIPS code of the county you want to include. Requires state to be specifiedplan
: Adict
with keys of years, storing lists ofVariableDef
objects defining the variables to be calculated for that year. Created from a yaml file. Structure defined in census_yamldata
: A pandas data frame created based on the defined data plan. only exists after theDataPlan.assemble_data()
method is called.
initialize a DataPlan object from a census yaml document
- Parameters:
yaml_path¶ – path to a yaml file. Structure defined in census_yaml
geometry¶ – which census geography this plan is for
years¶ – The list of years to query data from. The census_years() function can calculate which years in your timeframe of interest can be queried for the decennial and 5 year acs data. Note that this may not apply for the ACS1 or other data. That function may be updated in the future, but for now creating lists of years besides the defaults is left as an exercise for the interested reader.
state¶ – 2 digit FIPS code of the state you want to limit the query to (i.e. “06” for CA)
county¶ – 3 digit FIPS code of the county you want to include. Requires state to be specified
- supported_out_formats = ['csv']
- assemble_data()[source]
Create a data frame for each geoid , for each year, with each variable as defined in the data plan
- Returns:
Assembled data frame stored in self.data
- get_var_names()[source]
Return a list containing all the variable names that are created in the data plan
- Returns:
List of strings
- add_geoid()[source]
add a single column named ‘geoid’ to self.data combining all portions of a data sets geographical identifiers
- Returns:
None
- create_missingness(min_year=None, max_year=None)[source]
Create a row for all combinations of geospatial ID and year :return:
- write_data(path, file_type='csv')[source]
Write data out to a file. Default method is to write out to csv. new methods can be implemented in the future.
- calculate_densities(variables=['population'], sq_mi=True)[source]
Divide specified variables by area :param _sphinx_paramlinks_dorieh.census.assemble_data.DataPlan.calculate_densities.variables: List of variables to calculate densities for :param _sphinx_paramlinks_dorieh.census.assemble_data.DataPlan.calculate_densities.sq_mi: Should denisties be calculated per square mile? If false, calculated per square meter :return: None
- interpolate(method='ma', min_year=None, max_year=None)[source]
Fill in values :param _sphinx_paramlinks_dorieh.census.assemble_data.DataPlan.interpolate.method: Interpolation method to use :param _sphinx_paramlinks_dorieh.census.assemble_data.DataPlan.interpolate.min_year: Minimum year to interpolate :param _sphinx_paramlinks_dorieh.census.assemble_data.DataPlan.interpolate.max_year: Maximum year to interpolate :return:
- quality_check(test_file: str)[source]
Test self.data for the checks defined in the test file :param _sphinx_paramlinks_dorieh.census.assemble_data.DataPlan.quality_check.test_file: path to a yaml file defining tests per the quality check paradigm in dorieh.utils.qc :return: None
- write_schema(filename: Optional[str] = None, table_name: Optional[str] = None)[source]
Write out a yaml file describing the data schema :param _sphinx_paramlinks_dorieh.census.assemble_data.DataPlan.write_schema.filename: path to write to :param _sphinx_paramlinks_dorieh.census.assemble_data.DataPlan.write_schema.table_name: Name of the table for the schema :return: True
- class VariableDef(name: str, var_dict: dict, log: Optional[Logger] = None)[source]
Structured way of representing what we need to know for a variable. Members: *
dataset
: a string. The data set used to calculate a variable, should be dec, acs1, acs5, or pums *num
: a list, the names of variables that make up the numerator *den
: a list, the names of the variables that make up the denominator. Can be missing *has_den
: a boolean, indicates whether or not there is a denominator.- do_query(year, geometry, state=None, county=None)[source]
Run the query defined by the contained variables :param _sphinx_paramlinks_dorieh.census.assemble_data.VariableDef.do_query.geometry: census geometry to query :param _sphinx_paramlinks_dorieh.census.assemble_data.VariableDef.do_query.year: year of data to query :param _sphinx_paramlinks_dorieh.census.assemble_data.VariableDef.do_query.state: 2 Digit Fips code of state to limit the query to :param _sphinx_paramlinks_dorieh.census.assemble_data.VariableDef.do_query.county: 3 Digit county code to limit the query to, must be used with state :return: data frame of all census variables specified by the query
- calculate_var(year, geometry, state=None, county=None)[source]
Query the required data from the census, then calculate the variable defined :param _sphinx_paramlinks_dorieh.census.assemble_data.VariableDef.calculate_var.year: year of data to query :param _sphinx_paramlinks_dorieh.census.assemble_data.VariableDef.calculate_var.geometry: census geometry to query :param _sphinx_paramlinks_dorieh.census.assemble_data.VariableDef.calculate_var.state: 2 Digit Fips code of state to limit the query to :param _sphinx_paramlinks_dorieh.census.assemble_data.VariableDef.calculate_var.county: 3 Digit county code to limit the query to, must be used with state :return: a data frame with one column of the calcualted variable and the census geography columns