TPL prosperity hackathon mapping

Background

This notebook is the result of some exploration and analysis that took place at the Toronto Public Library TOProsperity hackathon.

The challenge for the day was: How can we use information from annual tax records to better understand patterns of income across neighbourhoods in Toronto?

The data used was T1FF Neighbourhood Income and Demographics Tables, by Neighbourhood.

More info can be found at the following links:

TPL challenges

T1 family file info

I chose to map out the economic dependency ratio by neighbourhood since it seemed to a tractable task for the day and could provide a good overview of earned income as compared to social benefits received.

Economic Dependency Ratio (EDR):

Is the sum of transfer payment dollars received as benefits in a given area, compared to every $100 of employment income for that same area. For example, where a table shows an Employment Insurance (EI) dependency ratio of 4.69, it means that $4.69 in EI benefits were received for every $100 of employment income for the area.

Explore and clean the CRA data from the T1FF data

import pandas as pd
import numpy as np
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns
import pysal as ps
%matplotlib inline

df = pd.read_csv('./data/T1FF-F2010-2014.csv', low_memory=False)

subset = df[(df.Year == '2012') & (df.Table.isin(['F-7','F-8'])) & (df.Attribute.str.contains('EDR'))]

#creating 2 subsets of data from CRA, those for Couple families (table F7) and those from lone parent families (table F8)
CF_subset = df[(df.Year == '2014') & (df.Table.isin(['F-7'])) & (df.Attribute.str.contains('· Government transfers · EDR'))]
LP_subset = df[(df.Year == '2014') & (df.Table.isin(['F-8'])) & (df.Attribute.str.contains('· Government transfers · EDR'))]

CF_subset.Attribute.unique()

array(['Couple families · Government transfers · EDR',
       'Male Partners in Couple Families · Government transfers · EDR',
       'Female Partners in Couple Families · Government transfers · EDR',
       'Children in CFF · Government transfers · EDR',
       'All persons · Government transfers · EDR'], dtype=object)

LP_subset.Attribute.unique()

array(['Lone-parent families · Government transfers · EDR',
       'Parents in LPF · Government transfers · EDR',
       'Children in LPF · Government transfers · EDR',
       'Non-family persons · Government transfers · EDR',
       'All persons · Government transfers · EDR'], dtype=object)

CF_subset = CF_subset.drop(['Year', 'Table'], axis=1)
LP_subset = LP_subset.drop(['Year', 'Table'], axis=1)

long_EDR_CF = CF_subset.T
long_EDR_LP = LP_subset.T

long_EDR_CF.columns = long_EDR_CF.loc['Attribute']
long_EDR_LP.columns = long_EDR_LP.loc['Attribute']

long_EDR_CF.drop(['Attribute'], inplace=True)
long_EDR_LP.drop(['Attribute'], inplace=True)

long_EDR_CF.reset_index(inplace=True)
long_EDR_CF.rename(columns={'index': 'neighbourhood'}, inplace=True)

long_EDR_LP.reset_index(inplace=True)
long_EDR_LP.rename(columns={'index': 'neighbourhood'}, inplace=True)

long_EDR_CF.head(3)

Attribute	neighbourhood	Couple families · Government transfers · EDR	Male Partners in Couple Families · Government transfers · EDR	Female Partners in Couple Families · Government transfers · EDR	Children in CFF · Government transfers · EDR	All persons · Government transfers · EDR
0	Agincourt North	21.9	20.2	33	5.4	26.8
1	Agincourt South-Malvern West	19.6	18.5	28.4	5.2	24.1
2	Alderwood	11.5	10.7	14.1	6.2	16

total = long_EDR_CF.merge(long_EDR_LP[['neighbourhood', 'Lone-parent families · Government transfers · EDR']], on='neighbourhood')

cols = ['neighbourhood', 'Couple families · Government transfers · EDR',
        'Lone-parent families · Government transfers · EDR', 'All persons · Government transfers · EDR']
EDR_df = total[cols]

EDR_df.head(2)

Attribute	neighbourhood	Couple families · Government transfers · EDR	Lone-parent families · Government transfers · EDR	All persons · Government transfers · EDR
0	Agincourt North	21.9	35.1	26.8
1	Agincourt South-Malvern West	19.6	37	24.1

cols = {'Couple families · Government transfers · EDR': 'couple_fam',
        'Lone-parent families · Government transfers · EDR': 'lone_parent',
        'All persons · Government transfers · EDR': 'all_persons'}

EDR_df.rename(columns=cols, inplace=True)

EDR_df.head()

Attribute	neighbourhood	couple_fam	lone_parent	all_persons
0	Agincourt North	21.9	35.1	26.8
1	Agincourt South-Malvern West	19.6	37	24.1
2	Alderwood	11.5	26.1	16
3	Annex	5.2	13.2	7.7
4	Banbury-Don Mills	11.3	21.6	15.8

Clean and merge the EDR data with Toronto neighbourhoods geometries

gdf = gpd.read_file('./data/neighbourhoods_shp/')

gdf.head(2)

	AREA_NAME	AREA_S_CD	geometry
0	Yonge-St.Clair (97)	097	POLYGON ((-79.39119482700001 43.681081124, -79...
1	York University Heights (27)	027	POLYGON ((-79.505287916 43.759873494, -79.5048...

To merge the spatial and EDR data, first clean the neighbourhood names so you can do a merge on that column

neighbourhood = gdf['AREA_NAME'].str.replace(r"\(.*\)","")
gdf['neighbourhood'] = neighbourhood.str.strip()
gdf.drop(['AREA_NAME', 'AREA_S_CD'], axis=1, inplace=True)

An inconsistency I've found is that not all neighbourhood names are the same in both the CRA data and the geometries shapefile for the city neighbourhoods. More cleaning is needed.

EDR_neigh = set(EDR_df.neighbourhood)
gdf_neigh = set(gdf.neighbourhood)

print('diff1: ', EDR_neigh - gdf_neigh)
print()
print('diff2: ', gdf_neigh - EDR_neigh)

diff1:  {'North St. James Town', 'Mimico (includes Humber Bay Shores)', 'City of Toronto', 'Cabbagetown-South St. James Town', 'Weston-Pelham Park'}

diff2:  {'Cabbagetown-South St.James Town', 'Weston-Pellam Park', 'North St.James Town', 'Mimico'}

To fix this I'll use fuzzy string matching. This is a case where it's relatively easy to fix manually, but using a fuzzy string matching process is likely more robust and could scale up if more errors were present.

from fuzzywuzzy import process

correct_neighbs  = list(gdf.neighbourhood)

def correct_neigh(neighbourhood):
    if neighbourhood in correct_neighbs:  # might want to make this a dict for O(1) lookups
        return neighbourhood, 100

    new_name, score = process.extractOne(neighbourhood, correct_neighbs)
    if score < 90:
        return neighbourhood, score
    else:
        return new_name, score

corrected_neigh, dfscore = zip(*EDR_df['neighbourhood'].apply(correct_neigh))
EDR_df['corrected_hood'], EDR_df['score'] = zip(*EDR_df['neighbourhood'].apply(correct_neigh))

EDR_df.drop(['neighbourhood'], axis=1, inplace=True)
EDR_df = EDR_df[EDR_df.score >= 90]

EDR_df.head()

Attribute	couple_fam	lone_parent	all_persons	corrected_hood	score
0	21.9	35.1	26.8	Agincourt North	100
1	19.6	37	24.1	Agincourt South-Malvern West	100
2	11.5	26.1	16	Alderwood	100
3	5.2	13.2	7.7	Annex	100
4	11.3	21.6	15.8	Banbury-Don Mills	100

After cleaning, merge the data on the cleaned up neighbourhood columns, then drop what is redundant:

map_data = gdf.merge(EDR_df, left_on='neighbourhood', right_on='corrected_hood')
map_data.drop(['corrected_hood'], axis=1, inplace=True)

map_data.head(3)

	geometry	neighbourhood	couple_fam	lone_parent	all_persons	score
0	POLYGON ((-79.39119482700001 43.681081124, -79...	Yonge-St.Clair	5.5	8.4	7.4	100
1	POLYGON ((-79.505287916 43.759873494, -79.5048...	York University Heights	21.1	47	26.7	100
2	POLYGON ((-79.439984311 43.761557655, -79.4400...	Lansing-Westgate	6.1	19.5	7.7	100

map_data.dtypes

geometry         object
neighbourhood    object
couple_fam       object
lone_parent      object
all_persons      object
score             int64
dtype: object

#convert the EDR scores from objects to numeric type so they can be processed properly for mapping
map_data[['couple_fam', 'lone_parent', 'all_persons']] = map_data[['couple_fam', 'lone_parent', 'all_persons']].apply(pd.to_numeric)

Plotting choropleths

Choropleths encode the spatial distribution of a variable in a color scheme. There are a number of ways to convert values to a specific color.
It's important to note that different classification schemes of the same data can produce very different maps due to distribution of values and the simplifications inherent in building a choropleth.
I plotted static choropleth maps alongside the density plots of the variable in question to get a better understanding of how classification schemes can affect the final visual.
Lastly, I tested out bokeh to get an interactive map where hovering over a neighbourhood would highlight the specific EDR in that neighbourhood.

from matplotlib import cm

cmap = cm.get_cmap('viridis')

def plot_scheme(scheme, var, df, figsize=(16, 8), saveto=None):
    '''
    Plot the distribution over value and geographical space of variable `var` using scheme `scheme
    ...

    Arguments
    ---------
    scheme   : str
               Name of the classification scheme to use
    var      : str
               Variable name
    df       : GeoDataFrame
               Table with input data
    figsize  : Tuple
               [Optional. Default = (16, 8)] Size of the figure to be created.
    saveto   : None/str
               [Optional. Default = None] Path for file to save the plot.
    '''
    from pysal.esda.mapclassify import Quantiles, Equal_Interval, Fisher_Jenks

    schemes = {'equal_interval': Equal_Interval, \
               'quantiles': Quantiles, \
               'fisher_jenks': Fisher_Jenks}
    classi = schemes[scheme](df[var], k=7)

    f, (ax1, ax2) = plt.subplots(1, 2, figsize=figsize)

    # KDE
    sns.kdeplot(df[var], shade=True, ax=ax1)
    sns.rugplot(df[var], alpha=0.5, ax=ax1)
    for cut in classi.bins:
        ax1.axvline(cut, color='red', linewidth=0.75)
    ax1.set_title('Value distribution')

    # Map
    p = df.plot(column=var, scheme=scheme, alpha=0.75, k=7, cmap=cmap, axes=ax2, linewidth=0.1)
    ax2.axis('equal')
    ax2.set_axis_off()
    ax2.set_title('Geographical distribution')
    f.suptitle(scheme, size=25)
    if saveto:
        plt.savefig(saveto)
    plt.show()

Choropleth of EDR for all persons

Lighter colours signify a higher EDR.

plot_scheme('fisher_jenks', 'all_persons', map_data)

png

Choropleth of EDR for couple families

plot_scheme('fisher_jenks', 'couple_fam', map_data)

png

Choropleth of EDR for lone parent families

important to note the EDR scale on the x-axis of the values distribution. These values are significantly higher than those seen in the previous plot for couple families.
The colour scheme is re-adjusted for each plot, so the colour intensities are not comparable between the previous plot and this one.

plot_scheme('fisher_jenks', 'lone_parent', map_data)

png

Using a different classification scheme

The following map is using the exact same subset of data as the previous one, yet produces a choropleth with distinctly different classifications.
Here the neighbourhoods are classified into quantiles of equal size, as opposed to using the fisher_jenks classification scheme which attempts to minimize intra-class differences while maximizing inter-class differences.
Hopefully this provides a good example of how the choice of different classification schemes can affect the final output and thus interpretation and understanding the plotted information. The distribution of the values used can simplify/skew the original information used in unintended ways, so exploration and the combination with a density plot helps mitigate these issues.

plot_scheme('quantiles', 'lone_parent', map_data)

png

Overall the TPL hackathon was very interesting! If I were to keep exploring this data I would want to figure out what factors lead to such a high EDR in the Bay st corridor, which is typically thought of as a neighbourhood with higher earning individuals who would not receive so many government transfer payments.