Background

This notebook is the result of some exploration and analysis that took place at the Toronto Public Library TOProsperity hackathon.

The challenge for the day was: How can we use information from annual tax records to better understand patterns of income across neighbourhoods in Toronto?

The data used was T1FF Neighbourhood Income and Demographics Tables, by Neighbourhood.

More info can be found at the following links:

TPL challenges

T1 family file info

I chose to map out the economic dependency ratio by neighbourhood since it seemed to a tractable task for the day and could provide a good overview of earned income as compared to social benefits received.

  • Economic Dependency Ratio (EDR):

Is the sum of transfer payment dollars received as benefits in a given area, compared to every $100 of employment income for that same area. For example, where a table shows an Employment Insurance (EI) dependency ratio of 4.69, it means that $4.69 in EI benefits were received for every $100 of employment income for the area.

Explore and clean the CRA data from the T1FF data

import pandas as pd
import numpy as np
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns
import pysal as ps
%matplotlib inline
df = pd.read_csv('./data/T1FF-F2010-2014.csv', low_memory=False)
subset = df[(df.Year == '2012') & (df.Table.isin(['F-7','F-8'])) & (df.Attribute.str.contains('EDR'))]
#creating 2 subsets of data from CRA, those for Couple families (table F7) and those from lone parent families (table F8)
CF_subset = df[(df.Year == '2014') & (df.Table.isin(['F-7'])) & (df.Attribute.str.contains('· Government transfers · EDR'))]
LP_subset = df[(df.Year == '2014') & (df.Table.isin(['F-8'])) & (df.Attribute.str.contains('· Government transfers · EDR'))]
CF_subset.Attribute.unique()
array(['Couple families · Government transfers · EDR',
       'Male Partners in Couple Families · Government transfers · EDR',
       'Female Partners in Couple Families · Government transfers · EDR',
       'Children in CFF · Government transfers · EDR',
       'All persons · Government transfers · EDR'], dtype=object)
LP_subset.Attribute.unique()
array(['Lone-parent families · Government transfers · EDR',
       'Parents in LPF · Government transfers · EDR',
       'Children in LPF · Government transfers · EDR',
       'Non-family persons · Government transfers · EDR',
       'All persons · Government transfers · EDR'], dtype=object)
CF_subset = CF_subset.drop(['Year', 'Table'], axis=1)
LP_subset = LP_subset.drop(['Year', 'Table'], axis=1)
long_EDR_CF = CF_subset.T
long_EDR_LP = LP_subset.T
long_EDR_CF.columns = long_EDR_CF.loc['Attribute']
long_EDR_LP.columns = long_EDR_LP.loc['Attribute']
long_EDR_CF.drop(['Attribute'], inplace=True)
long_EDR_LP.drop(['Attribute'], inplace=True)
long_EDR_CF.reset_index(inplace=True)
long_EDR_CF.rename(columns={'index': 'neighbourhood'}, inplace=True)

long_EDR_LP.reset_index(inplace=True)
long_EDR_LP.rename(columns={'index': 'neighbourhood'}, inplace=True)
long_EDR_CF.head(3)
Attribute neighbourhood Couple families · Government transfers · EDR Male Partners in Couple Families · Government transfers · EDR Female Partners in Couple Families · Government transfers · EDR Children in CFF · Government transfers · EDR All persons · Government transfers · EDR
0 Agincourt North 21.9 20.2 33 5.4 26.8
1 Agincourt South-Malvern West 19.6 18.5 28.4 5.2 24.1
2 Alderwood 11.5 10.7 14.1 6.2 16
total = long_EDR_CF.merge(long_EDR_LP[['neighbourhood', 'Lone-parent families · Government transfers · EDR']], on='neighbourhood')
cols = ['neighbourhood', 'Couple families · Government transfers · EDR',
        'Lone-parent families · Government transfers · EDR', 'All persons · Government transfers · EDR']
EDR_df = total[cols]
EDR_df.head(2)
Attribute neighbourhood Couple families · Government transfers · EDR Lone-parent families · Government transfers · EDR All persons · Government transfers · EDR
0 Agincourt North 21.9 35.1 26.8
1 Agincourt South-Malvern West 19.6 37 24.1
cols = {'Couple families · Government transfers · EDR': 'couple_fam',
        'Lone-parent families · Government transfers · EDR': 'lone_parent',
        'All persons · Government transfers · EDR': 'all_persons'}

EDR_df.rename(columns=cols, inplace=True)
EDR_df.head()
Attribute neighbourhood couple_fam lone_parent all_persons
0 Agincourt North 21.9 35.1 26.8
1 Agincourt South-Malvern West 19.6 37 24.1
2 Alderwood 11.5 26.1 16
3 Annex 5.2 13.2 7.7
4 Banbury-Don Mills 11.3 21.6 15.8

Clean and merge the EDR data with Toronto neighbourhoods geometries

gdf = gpd.read_file('./data/neighbourhoods_shp/')
gdf.head(2)
AREA_NAME AREA_S_CD geometry
0 Yonge-St.Clair (97) 097 POLYGON ((-79.39119482700001 43.681081124, -79...
1 York University Heights (27) 027 POLYGON ((-79.505287916 43.759873494, -79.5048...
  • To merge the spatial and EDR data, first clean the neighbourhood names so you can do a merge on that column
neighbourhood = gdf['AREA_NAME'].str.replace(r"\(.*\)","")
gdf['neighbourhood'] = neighbourhood.str.strip()
gdf.drop(['AREA_NAME', 'AREA_S_CD'], axis=1, inplace=True)
  • An inconsistency I've found is that not all neighbourhood names are the same in both the CRA data and the geometries shapefile for the city neighbourhoods. More cleaning is needed.
EDR_neigh = set(EDR_df.neighbourhood)
gdf_neigh = set(gdf.neighbourhood)

print('diff1: ', EDR_neigh - gdf_neigh)
print()
print('diff2: ', gdf_neigh - EDR_neigh)
diff1:  {'North St. James Town', 'Mimico (includes Humber Bay Shores)', 'City of Toronto', 'Cabbagetown-South St. James Town', 'Weston-Pelham Park'}

diff2:  {'Cabbagetown-South St.James Town', 'Weston-Pellam Park', 'North St.James Town', 'Mimico'}
  • To fix this I'll use fuzzy string matching. This is a case where it's relatively easy to fix manually, but using a fuzzy string matching process is likely more robust and could scale up if more errors were present.
from fuzzywuzzy import process
correct_neighbs  = list(gdf.neighbourhood)

def correct_neigh(neighbourhood):
    if neighbourhood in correct_neighbs:  # might want to make this a dict for O(1) lookups
        return neighbourhood, 100

    new_name, score = process.extractOne(neighbourhood, correct_neighbs)
    if score < 90:
        return neighbourhood, score
    else:
        return new_name, score
corrected_neigh, dfscore = zip(*EDR_df['neighbourhood'].apply(correct_neigh))
EDR_df['corrected_hood'], EDR_df['score'] = zip(*EDR_df['neighbourhood'].apply(correct_neigh))
EDR_df.drop(['neighbourhood'], axis=1, inplace=True)
EDR_df = EDR_df[EDR_df.score >= 90]
EDR_df.head()
Attribute couple_fam lone_parent all_persons corrected_hood score
0 21.9 35.1 26.8 Agincourt North 100
1 19.6 37 24.1 Agincourt South-Malvern West 100
2 11.5 26.1 16 Alderwood 100
3 5.2 13.2 7.7 Annex 100
4 11.3 21.6 15.8 Banbury-Don Mills 100
  • After cleaning, merge the data on the cleaned up neighbourhood columns, then drop what is redundant:
map_data = gdf.merge(EDR_df, left_on='neighbourhood', right_on='corrected_hood')
map_data.drop(['corrected_hood'], axis=1, inplace=True)
map_data.head(3)
geometry neighbourhood couple_fam lone_parent all_persons score
0 POLYGON ((-79.39119482700001 43.681081124, -79... Yonge-St.Clair 5.5 8.4 7.4 100
1 POLYGON ((-79.505287916 43.759873494, -79.5048... York University Heights 21.1 47 26.7 100
2 POLYGON ((-79.439984311 43.761557655, -79.4400... Lansing-Westgate 6.1 19.5 7.7 100
map_data.dtypes
geometry         object
neighbourhood    object
couple_fam       object
lone_parent      object
all_persons      object
score             int64
dtype: object
#convert the EDR scores from objects to numeric type so they can be processed properly for mapping
map_data[['couple_fam', 'lone_parent', 'all_persons']] = map_data[['couple_fam', 'lone_parent', 'all_persons']].apply(pd.to_numeric)

Plotting choropleths

  • Choropleths encode the spatial distribution of a variable in a color scheme. There are a number of ways to convert values to a specific color.
  • It's important to note that different classification schemes of the same data can produce very different maps due to distribution of values and the simplifications inherent in building a choropleth.
  • I plotted static choropleth maps alongside the density plots of the variable in question to get a better understanding of how classification schemes can affect the final visual.
  • Lastly, I tested out bokeh to get an interactive map where hovering over a neighbourhood would highlight the specific EDR in that neighbourhood.
from matplotlib import cm
cmap = cm.get_cmap('viridis')
def plot_scheme(scheme, var, df, figsize=(16, 8), saveto=None):
    '''
    Plot the distribution over value and geographical space of variable `var` using scheme `scheme
    ...

    Arguments
    ---------
    scheme   : str
               Name of the classification scheme to use
    var      : str
               Variable name
    df       : GeoDataFrame
               Table with input data
    figsize  : Tuple
               [Optional. Default = (16, 8)] Size of the figure to be created.
    saveto   : None/str
               [Optional. Default = None] Path for file to save the plot.
    '''
    from pysal.esda.mapclassify import Quantiles, Equal_Interval, Fisher_Jenks

    schemes = {'equal_interval': Equal_Interval, \
               'quantiles': Quantiles, \
               'fisher_jenks': Fisher_Jenks}
    classi = schemes[scheme](df[var], k=7)

    f, (ax1, ax2) = plt.subplots(1, 2, figsize=figsize)

    # KDE
    sns.kdeplot(df[var], shade=True, ax=ax1)
    sns.rugplot(df[var], alpha=0.5, ax=ax1)
    for cut in classi.bins:
        ax1.axvline(cut, color='red', linewidth=0.75)
    ax1.set_title('Value distribution')

    # Map
    p = df.plot(column=var, scheme=scheme, alpha=0.75, k=7, cmap=cmap, axes=ax2, linewidth=0.1)
    ax2.axis('equal')
    ax2.set_axis_off()
    ax2.set_title('Geographical distribution')
    f.suptitle(scheme, size=25)
    if saveto:
        plt.savefig(saveto)
    plt.show()

Choropleth of EDR for all persons

  • Lighter colours signify a higher EDR.
plot_scheme('fisher_jenks', 'all_persons', map_data)

png

Choropleth of EDR for couple families

plot_scheme('fisher_jenks', 'couple_fam', map_data)

png

Choropleth of EDR for lone parent families

  • important to note the EDR scale on the x-axis of the values distribution. These values are significantly higher than those seen in the previous plot for couple families.
  • The colour scheme is re-adjusted for each plot, so the colour intensities are not comparable between the previous plot and this one.
plot_scheme('fisher_jenks', 'lone_parent', map_data)

png

Using a different classification scheme

  • The following map is using the exact same subset of data as the previous one, yet produces a choropleth with distinctly different classifications.
  • Here the neighbourhoods are classified into quantiles of equal size, as opposed to using the fisher_jenks classification scheme which attempts to minimize intra-class differences while maximizing inter-class differences.
  • Hopefully this provides a good example of how the choice of different classification schemes can affect the final output and thus interpretation and understanding the plotted information. The distribution of the values used can simplify/skew the original information used in unintended ways, so exploration and the combination with a density plot helps mitigate these issues.
plot_scheme('quantiles', 'lone_parent', map_data)

png

Overall the TPL hackathon was very interesting! If I were to keep exploring this data I would want to figure out what factors lead to such a high EDR in the Bay st corridor, which is typically thought of as a neighbourhood with higher earning individuals who would not receive so many government transfer payments.