This is a notebook from a presentation I recently gave to the UofT scientific coding group. There was a good turnout and it gave me good experience for this sort of coding walkthrough as a teaching experience.

You can check out the screencast on youtube.

Objectives

  • Read tabular data into an IPython notebook
  • Access columns of the data
  • Isolate subsets of the data
  • Generate plots based on subsetted data

Resources

Pandas has lots of great documentation, tutorials and walkthroughs.

This tutorial was based largely off of a SWC inspired lesson by Nancy Soontiens.

I adapted other parts from a great tutorial by Greg Reda.

More can also be found in the pandas documentation.

A great youtube walkthrough from PyCon 2015:

I've also found a recent set of helpful blogposts for intermediate and advanced users.

Working with dataframes

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pandas introduces two new data structures to Python - Series and DataFrame, both of which are built on top of NumPy.

We can load in a tabular data set as a dataframe in a number of different ways.

df = pd.read_table('./gapminderDataFiveYear.txt')
df
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333 Asia 28.801 779.445314
1 Afghanistan 1957 9240934 Asia 30.332 820.853030
2 Afghanistan 1962 10267083 Asia 31.997 853.100710
3 Afghanistan 1967 11537966 Asia 34.020 836.197138
4 Afghanistan 1972 13079460 Asia 36.088 739.981106
5 Afghanistan 1977 14880372 Asia 38.438 786.113360
6 Afghanistan 1982 12881816 Asia 39.854 978.011439
7 Afghanistan 1987 13867957 Asia 40.822 852.395945
8 Afghanistan 1992 16317921 Asia 41.674 649.341395
9 Afghanistan 1997 22227415 Asia 41.763 635.341351
10 Afghanistan 2002 25268405 Asia 42.129 726.734055
11 Afghanistan 2007 31889923 Asia 43.828 974.580338
12 Albania 1952 1282697 Europe 55.230 1601.056136
13 Albania 1957 1476505 Europe 59.280 1942.284244
14 Albania 1962 1728137 Europe 64.820 2312.888958
15 Albania 1967 1984060 Europe 66.220 2760.196931
16 Albania 1972 2263554 Europe 67.690 3313.422188
17 Albania 1977 2509048 Europe 68.930 3533.003910
18 Albania 1982 2780097 Europe 70.420 3630.880722
19 Albania 1987 3075321 Europe 72.000 3738.932735
20 Albania 1992 3326498 Europe 71.581 2497.437901
21 Albania 1997 3428038 Europe 72.950 3193.054604
22 Albania 2002 3508512 Europe 75.651 4604.211737
23 Albania 2007 3600523 Europe 76.423 5937.029526
24 Algeria 1952 9279525 Africa 43.077 2449.008185
25 Algeria 1957 10270856 Africa 45.685 3013.976023
26 Algeria 1962 11000948 Africa 48.303 2550.816880
27 Algeria 1967 12760499 Africa 51.407 3246.991771
28 Algeria 1972 14760787 Africa 54.518 4182.663766
29 Algeria 1977 17152804 Africa 58.014 4910.416756
... ... ... ... ... ... ...
1674 Yemen, Rep. 1982 9657618 Asia 49.113 1977.557010
1675 Yemen, Rep. 1987 11219340 Asia 52.922 1971.741538
1676 Yemen, Rep. 1992 13367997 Asia 55.599 1879.496673
1677 Yemen, Rep. 1997 15826497 Asia 58.020 2117.484526
1678 Yemen, Rep. 2002 18701257 Asia 60.308 2234.820827
1679 Yemen, Rep. 2007 22211743 Asia 62.698 2280.769906
1680 Zambia 1952 2672000 Africa 42.038 1147.388831
1681 Zambia 1957 3016000 Africa 44.077 1311.956766
1682 Zambia 1962 3421000 Africa 46.023 1452.725766
1683 Zambia 1967 3900000 Africa 47.768 1777.077318
1684 Zambia 1972 4506497 Africa 50.107 1773.498265
1685 Zambia 1977 5216550 Africa 51.386 1588.688299
1686 Zambia 1982 6100407 Africa 51.821 1408.678565
1687 Zambia 1987 7272406 Africa 50.821 1213.315116
1688 Zambia 1992 8381163 Africa 46.100 1210.884633
1689 Zambia 1997 9417789 Africa 40.238 1071.353818
1690 Zambia 2002 10595811 Africa 39.193 1071.613938
1691 Zambia 2007 11746035 Africa 42.384 1271.211593
1692 Zimbabwe 1952 3080907 Africa 48.451 406.884115
1693 Zimbabwe 1957 3646340 Africa 50.469 518.764268
1694 Zimbabwe 1962 4277736 Africa 52.358 527.272182
1695 Zimbabwe 1967 4995432 Africa 53.995 569.795071
1696 Zimbabwe 1972 5861135 Africa 55.635 799.362176
1697 Zimbabwe 1977 6642107 Africa 57.674 685.587682
1698 Zimbabwe 1982 7636524 Africa 60.363 788.855041
1699 Zimbabwe 1987 9216418 Africa 62.351 706.157306
1700 Zimbabwe 1992 10704340 Africa 60.377 693.420786
1701 Zimbabwe 1997 11404948 Africa 46.809 792.449960
1702 Zimbabwe 2002 11926563 Africa 39.989 672.038623
1703 Zimbabwe 2007 12311143 Africa 43.487 469.709298

1704 rows × 6 columns

type(df)
pandas.core.frame.DataFrame
df.shape
(1704, 6)
df.columns
Index(['country', 'year', 'pop', 'continent', 'lifeExp', 'gdpPercap'], dtype='object')
df.head()
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333 Asia 28.801 779.445314
1 Afghanistan 1957 9240934 Asia 30.332 820.853030
2 Afghanistan 1962 10267083 Asia 31.997 853.100710
3 Afghanistan 1967 11537966 Asia 34.020 836.197138
4 Afghanistan 1972 13079460 Asia 36.088 739.981106
df.head(6)
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333 Asia 28.801 779.445314
1 Afghanistan 1957 9240934 Asia 30.332 820.853030
2 Afghanistan 1962 10267083 Asia 31.997 853.100710
3 Afghanistan 1967 11537966 Asia 34.020 836.197138
4 Afghanistan 1972 13079460 Asia 36.088 739.981106
5 Afghanistan 1977 14880372 Asia 38.438 786.113360
df.tail()
country year pop continent lifeExp gdpPercap
1699 Zimbabwe 1987 9216418 Africa 62.351 706.157306
1700 Zimbabwe 1992 10704340 Africa 60.377 693.420786
1701 Zimbabwe 1997 11404948 Africa 46.809 792.449960
1702 Zimbabwe 2002 11926563 Africa 39.989 672.038623
1703 Zimbabwe 2007 12311143 Africa 43.487 469.709298
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1704 entries, 0 to 1703
Data columns (total 6 columns):
country      1704 non-null object
year         1704 non-null int64
pop          1704 non-null float64
continent    1704 non-null object
lifeExp      1704 non-null float64
gdpPercap    1704 non-null float64
dtypes: float64(3), int64(1), object(2)
memory usage: 93.2+ KB
df.dtypes
country       object
year           int64
pop          float64
continent     object
lifeExp      float64
gdpPercap    float64
dtype: object

Get summary statistics for the numeric columns with the describe() method

df.describe()
year pop lifeExp gdpPercap
count 1704.00000 1.704000e+03 1704.000000 1704.000000
mean 1979.50000 2.960121e+07 59.474439 7215.327081
std 17.26533 1.061579e+08 12.917107 9857.454543
min 1952.00000 6.001100e+04 23.599000 241.165877
25% 1965.75000 2.793664e+06 48.198000 1202.060309
50% 1979.50000 7.023596e+06 60.712500 3531.846989
75% 1993.25000 1.958522e+07 70.845500 9325.462346
max 2007.00000 1.318683e+09 82.603000 113523.132900

Data selection

Sometimes we need to look at only parts of the data. For example, we might want to look at the data for a particular country or in a particular year.

Selecting columns

#select multiple columns with a list of column names
col_list = ['year','lifeExp', 'country']
df[col_list]
year lifeExp country
0 1952 28.801 Afghanistan
1 1957 30.332 Afghanistan
2 1962 31.997 Afghanistan
3 1967 34.020 Afghanistan
4 1972 36.088 Afghanistan
5 1977 38.438 Afghanistan
6 1982 39.854 Afghanistan
7 1987 40.822 Afghanistan
8 1992 41.674 Afghanistan
9 1997 41.763 Afghanistan
10 2002 42.129 Afghanistan
11 2007 43.828 Afghanistan
12 1952 55.230 Albania
13 1957 59.280 Albania
14 1962 64.820 Albania
15 1967 66.220 Albania
16 1972 67.690 Albania
17 1977 68.930 Albania
18 1982 70.420 Albania
19 1987 72.000 Albania
20 1992 71.581 Albania
21 1997 72.950 Albania
22 2002 75.651 Albania
23 2007 76.423 Albania
24 1952 43.077 Algeria
25 1957 45.685 Algeria
26 1962 48.303 Algeria
27 1967 51.407 Algeria
28 1972 54.518 Algeria
29 1977 58.014 Algeria
... ... ... ...
1674 1982 49.113 Yemen, Rep.
1675 1987 52.922 Yemen, Rep.
1676 1992 55.599 Yemen, Rep.
1677 1997 58.020 Yemen, Rep.
1678 2002 60.308 Yemen, Rep.
1679 2007 62.698 Yemen, Rep.
1680 1952 42.038 Zambia
1681 1957 44.077 Zambia
1682 1962 46.023 Zambia
1683 1967 47.768 Zambia
1684 1972 50.107 Zambia
1685 1977 51.386 Zambia
1686 1982 51.821 Zambia
1687 1987 50.821 Zambia
1688 1992 46.100 Zambia
1689 1997 40.238 Zambia
1690 2002 39.193 Zambia
1691 2007 42.384 Zambia
1692 1952 48.451 Zimbabwe
1693 1957 50.469 Zimbabwe
1694 1962 52.358 Zimbabwe
1695 1967 53.995 Zimbabwe
1696 1972 55.635 Zimbabwe
1697 1977 57.674 Zimbabwe
1698 1982 60.363 Zimbabwe
1699 1987 62.351 Zimbabwe
1700 1992 60.377 Zimbabwe
1701 1997 46.809 Zimbabwe
1702 2002 39.989 Zimbabwe
1703 2007 43.487 Zimbabwe

1704 rows × 3 columns

#alternative selection with dot notation won't work if column names have spaces, uncommon characters or leading numbers
df.lifeExp
0       28.801
1       30.332
2       31.997
3       34.020
4       36.088
5       38.438
6       39.854
7       40.822
8       41.674
9       41.763
10      42.129
11      43.828
12      55.230
13      59.280
14      64.820
15      66.220
16      67.690
17      68.930
18      70.420
19      72.000
20      71.581
21      72.950
22      75.651
23      76.423
24      43.077
25      45.685
26      48.303
27      51.407
28      54.518
29      58.014
         ...
1674    49.113
1675    52.922
1676    55.599
1677    58.020
1678    60.308
1679    62.698
1680    42.038
1681    44.077
1682    46.023
1683    47.768
1684    50.107
1685    51.386
1686    51.821
1687    50.821
1688    46.100
1689    40.238
1690    39.193
1691    42.384
1692    48.451
1693    50.469
1694    52.358
1695    53.995
1696    55.635
1697    57.674
1698    60.363
1699    62.351
1700    60.377
1701    46.809
1702    39.989
1703    43.487
Name: lifeExp, dtype: float64
#using describe() on a categorical column
df[['country', 'continent']].describe()
country continent
count 1704 1704
unique 142 5
top Guatemala Africa
freq 12 624

Selecting rows

#by index location
df.iloc[[1000]]
country year pop continent lifeExp gdpPercap
1000 Mongolia 1972 1320500 Asia 53.754 1421.741975
#you can provide a list of index values to select
df.iloc[[0,5,10]]
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333 Asia 28.801 779.445314
5 Afghanistan 1977 14880372 Asia 38.438 786.113360
10 Afghanistan 2002 25268405 Asia 42.129 726.734055
#or select with the slice notation
df[0:5]
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333 Asia 28.801 779.445314
1 Afghanistan 1957 9240934 Asia 30.332 820.853030
2 Afghanistan 1962 10267083 Asia 31.997 853.100710
3 Afghanistan 1967 11537966 Asia 34.020 836.197138
4 Afghanistan 1972 13079460 Asia 36.088 739.981106
#select by index label
#would require named index
country_index = df.set_index(['continent','country'])
country_index
year pop lifeExp gdpPercap
continent country
Asia Afghanistan 1952 8425333 28.801 779.445314
Afghanistan 1957 9240934 30.332 820.853030
Afghanistan 1962 10267083 31.997 853.100710
Afghanistan 1967 11537966 34.020 836.197138
Afghanistan 1972 13079460 36.088 739.981106
Afghanistan 1977 14880372 38.438 786.113360
Afghanistan 1982 12881816 39.854 978.011439
Afghanistan 1987 13867957 40.822 852.395945
Afghanistan 1992 16317921 41.674 649.341395
Afghanistan 1997 22227415 41.763 635.341351
Afghanistan 2002 25268405 42.129 726.734055
Afghanistan 2007 31889923 43.828 974.580338
Europe Albania 1952 1282697 55.230 1601.056136
Albania 1957 1476505 59.280 1942.284244
Albania 1962 1728137 64.820 2312.888958
Albania 1967 1984060 66.220 2760.196931
Albania 1972 2263554 67.690 3313.422188
Albania 1977 2509048 68.930 3533.003910
Albania 1982 2780097 70.420 3630.880722
Albania 1987 3075321 72.000 3738.932735
Albania 1992 3326498 71.581 2497.437901
Albania 1997 3428038 72.950 3193.054604
Albania 2002 3508512 75.651 4604.211737
Albania 2007 3600523 76.423 5937.029526
Africa Algeria 1952 9279525 43.077 2449.008185
Algeria 1957 10270856 45.685 3013.976023
Algeria 1962 11000948 48.303 2550.816880
Algeria 1967 12760499 51.407 3246.991771
Algeria 1972 14760787 54.518 4182.663766
Algeria 1977 17152804 58.014 4910.416756
... ... ... ... ... ...
Asia Yemen, Rep. 1982 9657618 49.113 1977.557010
Yemen, Rep. 1987 11219340 52.922 1971.741538
Yemen, Rep. 1992 13367997 55.599 1879.496673
Yemen, Rep. 1997 15826497 58.020 2117.484526
Yemen, Rep. 2002 18701257 60.308 2234.820827
Yemen, Rep. 2007 22211743 62.698 2280.769906
Africa Zambia 1952 2672000 42.038 1147.388831
Zambia 1957 3016000 44.077 1311.956766
Zambia 1962 3421000 46.023 1452.725766
Zambia 1967 3900000 47.768 1777.077318
Zambia 1972 4506497 50.107 1773.498265
Zambia 1977 5216550 51.386 1588.688299
Zambia 1982 6100407 51.821 1408.678565
Zambia 1987 7272406 50.821 1213.315116
Zambia 1992 8381163 46.100 1210.884633
Zambia 1997 9417789 40.238 1071.353818
Zambia 2002 10595811 39.193 1071.613938
Zambia 2007 11746035 42.384 1271.211593
Zimbabwe 1952 3080907 48.451 406.884115
Zimbabwe 1957 3646340 50.469 518.764268
Zimbabwe 1962 4277736 52.358 527.272182
Zimbabwe 1967 4995432 53.995 569.795071
Zimbabwe 1972 5861135 55.635 799.362176
Zimbabwe 1977 6642107 57.674 685.587682
Zimbabwe 1982 7636524 60.363 788.855041
Zimbabwe 1987 9216418 62.351 706.157306
Zimbabwe 1992 10704340 60.377 693.420786
Zimbabwe 1997 11404948 46.809 792.449960
Zimbabwe 2002 11926563 39.989 672.038623
Zimbabwe 2007 12311143 43.487 469.709298

1704 rows × 4 columns

country_index.loc['Americas','Canada']
/home/derek/anaconda2/envs/py3/lib/python3.5/site-packages/ipykernel/__main__.py:1: PerformanceWarning: indexing past lexsort depth may impact performance.
  if __name__ == '__main__':
year pop lifeExp gdpPercap
continent country
Americas Canada 1952 14785584 68.750 11367.16112
Canada 1957 17010154 69.960 12489.95006
Canada 1962 18985849 71.300 13462.48555
Canada 1967 20819767 72.130 16076.58803
Canada 1972 22284500 72.880 18970.57086
Canada 1977 23796400 74.210 22090.88306
Canada 1982 25201900 75.760 22898.79214
Canada 1987 26549700 76.860 26626.51503
Canada 1992 28523502 77.950 26342.88426
Canada 1997 30305843 78.610 28954.92589
Canada 2002 31902268 79.770 33328.96507
Canada 2007 33390141 80.653 36319.23501
#boolean indexing
large_pop = df[df['pop'] > 300000000]
large_pop
country year pop continent lifeExp gdpPercap
288 China 1952 5.562635e+08 Asia 44.00000 400.448611
289 China 1957 6.374080e+08 Asia 50.54896 575.987001
290 China 1962 6.657700e+08 Asia 44.50136 487.674018
291 China 1967 7.545500e+08 Asia 58.38112 612.705693
292 China 1972 8.620300e+08 Asia 63.11888 676.900092
293 China 1977 9.434550e+08 Asia 63.96736 741.237470
294 China 1982 1.000281e+09 Asia 65.52500 962.421380
295 China 1987 1.084035e+09 Asia 67.27400 1378.904018
296 China 1992 1.164970e+09 Asia 68.69000 1655.784158
297 China 1997 1.230075e+09 Asia 70.42600 2289.234136
298 China 2002 1.280400e+09 Asia 72.02800 3119.280896
299 China 2007 1.318683e+09 Asia 72.96100 4959.114854
696 India 1952 3.720000e+08 Asia 37.37300 546.565749
697 India 1957 4.090000e+08 Asia 40.24900 590.061996
698 India 1962 4.540000e+08 Asia 43.60500 658.347151
699 India 1967 5.060000e+08 Asia 47.19300 700.770611
700 India 1972 5.670000e+08 Asia 50.65100 724.032527
701 India 1977 6.340000e+08 Asia 54.20800 813.337323
702 India 1982 7.080000e+08 Asia 56.59600 855.723538
703 India 1987 7.880000e+08 Asia 58.55300 976.512676
704 India 1992 8.720000e+08 Asia 60.22300 1164.406809
705 India 1997 9.590000e+08 Asia 61.76500 1458.817442
706 India 2002 1.034173e+09 Asia 62.87900 1746.769454
707 India 2007 1.110396e+09 Asia 64.69800 2452.210407
1619 United States 2007 3.011399e+08 Americas 78.24200 42951.653090
large_pop['country'].unique()
array(['China', 'India', 'United States'], dtype=object)

You can also chain together multiple criteria for boolean indexing:

multi_criteria = df[(df['country']=='Canada') | (df['year'] > 1990)]
multi_criteria
country year pop continent lifeExp gdpPercap
8 Afghanistan 1992 16317921 Asia 41.674 649.341395
9 Afghanistan 1997 22227415 Asia 41.763 635.341351
10 Afghanistan 2002 25268405 Asia 42.129 726.734055
11 Afghanistan 2007 31889923 Asia 43.828 974.580338
20 Albania 1992 3326498 Europe 71.581 2497.437901
21 Albania 1997 3428038 Europe 72.950 3193.054604
22 Albania 2002 3508512 Europe 75.651 4604.211737
23 Albania 2007 3600523 Europe 76.423 5937.029526
32 Algeria 1992 26298373 Africa 67.744 5023.216647
33 Algeria 1997 29072015 Africa 69.152 4797.295051
34 Algeria 2002 31287142 Africa 70.994 5288.040382
35 Algeria 2007 33333216 Africa 72.301 6223.367465
44 Angola 1992 8735988 Africa 40.647 2627.845685
45 Angola 1997 9875024 Africa 40.963 2277.140884
46 Angola 2002 10866106 Africa 41.003 2773.287312
47 Angola 2007 12420476 Africa 42.731 4797.231267
56 Argentina 1992 33958947 Americas 71.868 9308.418710
57 Argentina 1997 36203463 Americas 73.275 10967.281950
58 Argentina 2002 38331121 Americas 74.340 8797.640716
59 Argentina 2007 40301927 Americas 75.320 12779.379640
68 Australia 1992 17481977 Oceania 77.560 23424.766830
69 Australia 1997 18565243 Oceania 78.830 26997.936570
70 Australia 2002 19546792 Oceania 80.370 30687.754730
71 Australia 2007 20434176 Oceania 81.235 34435.367440
80 Austria 1992 7914969 Europe 76.040 27042.018680
81 Austria 1997 8069876 Europe 77.510 29095.920660
82 Austria 2002 8148312 Europe 78.980 32417.607690
83 Austria 2007 8199783 Europe 79.829 36126.492700
92 Bahrain 1992 529491 Asia 72.601 19035.579170
93 Bahrain 1997 598561 Asia 73.925 20292.016790
... ... ... ... ... ... ...
1618 United States 2002 287675526 Americas 77.310 39097.099550
1619 United States 2007 301139947 Americas 78.242 42951.653090
1628 Uruguay 1992 3149262 Americas 72.752 8137.004775
1629 Uruguay 1997 3262838 Americas 74.223 9230.240708
1630 Uruguay 2002 3363085 Americas 75.307 7727.002004
1631 Uruguay 2007 3447496 Americas 76.384 10611.462990
1640 Venezuela 1992 20265563 Americas 71.150 10733.926310
1641 Venezuela 1997 22374398 Americas 72.146 10165.495180
1642 Venezuela 2002 24287670 Americas 72.766 8605.047831
1643 Venezuela 2007 26084662 Americas 73.747 11415.805690
1652 Vietnam 1992 69940728 Asia 67.662 989.023149
1653 Vietnam 1997 76048996 Asia 70.672 1385.896769
1654 Vietnam 2002 80908147 Asia 73.017 1764.456677
1655 Vietnam 2007 85262356 Asia 74.249 2441.576404
1664 West Bank and Gaza 1992 2104779 Asia 69.718 6017.654756
1665 West Bank and Gaza 1997 2826046 Asia 71.096 7110.667619
1666 West Bank and Gaza 2002 3389578 Asia 72.370 4515.487575
1667 West Bank and Gaza 2007 4018332 Asia 73.422 3025.349798
1676 Yemen, Rep. 1992 13367997 Asia 55.599 1879.496673
1677 Yemen, Rep. 1997 15826497 Asia 58.020 2117.484526
1678 Yemen, Rep. 2002 18701257 Asia 60.308 2234.820827
1679 Yemen, Rep. 2007 22211743 Asia 62.698 2280.769906
1688 Zambia 1992 8381163 Africa 46.100 1210.884633
1689 Zambia 1997 9417789 Africa 40.238 1071.353818
1690 Zambia 2002 10595811 Africa 39.193 1071.613938
1691 Zambia 2007 11746035 Africa 42.384 1271.211593
1700 Zimbabwe 1992 10704340 Africa 60.377 693.420786
1701 Zimbabwe 1997 11404948 Africa 46.809 792.449960
1702 Zimbabwe 2002 11926563 Africa 39.989 672.038623
1703 Zimbabwe 2007 12311143 Africa 43.487 469.709298

576 rows × 6 columns

Q:

How many unique countries are there in our dataframe? Years?

df.country.unique()
array(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',
       'Australia', 'Austria', 'Bahrain', 'Bangladesh', 'Belgium', 'Benin',
       'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil',
       'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon',
       'Canada', 'Central African Republic', 'Chad', 'Chile', 'China',
       'Colombia', 'Comoros', 'Congo, Dem. Rep.', 'Congo, Rep.',
       'Costa Rica', "Cote d'Ivoire", 'Croatia', 'Cuba', 'Czech Republic',
       'Denmark', 'Djibouti', 'Dominican Republic', 'Ecuador', 'Egypt',
       'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Ethiopia',
       'Finland', 'France', 'Gabon', 'Gambia', 'Germany', 'Ghana',
       'Greece', 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Haiti',
       'Honduras', 'Hong Kong, China', 'Hungary', 'Iceland', 'India',
       'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy',
       'Jamaica', 'Japan', 'Jordan', 'Kenya', 'Korea, Dem. Rep.',
       'Korea, Rep.', 'Kuwait', 'Lebanon', 'Lesotho', 'Liberia', 'Libya',
       'Madagascar', 'Malawi', 'Malaysia', 'Mali', 'Mauritania',
       'Mauritius', 'Mexico', 'Mongolia', 'Montenegro', 'Morocco',
       'Mozambique', 'Myanmar', 'Namibia', 'Nepal', 'Netherlands',
       'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Norway', 'Oman',
       'Pakistan', 'Panama', 'Paraguay', 'Peru', 'Philippines', 'Poland',
       'Portugal', 'Puerto Rico', 'Reunion', 'Romania', 'Rwanda',
       'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia',
       'Sierra Leone', 'Singapore', 'Slovak Republic', 'Slovenia',
       'Somalia', 'South Africa', 'Spain', 'Sri Lanka', 'Sudan',
       'Swaziland', 'Sweden', 'Switzerland', 'Syria', 'Taiwan', 'Tanzania',
       'Thailand', 'Togo', 'Trinidad and Tobago', 'Tunisia', 'Turkey',
       'Uganda', 'United Kingdom', 'United States', 'Uruguay', 'Venezuela',
       'Vietnam', 'West Bank and Gaza', 'Yemen, Rep.', 'Zambia', 'Zimbabwe'], dtype=object)

Exercise

Write a function 'print_stats()' that will print a given country's life expectancy, population and gdp per capita in a given year. (note data is available only for every 5 years between 1952 and 2007).

def print_stats(df,country,year):
    """ Prints the life expectancy, gdp per capita and population
    of country in year. """
    spec = ['lifeExp','pop','gdpPercap']

    print('Statistics for', country, 'in', year)
    print(df[(df['country']==country) & (df['year']==year)][spec])
print_stats(df, 'Canada', 2007)
Statistics for Canada in 2007
     lifeExp       pop    gdpPercap
251   80.653  33390141  36319.23501

Groupby

We can use the groupby method to split up the data according to repeated values in each column. For example, group the data by continent. This is helpful if we want to repeat an analysis on each group of data from a continent.

continents = df.groupby('continent')
continents
<pandas.core.groupby.DataFrameGroupBy object at 0x7fbac870b128>
len(continents)
5
#helpful way to visualize the groupby object: gives first row of each group
continents.first()
country year pop lifeExp gdpPercap
continent
Africa Algeria 1952 9279525 43.077 2449.008185
Americas Argentina 1952 17876956 62.485 5911.315053
Asia Afghanistan 1952 8425333 28.801 779.445314
Europe Albania 1952 1282697 55.230 1601.056136
Oceania Australia 1952 8691212 69.120 10039.595640

Q:

List the names of the continents and the number of data points in each.

continents.size()
continent
Africa      624
Americas    300
Asia        396
Europe      360
Oceania      24
dtype: int64

Q:

How many unique countries are there grouped together in the Americas continent?

len(continents.get_group('Americas')['country'].unique())
25

You can use an aggregate function to get the mean life expectancy in the different continents

continents['lifeExp'].mean()
continent
Africa      48.865330
Americas    64.658737
Asia        60.064903
Europe      71.903686
Oceania     74.326208
Name: lifeExp, dtype: float64

The previous cell showed mean life expectancy values aggregated over all the years.

Alternatively, we can groupby multiple columns and use an aggregate function to get the mean life expectancy/population/gdpPercap in a specific continent in a specific year of interest:

df.groupby(['continent', 'year']).agg(np.mean)
pop lifeExp gdpPercap
continent year
Africa 1952 4.570010e+06 39.135500 1252.572466
1957 5.093033e+06 41.266346 1385.236062
1962 5.702247e+06 43.319442 1598.078825
1967 6.447875e+06 45.334538 2050.363801
1972 7.305376e+06 47.450942 2339.615674
1977 8.328097e+06 49.580423 2585.938508
1982 9.602857e+06 51.592865 2481.592960
1987 1.105450e+07 53.344788 2282.668991
1992 1.267464e+07 53.629577 2281.810333
1997 1.430448e+07 53.598269 2378.759555
2002 1.603315e+07 53.325231 2599.385159
2007 1.787576e+07 54.806038 3089.032605
Americas 1952 1.380610e+07 53.279840 4079.062552
1957 1.547816e+07 55.960280 4616.043733
1962 1.733081e+07 58.398760 4901.541870
1967 1.922986e+07 60.410920 5668.253496
1972 2.117537e+07 62.394920 6491.334139
1977 2.312271e+07 64.391560 7352.007126
1982 2.521164e+07 66.228840 7506.737088
1987 2.731016e+07 68.090720 7793.400261
1992 2.957096e+07 69.568360 8044.934406
1997 3.187602e+07 71.150480 8889.300863
2002 3.399091e+07 72.422040 9287.677107
2007 3.595485e+07 73.608120 11003.031625
Asia 1952 4.228356e+07 46.314394 5195.484004
1957 4.735699e+07 49.318544 5787.732940
1962 5.140476e+07 51.563223 5729.369625
1967 5.774736e+07 54.663640 5971.173374
1972 6.518098e+07 57.319269 8187.468699
1977 7.225799e+07 59.610556 7791.314020
1982 7.909502e+07 62.617939 7434.135157
1987 8.700669e+07 64.851182 7608.226508
1992 9.494825e+07 66.537212 8639.690248
1997 1.025238e+08 68.020515 9834.093295
2002 1.091455e+08 69.233879 10174.090397
2007 1.155138e+08 70.728485 12473.026870
Europe 1952 1.393736e+07 64.408500 5661.057435
1957 1.459635e+07 66.703067 6963.012816
1962 1.534517e+07 68.539233 8365.486814
1967 1.603930e+07 69.737600 10143.823757
1972 1.668784e+07 70.775033 12479.575246
1977 1.723882e+07 71.937767 14283.979110
1982 1.770890e+07 72.806400 15617.896551
1987 1.810314e+07 73.642167 17214.310727
1992 1.860476e+07 74.440100 17061.568084
1997 1.896480e+07 75.505167 19076.781802
2002 1.927413e+07 76.700600 21711.732422
2007 1.953662e+07 77.648600 25054.481636
Oceania 1952 5.343003e+06 69.255000 10298.085650
1957 5.970988e+06 70.295000 11598.522455
1962 6.641759e+06 71.085000 12696.452430
1967 7.300207e+06 71.310000 14495.021790
1972 8.053050e+06 71.910000 16417.333380
1977 8.619500e+06 72.855000 17283.957605
1982 9.197425e+06 74.290000 18554.709840
1987 9.787208e+06 75.320000 20448.040160
1992 1.045983e+07 76.945000 20894.045885
1997 1.112072e+07 78.190000 24024.175170
2002 1.172741e+07 79.740000 26938.778040
2007 1.227497e+07 80.719500 29810.188275

You can also retrieve a particular group with the get_group() command.

continents.get_group('Africa').describe()
year pop lifeExp gdpPercap
count 624.00000 6.240000e+02 624.00000 624.000000
mean 1979.50000 9.916003e+06 48.86533 2193.754578
std 17.27411 1.549092e+07 9.15021 2827.929863
min 1952.00000 6.001100e+04 23.59900 241.165877
25% 1965.75000 1.342075e+06 42.37250 761.247010
50% 1979.50000 4.579311e+06 47.79200 1192.138217
75% 1993.25000 1.080149e+07 54.41150 2377.417422
max 2007.00000 1.350312e+08 76.44200 21951.211760

Q:

What is the maximum life expectancy for a country in Asia?

continents.get_group('Asia').lifeExp.max()
82.602999999999994

What country is this? When was the measurement taken? We can figure this out in a few different ways:

continents.get_group('Asia').lifeExp.idxmax()
803
#idxmax convenience function will return the index with max value
df[df['continent']=='Asia']['lifeExp'].idxmax()
803
df.loc[803]
country            Japan
year                2007
pop          1.27468e+08
continent           Asia
lifeExp           82.603
gdpPercap        31656.1
Name: 803, dtype: object

How can we rank each country based on their lifeExp?

Let's create a new column 'lifeExp_rank' that creates an ordered ranking based on the longest life expectancy.

sorted_by_lifeExp = df.sort_values('lifeExp', ascending=False)
sorted_by_lifeExp['lifeExp_rank'] = np.arange(len(sorted_by_lifeExp)) + 1
#lists all rows in order of lifeExp
sorted_by_lifeExp.head()
country year pop continent lifeExp gdpPercap lifeExp_rank
803 Japan 2007 127467972 Asia 82.603 31656.06806 1
671 Hong Kong, China 2007 6980412 Asia 82.208 39724.97867 2
802 Japan 2002 127065841 Asia 82.000 28604.59190 3
695 Iceland 2007 301931 Europe 81.757 36180.78919 4
1487 Switzerland 2007 7554661 Europe 81.701 37506.41907 5

split,apply and combine: the power of groupby

What if we want to rank each country by max life expectancy for each year that data was collected?

Applying a function on grouped selections can simplify this process:

def ranker(df):
    """Assigns a rank to each country based on lifeExp, with 1 having the highest lifeExp.
    Assumes the data is DESC sorted by lifeExp."""
    df['lifeExp_rank'] = np.arange(len(df)) + 1
    return df
#apply the ranking function on a per year basis:
sorted_by_lifeExp = sorted_by_lifeExp.groupby('year').apply(ranker)

We can now subset my new dataframe by year to view the lifeExp ranks for each year

sorted_by_lifeExp[sorted_by_lifeExp.year == 2007].head()
country year pop continent lifeExp gdpPercap lifeExp_rank
803 Japan 2007 127467972 Asia 82.603 31656.06806 1
671 Hong Kong, China 2007 6980412 Asia 82.208 39724.97867 2
695 Iceland 2007 301931 Europe 81.757 36180.78919 3
1487 Switzerland 2007 7554661 Europe 81.701 37506.41907 4
71 Australia 2007 20434176 Oceania 81.235 34435.36744 5

We can also subset by country=='Canada' to see how Canada's ranking has changed over the years:

sorted_by_lifeExp[(sorted_by_lifeExp['country']=='Canada')]
country year pop continent lifeExp gdpPercap lifeExp_rank
251 Canada 2007 33390141 Americas 80.653 36319.23501 10
250 Canada 2002 31902268 Americas 79.770 33328.96507 9
249 Canada 1997 30305843 Americas 78.610 28954.92589 10
248 Canada 1992 28523502 Americas 77.950 26342.88426 5
247 Canada 1987 26549700 Americas 76.860 26626.51503 6
246 Canada 1982 25201900 Americas 75.760 22898.79214 8
245 Canada 1977 23796400 Americas 74.210 22090.88306 9
244 Canada 1972 22284500 Americas 72.880 18970.57086 9
243 Canada 1967 20819767 Americas 72.130 16076.58803 7
242 Canada 1962 18985849 Americas 71.300 13462.48555 7
241 Canada 1957 17010154 Americas 69.960 12489.95006 10
240 Canada 1952 14785584 Americas 68.750 11367.16112 10

Visualization

Make sure you use the following %magic command to allow for inline plotting

%matplotlib inline

We can specify the type of plot with the kind argument. Also, choose the independent and dependent variables with x and y arguments.

  • Plot year vs life expectancy in a scatter plot.
df.plot(x='year',y='lifeExp',kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x7fbac8710b00>

png

  • Plot gdp per capita vs life expectancy in a scatter plot
df.plot(x='gdpPercap',y='lifeExp',kind='scatter', alpha = 0.2, s=50, marker='o')
<matplotlib.axes._subplots.AxesSubplot at 0x7fbac86e87f0>

png

What's going on with those points on the right?

High gdp per capita, yet not particularly high lifeExp. We can use boolean selection to rapidly subset and check them out.

df[df['gdpPercap'] > 55000]
country year pop continent lifeExp gdpPercap
852 Kuwait 1952 160000 Asia 55.565 108382.35290
853 Kuwait 1957 212846 Asia 58.033 113523.13290
854 Kuwait 1962 358266 Asia 60.470 95458.11176
855 Kuwait 1967 575003 Asia 64.624 80894.88326
856 Kuwait 1972 841934 Asia 67.712 109347.86700
857 Kuwait 1977 1140357 Asia 69.343 59265.47714
df.hist(column='lifeExp')
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fbac5fb3b00>]], dtype=object)

png

df.lifeExp.plot.hist(bins=200)
<matplotlib.axes._subplots.AxesSubplot at 0x7fbac5fa1a90>

png

df['lifeExp'].plot(kind='kde')
<matplotlib.axes._subplots.AxesSubplot at 0x7fbac5e5a588>

png

Exercise

Write a function that will take two countries as an argument and plot the life expectancy vs year for each country on the same axis.

def compare_lifeExp(country1, country2):
    """Plot life expectancy vs year for country1 and country2"""
    ax = plt.subplot()
    for c in [country1,country2]:
        df[df['country']==c].plot(x='year',y='lifeExp', ax=ax)
    plt.legend((country1,country2))
compare_lifeExp('Canada', 'Mexico')

png

Exercises

Suzy wrote some code to determine which country had the lowest life expectancy in 1982.

What is wrong with her solution?

spec=['country','lifeExp']
df[df['year']==1982][spec].min()
country    Afghanistan
lifeExp         38.445
dtype: object

We can do a quick check to look up Afghanistan's life expectancy in 1982.

df[(df['year']==1982) & (df['country']=='Afghanistan')]
country year pop continent lifeExp gdpPercap
6 Afghanistan 1982 12881816 Asia 39.854 978.011439

This doesnt match with the answer above because the min() function was applied to each column (country and lifeExp).

She should have done this:

df.loc[df[df['year']==1982]['lifeExp'].idxmin()]['country']
'Sierra Leone'

Putting it together:

We can use all of these ideas to generate a plot that looks at a subset of the data.

  • Plot GDP per capita vs life expectancy in 2007 for each continent.
continents = df.groupby(['continent'])
for continent in continents.groups:
    group = continents.get_group(continent)
    group[group['year']==2007].plot(kind='scatter', x='gdpPercap',
                                    y='lifeExp', title=continent)
    plt.axis([-10000,60000,30,90])

png

png

png

png

png

#Example
fig,ax = plt.subplots(1,1)
colours = ['m','b','r','g','y']
for continent, colour in zip(continents.groups, colours):
    group = continents.get_group(continent)
    group[group['year']==2007].plot(kind='scatter',x='gdpPercap',y='lifeExp',label=continent,ax=ax,color=colour,alpha=0.5)
ax.set_title(2007)
plt.legend(loc='lower right')
<matplotlib.legend.Legend at 0x7f5791e0db00>

png

Exercise

Write a function the takes a country as an argument and plots the life expectancy against GDP per capita for all years in a scatter plot. Also print the year of the minimum/maximum lifeExp and the year of the miniimim/maximum GDP per capita.

def compare_gdp_lifeExp(df,country):
    """ plot GDP per capita against life expectancy for a given country.
    print year of min/max gdp per capita and life expectancy
    """

    sub = df[df['country']==country]
    sub.plot(x='gdpPercap',y='lifeExp',kind='scatter',title=country)

    print('Year of Min/Max GDP per capita')
    print(df.iloc[[sub['gdpPercap'].idxmin(),sub['gdpPercap'].idxmax()]]['year'])
    print('Year of Min/Max life expectancy')
    print(df.iloc[[sub['lifeExp'].idxmin(),sub['lifeExp'].idxmax()]]['year'])
compare_gdp_lifeExp(df,'Zimbabwe')
Year of Min/Max GDP per capita
1692    1952
1696    1972
Name: year, dtype: int64
Year of Min/Max life expectancy
1702    2002
1699    1987
Name: year, dtype: int64

png

compare_gdp_lifeExp(df,'Canada')
Year of Min/Max GDP per capita
240    1952
251    2007
Name: year, dtype: int64
Year of Min/Max life expectancy
240    1952
251    2007
Name: year, dtype: int64

png

Rapid plotting with seaborn

import seaborn as sns
/home/derek/anaconda2/envs/py3/lib/python3.5/site-packages/IPython/html.py:14: ShimWarning: The `IPython.html` package has been deprecated. You should import from `notebook` instead. `IPython.html.widgets` has moved to `ipywidgets`.
  "`IPython.html.widgets` has moved to `ipywidgets`.", ShimWarning)
df.head()
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333 Asia 28.801 779.445314
1 Afghanistan 1957 9240934 Asia 30.332 820.853030
2 Afghanistan 1962 10267083 Asia 31.997 853.100710
3 Afghanistan 1967 11537966 Asia 34.020 836.197138
4 Afghanistan 1972 13079460 Asia 36.088 739.981106
sns.set_context("talk")
sns.factorplot(data=df, x='year', y='lifeExp', hue='continent', size=8)
<seaborn.axisgrid.FacetGrid at 0x7f57901cfa20>

png

sns.regplot(data=df, x='year', y='gdpPercap', fit_reg=False)
<matplotlib.axes._subplots.AxesSubplot at 0x7fbabf5472b0>

png

sns.lmplot(data=df, x='year', y='gdpPercap', hue='continent')
<seaborn.axisgrid.FacetGrid at 0x7f579197f710>

png

sns.lmplot(data=df, x='year', y='gdpPercap', row='continent')
<seaborn.axisgrid.FacetGrid at 0x7f578da9eb38>

png

sns.factorplot(data=df, x='continent', y='gdpPercap', kind='bar')
<seaborn.axisgrid.FacetGrid at 0x7f5791ac21d0>

png

g = sns.FacetGrid(df, col='continent', row='year')
g.map(plt.hist, 'lifeExp')
<seaborn.axisgrid.FacetGrid at 0x7f578d928eb8>

png