In this analysis, I use Kenya’s 2019 census data to visualize the distribution of household sizes and population density in Python. The data is published at the Kenya National Bureau of Statistics’ website.
The data is shared as a pdf report and we need apply data mining techniques in order to use it. After careful searching I settled for tika package to parse the document. Tika preserves the file structure when we parse it.
file_name = 'VOLUME 1 KPHC 2019.pdf'
raw = parser.from_file(file_name)
raw_content = raw['content']
# text sample
sample_index = np.random.randint(0,len(raw_content))
raw_content[sample_index:sample_index+1000]
From the pdf document, we can see that our target data are in tables 2.6 and 2.7. We can then use text idexing to find these tables as shown below
household_table_index = raw_content.find('Table 2.6:')
pop_density_table_index = raw_content.find('Table 2.7:')
last_index = raw_content.find("@KNBStats")-400
We know that the households table ends where population density table begins. Similarly the population table ends before the last index minus any extra text that is at the end of the document.
household_table_text = raw_content[
household_table_index:pop_density_table_index].split('\n')
population_density_table_text = raw_content[
pop_density_table_index:last_index].split('\n')
Once we have extracted the tables as text we can start cleaning them. Lucky for us the extracted data follow a similar structure and we can apply some tricks such as:
The resulting data is a pandas dataframe that we can merge with the counties shapefile.
The shapefile that I found here is for Kenya’s sub-counties. Geopandas has a nice function that we use to combine sub-counties into counties.
sub_county_shapefile = gpd.read_file(
'../sub-counties/ken_admbnda_adm2_iebc_20191031.shp')
# ADM1_EN is the county name column
counties_shapefile = sub_county_shapefile.dissolve('ADM1_EN')
As shown here, the resulting shapefile looks good and we can now go ahead and merge it with our data.
The first step of our analysis is to understand how the data is distributed. To do this, we use describe function in pandas to generate summary statistics. The table below shows count, mean, standard deviation, percentile, min, max, kurtosis and skewness values for our dataset.
The graphs below show the population and population density distribution
We can visualize this distribution on a map to get a better idea of how the population is distributed across Kenya. As shown in the population density map below:
Here are counties with the highest and the lowest population sizes respectively.
Let’s look at the distribution of total households and household sizes across counties. Do counties with the highest population sizes have bigger household sizes on average?
From the data we can observe that:
Below we can see that the average household size increase as you move from the central to the northern regions. Despite being sparsely populated, pastoralist counties have bigger households than counties in central and western regions. The central region have the lowest household sizes in the country.
We also look at the correlations between variables in our dataset. For instance is there any relationship between land area and population size? The table below shows correlation statistics for land area, population, population density, households and household size
We observe that:
In the following map we classify population density and household sizes into groups. First we split population density and household sizes into four categories low, medium, high and very high. We then create qualitative pairs for all the counties. Northern counties have low population density and very high household sizes.
Central counties are further classified into smaller groups where high population density and low to medium household sizes are the most common combinations.