Hands-on Tutorials

Geospatial Machine Learning in Public Policy

Forecasting gentrification in Philadelphia to inform affordable housing policy

Prateek Agarwal
Towards Data Science
12 min readNov 2, 2020

--

Photo by Ethan Hoover on Unsplash

Introduction

Over the last eight years, the Philadelphia housing market has turned around from recession and is primed to accelerate. At the same time, thousands of impoverished tenants struggle to find and maintain reasonably priced housing. Affordable housing initiatives have not come without criticisms regarding the placement of new housing projects, particularly with the concentration of new developments in already low-income areas. While one can argue that locating affordable housing projects in these areas keeps tenants close to their existing communities, it also concentrates them away from possible economic growth and social mobility.

Gentrification is a major source of neighborhood change and is key to understanding the shift in housing price over time. In the interest of drawing in higher income residents, the public sector works to provide amenities and leisure at higher standards. Focus on gentrifying areas pulls funds away from other generally lower income areas. These neighborhoods then go through periods of disinvestment, or neighborhood decline. As the urban center expands and surrounding land value increases, disinvested regions become primed for future cycles of gentrification. This process is evident in Philadelphia, where 13,000 lower-cost units were lost while 6,000 high end units were added from 2008 to 2016 alone.

Evidently, as the gentrification positive feedback loop continues, more people get “priced out” of their rental agreements and are forced to relocate. This effect cascades through income classes, necessitating more affordable housing.

Affordable housing projects are an important factor in maintaining a sense of equity in the economic landscape, and the placement of these projects is critical in the shaping of that space. Placing a development in an area that is currently occupied by predominantly low-income or disinvested areas assures that those in need will have access to rent-controlled housing that is relatively inexpensive for the developer. Additionally, if the area is to gentrify, rent prices stay fixed while the amenities and opportunities associated with gentrification bring economic opportunity to the area. Thus, affordable housing placement would be well-guided by a predictive measure of gentrification.

Data

Philadelphia has a robust open data platform, OpenDataPhilly, with over 350 public datasets from governmental organizations to comb through. The datasets I chose were largely selected based on enumerating the amenities across the city, including parks, schools, hospitals, higher education, and public transit stops. Additionally, I gathered crime point location data to represent a common disamenity taken into consideration by prospective homeowners. Market valuations for each parcel (building) were gathered as well, all from 2014 to 2020.

As is always the case, the data are not perfect. For example, if there were an accessible log of the expansion of transit stops over the last couple years, the model could account for increased access to public transportation as the reason for certain shifts in pricing in areas further from Center City. As another example, the school quality and catchment data does not include Charter Schools, which are a growing part of the Philadelphia school landscape and would lead to a more accurate data representation of school quality. Nevertheless, with some cleaning, merging these datasets created a rich spatial and temporal dataset for featurization.

Gathering and cleaning the data was nontrivial. Here are some examples of the data engineering work that went into the project before featurization:

  • I used the South Eastern Pennsylvania Transit Authority (SEPTA) API to extract transit accessibility across the city. However, the API only returns the closest 50 transit stops to a given latitude and longitude coordinate. In order to create a point-level transit stop dataset across the city, I iteratively queried coordinates in a grid manner over the entire city, with 0.1 miles between each coordinate. This gave many duplicates, but thoroughly covered all transit stops in Philadelphia. I removed duplicate points and trimmed the resulting locations to the boundaries of the city.
  • The Property Assessments dataset, which details parcel values and internal characteristics over the target time period, has dozens of rows with inconsistent values. For example, in many cases num_rooms is 0 while num_bedrooms is 2. On this basis, I removed num_rooms and similarly inconsistent columns from consideration.
  • I replaced some inconsistent columns with values scraped from descriptive data of the property. Every residential parcel in the dataset has a description including the presence of a garage and the number of stories the building has in a descriptive text field. Using regular expressions, I extracted this information over the dataset to add two columns for the presence of a garage and the true number of rooms.
  • In the Property Assessments dataset, the distribution of parcel values is heavily right-tailed:
Histogram of Market Value of Parcels in Philadelphia [Image by Author]

To narrow down the scope of prediction, I removed parcels with a valuation higher than $1.5 million. Parcels with a valuation of $1 or $0 were also removed. 98.5% of the parcels remained in the dataset for consideration.

A quick view of the average market valuations per neighborhood gives us a good idea of the distribution across the city:

Average Price Per Neighborhood [Image by Author]

From the image, there is a dense, high priced urban core with lower priced outskirts. Suburbs of Northwest and Northeast Philly tend to have higher values as well. The areas surrounding the airport and industrial areas in South and Southwest Philly are lower value, while the Center City waterfront on the East side has a higher average value.

Percent Change 2014–2020 Per Neighborhood [Image by Author]

This visualization digs deeper into the evident gentrification process surrounding the urban core. Just north and south of the high-priced center city are pockets of neighborhoods with rapidly increasing average value. The goal of this project is to predict the specific locations of the expansion of this process.

Feature Engineering

Features are categorized as either static or dynamic and either exogenous or endogenous.

  • Static features are assumed to not change over the given time period (E.g: locations of parks, hospitals, and public transit stops)
  • Dynamic features do change from year to year (E.g: crime incidents, school quality)
  • Endogenous features are internal to each parcel (E.g: total livable area, number of bedrooms, number of bathrooms)
  • Exogenous features are external to the parcel, caused by the surrounding environment (E.g: distance to different amenities, average neighborhood price)

With these classifications in mind, I constructed the following features:

Full Feature List [Image by Author]

Note the numbers after some of the features represent the k in average k-nearest neighbors. I discuss this version of the k-nearest neighbors algorithm more in depth here. In short, for each parcel, I calculated the k closest locations of the given feature and averaged the feature value over those closest points.

I tried different values of k as a different measure of the average. With the closest crimes, for example, if there were a crime close to a house for just one year, the feature would be very skewed in that time step. Each feature had different values for k, and the most informative k for each feature was selected in the model building process.

Notice that the average neighborhood price feature is generated over predefined Philadelphia neighborhoods. This can cause issues when the given boundaries do not properly delineate between clusters of housing type and price. This is known as the Modifiable Areal Unit Problem. In other house price analyses, different forms of clustering have been used to group geographic areas based on specific characteristics rather than using neighborhood boundaries. I go further into my efforts with geographic clustering in this article, but in summary, the attempts to create custom clusters were not significantly more homogenous than the predefined neighborhoods. Original neighborhoods were used instead.

Modeling

With dozens of features at hand, I compared linear, non-linear, and tree-based methods to predict housing prices across the city.

Ordinary Least Squares, or OLS, has been used frequently in hedonic modeling of house prices but the presence of spatial autocorrelation undermines the statistical assumptions of this test. Spatial autocorrelation is the fact that housing price observations are not spatially independent, as they are affected by many of the same spatial factors. Instead of optimizing for statistical inference, this project aims to make accurate predictions. Thus, accuracy and generalizability are of more importance. OLS predictions served as a valuable baseline throughout the feature engineering process to determine the quality of new features.

I also used penalized regression, known as elastic net, as a baseline estimate. The purpose of LASSO is to push variables with small coefficients towards zero, thus providing a means of pseudo-feature selection, which is especially important due to the large number of features available in the regression equation. Ridge regression minimizes coefficients as well, helping particularly in the existence of multiple correlated predictors, as is the case in spatial models due to autocorrelation. The combined version with mixing factor is fitting for the current model as it can work with both correlated variables and feature selection.

Principal Component Regression, or PCR, is capable of reducing the dimensionality of datasets and was also compared to other models. PCR first extracts the principal components, or feature vectors, that explain the most variance using Principal Component Analysis. This new space will have less principal components than there are features, thus reducing the dimensionality of the input. Coefficients can then be fit in the new feature space containing the principal components, which account for the variance in the original data while using less features. This dimensionality reduction could be important in predicting house price, as there are many potentially correlated features to interpret.

Other than regressions, I also used Random Forests and Gradient Boosting Machines. Random Forests are an ensemble of many decision trees on different bootstrapped samples of the original data. Since many random decision trees are used in Random Forests, this model is susceptible to outliers. It also automatically selects more relevant features as training completes.

Finally, I tested Gradient Boosting Machines. The algorithm involves successively computing decision trees to fit to the residuals of the previous model, scaled by a learning rate. As the number of successive trees increases, the model becomes more fit to the training data as the residuals are accounted for. This way, GBMs are able to fit to known data well, though they are sensitive to outliers. A representative training sample is key for this technique to work effectively.

To evaluate and compare the models, I used mean absolute percent error (MAPE) and mean absolute error (MAE) for 10-fold cross validation and spatial cross validation. 10-fold cross validation simply randomly partitions the training data into 10 folds. For 10 iterations, one of the folds is held out as the test set and the rest are used to train the model. Spatial cross validation works in a similar way, but the folds are randomly selected neighborhoods in Philadelphia. Investigating errors in this way would expose spatial autocorrelation on the neighborhood level that has not been accounted for in the model.

Model Results [Image by Author]

Predictions

Of the models used, Random Forests performed the best by far. This may be due to the inherent variable selection process exhibited when drawing an averaged result from many random trees. Insignificant predictors are eventually excluded from successive trees as more predictive ones emerge. Note that for all models, there is an increase in error when calculating spatial cross validation in comparison to random cross validation. This shows that there is still some level of spatial autocorrelation in the residuals that exhibits itself on the neighborhood level.

Having determined the most accurate model for housing price, I used the Random Forest model to predict the market value of future years. These predictions are made by using training data in certain years to predict the target market value of future years. Since the dataset can at maximum train with the data describing the state of the city in 2014 and predict known values in 2020, the largest future prediction to be made is six years into the future as well. Once trained, these models can be extrapolated to years beyond 2020.

Heatmaps of Neighborhood Change Predictions [Image by Author]

Above, heatmaps display the model prediction for 2025. The left map displays regions with higher densities of parcels that are projected to decline in value over the next five years, and the right map displays those projected to rise. Additionally, both maps have the locations of current affordable housing projects displayed, sizes based on the number of units available. Overall, the city’s average single family house price is expected to rise from $136,860.90 in 2020 to $138,848.80 in 2025, which explains more frequent, denser regions with rising house prices. In fact, of the 275,347 parcels predicted, 121,342 are expected to appreciate more than 10% in value, while 36,231 are expected to decline by the same amount.

Recommendations

In the context of affordable housing, the heatmaps give an indicator for future placements. The pockets in North and South Philadelphia surrounding Center City have high densities of rapidly appreciating parcels, which primes these areas for affordable housing needs. These results align with the intuition that regions of fast economic growth are on the outskirts of the center of the city, which could lead to displacement and an increased need for housing in these areas.

The city of Philadelphia has recognized this need in the creation of the Housing For Equity plan, which proposes 25,000 new households over the course of the next ten years. In order to evaluate the location of past housing units, I calculated both the current and predicted future average value of parcels within a quarter mile radius for each development. The growth rate in parcel value is then defined as the ratio between the average parcel value at the beginning and end of the timespan.

Affordable Housing Developments and Neighborhood Growth Rates [Image by Author]

As seen in the maps above, the affordable housing projects with the most surrounding neighborhood growth are situated in Center City, particularly in clusters in the northwest, northeast, and south. Those placed further away from Center City have experienced less growth in neighborhood prices.

Affordable Housing Units in High Growth Areas [Image by Author]

Zooming in on the clusters in high growth areas, the projects identified are in neighborhoods that more than doubled in average value over each of the time spans. According to the model, there is a predicted spread to the North in neighborhood growth predicted in the upcoming five years. The contrast is stark — there is only one current project that is in an area that is expected to double in value in both of the displayed time periods. Though there has been significant growth in value in South Philly, the same growth is not expected in the upcoming years.

Of the 20 projects in areas predicted to at least double in average neighborhood value in the future, only four were constructed in the 2010s. The lack of recent affordable units in fast growing areas makes the future of affordable housing in these areas unstable. The Low Income Housing Tax Credit (LIHTC) that funds these developments allows developers to repurpose affordable housing developments into market rate units after a period of 15 to 30 years. Affordable housing projects that are developed and owned by for-profit companies are more likely to repurpose their projects into market-rate units at the end of their term, especially in high growth areas like those shown above.

Of the 27 projects in high growth areas from 2015–2020, 10 are owned by for-profit developers. Of the 20 projects in projected high growth areas from 2020–2025, half are owned by for-profit developers. With a 15 year contract for providing rent-subsidized housing, the affordable housing stock in these key areas is projected to decline drastically, even as need increases.

Thus, especially in gentrifying areas, available rent controlled units are expected to decrease as privately owned affordable housing units are converted to market-rate units. This lack of housing could result in displacement of long-term renters. As proposed earlier by Councilmember Green, one way to anticipate these changes is by announcing changes in low income rent status up to two years in advance in order to give both city planners and residents time to prepare for relocation.

Given the forecast of parcel values, suggestions for the future of affordable housing in Philly should take into account the areas where the highest growth is expected. The current projects in these areas are listed below

Affordable Housing Projects in Predicted High Growth Areas [Image by Author]

Without renewing the affordable housing period, the for profit developers in these high growth areas are likely to evict low income tenants when their 15 year contracts expire, starting with Nellie Reynolds Gardens next year. With the predictions for high growth areas in mind, three approaches are recommended to help curb the effects of displacement:

  • Require changes from subsidized to market rate rent changes to be announced two years in advance
  • Extend the period for rent controlled housing in Station House Apartments, Raymond Rosen Apartments or Nellie Reynolds Gardens
  • Focus housing vouchers towards tenants in these areas that are likely to see rent hikes in the upcoming year

Awareness and area-based strategies like these are important to consider in planning the future of affordable housing investment in Philadelphia.

--

--