In the last week of the GSOC work, the focus is on wrapping everything up, which means finalizing code and documentation, providing useful examples for educational use, and reflecting on the entire project to provide a plan on how to continue to grow the project beyond GSOC 2016.
The finalized code and documentation will be reflected in the project itself, where as educational materials will be in the form of a jupyter notebook that demonstrate various features of the project on a real life dataset (NYC CITI bike share trips). The notebook can be found here. Other experiments and proto-typing notebooks can be found in this directory.
In order to systematically reflect on the progress made throughout the project, I will now review the primary features that were developed, linking back to pertinent blog posts where possible.
API Design and The SpInt Framework
The primary API consists of four user-exposed classes: Gravity, Production, Attraction, and Doubly, which all inherit from a base class called BaseGravity. All of these classes can found in gravity script. The user classes accept the appropriate inputs for each of the four types of gravity-based spatial interaction model: basic gravity model, production-constrained (origin-constrained), attraction-constrained (destination-constrained), and doubly constrained. The BaseGravity class does most of the heavy lifting in terms of preparing the apprpriate design matrix. For now, BaseGravity inherits from CountModel, which is designed to be a flexible generalized linear model class that can accommodate several types of count models (i.e., Poisson, negative binomial, etc.) and several different types of parameter estimation (i.e., iteratively weighted least squares, gradient optimization, etc.). In reality, CountModels only currently supports Poisson GLM's (based on a customized implementation of statsmodels GLM) and iteratively weighted least sqaures estimation, which will be discussed further later in this review. In addition, the user may have a continuous dependent variable, say trade flows in dollars between countries, and therefore might want to use a non-count model, like Gaussian ordianry least sqaures. Hence, it may make more sense in the future to move away from the CountModel, and just have the BaseGravity class do the necessary dispatching to the approriate probability model/estimation techniques.
Sparse Compatibility
Because the constrained variety of gravity models (Production, Attraction, Doubly) require either N or 2N categorical variables (fixed effects), where N is the number of locations that flows may move between, a very large sparse design matrix is necessary for any non-trivial dataset. Therefore, a large amunt of effort was put into efficiently building the sparse deisgn matrix, specifically the sparse categorical variables. After much testing and benchmarking, a function was developed that can construct the sparse categorical variable portion of the design matrix relatively quickly. This function is particualr fast if the set of locations, N, is index using integers, though it is still efficient if unqiue locations are labeled using string identifiers. The sparse design matrix allows more efficient model calibration. For example, a spatial system akin to all of the counties in US (~3k locations or ~9 million observations), requires less than a minute to claibrate a production-constrained model (3k fixed effects) on my notebook about two minutes for a doubly-constrained (6k fixed effects) model on my notebook. It was decided to use normal array's for the basic Gravity model since it does not have fixed effects by defualt, has dense matrices, and therefore becomes inefficient as the number of observations grows if sparse matrices are used.
Testing and Accounting for Overdispersion
Since the nubmer of pointial origin-destination flow observations grows exponentially as the number of locations increases, and because often there are no trips occuring between many locations, spatial interaction datastes can often be overdispersed. That is, there is more variaiton in the data than would expected given the mean of the observations. Therefore, several well known tests for overdispersion (Cameron & Trivedi, 2012) in the context of count (i.e., Poisson) models were implemented. In addition, a QuasiPoisson family was added that can be activated using the Quasi=True parameterization. If it is decided to accomodate more probability models other than Poisson, as previously discussed, then the Quasi flag would be replaced with a family parameter that could be set to Gaussian, Poisson, or QuasiPoisson. The purpose of the QuasiPoisson formulation is to calibrate a Poisson GLM that allows for the variance to be different than the mean, which is a key assuption in the defualt Poisson model.
Related blog post(s): post six
Exploratory Spatial Data Analysis for Spatial Interaction
To explore the spatial clustering nature of raw spatial interaction data an implementation of Moran's I spatial autocorrelation statistic for vectors was completed. Then several experiemnts were carried out to test the different randomization technqiues that coudl be used for hypothesis testing of the computed statistic. This analytic is good for exploring your data before you calibrate a model to see if there are spatial associations above and beyond what you might expect from otherwise random data. More will be said about the potential to expand this statistic at the end of this review in the 'Unexpected Discoveries' section.
Related blog post(s): post seven
Investigating Non-Stationarity in Spatial Interaction
A method called local() was added to the Gravity, Production, and Attraction classes that allows the models to be calibrated for each of a single location, such that a set of parameter estimates and associated diagnostics is acquired for individual subsets of data. These results can then be mapped, either using python or other conventional GIS software, in order to explore how relationships change over space.
Related blog post(s): post seven
Spatial Weights for Spatial Interation
In order to carry out the vector-based spatial autocorrelation statistics, as well as various types of spatial autoregressive model specifications, it is necessary to define spatial associations between flows using a spatial weight matrix. To this end, three types of spatial weights were implemented, which can be found in the spintW script in the weights mpodule. The first is a origin-destination contiguity-based weight that encodes two flows as a neighbor is they share either an origin or a destination. The second weight is based on a 4-dimensional distance (origin x, origin y, destination x, destination y) where the strength of the association decays with further distances. Finally, network-based weights that use different types of adjacency of flows represented as an abstract or physical network.
As part of this work I also had the opportuity to contirbute some speed-ups to the DistanceBand class in the Distance script of the weights module so that it avoided a slow loop and could leverage both dense and sparse matrices. In the case of the 4-dimensional diistance-based spatial weight, the resulting spatial weight is not sparse and so the exisiting code could become quite slow. Now it is possible to set the boolean parameter build_sp=False, which will be more effiient when the distance-weighted entries of the weight matrix are increasingly non-zero.
Related blog post(s): post eight
Accounting for Spatial Association in Spatial Interaction Models
It has recently been proposed that due to spatial autocorrelation in spatial interaction data it is necessary to account for the spatial association, otherwsie the estimated parameters could be biased. The solution was a variaiton of the classic spatial autoregressive (i.e., spatia lag) model for flows, which could estimate an additional parameter to capture spatial autocorrelation in the origins, in the destinations and/or in a combination of the origins and destinations (LeSage and Pace, 2008). Unfortunately, no code was released to support this mode specification, so I attempted to implement this new model. I was not able to completely replicate the model, but I was able to extend the existing pysal ML_Lag model to estimate all three autocorrelation parameters, rather than just one. I have also attemped to re-derive the appropriate variance-covariance matrix, though this will take some more work before it is completed. More on this in the 'Moving Forward' section found below.
Related blog post(s): post nine
Assessing Model Fit for Spatial Interaction Models
Several metrics were added for assessing model fit. These include McFadden's pseudo r-squared (based on likelihood ratios), the adjusted McFadden's pseduo r-squared ro account for model complexity, D-squared or percent deviance (based on deviance ratio) and its adjusted counterpart, the standardized root mean square error (SRMSE), and the Sorenseon similarity index (SSI). The D-squared statistics and the pseudo r-squared statistics are properties of the GLM class, while the SRMSE and SSI metrics have been added as properties of the BaseGravity class. However, the functions to compute the SSI and SRMSE are stored in the utils script since they may also be useful for detemrinistic non-gravity type models that could be implemented in the future.
Related blog post(s): post ten
Unexpected Discoveries
While implementing the vector-based spatial autcorrelation statistic, it was noticed that one of the randomization technqiues is not exactly random, depending on the hypothesis that one would like to test. In effect, when you produce many permutations and compare your test statistic to a distriubution of values, you would find that you are always rejecting your statistic. Therefore, there is additional work to be done here to further define different possible hypothesis and the appropriate randomization techniques.
Leftovers and Moving Forward
Future work will consist of completing the spatial interaction SAR model specification. It will also include adding in gradient-based optimization of likelihood functions, rather than solely iteratively weighted least squares. This will allow the implementation of other model extensions such as the zero-inflated Poisson model. I was not able to implement these features because I was running short on time and decided to work on the SAR model, which turned out to be more complicated than I originally expected. Finally, future works will also incorporate determinsitic models, spatial effects such as competing destinations and eigenvector spatial filters, and neural network spatial interaction models.