Tue 16 August 2016
By Taylor Oshan
In misc .
In the last week of the GSOC work, the focus is on wrapping everything up, which
means finalizing code and documentation, providing useful examples for
educational use, and reflecting on the entire project to provide a plan on how
to continue to grow the project beyond GSOC 2016.
The finalized code and documentation will be reflected in the project itself,
where as educational materials will be in the form of a jupyter notebook
that demonstrate various features of the project on a real life dataset (NYC
CITI bike share trips). The notebook can be found here . Other experiments and
proto-typing notebooks can be found in this directory.
In order to systematically reflect on the progress made throughout the project,
I will now review the primary features that were developed, linking back to
pertinent blog posts where possible.
API Design and The SpInt Framework
The primary API consists of four user-exposed classes: Gravity, Production, Attraction, and
Doubly, which all inherit from a base class called BaseGravity. All of these
classes can found in gravity script. The user classes
accept the appropriate inputs for each of the four types of gravity-based
spatial interaction model: basic gravity model, production-constrained
(origin-constrained), attraction-constrained (destination-constrained), and
doubly constrained. The BaseGravity class does most of the heavy lifting in
terms of preparing the apprpriate design matrix. For now, BaseGravity inherits
from CountModel, which is designed to be a flexible generalized linear model
class that can accommodate several types of count models (i.e., Poisson,
negative binomial, etc.) and several different types of parameter estimation
(i.e., iteratively weighted least squares, gradient optimization, etc.). In
reality, CountModels only currently supports Poisson GLM's (based on a
customized implementation of statsmodels GLM) and iteratively
weighted least sqaures estimation, which will be discussed further later in this
review. In addition, the user may have a continuous
dependent variable, say trade flows in dollars between countries, and therefore
might want to use a non-count model, like Gaussian ordianry least sqaures.
Hence, it may make more sense in the future to move away from the CountModel,
and just have the BaseGravity class do the necessary dispatching to the
approriate probability model/estimation techniques.
Related blog post(s): post one ; post two
Sparse Compatibility
Because the constrained variety of gravity models (Production, Attraction,
Doubly) require either N or 2N categorical variables (fixed effects), where N is the number of locations
that flows may move between, a very large sparse design matrix is necessary for
any non-trivial dataset. Therefore, a large amunt of effort was put into
efficiently building the sparse deisgn matrix, specifically the sparse
categorical variables. After much testing and benchmarking, a function was
developed that can construct the sparse categorical variable portion of the
design matrix relatively quickly. This function is particualr fast if the set of
locations, N, is index using integers, though it is still efficient if unqiue
locations are labeled using string identifiers. The sparse design matrix allows
more efficient model calibration. For example, a spatial system akin to all of
the counties in US (~3k locations or ~9 million observations), requires less
than a minute to claibrate a production-constrained model (3k fixed effects) on my notebook about
two minutes for a doubly-constrained (6k fixed effects) model on my notebook.
It was decided to use normal array's for the basic Gravity model since it does
not have fixed effects by defualt, has dense matrices, and therefore becomes
inefficient as the number of observations grows if sparse matrices are used.
Related blog post(s): post three ; post four ; post five
Testing and Accounting for Overdispersion
Since the nubmer of pointial origin-destination flow observations grows exponentially as the number of locations
increases, and because often there are no trips occuring between many locations, spatial interaction datastes can often be overdispersed. That is, there is more variaiton in the data than would expected given the mean of the observations. Therefore, several well known tests for overdispersion (Cameron & Trivedi, 2012) in the context of count (i.e., Poisson) models were implemented. In addition, a QuasiPoisson family was added that can be activated using the Quasi=True parameterization. If it is decided to accomodate more probability models other than Poisson, as previously discussed, then the Quasi flag would be replaced with a family parameter that could be set to Gaussian, Poisson, or QuasiPoisson. The purpose of the QuasiPoisson formulation is to calibrate a Poisson GLM that allows for the variance to be different than the mean, which is a key assuption in the defualt Poisson model.
Related blog post(s): post six
Exploratory Spatial Data Analysis for Spatial Interaction
To explore the spatial clustering nature of raw spatial interaction data an
implementation of Moran's I spatial autocorrelation statistic for vectors was
completed. Then several experiemnts were carried out to test the different
randomization technqiues that coudl be used for hypothesis testing of the
computed statistic. This analytic is good for exploring your data before you
calibrate a model to see if there are spatial associations above and beyond what
you might expect from otherwise random data. More will be said about the
potential to expand this statistic at the end of this review in the 'Unexpected
Discoveries' section.
Related blog post(s): post seven
Investigating Non-Stationarity in Spatial Interaction
A method called local() was added to the Gravity, Production, and Attraction classes that
allows the models to be calibrated for each of a single location, such that a
set of parameter estimates and associated diagnostics is acquired for individual
subsets of data. These results can then be mapped, either using python or other
conventional GIS software, in order to explore how relationships change over
space.
Related blog post(s): post seven
Spatial Weights for Spatial Interation
In order to carry out the vector-based spatial autocorrelation statistics, as
well as various types of spatial autoregressive model specifications, it is
necessary to define spatial associations between flows using a spatial weight
matrix. To this end, three types of spatial weights were implemented, which can
be found in the spintW script in the weights mpodule. The first is a
origin-destination contiguity-based weight that encodes two flows as a neighbor
is they share either an origin or a destination. The second weight is based on a
4-dimensional distance (origin x, origin y, destination x, destination y) where
the strength of the association decays with further distances. Finally,
network-based weights that use different types of adjacency of flows represented
as an abstract or physical network.
As part of this work I also had the opportuity to contirbute some speed-ups to
the DistanceBand class in the Distance script of the weights module so that it
avoided a slow loop and could leverage both dense and sparse matrices. In the
case of the 4-dimensional diistance-based spatial weight, the resulting spatial
weight is not sparse and so the exisiting code could become quite slow. Now it
is possible to set the boolean parameter build_sp=False, which will be more
effiient when the distance-weighted entries of the weight matrix are
increasingly non-zero.
Related blog post(s): post eight
Accounting for Spatial Association in Spatial Interaction Models
It has recently been proposed that due to spatial autocorrelation in spatial
interaction data it is necessary to account for the spatial association,
otherwsie the estimated parameters could be biased. The solution was a variaiton
of the classic spatial autoregressive (i.e., spatia lag) model for flows, which could
estimate an additional parameter to capture spatial autocorrelation in the
origins, in the destinations and/or in a combination of the origins and
destinations (LeSage and Pace, 2008). Unfortunately, no code was released to
support this mode specification, so I attempted to implement this new model. I
was not able to completely replicate the model, but I was able to extend the
existing pysal ML_Lag model to estimate all three autocorrelation parameters,
rather than just one. I have also attemped to re-derive the appropriate
variance-covariance matrix, though this will take some more work before it is
completed. More on this in the 'Moving Forward' section found below.
Related blog post(s): post nine
Assessing Model Fit for Spatial Interaction Models
Several metrics were added for assessing model fit. These include McFadden's
pseudo r-squared (based on likelihood ratios), the adjusted McFadden's pseduo
r-squared ro account for model complexity, D-squared or percent deviance (based
on deviance ratio) and its adjusted counterpart, the standardized root mean
square error (SRMSE), and the Sorenseon similarity index (SSI). The D-squared
statistics and the pseudo r-squared statistics are properties of the GLM class,
while the SRMSE and SSI metrics have been added as properties of the BaseGravity
class. However, the functions to compute the SSI and SRMSE are stored in the
utils script since they may also be useful for detemrinistic non-gravity type
models that could be implemented in the future.
Related blog post(s): post ten
Unexpected Discoveries
While implementing the vector-based spatial autcorrelation statistic, it was
noticed that one of the randomization technqiues is not exactly random,
depending on the hypothesis that one would like to test. In effect, when you
produce many permutations and compare your test statistic to a distriubution of
values, you would find that you are always rejecting your statistic. Therefore,
there is additional work to be done here to further define different possible
hypothesis and the appropriate randomization techniques.
Leftovers and Moving Forward
Future work will consist of completing the spatial interaction SAR model
specification. It will also include adding in gradient-based optimization of
likelihood functions, rather than solely iteratively weighted least squares.
This will allow the implementation of other model extensions such as the
zero-inflated Poisson model. I was not able to implement these features because
I was running short on time and decided to work on the SAR model, which turned
out to be more complicated than I originally expected. Finally, future works
will also incorporate determinsitic models, spatial effects such as competing
destinations and eigenvector spatial filters, and neural network spatial
interaction models.