Austin Stat. 2014;1(1): 9.
Abstract
This paper proposes a novel two-sample two-dimensional Kolmogorov-Smirnov type test for the proportionality of intensity functions in case-control spatial point processes. The proposed test statistic is based on the absolute maximum deviation of proportions of points observed in a selected π-system. It shows that the asymptotic null distribution of the test statistic converges in distribution to a two-dimensional functional pinned Brownian sheet, which depends on both the true intensity functions and the shape of the region. However, by carefully selecting the π-system, the asymptotic null distribution may be reduced to the standard Brownian Bridge. Simulation studies show that the proposed test is effective in testing the proportionality in case-control spatial point processes. In an application of the West Nile virus study in Nebraska USA, the method shows that the proportionality between the case (i.e. positive to the virus) and the control (i.e. negative to the virus) is violated.
Keywords: Case-control spatial point processes; Two-dimensional functional pinned Brownian sheet; Kolmogorov-Smirnov test; Proportionality
Introduction
A fundamental problem in spatial epidemiology is the understanding of the relationship between risks experienced by humans or animals. A widely used method to address such a problem is to consider a case-control study for a certain risk [1]. In this research, we focus on problems of case-control studies for spatial point processes. In a case-control spatial point process model, data often consist of locations in a specific geographical region which can be classified into two categories: observations from the case process (composed of incidence locations of a particular disease) and observations from the control process (composed of incidence locations of other diseases). Typically, each observation in the case process presents a positive result to a certain medical test while each observation in the control process presents a negative one. A common task in the analysis of the case-control spatial point patterns is to compare their spatial distributions. For instance, in our data example of Section 5, we are interested in the comparison between the spatial distributions of dead birds on whether or not they were infected by West Nile virus. The main interest is to discover whether the spatial distribution of the case process (i.e. positive incidences) and the spatial distribution of the control process (i.e. negative incidences) are the same. The results of the analysis can provide potentially useful information on how the behavior of birds is affected by the infection of the West Nile virus.
To make the comparison, we study the relationship between the two spatial point processes by testing whether their intensity functions are proportional. A useful method to facilitate the comparison is to assume that both the case and control processes are inhomogeneous spatial Poisson processes. To compare their distributions, it is sufficient to simply analyze their intensity functions. If the distributions of the case and control processes are the same, then their intensity functions are proportional. This is called the proportionality of a case-control study for spatial point processes [2]. A proportional intensity model is derived if the proportionality holds.
In the literature for spatial epidemiology, the proportional intensity model is often used as a baseline assumption for model development. For instance, the spatial distribution of larynx cancer was compared to the spatial distribution of lung cancer around a prespecified location in the Chorley-Ribble area. The proportional intensity model was derived if the spatial distributions of the two cancers around the location were similar [3]. The proportional intensity model has been modified to suggest explanatory variables for the relationship between the two intensity functions [2]. In addition, a statistical model with clustering effects in the case process is also extended from the proportional intensity model [4]. A second-order analysis approach to the proportional intensity model has been also considered [5].
Although it is a useful assumption, the proportionality in case-control spatial point processes may be questionable in real applications. We note that the comparison between two cumulative distribution functions based on the two-sample two-dimensional Kolmogorov-Smirnov (KS) test has been extensively studied in statistical literature [6]. However, little has been done for spatial point processes. In this article, we develop a spatial point process version of the popular two-sample two-dimensional KS test. Our test statistic is constructed in terms of the absolute maximum difference between the observed point proportions from the two processes. A nice property of the proposed method is that the asymptotic null distribution of the test statistic can be derived under only a few weak assumptions. To our best knowledge, this is the first official test that compares the intensity functions between two spatial point processes.
The remainder of the article is organized as follows. In Section 2, we review the necessary background on the two-sample two-dimensional KS test for cumulative distribution functions (CDFs). In Section 3, we propose our test for case-control spatial point processes. In Section 4, we present a simulation study to evaluate our testing method. In Section 5, we apply our testing method to the Nebraska West Nile data. In Section 6, we conclude this article with a discussion.
Two-Sample Two-dimensional KS Test for CDFs
The KS test was originally proposed for one-sample one-dimensional continuous data [7] and later extended to one-sample multi-dimensional continuous data [8]. The aim of the one-sample KS test is to determine the distribution family of the observed data. Since it is often necessary to compare two distributions, the two-sample KS test is proposed [9]. This method is later extended to a multiple-sample KS test for the comparison of multiple one-dimensional distributions [10]. The idea of the KS test for one-dimensional distributions has later been extended to multi-dimensional cases for the study of astronomical data, which includes the two-sample two-dimensional KS test [11] as well as the two-sample multidimensional-dimensional KS test [12].
Since the focus of this article is to develop a two-sample KS test for spatial point process data, we decide to only review the two-sample two-dimensional KS test. Although the KS test is one of the most important goodness of fit tests based on the empirical distribution functions of random samples, it has not yet been well extended to the multivariate case [13]. The problem is that the asymptotic null distribution of the test statistic is not distribution-free as in the univariate case. Although a method using a simple transformation to make the asymptotic null distribution distribution-free has been proposed [8], this method cannot be used in the two-sample two-dimensional KS test since it involves the unknown true distribution of the observed data in the two-sample problem.
Let and be two independent random vectors with CDFs and on , respectively, where and are unknown. A classical two-sample nonparametric testing problem considers the null hypothesis
against the alternative hypothesis
This kind of problems arises when given an observed sample and an observed sample . It must be determined whether the two distributions are equal. The idea of the KS test is to compare the maximum difference between the empirical distributions of sampled data, where a significant difference is concluded if its value is large. Let the empirical distribution of be
and the empirical distribution of be
where with and for is the indicator function on which equals one if and and zero otherwise. The two-sample two-dimensional KS statistic, denoted by , for testing against is defined as
Where is the standard term used to ensure that the statistic converges to a common limiting distribution? The asymptotic null distribution of is provided in the following proposition.
Proposition 1
Let and be independently observed from and , respectively. Then under ,
as min , where is the F-functional two-dimensional pinned Brownian sheet, which is a mean, zero Gaussian process on with the covariance function given by
Therefore, the covariance function of is the same as the covariance function of for . To show the asymptotic distribution of given by (2), we need to use the basic theory of the empirical distribution, which includes Theorem 19.4, Theorem 19.5, and the method for the Donsker condition given by Example 19.6 in [14]. First, we consider as . According to the theory of the empirical distribution, it weakly converges to as . A similar conclusion also holds for as . Note that these two expressions are independent. We conclude that weakly converges to as . With the method given by Example 19.6 of [14], we can show that the Donsker condition holds in this case. Then, the conclusion given by (2) is drawn using Theorem 19.5 of [14].
In Proposition, if F is the CDF of the uniform distribution on , then is called the two-dimensional standard pinned Brownian sheet, which is denoted by W(x) for . It is clear that the covariance function of the W(x) is given by , .
To conduct the test, one can first directly compute the value of defined in (1) and then compare it with the upper tail critical values obtained from its limiting distribution given by (2). As neither the exact nor the approximate distribution of is known, a Monte Carlo method is often used. Since the distribution of may depend on F, it is not easy to provide a general list of the critical values for the significance of . This issue will be discussed later in Section 4. Using the simulation method, it can be shown that if F is the CDF of the uniform distribution on then the critical values at 1%, 5%, and 10% levels are approximately equal to 1.8656, 1.6522, and 1.4937, respectively. However, this is not enough to carry out a general two-sample KS test for two-dimensional CDFs.
Method
Although much has been done for the comparison between two CDFs, there is little work developed for spatial point processes. In this section, we propose our method, which is modified from the two-sample KS test, for the comparison between two independent spatial point processes. Since the most important issue in a spatial point process is its (first-order) intensity function, we decide to focus on the comparison between the intensity functions of two spatial point processes. If their intensity functions are proportional, then the two spatial point processes will have similar features which indicate that their distributions are affected by the same spatially varying factors. In this article, we consider the simplest case in such a problem: the comparison of intensity functions between the case point process and the control point process in a case-control study, where both case and control point processes can be modeled by inhomogeneous Poisson processes with unknown intensity functions [2].
Spatial point processes
The theory and concept of spatial point processes are well-established, which are available in many textbooks [15-17]. Overall, a spatial point process is defined on a measurable subset in a completely separable metric space. Let the completely separable metric spaces be and the measurable subset be S. Then, a spatial point process N is composed of points observed in S. Denote B(S) as the collection of all Borel sets of S. Let N(A) be the number of points in A∈B(S). Then, N(A) is finite if A is bounded. If N(A) and N(A') are independent for any disjoint A and A' in B(S), then N is called a spatial Poisson process. If N is a spatial Poisson process, then its distribution can be uniquely determined by its intensity function λ(s), which is defined by
where is a neighborhood of s∈S, is its Lebesgue measure, and represents the diameter of : for a distance function d. If N is a spatial Poisson process, then N(A) follows a Poisson distribution with mean . Further in this section, we propose our methods based on a case-control spatial Poisson process, where both the case and the control processes are modeled by spatial Poisson processes.
The test statistic
Let and be two independent spatial Poisson processes on S with intensity function and , respectively. Then, for any A∈B(S), and are independent Poisson random variables with mean functions and , respectively, where both and are positive and continuous. In this article, we focus on testing the null hypothesis of
for some ω>0 against the alternative hypothesis of
for any ω>0, where implies that and are proportional and implies that and are not proportional.
Note that can be interpreted as: there exists an ω>0 such that for any A∈B(S) and can be interpreted as: there exists an A∈B(S) such that for any ω>0. Let
Then, for a given A, is a function of ω. The null hypothesis is equivalent to that there exists an ω>0 such that
and the alternative hypothesis is equivalent to that for any ω>0
Our test statistic is developed by considering the behavior of for all A∈B(S). The basic idea of our approach is formulated in the following theorem.
Theorem 1 Let S⊆B(S) be a collection of Borel sets in S. If S is a π-system, i.e. S satisfies if , then a necessary condition for Equation (7) to hold for all A∈B(S) is that there is an ω>0 such that for any A∈S. In addition, if B(S) can be generated by S, then the condition is also sufficient.
Proof: The necessity can be directly implied by Equation (7). We only need to show the sufficiency. Let . Then, is a signed measure for any ω>0. According to the Hahn Decomposition ([18], P 420), we can find and with and such that can be almost surely uniquely decomposed into with
for any A∈B(S), where and are two nonnegative σ-finite measures on S. Let be the true value of ω such that for all A∈S. Then, and agree on S. According to the theorem of the π-λ system which says that if two measures agree on a π-system then they also agree on the σ-algebra of the π-system (e.g. Theorem 3.3 in [18], P 42), we conclude that and agree on σ(S)=B(S). This is enough to conclude the sufficiency.
It is clear from Theorem that Equation (7) is only necessary to be considered in a special π-system S⊆B(S), which implies that we only need to consider
It can be seen from Theorem that is rejected if Equation (10) is violated. However, if Equation (10) holds, then is accepted only when (S) can be generated by S. Note that under a straightforward estimator of ω is
Then,
Using the above in (10), we derive our test statistic as
where S is a collection of a π-system in B(S), and is rejected if T is large. The p-value of T can be derived from the distribution of the maximum of an F-functional pinned Brownian sheet, where the function F can be defined using the following theorem. The conditions in the asymptotics considered in the theorem can be interpreted as the expected number of points in the two processes, i.e. and
, approaches infinity but the proportion of expected points in subregions of S does not vary.
Theorem 2 Let the π-system
be generated by a measurable function G from S to . Denote T as if the π-system in T is given by . Define and . Assume for any the two functions and do not vary as . If there exists a positive ω such that
(which is also ), then as
.
Proof: Note that the spatial Poisson processes and can be interpreted as points being independently observed from S with CDFs and , respectively. Then, and . Denote ,
and A as the complementary set of A. For any , define
and
Then, Z is an eight-dimensional independent Poisson random vector with the mean vector given by ν. According to the central limit theorem for independent Poisson random variables, we have
respectively. Then,
and
If and are proportional, then and
Using the Delta Method with the expressions of the gradients of and given above, we have
as . Since the above holds for any pair of , under the covariance function of is the same as the covariance function of , which implies the conclusion of the theorem by the theory of empirical distributions that was already used in the proof of Proposition 1.
According to Theorem, the p-value of T can be approximately derived by the distribution of . Since there is no closed form formula of such a distribution, a Monte Carlo method is used. This issue will be discussed in our simulation study in Section 4.
Practical guidelines
To calculate the test statistic T, it is important to choose the pre-selected function G from S to to determine the π-system . Generally, G is defined via a continuous bivariate function. Assume that proportionality holds and denote . The basic idea can be derived by considering the special case in which S is a rectangular region given by S=[0,a]x[0,b] for a,b>0. In this case, a natural choice of G is
for . The corresponding F-functional pinned Brownian sheet on is derived if we choose F as
which can be used to compute the p-value of using the distribution of . For an arbitrary region S in , we can define
where is a pre-selected point in S. The corresponding F-functional pinned Brownian sheet on is derived if we choose F as
Since the distribution of with in (15) or in (16) depends on F, it is generally impossible to provide a general numerical table for or , which implies that the critical values of the test should be provided by a Monte Carlo method in every application. In order to avoid this difficulty, we consider a simplified choice of G as
It can be seen that such an F can make to be the standard Brownian bridge on [0,1]. The Taylor expansion of the distribution of the absolute maximum of the standard Brownian bridge on [0,1] is well-known and available in many textbooks [14]. According to the Taylor expansion, we can approximately compute the p-value of by
If (19) is used, the critical values at 10%, 5%, and 1% levels are approximately equal to 1.2239, 1.3581, and 1.6277, respectively (Figure 1).