Tuesday, Aug 5: 2:00 PM - 3:50 PM
0785
Topic-Contributed Paper Session
Music City Center
Room: CC-202C
Applied
Yes
Main Sponsor
International Statistical Institute
Co Sponsors
Government Statistics Section
Survey Research Methods Section
Presentations
The popular Fay-Herriot small area estimation model correlates area-level direct estimates with a set of area-level auxiliary covariates. Modifications, such as spatial Fay-Herriot models, allow for spatial correlation in these covariates, which are captured by introducing a spatial random effect term or by imposing a spatial correlation structure on the regression errors. The spatial variation assumed in these cases is based on a traditional nearest-neighbor approach. However, such methods may not fully capture the complexity of spatial dependencies. We propose an extension of the spatial Fay-Herriot model by introducing a varying coefficient structure, allowing the regression coefficients to vary systematically and smoothly for some of the area-level covariates. Instead of relying on geographical proximity to define network edges, we leverage node covariates in a latent socio-demographic space to infer the dependency network among auxiliary covariates. Unlike traditional approaches that impose predefined neighborhood structures, our model learns neighborhoods directly from data, and averages over all possible neighborhood structures. We showcase the flexibility and advantages of the proposed approach by presenting simulation results and applications to county-level American Community Survey data on median household income in the U.S. states of North Carolina and South Carolina.
In the context of small-area estimation (SAE) models on data from sample surveys, a typical approach is to use auxiliary information to enhance the precision of estimates obtained from direct total or mean estimators. Direct estimates and auxiliary information are integrated through regression models that can include both fixed and random effects, thus allowing for the development of mixed-effects models with potentially spatially and temporally structured components (Morales et al., 2021). These models typically assume that the regression coefficients remain constant over time or space. However, As observed by Wang et al. (2023), this strategy may prove to be inadequate in light of the possibility that the relationships between variables may vary in space, thereby giving rise to the presence of spatial heterogeneity (Zhu & Turner, 2022). The objective of this paper is to present an innovative approach to simultaneously address the issues of heterogeneity and spatial dependence in small-area estimation (SAE) models. This strategy aims to enhance the predictive capabilities of SAE models and provide more accurate estimates of the sample variables of interest. The proposed methodology integrates three previously proposed methodologies: (1) Sugasawa and Murakami (2021) proposal of spatially-clustered regression models, in which regression coefficients can vary according to a spatial cluster structure determined endogenously through penalized likelihood; (2) Wang et al. (2023) proposal, in which in a context of spatial penalized least squares, location-specific weights are employed to estimate local regression coefficients and clustering membership; (3) Cerqueti et al. (2024) proposal which extended the spatially-clustered linear regression model to encompass the leading spatial econometric models (e.g., SAR and Durbin model). In particular, the proposal entails the estimation of linear mixed effects models belonging to the Fay-Herriot family with clusterwise spatially-varying coefficients, wherein areas are merged through a spatially-penalized likelihood. The proposed methodology is applied to data on Italian farms provided by the Farm Accountancy Data Network (FADN) survey of the European Union (Baldoni et al., 2017). The dataset consists of a sample of thousands of farms across the country, the economic, production, technological, energy, and environmental impact information of which is collected annually. In particular, the application involves estimating the carbon footprint of farms in the Po Valley in recent years (Carillo et al., 2024) supported by auxiliary information from the 2020 national agricultural census.
References
Baldoni, E., Coderoni, S., & Esposti, R. (2017). The productivity and environment nexus with farm-level data. The Case of Carbon Footprint in Lombardy FADN farms. Bio-based and Applied Economics, 6(2), 119-137.
Carillo, F., Maranzano, P., Marcis, L., Pagliarella, M. C., & Salvatore, R. (2024). The spatio-temporal Fay-Herriot model using the state-space method: an application to Italian Lombard agrarian sub-regions. In Book of Short Papers - 2nd Italian Conference on Economic Statistics (ICES 2024) - Statistical Analysis of Complex Economic Data: Recent Developments and Applications (pp. 66-69). Casa Editrice Bonechi, Via Scipione Ammirato, 100 - 50136 Firenze (FI),
[email protected], www.bonechi.it.
Cerqueti, R., Maranzano, P., & Mattera, R. (2024). Spatially-Clustered Spatial Autoregressive Models with Application to Agricultural Market Concentration in Europe. Journal of Agricultural, Biological and Environmental Statistics, DOI: 10.1007/s13253-024-00672-4
Morales, D., Esteban, M. D., Pérez, A., & Hobza, T. (2021). A course on small area estimation and mixed models. Methods, theory and applications in R.
Sugasawa, S., & Murakami, D. (2021). Spatially clustered regression. Spatial Statistics, 44, 100525. https://doi.org/https://doi.org/10.1016/j.spasta.2021.100525
Wang, X., Zhu, Z., & Zhang, H. H. (2023). Spatial heterogeneity automatic detection and estimation. Computational Statistics & Data Analysis, 180, 107667. https://doi.org/https://doi.org/10.1016/j.csda.2022.107667
Zhu, A. X., & Turner, M. (2022). How is the Third Law of Geography different? Annals of GIS, 28(1), 57-67. https://doi.org/10.1080/19475683.2022.2026467
We proposed a unit-level modeling framework for dependent multi-type survey data. Our model combines highly correlated binomial and Gaussian responses by employing a pseudo-likelihood approach that effectively incorporates varying unit weights and adapts to complex survey designs. By integrating different response types within a single model, the inherent correlations are leveraged to enhance predictive performance. For computational efficiency, Polya-Gamma data augmentation is utilized to introduce latent variables that facilitate Gibbs sampling during model estimation. This strategy simplifies the computational process while retaining the flexibility necessary to capture the nuances of both response types. Comparative analyses with traditional univariate methods, based on empirical simulation studies and real data analysis, indicate promising improvements in prediction accuracy.
Keywords
Unit-level modeling; Highly dependent data
Multi-type model; pseudo-likelihood approach;
Polya-Gamma data augmentation; latent variables; Gibbs sampling
Model validation and comparison is a challenge in Small Area Estimation. The primary gauge of a good small area model is the accuracy of its predictors. In many sub-fields where accuracy is the focus, a common practice is sample splitting: dividing their dataset into training and validation subsets. This is not possible in Small Area Estimation since replicate surveys do not exist. However, we show that using data thinning, an approach for splitting an observation into two or more independent parts that sum to the original observation, can allow us to validate small area models with relative ease in a similar manner as sample splitting. We will go over several example applications for validating area-level models.
Keywords
Cross validation
Small Area Estimation
Model Comparison
Data Thinning
Poisson autoregressive count models have evolved into a time series staple for correlated count data. This paper proposes an alternative to Poisson autoregressions: echo state networks. Echo state networks can be statistically analyzed in frequentist manners via optimizing penalized likelihoods, or in Bayesian manners via MCMC sampling and conjugacy properties of multivariate log-Gamma priors. This paper develops Poisson echo state techniques for count data and applies them to a massive count data set containing the number of graduate students from 1,758 United States universities during the years 1972-2021 inclusive. Negative binomial models are also implemented to better handle overdispersion in the counts. Performance of the proposed models are compared via their forecasting performance as judged by several methods. In the end, a hierarchical negative-binomial based echo state network is judged as the superior model.
Co-Author
Qi Wang, University of California, Santa Cruz
Speaker
Qi Wang, University of California, Santa Cruz