06/18/2024: 1:05 PM - 2:45 PM BST
Invited Paper Session
Room: Annie Lennox Building W110
Official Statistics are more and more the result of production processes where different data sources (traditional survey data and non-traditional data such as administrative data and big data) are combined together in order to ensure that "high quality" statistical information is provided to end users. Defining and measuring the quality of multi-source statistical processes and their products is a new challenge at National Statistical Offices, due to a number of elements such as the nature of the integrated sources, the integration approach, the specific statistical purposes, etc. The aim of this session is to discuss how quality can be evaluated when producing statistics jointly using different data sources.
Abstract 1 Automatic detection of misclassifications in registered NACE codes using textual data; Naomi Schalken, Arnout van Delden, Sander Scholtus, Dick Windmeijer, CBS Netherlands.
Economic activity codes of legal units (NACE codes) are centrally stored in a statistical business register (SBR). Unfortunately, a considerable portion of them are misclassified, leading to biased business statistics. We developed a method to detect which units are misclassified. It consists of two parts: a logistic and a text mining model. The logistic model estimates the probability that a unit is misclassified depending on certain background variables. The text mining model uses scraped texts of dedicated business websites and additional register data. These elements are combined in a mixture model estimating the misclassification probability. In the presentation we will shortly explain the model, give results on a test with of 25 NACE codes and results on an evaluation of how the method works in practice.
Abstract 2 Process and output quality evaluation measures for Istat Integrated System of Statistical Registers; Marco Di Zio, Stefano Falorsi, Fabiana Rocci, Giorgia Simeoni, Istat
In Istat, the creation of the Integrated System of Statistical Registers (ISSR) is the main pillar of whole re-engineering of the statistical production. It comprehends several statistical registers, aimed at covering the various statistical domains and is fed by both administrative and survey data.
ISTAT has developed a framework for documenting, monitoring and assessing the quality of the ISSR and their potential output, organised according to the GSBPM sub-processes. After several tests, the final framework was released. Next step involves its gradual application and implementation. Istat is also developing a new measure for the estimation error, that is the Global Mean Squared Error accounting for two types of uncertainty, the sampling and the modelling uncertainty. The paper will show the main elements of the framework with particular focus on the quality measures definition and implementation.
Abstract 3 Methodologies for integrating different sources in Official Statistics; Pedro Campos, Statistics Portugal
For several years we have been testing the use of alternative sources to improve the quality of official statistics. We use Small Area Estimation (SAE) techniques where the sample size does not support reliable results. SAE approaches are obtained by fitting a model to the data, using covariates as auxiliary information. In SAE, mixed models can be used to combine different sources of information and explain different sources of errors.
There are several examples where this methodology has been tested, such as the Labour Force Survey(in the social domain and the production of Land use/Land Cover (LCLU) statistics in the business domain. SAE estimated for LCLU requires new auxiliary sources, namely Annual Crop Statistics at NUTS 3 level and Farm Structure Survey. This auxiliary information comprises several sources already available for NUTS 3 (administrative data, national surveys, etc). This procedure usually works well still in small samples because estimation is based on regressions between the variables underlying the model.
Organizer
Arnout Van Delden, Statistics Netherlands
Chair
Orietta Luzi, ISTAT
Discussant
Wesley Yung, Statistics Canada
Presentations
Economic activity codes of legal units (NACE codes) are centrally stored in the statistical business register. They form the basis of the NACE code of the statistical units that are derived from them. Unfortunately, part of these codes is incorrect, due to registration errors, administrative delays and because entrepreneurs how do not report changes in their activities to the chamber of commerce, who in turn delivers these codes to Statistics Netherlands. A considerable portion of the small and medium sized enterprises in the SBR have wrong NACE codes leading to biased enterprise statistics. We have developed a method to detect which legal units have a NACE code that is misclassified. It consists of two parts: a logistic model and a text mining model. The logistic model estimates the probability that a legal unit is misclassified depending on certain background variables. The text mining model uses scraped texts of dedicated business websites and additional register data. It estimates the probability for a range of NACE codes. These elements are combined in a mixture model estimating the misclassification probability.
We first tested this model on a set of 25 5-digit NACE codes of approximately 45 thousand legal units. We created a set that was nearly free of errors and then purposely introduced different levels and kinds of misclassifications and evaluated the model performance. Next, we tested how the model performed in a real situation: misclassifications in NACE section R. For a large set of legal units the NACE codes in section R were manually evaluated, and with the model we predicted which of the legal units we expected to be misclassified, using scraped texts and additional register data. In the presentation we will shortly explain the model and give results of both tests of the model.
Speaker
Arnout Van Delden, Statistics Netherlands
CoAuthor
Naomi Schalken, Statistics Netherlands (CBS)
In Istat, the creation of the Integrated System of Statistical Registers (ISSR) is the main pillar of whole re-engineering of the statistical production. ISSR comprehends several statistical registers aimed at covering the various statistical domains. The ISSR is fed by both administrative and survey data and, in the long term, will be the main source of the Istat produced statistics.
In order to answer to the need for a system for documenting, monitoring and assessing the quality of the ISSR and their potential outputs, a quality framework has been developed. It is organised according to the sub-processes of the Generic Statistical Business Process Model (GSBPM 5.1) that are identified as more relevant for the processes of the statistical registers. For each sub-process, a set of metadata elements to be documented are defined according to the Generic Statistical information Model (GSIM 1.2). For each sub-process also a set of quality indicators is proposed for monitoring and assessment purposes.
After several tests, the final framework was released. Next step involves its gradual application to the different statistical registers and implementation in the Istat monitoring and metadata systems.
Some proposals related to methods for evaluating the quality of estimates produced from the statistical registers are also being evaluated. These data are subject to different sources of uncertainty and Istat is conducting a project aimed to assess the empirical and theoretical properties of a new global measure for the estimation error, that is the Global Mean Squared Error accounting for two types of uncertainty, the sampling and the modelling uncertainty.
The paper will show the main theoretical elements of the quality framework and a particular focus on the output GMSE indicator will be presented.
Speaker
DIEGO CHIANELLA, ISTAT
CoAuthor(s)
Nina Deliu, La Sapienza
Marco Di Zio, ISTAT
Stefano Falorsi, ISTAT
Piero Falorsi, La Sapienza
Fabiana Rocci, Istat
Giorgia Simeoni, Istat
For several years we have been testing the use of alternative sources to improve the quality of official statistics. In this work we use Small Area Estimation (SAE) techniques as an alternative to design-based approaches for the domains where the sample size does not give support to reliable results. Model-based approaches in SAE are obtained by fitting a model (often a regression model) to the data, using covariates as auxiliary information. In SAE, mixed models can be used to combine different sources of information and explain different sources of errors.
There are several examples where this methodology has been tested, such as the Labour Force Survey (in the social domain) and the production of Land use/Land Cover (LCLU) statistics (in the business domain). LCLU based on Small Area Estimation (SAE) requires new auxiliary sources, namely Annual Crop Statistics (ACS) at NUTS 3 level (available for internal use) and Farm Structure Survey (FSS). This auxiliary information comprises several sources already available for NUTS 3 (administrative data, national surveys, etc). This procedure usually works well still in small samples because estimation is based on regressions between the variables underlying the model.
Speaker
Pedro Campos, Statistics Portugal (ine)