Automatic detection of NACE misclassifications with multiple sources
Conference: ICES VII
06/18/2024: 1:05 PM - 1:30 PM BST
Invited Paper Session
Economic activity codes of legal units (NACE codes) are centrally stored in the statistical business register. They form the basis of the NACE code of the statistical units that are derived from them. Unfortunately, part of these codes is incorrect, due to registration errors, administrative delays and because entrepreneurs how do not report changes in their activities to the chamber of commerce, who in turn delivers these codes to Statistics Netherlands. A considerable portion of the small and medium sized enterprises in the SBR have wrong NACE codes leading to biased enterprise statistics. We have developed a method to detect which legal units have a NACE code that is misclassified. It consists of two parts: a logistic model and a text mining model. The logistic model estimates the probability that a legal unit is misclassified depending on certain background variables. The text mining model uses scraped texts of dedicated business websites and additional register data. It estimates the probability for a range of NACE codes. These elements are combined in a mixture model estimating the misclassification probability.
We first tested this model on a set of 25 5-digit NACE codes of approximately 45 thousand legal units. We created a set that was nearly free of errors and then purposely introduced different levels and kinds of misclassifications and evaluated the model performance. Next, we tested how the model performed in a real situation: misclassifications in NACE section R. For a large set of legal units the NACE codes in section R were manually evaluated, and with the model we predicted which of the legal units we expected to be misclassified, using scraped texts and additional register data. In the presentation we will shortly explain the model and give results of both tests of the model.
NACE misclassifications
machine learning
statistical business register
editing
Speaker
Arnout Van Delden, Statistics Netherlands
CoAuthor
Naomi Schalken, Statistics Netherlands (CBS)
You have unsaved changes.