Exploiting the Web Presence of Enterprises to Improve NACE Code Classification

Conference: ICES VII
06/18/2024: 4:00 PM - 4:20 PM BST
Topic Contributed Session 

Description

NACE is the European standard hierarchical classification method used to classify enterprises according to their economic activity and as such builds the foundation of various business statistics and indicators. Accordingly, it is imperative to mitigate NACE code misclassifications as best as possible to avoid biased statistical outputs. Hence, national statistical institutes carefully classify and edit NACE codes continuously causing a significant depletion of time resources. In order to assist and expedite the manual editing and classification processes, we propose to exploit the increasing web presence of enterprises to predict their NACE codes on the basis of their scraped webpages. In this paper we propose the current state of an automated 1) flat classification procedure to predict the economic activity for a fixed NACE level (level 2-4) and 2) hierarchical classification procedure to predict the economic activity in terms of all the NACE 1-5 levels. Clearly whether a proposed classification model has the ability to support the manual editing processes will depend on its quality. Thereby it is detrimental to use evaluation measures which take the structure of the classification models into account. While there is a general consensus regarding the quality measures to be used to assess flat classification models, hierarchical classification models do not enjoy the same benefit. Hence in this paper we also present evaluation measures, including a novel customized performance measure, which are more suitable to assess the quality of hierarchical models than the standard evaluation metrics.

Keywords

Economic Activity

Statistical Classification

Web Scraping

Machine Learning 

Speaker

Johannes Gussenbauer, Statistics Austria