CS016 Using online data for business register enhancement

Conference: ICES VII
06/18/2024: 3:20 PM - 5:00 PM BST
Topic Contributed Session 
Room: Annie Lennox Building W010A 

Description

Having up-to-date and complete information in business registers is vital for accurate business statistics. These days most enterprises have online data which are a valuable source to complete and update their characteristics that are stored in statistical business registers. Examples of such information are contact information (name, addresses), economic activity codes and the structure of enterprises. The aim of this session is to present multiple ways to exploit the value of web data for business register improvements. It shows results obtained in the ESSnet-web intelligence network.
Abstract 1 In todays' digital society enterprises typically leave digital traces that may provide useful information to improve and/or extend statistical business registers. However, these innovative data sources are of a different nature than their traditional administrative counterparts and thus should be explored with new methods. This presentation gives an overview of the various activities performed in the ESSnet-WIN project on the subject of business register enhancements. In addition, one subject is covered in more detail: the experiences on use third-party web scraped data in the Netherlands.
Abstract 2 Traditional data sources in official statistics, such as surveys, are being complemented or replaced by new alternative sources such as scanner data, web-scraped data, and registers. These alternative sources often require linking to the more traditional sources, for which unique identifiers are essential. The absence of such identifiers can lead to issues that can be resolved through probabilistic record linkage methods. One practical application of these methods is the consolidation of enterprises from the Business Register and their respective webpage domain names obtained from the Finnish Transport and Communications Agency (Traficom). This presents many challenges, including finding the company that uses the domain, instead of the owner registered in the Traficom data.
Abstract 3 NACE is the European standard hierarchical classification method for economic activity of enterprises. It is imperative to mitigate NACE misclassifications to avoid biased statistical outputs. Hence, national statistical institutes carefully manually classify and edit NACE codes causing a significant depletion of time resources. To reduce this manual effort, we propose to exploit enterprises webpages to predict their NACE codes. We present an automated flat classification procedure to predict a fixed NACE level (level 2-4) and a hierarchical procedure to predict all NACE 1-5 levels. We evaluate the quality of the proposed models by taking the structure of the models into account. We present evaluation measures, including a novel customized performance measure, which are more suitable to assess the quality of hierarchical models than the standard evaluation metrics.
Abstract 4 Likewise to the previous contribution, we are interested to predict the NACE codes of enterprises using enterprise website texts and some additional data. We aim to use automatic NACE prediction to reduce manual editing effort and to support the coming NACE revision. The choice of the features can play an important role to obtaining a good prediction performance. In official statistics it is important that those features are not only correlated but also content-related with the targeted classification, because it leads to more explainable and reliable predictions over time. In this presentation we compare the performance a large-set of different ways to derive such content-related (knowledge-based) features. We apply them to NACE section R and related NACE codes (level 4,5).

Organizer

Arnout Van Delden, Statistics Netherlands

Chair

Gary Brown, Office for National Statistics

Discussant

Markus Sova, Office for National Statistics (ONS)

Presentations

Business register enhancements from online data: an overview

This presentation provides an overview of efforts to enhance business registers using online data within the European WIN (Web Intelligence Network) project. The WIN project aims to explore and, where feasible, integrate new online data sources into national statistical production systems. It encompasses various use cases, including those focusing on online data sources such as job advertisements, business characteristics, real estate data, construction activities, household appliance prices, and tourism. This session specifically focuses on the use of online data for business register enhancements. Statistical offices from Austria, Finland, Hessen (Germany), the Netherlands (lead), and Sweden are collaborating closely to experiment with techniques for validating and improving business register information. This presentation outlines the common approach and introduces the other presentations that will delve into specific topics.

National Statistical Institutes (NSIs) worldwide maintain statistical business registers (SBRs), which are comprehensive databases of information of statistical units such as enterprises within their respective jurisdictions. For example, these registers contain detailed information about each enterprise, including its size, location, economic activity, and administrative details. They also capture the relationship with the legal unit and other important relationships among enterprises. The SBRs serve as essential sampling frames for statistical surveys and are indispensable for producing official economic statistics. The WIN use case aims to leverage online data to enhance SBRs by incorporating better, more detailed, or new information that may be difficult or impractical to acquire through traditional methods.

Enterprises leave a trail of digital footprints across the internet, providing information that can enrich and enhance traditional business registers. These digital traces, such as websites, media advertisements, product listings, customer care interactions, and job postings, may offer valuable insights into enterprise operations and characteristics. Moreover, digital traces need not be created by the enterprise itself; descriptive information maintained in community media such as Wikipedia or in articles about the statistical unit may also be valuable. The overall challenge for all traces is to find them and interpret them as well as possible in a statistical context keeping in mind that data sources are of a different nature than their traditional administrative counterparts and thus should be explored with new methods.

Although all digital traces are potentially interesting and could be explored further, for practical reasons we initially focus on one obvious starting point: the website(s) of the statistical unit itself. Unfortunately, the URL of this website is not always known from the SBR. If missing, or not reliable enough, the missing ones must be found in the so-called URL finding phase (A). In the next phase (B), statistical variables can be derived or improved from the data found. Online data can be used in both phases. We explain the two phases in more detail.

A. URL finding phase

In the URL finding phase (A) multiple online data sources or services can be used to locate URLs for statistical units or to verify known URLs. One way is to use search engines to find or verify URLs. This can be done by executing a search query containing the name of the unit, optionally supplemented with contact information such as address or a chamber of commerce (COC) or tax identification number. Executing multiple search queries per statistical unit with different composition and possibly on multiple search engines can increase the results, but should be weighed against the costs or resources. Search engines should be used with caution, as identifying information contained in the query could be identified in web server logs (search engine leakage). This risk can be reduced by carefully designing the queries, spreading them across different search engines, or entering into a non-disclosure agreement with a search provider. Hence search engine leakage is a serious but manageable concern.

To automate the search process, it is recommended to use a (paid) search application programming interface (API) rather than interpreting a human-readable search result pages, although both approaches are possible. In any case search results must be interpreted to select the best match. This can be done using the snippet (a short textual extract of the results page) or by scraping the URLs returned by the search engine. However, the latter approach involves an extra step which can increase costs and slow down the process. In countries where enterprises are required to include identification numbers on their websites, these can be used for direct and exact linkage to the business register. In the absence of such legislation, machine learning techniques have been shown to be useful for selecting relevant search results. Based on a labelled training set of valid and invalid search hits, a model can be trained to capture the search engine behaviour for a particular search engine. The set of legal units in the business register with known URLs can serve as a training set. Since search engines evolve over time, the model must be retrained periodically.

Another source of URLs is web data collected by third parties. Reusing such data can save resources. However, a (paid) agreement must be made with the third party, and the dependence on the third party must be managed. This is only feasible if the added value of the data is considerable. One example of a third party that collects web data is the company DataProvider (DP). For over two years, DP has provided Statistics Netherlands with a monthly dataset of URLs of Dutch businesses and additional variables. This data has been linked to the business register using contact variables such as COC number, domain, email, zip code, and phone numbers. Note that the link between DP data and legal units can be many-to-many. A business may operate multiple websites depending on its business activities or (usually smaller) businesses might a business portal for advertising their activities which hosts many enterprises. This can significantly complicate the linking process. Overall, we have concluded that using third-party data for URL finding is valuable, but it should be monitored over the time, as the third-party might change scraping and processing methodologies based on their own business strategy. Hence a longitudinal analysis of the linking results is ongoing.

Another online source for URL finding is domain registry data, which is a register of domain names and IP addresses that exists in all countries and for cross-country domains such as .com. This data can be useful for deducing domain ownership, but the degree of openness of this data varies per country and domain. This session features a dedicated presentation from Statistics Finland that goes into more detail on this topic targeted at the .fi domain. For all techniques in the URL finding phase it is important that one has a measure for the quality of the relation between the URL and the legal unit.

B. Deriving statistical variables

The second phase (B) is to derive statistical variables from the online data. This is typically done by scraping text from the URLs found, establishing the relationship with the unit in the SBR. Deriving economic activity (NACE) is a common use-case. This requires interpretation of raw texts and usually involves natural language processing (NLP) and machine learning techniques. Other statistical variables that have been derived from web data are, degree of innovativeness, degree of sustainability, operating a web shop or not, or belonging to the platform economy. Yet another use case is to discover or verify administrative information of SBR units, such as email addresses and telephone numbers. This use case is explored in the WIN project in more depth by Statistics Hessen (Germany).

Because enterprise websites can vary significantly in almost all aspects, scraping is typically done using a generic scraping approach, which, unlike specific scraping, does not require prior knowledge of the site's structure. Generic scraping typically begins at the home page and recursively visits deeper pages up to a certain maximum depth. Decisions must be made about whether to store the entire website, only the text, or the derived data, or all of the above. It might be valuable to use a focused scraper. Such scraper does not follow all links but prioritizes those that are expected to contain the most valuable information for the task at hand. For example, a focused scraper might prioritize the "about us" page to identify economic activity. Both approaches have their pros and cons, a generic scraper is simpler and thus requires less maintenance, but will result in larger data volumes where a focused scraper could have better results but may needs more configuration.

It is important to verify that the website visited matches the legal unit at hand based on the presence of identifying information on the site, such as a chamber of commerce or tax identification number, if possible. In countries that have national legislation that requires enterprises to include this information on their websites this is straightforward. Within this project Statistics Hessen and Statistics Austria make use of such legislation. As with the URL finding phase, special attention should be paid to many-to-many relationships between legal units and websites.

The use case of NACE detection from website texts is an important application. This session includes two more detailed presentations on deriving or verifying NACE codes of business register units using machine learning techniques. Statistics Austria discusses predicting NACE codes on multiple hierarchical NACE levels, and Statistics Netherlands presents a machine learning approach to predicting NACE codes, with special attention to feature selection and performance comparison for one dedicated NACE category.

Wrap-up

This paper presents a high-level overview of enhancing business registers using online data. There are two phases: URL finding and statistical variable derivation. URL finding is the process of discovering URLs of legal units that are not registered in the business register. Search engines can be used effectively for this task. Machine learning techniques can be applied to the textual paragraphs of search results to select valid hits. Alternatively, the discovered URLs can be scraped. Another way to find URLs is using third party web scraped data to be linked to the SBR. Also, domain registry data can be a valuable input.

Once URLs for legal units are known or found through URL finding, statistical variables, such as NACE code, can be derived. This involves scraping the websites and interpreting the results using natural language processing (NLP) and machine learning techniques. Other web data sources such as news articles, and job postings, can also be used. The best approach to improving a business register using web data is likely a combination of web data sources and techniques that are tailored to the specific country's business register and national web.
 

Speaker

Arnout Van Delden, Statistics Netherlands

CoAuthor

Olav ten Bosch, Statistics Netherlands

Enhancing Business Register: Integrating Traditional and Alternative Data Sources through Record Linkage Methods

Traditional data sources in official statistics production, such as surveys, are being complemented or, in some cases, replaced by new alternative data sources. These alternative sources include scanner data, web-scraped data, and new registers. The use of such alternative data sources often requires linking the data to other more traditional sources. To link observations of the same unit from different data sources, unique identifiers for the given unit are essential. The absence of such identifiers can lead to issues that can be resolved through probabilistic record linkage methods.

Statistics Finland has quite recently faced the need to apply these methods in numerous cases, such as combining house sales advertisement data with transfer tax data and Energy Certificate register with the Building Register. Drawing from our experiences in these cases, we have expanded the use of these methods to new applications.

One practical application for record linkage methods is the linking of enterprises from the Business Register and their respective webpage domain names obtained from the Finnish Transport and Communications Agency (Traficom). This presents many challenges, including that these domain names are often associated with parent companies in the Traficom data, and the challenge lies in accurately pairing the appropriate domain name with the correct company within these parent entities.

The linkage was done by comparing enterprise and establishment names with the domain names with the help of different string comparison methods. Lack of sufficient auxiliary variables that could provide more information for the linkage proofed to be a big challenge. Additionally, a few problematic cases were identified very quickly: cooperatives, housing companies and building superintendents.
Despite our efforts, the results thus far have been less than promising. However, the work is still ongoing and different methods and data sources are being explored in order to enhance the Business Register in Finland.
 

Speaker

Ville Auno, Statistics Finland

CoAuthor

Katja Löytynoja, Tilastokeskus

Exploiting the Web Presence of Enterprises to Improve NACE Code Classification

NACE is the European standard hierarchical classification method used to classify enterprises according to their economic activity and as such builds the foundation of various business statistics and indicators. Accordingly, it is imperative to mitigate NACE code misclassifications as best as possible to avoid biased statistical outputs. Hence, national statistical institutes carefully classify and edit NACE codes continuously causing a significant depletion of time resources. In order to assist and expedite the manual editing and classification processes, we propose to exploit the increasing web presence of enterprises to predict their NACE codes on the basis of their scraped webpages. In this paper we propose the current state of an automated 1) flat classification procedure to predict the economic activity for a fixed NACE level (level 2-4) and 2) hierarchical classification procedure to predict the economic activity in terms of all the NACE 1-5 levels. Clearly whether a proposed classification model has the ability to support the manual editing processes will depend on its quality. Thereby it is detrimental to use evaluation measures which take the structure of the classification models into account. While there is a general consensus regarding the quality measures to be used to assess flat classification models, hierarchical classification models do not enjoy the same benefit. Hence in this paper we also present evaluation measures, including a novel customized performance measure, which are more suitable to assess the quality of hierarchical models than the standard evaluation metrics. 

Speaker

Johannes Gussenbauer, Statistics Austria