Business register enhancements from online data: an overview

Conference: ICES VII
06/18/2024: 3:20 PM - 3:40 PM BST
Topic Contributed Session 

Description

This presentation provides an overview of efforts to enhance business registers using online data within the European WIN (Web Intelligence Network) project. The WIN project aims to explore and, where feasible, integrate new online data sources into national statistical production systems. It encompasses various use cases, including those focusing on online data sources such as job advertisements, business characteristics, real estate data, construction activities, household appliance prices, and tourism. This session specifically focuses on the use of online data for business register enhancements. Statistical offices from Austria, Finland, Hessen (Germany), the Netherlands (lead), and Sweden are collaborating closely to experiment with techniques for validating and improving business register information. This presentation outlines the common approach and introduces the other presentations that will delve into specific topics.

National Statistical Institutes (NSIs) worldwide maintain statistical business registers (SBRs), which are comprehensive databases of information of statistical units such as enterprises within their respective jurisdictions. For example, these registers contain detailed information about each enterprise, including its size, location, economic activity, and administrative details. They also capture the relationship with the legal unit and other important relationships among enterprises. The SBRs serve as essential sampling frames for statistical surveys and are indispensable for producing official economic statistics. The WIN use case aims to leverage online data to enhance SBRs by incorporating better, more detailed, or new information that may be difficult or impractical to acquire through traditional methods.

Enterprises leave a trail of digital footprints across the internet, providing information that can enrich and enhance traditional business registers. These digital traces, such as websites, media advertisements, product listings, customer care interactions, and job postings, may offer valuable insights into enterprise operations and characteristics. Moreover, digital traces need not be created by the enterprise itself; descriptive information maintained in community media such as Wikipedia or in articles about the statistical unit may also be valuable. The overall challenge for all traces is to find them and interpret them as well as possible in a statistical context keeping in mind that data sources are of a different nature than their traditional administrative counterparts and thus should be explored with new methods.

Although all digital traces are potentially interesting and could be explored further, for practical reasons we initially focus on one obvious starting point: the website(s) of the statistical unit itself. Unfortunately, the URL of this website is not always known from the SBR. If missing, or not reliable enough, the missing ones must be found in the so-called URL finding phase (A). In the next phase (B), statistical variables can be derived or improved from the data found. Online data can be used in both phases. We explain the two phases in more detail.

A. URL finding phase

In the URL finding phase (A) multiple online data sources or services can be used to locate URLs for statistical units or to verify known URLs. One way is to use search engines to find or verify URLs. This can be done by executing a search query containing the name of the unit, optionally supplemented with contact information such as address or a chamber of commerce (COC) or tax identification number. Executing multiple search queries per statistical unit with different composition and possibly on multiple search engines can increase the results, but should be weighed against the costs or resources. Search engines should be used with caution, as identifying information contained in the query could be identified in web server logs (search engine leakage). This risk can be reduced by carefully designing the queries, spreading them across different search engines, or entering into a non-disclosure agreement with a search provider. Hence search engine leakage is a serious but manageable concern.

To automate the search process, it is recommended to use a (paid) search application programming interface (API) rather than interpreting a human-readable search result pages, although both approaches are possible. In any case search results must be interpreted to select the best match. This can be done using the snippet (a short textual extract of the results page) or by scraping the URLs returned by the search engine. However, the latter approach involves an extra step which can increase costs and slow down the process. In countries where enterprises are required to include identification numbers on their websites, these can be used for direct and exact linkage to the business register. In the absence of such legislation, machine learning techniques have been shown to be useful for selecting relevant search results. Based on a labelled training set of valid and invalid search hits, a model can be trained to capture the search engine behaviour for a particular search engine. The set of legal units in the business register with known URLs can serve as a training set. Since search engines evolve over time, the model must be retrained periodically.

Another source of URLs is web data collected by third parties. Reusing such data can save resources. However, a (paid) agreement must be made with the third party, and the dependence on the third party must be managed. This is only feasible if the added value of the data is considerable. One example of a third party that collects web data is the company DataProvider (DP). For over two years, DP has provided Statistics Netherlands with a monthly dataset of URLs of Dutch businesses and additional variables. This data has been linked to the business register using contact variables such as COC number, domain, email, zip code, and phone numbers. Note that the link between DP data and legal units can be many-to-many. A business may operate multiple websites depending on its business activities or (usually smaller) businesses might a business portal for advertising their activities which hosts many enterprises. This can significantly complicate the linking process. Overall, we have concluded that using third-party data for URL finding is valuable, but it should be monitored over the time, as the third-party might change scraping and processing methodologies based on their own business strategy. Hence a longitudinal analysis of the linking results is ongoing.

Another online source for URL finding is domain registry data, which is a register of domain names and IP addresses that exists in all countries and for cross-country domains such as .com. This data can be useful for deducing domain ownership, but the degree of openness of this data varies per country and domain. This session features a dedicated presentation from Statistics Finland that goes into more detail on this topic targeted at the .fi domain. For all techniques in the URL finding phase it is important that one has a measure for the quality of the relation between the URL and the legal unit.

B. Deriving statistical variables

The second phase (B) is to derive statistical variables from the online data. This is typically done by scraping text from the URLs found, establishing the relationship with the unit in the SBR. Deriving economic activity (NACE) is a common use-case. This requires interpretation of raw texts and usually involves natural language processing (NLP) and machine learning techniques. Other statistical variables that have been derived from web data are, degree of innovativeness, degree of sustainability, operating a web shop or not, or belonging to the platform economy. Yet another use case is to discover or verify administrative information of SBR units, such as email addresses and telephone numbers. This use case is explored in the WIN project in more depth by Statistics Hessen (Germany).

Because enterprise websites can vary significantly in almost all aspects, scraping is typically done using a generic scraping approach, which, unlike specific scraping, does not require prior knowledge of the site's structure. Generic scraping typically begins at the home page and recursively visits deeper pages up to a certain maximum depth. Decisions must be made about whether to store the entire website, only the text, or the derived data, or all of the above. It might be valuable to use a focused scraper. Such scraper does not follow all links but prioritizes those that are expected to contain the most valuable information for the task at hand. For example, a focused scraper might prioritize the "about us" page to identify economic activity. Both approaches have their pros and cons, a generic scraper is simpler and thus requires less maintenance, but will result in larger data volumes where a focused scraper could have better results but may needs more configuration.

It is important to verify that the website visited matches the legal unit at hand based on the presence of identifying information on the site, such as a chamber of commerce or tax identification number, if possible. In countries that have national legislation that requires enterprises to include this information on their websites this is straightforward. Within this project Statistics Hessen and Statistics Austria make use of such legislation. As with the URL finding phase, special attention should be paid to many-to-many relationships between legal units and websites.

The use case of NACE detection from website texts is an important application. This session includes two more detailed presentations on deriving or verifying NACE codes of business register units using machine learning techniques. Statistics Austria discusses predicting NACE codes on multiple hierarchical NACE levels, and Statistics Netherlands presents a machine learning approach to predicting NACE codes, with special attention to feature selection and performance comparison for one dedicated NACE category.

Wrap-up

This paper presents a high-level overview of enhancing business registers using online data. There are two phases: URL finding and statistical variable derivation. URL finding is the process of discovering URLs of legal units that are not registered in the business register. Search engines can be used effectively for this task. Machine learning techniques can be applied to the textual paragraphs of search results to select valid hits. Alternatively, the discovered URLs can be scraped. Another way to find URLs is using third party web scraped data to be linked to the SBR. Also, domain registry data can be a valuable input.

Once URLs for legal units are known or found through URL finding, statistical variables, such as NACE code, can be derived. This involves scraping the websites and interpreting the results using natural language processing (NLP) and machine learning techniques. Other web data sources such as news articles, and job postings, can also be used. The best approach to improving a business register using web data is likely a combination of web data sources and techniques that are tailored to the specific country's business register and national web.

Keywords

online data

webscraping

search engines

machine learning

NACE 

Speaker

Arnout Van Delden, Statistics Netherlands

CoAuthor

Olav ten Bosch, Statistics Netherlands