Frequently Asked Questions

Whythawk offers integrated open data science consulting, data publishing software, and training for data science and economic development. is our quarterly-updated commercial location database, aggregating open data on vacancies, rental valuations, rates & ratepayers, into an integrated time-series database of individual retail, industrial, office and leisure business units.

We hope these FAQs cover everything you need to know, but if you still have any questions please feel free to contact us.

What is the difference between openLocal and other location data services?

There are two key methods for gathering data for research: observational, and analytical. These can be understood as the difference between a survey and a census. Surveys give an estimate of reality, but may suffer bias or sampling error. A census is a set of answers to standardised questions asked of an entire population. Observation requires manual sampling of data subjects; analysis is algorithmic and consolidates all data simultaneously.

All the information consolidated in our database exist in multiple incompatible, poorly-structured, and inaccessible public databases. Our service is to acquire, restructure, validate and enrich these data and provide them in a standardised geospatial format for responding to complex queries.

Who uses openLocal?

We have supported the Greater London Authority (GLA), the Department for Business, Energy & Industrial Strategy (BEIS) and the Ministry for Housing, Communities & Local Government (MHCLG, both BEIS and MHCLG are now Ministry for Levelling Up, Housing & Communities, DLUHC), University College London (UCL) and the universities of Leeds, Northumbria, and Warwick, and research groups like Centre for Cities, Centre for London and the Consumer Data Research Centre (CDRC).

Our data and analysis have served to inform analysis into the COVID lockdown period, the Levelling Up economic recovery response, and research into meanwhile use for empty shops, business energy consumption, the impact of rates on business vacancy, and business activity clustering maps.

Why should commercial location & ratepayer data be in the public domain?

Even before COVID devastated our economy, businesses were facing disruption from changes to the way we work and shop, and councils were battling to fill the gap between their spending responsibilities and what they earn from commercial rates. Business owners, city managers, business improvement districts, and investors want to know: Who is affected? What areas are struggling? How do they compare? Why is it happening? What can we do about it?

Then there is the intersection of money, opaque and discretionary tax reliefs worth billions of pounds, and organised crime. A rates relief is a benefit offered to business owners and investors. It is tax, which could have gone to the public benefit, deliberately foregone in the hopes that the business benefitting will create jobs and opportunities for their community. We believe information about this benefit should be in the public domain. You should know what you got for the tax you chose not to collect.

How do you assemble the data in your database?

Where an observational study is weighted by the physical process of sending fieldworkers out to conduct surveys, our main analytical task is data-wrangling; to find, acquire, import, restructure and validate all data published by public sources.

There is no nationally agreed schema for location data, yet multiple state entities require detailed information on individual business and location activity. In the absence of standards, each authority has developed its own methodology and definitions. This has resulted in a diverse and incompatible range of definitions on everything from business activity, to building use, to multiple "unique" identity and geographic systems. Adding to the complexity are scarce data skills leading to publication of incompatible data formats, or even resistance to publication.

Data are assembled via a combination of machine-learning techniques — including regression analysis, natural language processing and pattern-matching — into a single, unified geospatial database supporting research requirements for complex queries. All sources are automatically imported and processed, save for local rates data which are processed manually and algorithmically by our data wranglers.

What software do you use for data wrangling?

We developed an open source data wrangling toolkit called whyqd.

whyqd provides an intuitive method for restructuring messy data to conform to a standardised metadata schema. It supports our data wranglers to rapidly, and continuously, normalise any messy spreadsheets using a simple series of steps. This is a transparent and collaborative process, permitting assumptions and methods to be interrogated, reviewed and rerun at any stage of the data extract-transform-load (ETL) process.

whyqd ensures complete audit transparency by saving all actions performed to restructure our input data to a separate json-defined methods file. This permits others to scrutinise our approach, validate our methodology, or even use our methods to import data in production.

Why does national and local government need commercial location data, and who produces it?

Commercial rates are a legal requirement and must be paid on all hereditaments considered to be in a lettable/ habitable state. The official definition of hereditament is somewhat self-referential: "hereditament" means property which is or may become liable to a rate, being a unit of such property which is, or would fall to be, shown as a separate item in the valuation list. In practical terms, these are individual, unique commercial units at specific addresses.

For England and Wales, these data are developed and maintained by the Valuations Office Agency (VOA). Their rates valuations are closest to the true (rather than quoted) rentals paid by each tenant. They release updates to their 2010 and 2017 ratings lists fortnightly, issuing adjustments to valuations of existing hereditaments, or removing and adding hereditaments to the list. Each is assigned a category of use (SCAT) from a list of 457 types. The next revaluation will be for 2023.

Our other sources are listed here and on the publishers history page.

How do you collect ratepayer data from local authorities?

Each quarter, we send Freedom of Information (FOI) requests to the 35% of local authorities which do not publish automatically, and download and restructure all publishing- and responding authorities.

Since 2016, we have made more than 2,500 Freedom of Information requests and curated almost 20 million records on individual commercial locations in England and Wales.

No matter how they choose to publish, we receive a list of each ratepayer change to all hereditaments for the period between updates.

The data requested are:

  • Billing authority property reference code (linking the property to the VOA database reference)
  • Firm's trading name (i.e. property occupant)
  • Full property address (number, street, postal code, town)
  • Occupied / Vacant
  • Date of occupation / vacancy
  • Actual annual rates charged (in Pounds)

This establishes a history for each commercial hereditament, with the date of ratepayer change, giving us the period of occupation or vacancy, the name of the ratepayer (if a company), and any rates reliefs or exemptions.

Our extract-transform-load process restructures these messy data into a single schema. These transformed data files are subjected to additional analysis and matching against the existing current VOA data.

What algorithms and methods do you use for data matching?

All VOA addresses with postcodes are checked against the master Office of National Statistics (ONS) postcode list. Each item with malformed or missing postcodes is checked against nearby addresses with known postcodes from the master list (starting in the same street, then the same town) to find the closest approximate address. This is performed using an implementation of the Levenshtein Distance, a natural language processing mechanism for estimating similarity between text. Our objective is to situate an hereditament as close as possible to its accurate position.

Matching of ratepayer data is also performed according to the Levenshtein Distance of the VOA Unique Address Reference Numbers (UARN) and text address for each hereditament, and producing a probably of matching score. We pick the highest probability match for each address.

How are hereditaments segmented?

Each hereditament is assigned Standard Industry Codes (SICs) for businesses likely to occupy those sites based on the category of use (SCAT), from a list of 457 types, assigned by the VOA. For example, a Warehouse could be used for Construction, Transport, but also Wholesale Trade. These SIC categorisations must hold, subject to deliberate change of use, irrespective of future or previous occupants. Between them, the SCATs and SICs define building type, and its long-term potential for use.

Do you perform custom queries?

Underneath, openLocal is a geospatial data engine, matching and managing diverse data, and permitting complex analytical and geospatial queries. This permits us to aggregate location data by any regional definition, with the flexibility to import a wide range of sources and match them together to build regional and hereditament-level integrated data and research.

As example for new sources of data and aggregations:

  • High street and city centre shapefiles: University of Liverpool has developed boundaries for retail centres across the UK, and London School of Economics has developed boundaries for city centres. These can be imported and used to develop custom reports required as part of this service.
  • Workplace zones: ONS and University of Southampton developed a set of geodemographic workplace zone classifications (COWZ) based on the 2011 census. University of Leeds and Consumer Data Research Centre are collaborating with Whythawk to develop updated and revised COWZ using the openLocal data.
  • Ownership and ownership-networks: OpenOwnership publishes a database of beneficial company owners which we have restructured for use in openLocal. This is not currently part of our core data, and permits us to link disparate special purpose vehicles and networks of company ownership into single groups for aggregate analysis (e.g. BetFred owns most of their leases through Done Bros or Lightcatch).

We produce raw data, geospatial and industry aggregations, as well as supporting data visualisations.