Coverage and quality

Methodology

Measuring accuracy in open-web brand data has real theoretical and practical limits. We want to be explicit about how the numbers in this document should be interpreted. A defining property of our dataset is that it must work across the full range of domains on the open internet. Input quality varies materially, and output quality is a direct function of it. When a site is slow, broken, or content-thin, the resulting data footprint is thinner, which reduces coverage. We do not compensate for this. A thin footprint is itself a strong indicator of low brand maturity or weak digital footprint. For practical purposes, this evaluation is sample-based. Our goal is to construct samples that are random enough to avoid gaming, but representative enough to reflect real production usage. We therefore built two datasets:

Core distribution. A baseline sample across public and private brands, primarily in the US and EU.
Long-tail distribution. An uncurated global sample of micro-businesses and local service providers.

The results should be read with three principles in mind:

First-party sources. We rely primarily on first-party surfaces, the brand’s own website and managed social profiles, and cross-check across them when possible to increase accuracy.
Long-tail coverage. We deliver strong performance on the long tail, showing that the pipeline works beyond curated cases.
Non-padding. When we return null, it reflects weak, missing, or ambiguous public signals. In open-web data, missing information often correlates with low brand maturity or weak digital presence.

The attributes reported here are the primary inputs used to compute both production and evaluation signals in the Signal Catalog.

Core distribution

Coverage was measured on an expanded sample of 542 brands built from existing financial-industry customer datasets. The sample is 30–40% public and 60–70% private (startups and SMBs). It is mostly US and EU, with extra weight on SMBs, local businesses, and long-tail domains to stress test performance outside large enterprise brands.

Core attributes

Core attributes used to identify and resolve merchant entities.

Datapoint	Coverage
Logo (Any)	95%
Logo (Dark)	87.5%
Icon (Any)	83.8%
Logo (Light)	58.3%
Symbols	33.4%
Colors	97%
Banner	84%

Key takeaway: Core identity coverage remains above 95%. This indicates the system is driven by algorithmic discovery and resolution. Lower coverage for symbols and specific variants provides an organic signal of a merchant’s digital and brand operation.

Firmographic and financial data

Data density supporting KYB and compliance workflows.

Datapoint	Coverage
Description	94.1%
longDescription	95%
Social Links	86.5%
Country	77.1%
City	76%
Kind	71%
Founded year	66.8%
Employees Count	79.3%
ISIN	32.1%
Stock	31.7%

Key takeaway: Coverage drops in a predictable way as fields rely more on formal disclosure and public-company status. This is expected in a mixed enterprise and long-tail sample. Financial identifiers remain limited to public companies.

Long-tail distribution

Coverage was measured on a global sample of 397 micro-businesses. These entities were randomly selected from Google Maps across a diverse set of cities and regions (without filtering for brand maturity, technical sophistication, or even the presence of a working domain). The sample spans 16 cities worldwide, including Europe, North America, Latin America, Africa, the Middle East, and Asia (e.g., Lausanne, Rotterdam, London, Barcelona, Tashkent, Dubai, Hanoi, Fukuoka, Dallas, New York, Nagpur, Accra, Lima, Cali, Durban). This dataset is designed to test performance at the extreme long tail: local, non-tech, offline-first businesses that may lack a formal brand, a maintained website, or any structured public footprint. It represents a lower bound on expected coverage.

Core attributes

Core attributes used to identify and resolve merchant entities.

Datapoint	Coverage
Logo (Any)	85.9%
Logo (Dark)	65.7%
Icon (Any)	58.4%
Logo (Light)	20.4%
Symbols	0.76%
Colors	79.6%
Banner	36.3%

Key takeaway: Even at the extreme long tail, 70–82% of merchants still expose at least one core identity signal. When coverage is lower (for example, for symbols), this reflects a weak or immature public footprint, which is itself a meaningful signal.

Firmographic and financial data

Data density supporting KYB and compliance workflows.

Datapoint	Coverage
Description	80.6%
longDescription	88.9%
Social Links	70.0%
Country	35.8%
City	34.5%
Kind	34.0%
Founded year	27.7%
Employees Count	36.5%

Key takeaway: Given that these are random micro-businesses with very little formal data, this level of coverage is strong. With ~70–80% descriptive and social signals, most businesses still expose enough digital surface to allow identity resolution, even when traditional firmographics are missing.

​Methodology

​Core distribution

​Core attributes

​Firmographic and financial data

​Long-tail distribution

​Core attributes

​Firmographic and financial data

Methodology

Core distribution

Core attributes

Firmographic and financial data

Long-tail distribution

Core attributes

Firmographic and financial data