Data Universe Labeling

⬇ Download this section


Standing rule (exec/leadership, C&P Weekly 2026-04-20): “If we can’t tell the universe of the data, we don’t validate it. The data is not valid to us.”

Every data artifact—whether a report, a chart, a dashboard, a Slack message, a briefing document, a SQL query result, a spreadsheet column, or an AI-generated finding—must declare its data universe before it is used, shared, or acted upon. Data without a declared universe is not valid and must not be incorporated into decisioning, commissioning, or editorial workflow.

Why this rule exists

The data intelligence team at McClatchy has historically sent performance data without universe labels, forcing consumers to chase down the source, scope, and caveats before the data is usable. This creates persistent ambiguity about what any given number actually represents. exec/leadership explicitly named this as the anti-pattern his team must not replicate.

Labeling the universe up front: (a) forces the producer to know what they built, (b) lets the consumer trust-but-verify in seconds rather than hours, (c) makes caveats legible so nobody is surprised, and (d) prevents the comparison of incompatible datasets.

What counts as a data universe label

A minimally acceptable universe label answers five questions:

Field What it answers Example
Source system Where did the data come from? Snowflake MCC_PRESENTATION.CONTENT_SCALING_AGENT.TRACKER_ENRICHED
Scope / filter What is included vs. excluded? 13 National-team brands per national-portfolio.js; excludes Life & Style, Mod Moms Club
Window What time range does this cover? Publication 2026-01-01 through 2026-04-19; traffic as of 2026-04-19 model run
Caveats What should the consumer know before using this? Pre-~August 2025 L&E Amplitude data is likely wrong; MSN not in Snowflake (comes from Tarrow XLSX)
Run stamp When was this artifact produced? 2026-04-20 09:17 CDT (Monday scheduled sync)

If any of the five cannot be answered, the data is not ready for use.

Required formats

Machine-readable (frontmatter / JSON / YAML)

Any data file or artifact that has structured metadata (Jekyll frontmatter, a JSON envelope, a config block) must include a data_universe key:

---
data_universe:
  source: "Snowflake: MCC_PRESENTATION.CONTENT_SCALING_AGENT.TRACKER_ENRICHED"
  scope: "13 publications per national-portfolio.js; excludes Life & Style, Mod Moms Club"
  window: "publication 2026-01-01 through 2026-04-19; traffic as of 2026-04-19"
  caveats:
    - "Pre-Aug-2025 L&E Amplitude data may be wrong (L&E not properly integrated before then)"
    - "MSN data not in Snowflake; comes from Tarrow XLSX only"
    - "Cluster aggregates (cluster_* columns) appear on parent rows only; child rows get empty strings"
  run_stamp: "2026-04-20T09:17:00-05:00"
---

Human-readable (prose header on reports / messages)

Any report, Slack message, email, or document that surfaces data must open with a prose universe paragraph, flagged visually so it can’t be skipped:

Data universe: Snowflake TRACKER_ENRICHED (twice daily Mon-Fri, 10:13 + 18:13 CDT rebuild) · 13 National brands per national-portfolio.js · publication 2026-01-01 onward · traffic as of 2026-04-19. Caveats: pre-Aug-2025 L&E Amplitude may be wrong; MSN not in Snowflake; cluster aggregates on parent rows only.

One sentence is fine when the universe is simple. Multiple lines when it isn’t. The key discipline: the consumer should never have to ask “where did this come from?”

Agent prompts (CSA and any Pierce-built agents)

Any agent that consumes or produces data must:

  1. Refuse to incorporate data without a declared universe. If data arrives without a universe label, the agent returns a request for the universe before proceeding.
  2. Emit the universe in its output when producing data-driven findings. The agent’s response must state which universe informed the finding.
  3. Distinguish universes when comparing. If the agent is comparing two numbers, the agent must state both universes and flag if they are incomparable.

Usage rules tied to data universes (new pattern, 2026-04-21)

Some data universes come with operating rules that must be applied by every consumer—not just source / scope / window / caveats, but actual decisioning rules authored by subject-matter experts (exec/leadership on strategy, data team on ad yield, etc.). These rules cannot just live in a meeting transcript—they need to travel with the data.

The pattern: encode operating rules in four places, each tuned to a different consumer:

  1. In the column’s provenance row in SNOWFLAKE.md §7—the preprocessing and validation cells spell out the rules. Anyone reading column docs sees them.
  2. In this canonical standard (see per-universe “Usage rules” blocks below)—agents and skills reading governance see them at the universe level.
  3. In SQL comments inside the CTE of the model that computes the column—anyone touching the code sees them inline.
  4. As a sibling caveats column shipped with every row (e.g., OUTLET_ECPM_CAVEATS text column next to OUTLET_ECPM_PROGRAMMATIC)—downstream consumers (reports, agents, dashboards) physically cannot use the number without the rules attached.

The fourth tier is what guarantees the rules survive any path through the data. If someone copies the number into a slide, runs a query that strips metadata, or feeds it to an agent that doesn’t read governance docs—the rules come along for the ride because they’re literally in the row.

When to use which tier:

If the rule is Use tier(s)
A simple caveat (“this column is NULL for L&E articles”) Tier 1 (provenance row)
A universe-wide caveat (“pre-Aug-2025 Amplitude data is wrong”) Tier 1 + 2
A hard analytical rule (“page views primary, eCPM tiebreaker”) All 4 tiers
A multi-part operating framework (exec/leadership’s decisioning hierarchy) All 4 tiers + a link to the canonical governance doc

Known universes that carry mandatory caveats

These are the most common data universes in Pierce’s / the team’s work and the caveats that must accompany each:

Snowflake TRACKER_ENRICHED (primary performance universe)

Amplitude (via Snowflake or direct)

L&E brand page views appearing in content team lead’s tracker

SEMrush (via API)

Tarrow XLSX (syndication platform-side)

GA → Snowflake (legacy fallback)

Story facts + IAB + extended PV channels (DYN_STORY_FACTS_DETAIL_WITH_KPIS)

Cross-site syndication (STORY_TRAFFIC_METRICS)

Cross-syndication distortion screen—Marfeel per-medium (MARFEEL_ARTICLE_BY_MEDIUM)

Revenue and ad yield (multiple universes)

Usage rules (exec/leadership + data team, C&P Weekly 2026-04-20)—MUST be applied by every consumer of eCPM or revenue data:

  1. Programmatic baseline only; exclude direct-sold “gravy”. The stable-state programmatic number is the safe decisioning baseline. Direct-sold is variable and shouldn’t drive content decisions. When BURT access lands, use the programmatic-only slice, not the combined.
  2. Monthly volatility is ~35%. Holidays run strong (Nov-Dec), January runs soft. Use stable-state baselines (roughly last-summer averages), NOT spot monthly values. For Kansas City, data team’s canonical stable number is ~$130.
  3. Page views are primary; eCPM is tiebreaker. Per data team: “$1 CPM × 5× page views beats $2 CPM × 1× page views.” Never flip this hierarchy.
  4. Market authority trumps eCPM for search-driven content. Tier 1 markets (Miami, KC, DFW) have stronger domain authority and outperform smaller-eCPM markets (Myrtle Beach, Bradenton) in search regardless of their eCPM. Prefer larger markets for search-dependent content.
  5. Category/section eCPM variance matters. Certain sections (real estate, specific verticals) have direct-sold advertiser interest that makes them more valuable per PV than blended brand-level averages suggest. Surface high-eCPM-section signal when available.
  6. Sigma dashboards underestimate. The live STAR-Automation Sigma workbook includes GAM programmatic display only. It excludes video and Taboola. The “complete programmatic” picture (what’s in BURT_INTELLIGENCE) includes those.

These rules are tier-4-encoded (per the usage-rules-tied-to-universes pattern above) as a literal OUTLET_ECPM_CAVEATS text column on every row of TRACKER_ENRICHED. Downstream consumers cannot use the eCPM number without receiving the rules.

content team lead’s tracker (Google Sheet)

Snowflake is the validated boundary

A corollary to the labeling rule, made explicit 2026-04-21 after exec/leadership’s original directive was extended to column-level provenance:

  1. MCC_PRESENTATION.CONTENT_SCALING_AGENT.* is the only sanctioned schema for production consumers. Reports, agents, skills, dashboards, and talking points read from this schema (or artifacts derived from it)—not from raw upstream sources.

  2. Every column in every CSA-schema table must have documented provenance. The canonical place for this is ops-hub/SNOWFLAKE.md §7 (TRACKER_ENRICHED) and §8 (TRACKER_WEEKLY). Each column must declare:
    • Source—which upstream table / feed / sheet the data ultimately originates from
    • Collects / aggregates—what the upstream source actually records, at what grain
    • Preprocessing—what the pipeline does to the raw value (dedup, canonicalize, filter, compute, fill)
    • Validation—how we know the value is valid (which safety gate guards it, which constraint it must satisfy)

    If any of the four cannot be stated, the column must be removed or quarantined until the gap is closed.

  3. External data cannot reach consumers directly. Tarrow XLSX, Amplitude API pulls, Google Sheets, SEMrush endpoints, GA, and any other external source must first land in Snowflake via a vetted ingest pipeline. The path is always: raw source → intake/staging → vetted model routine → CSA-schema output table → consumer. Consumers do not bypass this.

  4. Intake vetting for new external sources. Before a new source enters the pipeline, the following must be answered in writing (in SNOWFLAKE.md, PIPELINE.md, or the relevant ingest script’s header comments):
    • What does the source collect? At what grain? With what latency?
    • What are its known quirks, gaps, or integration caveats?
    • What preprocessing protects downstream consumers from deceptive outputs?
    • How do we detect when the source silently changes or goes bad?
  5. Unknown-source data must be blocked, not used. Data columns whose source is unverified (example: the L&E page view column in content team lead’s tracker as of 2026-04-21) cannot be incorporated into reports or dashboards until the source is identified and labeled.

Enforcement

  1. Agents and skills: any agent or skill that produces a data-driven artifact must emit the data universe in its output. Agents should refuse data without declared universes at the intake boundary.
  2. Reports (docx, markdown, HTML): must carry a universe header.
  3. Dashboards: must surface the universe in a persistent location (site footer, header banner, or tooltip).
  4. Slack / email to stakeholders: data-bearing messages open with the universe.
  5. Commits: pipeline code that emits data artifacts must include the universe block in generation templates.
  6. Column-provenance documentation: every new column added to a CSA-schema table must land with its full provenance row in SNOWFLAKE.md in the same commit as the code that creates it. No column ships without documentation.

Anti-patterns

Do not: