Data Universe Labeling

⬇ Download this section

Standing rule (exec/leadership, C&P Weekly 2026-04-20): “If we can’t tell the universe of the data, we don’t validate it. The data is not valid to us.”

Every data artifact—whether a report, a chart, a dashboard, a Slack message, a briefing document, a SQL query result, a spreadsheet column, or an AI-generated finding—must declare its data universe before it is used, shared, or acted upon. Data without a declared universe is not valid and must not be incorporated into decisioning, commissioning, or editorial workflow.

Why this rule exists

The data intelligence team at McClatchy has historically sent performance data without universe labels, forcing consumers to chase down the source, scope, and caveats before the data is usable. This creates persistent ambiguity about what any given number actually represents. exec/leadership explicitly named this as the anti-pattern his team must not replicate.

Labeling the universe up front: (a) forces the producer to know what they built, (b) lets the consumer trust-but-verify in seconds rather than hours, (c) makes caveats legible so nobody is surprised, and (d) prevents the comparison of incompatible datasets.

What counts as a data universe label

A minimally acceptable universe label answers five questions:

Field	What it answers	Example
Source system	Where did the data come from?	Snowflake `MCC_PRESENTATION.CONTENT_SCALING_AGENT.TRACKER_ENRICHED`
Scope / filter	What is included vs. excluded?	13 National-team brands per `national-portfolio.js`; excludes Life & Style, Mod Moms Club
Window	What time range does this cover?	Publication 2026-01-01 through 2026-04-19; traffic as of 2026-04-19 model run
Caveats	What should the consumer know before using this?	Pre-~August 2025 L&E Amplitude data is likely wrong; MSN not in Snowflake (comes from Tarrow XLSX)
Run stamp	When was this artifact produced?	2026-04-20 09:17 CDT (Monday scheduled sync)

If any of the five cannot be answered, the data is not ready for use.

Required formats

Machine-readable (frontmatter / JSON / YAML)

Any data file or artifact that has structured metadata (Jekyll frontmatter, a JSON envelope, a config block) must include a data_universe key:

---
data_universe:
  source: "Snowflake: MCC_PRESENTATION.CONTENT_SCALING_AGENT.TRACKER_ENRICHED"
  scope: "13 publications per national-portfolio.js; excludes Life & Style, Mod Moms Club"
  window: "publication 2026-01-01 through 2026-04-19; traffic as of 2026-04-19"
  caveats:
    - "Pre-Aug-2025 L&E Amplitude data may be wrong (L&E not properly integrated before then)"
    - "MSN data not in Snowflake; comes from Tarrow XLSX only"
    - "Cluster aggregates (cluster_* columns) appear on parent rows only; child rows get empty strings"
  run_stamp: "2026-04-20T09:17:00-05:00"
---

Human-readable (prose header on reports / messages)

Any report, Slack message, email, or document that surfaces data must open with a prose universe paragraph, flagged visually so it can’t be skipped:

Data universe: Snowflake TRACKER_ENRICHED (twice daily Mon-Fri, 10:13 + 18:13 CDT rebuild) · 13 National brands per national-portfolio.js · publication 2026-01-01 onward · traffic as of 2026-04-19. Caveats: pre-Aug-2025 L&E Amplitude may be wrong; MSN not in Snowflake; cluster aggregates on parent rows only.

One sentence is fine when the universe is simple. Multiple lines when it isn’t. The key discipline: the consumer should never have to ask “where did this come from?”

Agent prompts (CSA and any Pierce-built agents)

Any agent that consumes or produces data must:

Refuse to incorporate data without a declared universe. If data arrives without a universe label, the agent returns a request for the universe before proceeding.
Emit the universe in its output when producing data-driven findings. The agent’s response must state which universe informed the finding.
Distinguish universes when comparing. If the agent is comparing two numbers, the agent must state both universes and flag if they are incomparable.

Usage rules tied to data universes (new pattern, 2026-04-21)

Some data universes come with operating rules that must be applied by every consumer—not just source / scope / window / caveats, but actual decisioning rules authored by subject-matter experts (exec/leadership on strategy, data team on ad yield, etc.). These rules cannot just live in a meeting transcript—they need to travel with the data.

The pattern: encode operating rules in four places, each tuned to a different consumer:

In the column’s provenance row in SNOWFLAKE.md §7—the preprocessing and validation cells spell out the rules. Anyone reading column docs sees them.
In this canonical standard (see per-universe “Usage rules” blocks below)—agents and skills reading governance see them at the universe level.
In SQL comments inside the CTE of the model that computes the column—anyone touching the code sees them inline.
As a sibling caveats column shipped with every row (e.g., OUTLET_ECPM_CAVEATS text column next to OUTLET_ECPM_PROGRAMMATIC)—downstream consumers (reports, agents, dashboards) physically cannot use the number without the rules attached.

The fourth tier is what guarantees the rules survive any path through the data. If someone copies the number into a slide, runs a query that strips metadata, or feeds it to an agent that doesn’t read governance docs—the rules come along for the ride because they’re literally in the row.

When to use which tier:

If the rule is	Use tier(s)
A simple caveat (“this column is NULL for L&E articles”)	Tier 1 (provenance row)
A universe-wide caveat (“pre-Aug-2025 Amplitude data is wrong”)	Tier 1 + 2
A hard analytical rule (“page views primary, eCPM tiebreaker”)	All 4 tiers
A multi-part operating framework (exec/leadership’s decisioning hierarchy)	All 4 tiers + a link to the canonical governance doc

Known universes that carry mandatory caveats

These are the most common data universes in Pierce’s / the team’s work and the caveats that must accompany each:

Snowflake `TRACKER_ENRICHED` (primary performance universe)

Source: MCC_PRESENTATION.CONTENT_SCALING_AGENT.TRACKER_ENRICHED
Refreshed: twice daily Mon-Fri, 10:13 + 18:13 CDT (snowflake-tracker-sync.yml); manual re-runs possible
Scope: 13 National brands + L&E (Us Weekly, Woman’s World only)
Caveats:
- Cluster aggregate columns (cluster_*) appear only on parent rows; children get empty strings
- DYN_CONTENT_API_LATEST was dropped 2026-04-20 then recreated late April (table-ids changed; filter ACCOUNT_USAGE.TABLES WHERE deleted IS NULL to avoid the dropped predecessors). The current live instance is actively maintained (251K rows, last_altered daily) and was re-integrated into TRACKER_ENRICHED at v2.7 (2026-05-16), supplying tags_iab (the canonical IAB-tier signal) plus tag_need, tag_sensitive, tags_other, tags_seo.
- MSN traffic is NOT represented—comes from Tarrow XLSX only
- Performance-signal columns are origin-PVs basis (post-cross-syndication-screen, ship date TBD). is_hit, article_vs_co_median, cluster_vs_co_median, author_hit_* measure the article’s home-publication performance only—cross-syndication fan-out is excluded. Reach + revenue columns (total_pvs, article_programmatic_revenue_live, author_avg_pvs) remain on total-PVs basis. Driven by exec/leadership 2026-05-06 ask: distribution picking already-strong stories for syndication “juices” apparent topic perf and must not feed the outcomes loop. Implementation: ops-hub/docs/cross-syndication-screen.md.

Amplitude (via Snowflake or direct)

Sources (verified 2026-05-27):
- MCC_PRESENTATION.AMPLITUDE.AMPLITUDE_EVENTS_PROD — canonical presentation-layer events table (10.1B rows / 5.9 TB, continuous refresh). Use this for vendor + feed-source attribution work (the individual_feed_source property lives here).
- MCC_AMPLITUDE.AMPLITUDE.EVENTS_412949 — raw per-project export (54.3B rows / 22.2 TB; same upstream Amplitude org, different consumption surface). Source for compute_article_engagement_signals.py.
- MCC_AMPLITUDE.AMPLITUDE.EVENTS_412950 — paywall funnel events (3.8M rows).
- MCC_AMPLITUDE.AMPLITUDE.EVENTS_669032 — O&O events (1.36B rows, not used in pipeline).
STALE views — do NOT use: MCC_PRESENTATION.AMPLITUDE.AMPLITUDE_EVENTS_PROD_LAST_30DAYS (not refreshed since 2025-06-04); MCC_PRESENTATION.AMPLITUDE.AMPLITUDE_LIFESTYLE_AND_ENTERTAINMENT_EVENTS_PROD (not refreshed since 2026-02-24). Both return old data silently. Use AMPLITUDE_EVENTS_PROD with explicit filters instead.
Caveats:
- L&E brands were not properly integrated into Amplitude before ~August 2025. Any L&E data from before that window is likely wrong. Historical analyses must label this caveat explicitly or exclude pre-Aug-2025 L&E.
- Placement-test freeze — no L&E content-placement tests until the L&E data pipe is repaired. With L&E Amplitude integration broken (above) and the dedicated L&E view stale since 2026-02-24, a placement test on L&E cannot be measured, so its result is uninterpretable; exclude L&E from the placement-test matrix until the pipe is fixed and the read is reliable.
- p-tagging bug (CUE vs. WordPress format mismatch) may still affect cross-platform event data. Reliability gate: PTECH-7730.
- event_time has future-dated garbage rows (max seen: 2201-01-01). Always pair date filters with both lower AND upper bounds.
- individual_feed_source is the canonical property for vendor + feed-source attribution. content_credit undercounts NYT / Tribune / Minute Media by 50-90% vs individual_feed_source; do not use content_credit as a vendor cross-check (audited 2026-05-27).
- Instrumentation blindspots: these event types fire in Amplitude but carry ZERO individual_feed_source — amp_article_view, app_eedition_article_view, app_eedition_replica_view, newsbreakapp_article_view, smartnewsapp_article_view. Combined ~15M events / 10 days are invisible to feed-source-based analysis. Engineering-side gap.
- Canonical filter for feed-source PV analysis: event_type IN ('article_view', 'eedition_article_view', 'eedition_replica_view', 'app_article_view'). Full reference in ops-hub SNOWFLAKE.md §19.

L&E brand page views appearing in content team lead’s tracker

Source: UNCONFIRMED as of 2026-04-21—likely Amplitude but needs verification
Status: do not incorporate into downstream work until source is identified and labeled
Action: see ops-hub P3 nextActions—identify source + universe-label column before using

SEMrush (via API)

Source: SEMrush API, Pierce’s L&E allocation (250K credits/month; data team’s total pool 2M/month)
Scope: keyword metrics at seed level; each brief declares its seed set explicitly
Caveats: verdicts (Go Hard / Test Small / Skip) are editorial judgments overlaid on SEMrush metrics—not SEMrush’s own verdicts

Tarrow XLSX (syndication platform-side)

Source: Tarrow vendor export (XLSX); downloaded weekly via data-headlines/download_tarrow.py
Scope: Apple News native, MSN, Yahoo, SmartNews platform-side views
Caveats:
- Data is platform-side, NOT O&O click-throughs. Do not commingle with Snowflake O&O traffic.
- Syndication platforms are LTV=0 per exec/leadership framing—pure PV increment, no subscriber conversion. Do not treat as equivalent to O&O PVs for decisioning.

GA → Snowflake (legacy fallback)

Source: Google Analytics, piped into Snowflake (availability varies by brand)
Caveats:
- Pre-current-integration era may contain infinite-scroll artifacts and other recording anomalies
- Last ~18 months of GA data is “pretty good” per exec/leadership (2026-04-20) but any analysis must label that cutoff

Story facts + IAB + extended PV channels (`DYN_STORY_FACTS_DETAIL_WITH_KPIS`)

Source: MCC_PRESENTATION.TABLEAU_REPORTING.DYN_STORY_FACTS_DETAIL_WITH_KPIS—177K rows, keyed by STORY_ID
Contains: IAB taxonomy (up to 5 levels), custom + MCC-defined keywords, story topic, section names, plus extended PV channels not in STORY_TRAFFIC_MAIN: paywall hits, app views, cross-site internal recirculation, external backlink views, eEdition (print-replica) views, direct subscription conversions
Status: Primary replacement for the dropped DYN_CONTENT_API_LATEST (on the IAB + keyword side). Integrated into TRACKER_ENRICHED 2026-04-21.
Caveats: Classification quality depends on upstream editorial tagging; IAB array may be sparse for some articles.

Cross-site syndication (`STORY_TRAFFIC_METRICS`)

Source: MCC_PRESENTATION.TABLEAU_REPORTING.STORY_TRAFFIC_METRICS—205M rows, (URL × BIZ_UNIT × DATE) grain
Contains: Every McClatchy newspaper site × every article × every date it was served there. Enables cross-site syndication aggregates per article.
Window: 2023-03 to present (3 years of history—deeper than STORY_TRAFFIC_MAIN)
Caveats:
- Different universe from STORY_TRAFFIC_MAIN / _LE. Appears to be the GA-pumped-into-Snowflake source exec/leadership mentioned (2026-04-20). Do not mix with Amplitude-derived metrics without labeling.
- Closer is absent from this table (no evidence of meaningful Closer syndication on McClatchy sites at scale).
- Covers internal McClatchy newspaper syndication only (Sac Bee ↔ Miami Herald ↔ Kansas City Star, etc.). External / platform syndication (Field Level Media, MSN, Yahoo News) lives in the Marfeel-per-medium universe below; do not conflate.

Cross-syndication distortion screen—Marfeel per-medium (`MARFEEL_ARTICLE_BY_MEDIUM`)

Source: MCC_PRESENTATION.CONTENT_SCALING_AGENT.MARFEEL_ARTICLE_BY_MEDIUM—fed by Marfeel API ingest path #3 (data-engineer-built; commitment 2026-05-08, ship date TBD)
Contains: One row per (article × medium × date). The medium field identifies the syndication target—origin domain, Field Level Media, MSN, Yahoo News, Apple News partner feeds, etc.
Window: Trailing rolling window per Marfeel API limits (initial scope: lifetime per article)
Why it exists (full context): exec/leadership flagged 2026-05-06—distribution hand-picking already-strong stories for cross-syndication “juices” apparent topic performance. His canonical example was a Field-Level-Media-syndicated article spread across ~24 syndication targets whose TOTAL_PVS distorted any rollup that consumed it. The distortion is selection-on-success: syndication is downstream of strong early performance on the origin publication, not a topic-strength signal. Operators reading inflated cluster medians + topic averages over-commission on juiced topics rather than topics with native strength—the “outcomes loop” gets contaminated. The screen separates ORIGIN_PVS (home-publication only) from SYNDICATED_PVS (everything else) and tags each article with SYNDICATION_JUICE ∈ {none, light, heavy}. Performance-signal columns (is_hit, article_vs_co_median, cluster_vs_co_median, author_hit_*) switch to origin-PVs basis. Reach + revenue columns stay total-PVs basis (every PV is real revenue regardless of medium). Operationalizes the cross-syndication data-bias caveat (Strategic Framework #16, 2026-05-05 TH Team Meeting). Closes the loop with exec/leadership via a spot-check showing his 2026-05-06 example article correctly flagged heavy post-feed-land.
Caveats:
- External / platform syndication only. McClatchy-internal syndication is in STORY_TRAFFIC_METRICS (above).
- Articles outside Marfeel’s universe (or feed lag) get NULL rows—the model_tracker join is fail-open: ORIGIN_PVS falls back to TOTAL_PVS, SYNDICATION_JUICE defaults to none. Pre-feed-land state is the same as no-screen state.
- Juice tier thresholds (heavy ≥10 sites & ≥60% syndicated; light ≥3 & ≥30%) are v0; calibrate against actual distribution after 2-3 weeks of production data.
- The screen is a methodology change, not a hide-juiced policy. Heavily-syndicated articles still appear in every dashboard + still earn revenue + still count toward authors’ total reach. The change is what counts as performance signal, not what counts as reach.
Implementation: ops-hub/docs/cross-syndication-screen.md (master spec—full design, SQL diffs, ship sequence). data-headlines/dev-docs/cross-syndication-screen-ui.md (operator-facing UI—chip + filter). data-keywords/dev-docs/cross-syndication-screen-impact.md (downstream consumer pre-flight).

Newsletter attribution (`NEWSLETTER_LINK_HEADLINES`)

Source: MCC_PRESENTATION.TABLEAU_REPORTING.NEWSLETTER_LINK_HEADLINES—79K rows, per-campaign article attribution
Contains: URL × newsletter campaign × click counts
Caveats: Coverage concentrated in newsroom brands (Sac Bee, Miami Herald); sparse for Us Weekly / Woman’s World.

Revenue and ad yield (multiple universes)

PRIMARY SOURCE (pending access, flagged 2026-04-21): MCC_RAW.TEMP.BURT_INTELLIGENCE—data team confirmed this is the canonical dataset for per-market programmatic eCPM. Daily refresh. Built by data team. Access grant pending from data team; GROWTH_AND_STRATEGY_ROLE does not currently have usage on MCC_RAW.TEMP.
Woman’s World article-level revenue (accessible today): MCC_PRESENTATION.TABLEAU_REPORTING.WOMANSWORLD_PAGEPERFORMANCE—daily refresh, per-URL EARNINGS + PAGE_RPM + CPM + VIEWABILITY. Woman’s World only; ~51K rows.
Direct-sold metadata (Naviga): MCC_RAW.SIGMA.VIEW_NAVIGA_FLASH_DAILY_*—sales campaign metadata, not article-level. Useful context, not direct enrichment.
Market-level forecasts (contextual): MCC_PRESENTATION.TABLEAU_REPORTING.KPI_DIGITAL_REVENUE_* series—aggregated market-level revenue forecasts; not per-article.

Usage rules (exec/leadership + data team, C&P Weekly 2026-04-20)—MUST be applied by every consumer of eCPM or revenue data:

Programmatic baseline only; exclude direct-sold “gravy”. The stable-state programmatic number is the safe decisioning baseline. Direct-sold is variable and shouldn’t drive content decisions. When BURT access lands, use the programmatic-only slice, not the combined.
Monthly volatility is ~35%. Holidays run strong (Nov-Dec), January runs soft. Use stable-state baselines (roughly last-summer averages), NOT spot monthly values. For Kansas City, data team’s canonical stable number is ~$130.
Page views are primary; eCPM is tiebreaker. Per data team: “$1 CPM × 5× page views beats $2 CPM × 1× page views.” Never flip this hierarchy.
Market authority trumps eCPM for search-driven content. Tier 1 markets (Miami, KC, DFW) have stronger domain authority and outperform smaller-eCPM markets (Myrtle Beach, Bradenton) in search regardless of their eCPM. Prefer larger markets for search-dependent content.
Category/section eCPM variance matters. Certain sections (real estate, specific verticals) have direct-sold advertiser interest that makes them more valuable per PV than blended brand-level averages suggest. Surface high-eCPM-section signal when available.
Sigma dashboards underestimate. The live STAR-Automation Sigma workbook includes GAM programmatic display only. It excludes video and Taboola. The “complete programmatic” picture (what’s in BURT_INTELLIGENCE) includes those.

These rules are tier-4-encoded (per the usage-rules-tied-to-universes pattern above) as a literal OUTLET_ECPM_CAVEATS text column on every row of TRACKER_ENRICHED. Downstream consumers cannot use the eCPM number without receiving the rules.

content team lead’s tracker (Google Sheet)

Source: the content team lead’s Google Sheet → MCC_RAW.GROWTH_AND_STRATEGY.NATIONAL_CONTENT_TRACKER (via ingest_tracker.py, twice daily Mon-Fri, 10:13 + 18:13 CDT)
Scope: National team commissioned content only
Caveats:
- exec/leadership reframe (2026-04-20): this is a production operations doc, not purely an analytical tracker
- the content team lead’s team may enter with inattention to detail; content team lead cleans up manually—integrate the data with awareness that raw inputs are not always clean
- Going forward (per 2026-04-20): only the human-created cluster ID (hCID) is strictly needed from this sheet—every other column is derivable from Snowflake

Snowflake is the validated boundary

A corollary to the labeling rule, made explicit 2026-04-21 after exec/leadership’s original directive was extended to column-level provenance:

MCC_PRESENTATION.CONTENT_SCALING_AGENT.* is the only sanctioned schema for production consumers. Reports, agents, skills, dashboards, and talking points read from this schema (or artifacts derived from it)—not from raw upstream sources.
Every column in every CSA-schema table must have documented provenance. The canonical place for this is ops-hub/SNOWFLAKE.md §7 (TRACKER_ENRICHED) and §8 (TRACKER_WEEKLY). Each column must declare:
- Source—which upstream table / feed / sheet the data ultimately originates from
- Collects / aggregates—what the upstream source actually records, at what grain
- Preprocessing—what the pipeline does to the raw value (dedup, canonicalize, filter, compute, fill)
- Validation—how we know the value is valid (which safety gate guards it, which constraint it must satisfy)
If any of the four cannot be stated, the column must be removed or quarantined until the gap is closed.
External data cannot reach consumers directly. Tarrow XLSX, Amplitude API pulls, Google Sheets, SEMrush endpoints, GA, and any other external source must first land in Snowflake via a vetted ingest pipeline. The path is always: raw source → intake/staging → vetted model routine → CSA-schema output table → consumer. Consumers do not bypass this.
Intake vetting for new external sources. Before a new source enters the pipeline, the following must be answered in writing (in SNOWFLAKE.md, PIPELINE.md, or the relevant ingest script’s header comments):
- What does the source collect? At what grain? With what latency?
- What are its known quirks, gaps, or integration caveats?
- What preprocessing protects downstream consumers from deceptive outputs?
- How do we detect when the source silently changes or goes bad?
Unknown-source data must be blocked, not used. Data columns whose source is unverified (example: the L&E page view column in content team lead’s tracker as of 2026-04-21) cannot be incorporated into reports or dashboards until the source is identified and labeled.

Enforcement

Agents and skills: any agent or skill that produces a data-driven artifact must emit the data universe in its output. Agents should refuse data without declared universes at the intake boundary.
Reports (docx, markdown, HTML): must carry a universe header.
Dashboards: must surface the universe in a persistent location (site footer, header banner, or tooltip).
Slack / email to stakeholders: data-bearing messages open with the universe.
Commits: pipeline code that emits data artifacts must include the universe block in generation templates.
Column-provenance documentation: every new column added to a CSA-schema table must land with its full provenance row in SNOWFLAKE.md in the same commit as the code that creates it. No column ships without documentation.

Anti-patterns

Do not:

Send a chart or number without stating its source.
Compare two numbers without confirming both universes are comparable.
Use pre-Aug-2025 L&E Amplitude data without the integration-gap caveat.
Treat Tarrow platform-side data as equivalent to O&O click-through data.
Incorporate the content team lead’s L&E PV column into analysis before its source is verified and labeled.
Assume “Amplitude” is one universe—it’s a set of event tables each with different coverage by brand and time.

Editorial fact-checking lives in the Claims Validation standard (§9). Data universe labeling is about data provenance; claims validation is about claim accuracy in CSA output. Both required, different scopes.
Canonical Snowflake reference: SNOWFLAKE.md in ops-hub.
Canonical pipeline reference: PIPELINE.md in ops-hub.
National team portfolio scope: national-portfolio.js in ops-hub.