Module 19 — The Tarpit: Your Data Layer & the Single-Source-of-Truth Myth

"There is no single source of truth. There is a single source of AGREEMENT, and you build it on purpose, brick by governed brick, or you spend the rest of your goddamn career in meetings about why two numbers that should be the same are not — surrounded by people in Patagonia vests, all of whom are technically correct, and every last one of whom can go straight to hell." — an anonymous data lead, three warehouses deep, eyes like a hostage video

It is 1 a.m. and there are four dashboards open on four monitors and every single fucking one of them disagrees about how many deals closed last quarter. The CRM says 87. The BI tool says 84. The board deck — printed, bound, already FedExed — says 91. Finance, who answers to a different and considerably angrier god, says 79 and is prepared to die on that hill, die on it twice, and haunt the wreckage afterward. Nobody is lying. That's the part that will crack your skull open at this hour: every number is "right." They're just right about different questions, pulled at different times, through different filters, by people who never once wrote down what the fuck they meant. Welcome to the tarpit. It is warm. It is patient. It has swallowed entire RevOps careers without blinking, leaving behind nothing but a Patagonia vest floating on the surface and a Slack thread 240 replies deep titled "quick question about the pipeline number" that has become a slow-motion obituary for every analyst who touched it.

THE JOB

The data layer exists to do one savage, unglamorous, deeply unsexy thing: make the whole damn company agree on what a number MEANS before it starts screaming about what the number IS. That's it. The architecture, the warehouse, the pipelines, the semantic layer, the metric dictionary — all of it is infrastructure in service of killing one stupid, recurring, soul-crushing question: whose number is right? The answer, when you build this correctly, should be architecturally impossible to ask. Not mediated in a two-hour Zoom that spirals into philosophy. Impossible.

Here is the lie you must murder on day one, with your bare hands, without ceremony: "single source of truth" is not a tool you buy. No vendor sells it. No CRM is it. No warehouse is it. No $280,000-a-year Snowflake contract delivers it in a box with a bow and a certificate of data purity. The single source of truth is a layered, governed system in which data flows from operational tools into a central store, gets defined ONCE in code, and gets served back out with its definitions tattooed to it. Truth isn't a place. Truth is a discipline. The discipline is governance, and nobody wants to do it because it requires writing things down and owning them and being accountable when the definitions are wrong, which is considerably less fun than buying another dashboard and calling yourself data-driven on LinkedIn while the tarpit opens beneath your feet.

Your job is to build the plumbing AND the rulebook. The plumbing is, I shit you not, the easy half.

THE PLAYBOOK

1. Know the goddamn layers of the modern data stack. Draw them. Memorize them. If you can't diagram this from memory, you don't own your data layer — it owns you.

Revenue data flows through a stack. Learn the layers or drown in the seams between them, which are wet, dark, and smell like a crime scene:

LayerWhat lives hereThe point
SourcesCRM, billing, product usage, marketing, supportWhere raw events are born. Operational, messy, beloved in their filth.
Ingestion (ETL/ELT)Fivetran, Airbyte, custom pipesSucks data out of sources and dumps it into the warehouse. The plumbing.
WarehouseSnowflake, BigQuery, Redshift, DatabricksThe central store. All the raw shit in one place. Cheap compute. The spine.
Transformationdbt and its disciplesRaw → clean, modeled, business-ready. The engine of meaning.
Semantic / metrics layerdbt Semantic Layer, Cube, LookMLWhere a metric is DEFINED once, centrally, in code you can audit and version-control.
ConsumptionBI dashboards, notebooks, reverse ETLWhere humans — and tools — read the truth. Or what passes for it.

The warehouse is the spine. Upstream feeds it; downstream reads from it. Anything that bypasses the spine and queries the source directly — a BI tool connected straight to Salesforce, a VP who pulls his own CSV exports — is a heretic building a second source of truth with their own hands, and they will burn your governance to the ground with a confidence that is genuinely impressive to witness.

2. Get ELT vs ETL straight. They are not the same word with different letters, and saying them interchangeably marks you as someone who shouldn't be running this conversation.

  • ETL — Extract, Transform, Load. Old world. Clean the data before it lands in the warehouse. Rigid, expensive to change, tidy on arrival. Great until requirements change — which they will, approximately every nine working days, because the business is a chaotic organism that cannot agree on what "customer" means and will die without ever resolving that argument.

  • ELT — Extract, Load, Transform. Modern default. Dump the raw data into the warehouse first — storage is cheap as hell now, practically a rounding error on your monthly bill — then transform it in place with SQL and dbt. Load everything; model it later; never throw away the raw. The day you need to backfill a metric you didn't anticipate is the day you discover whether past-you was a genius or a short-sighted bastard who deleted the source columns to save eighteen dollars and now owes the company a quarter's worth of backfilled revenue attribution.

RULE No. 19: Land raw, transform in the warehouse. The day you need a metric you didn't anticipate, you'll send a silent prayer of thanks to the version of yourself that didn't clean the raw data into oblivion like a goddamn amateur chasing tidiness over utility.

3. Understand reverse ETL — the move that makes the warehouse operational instead of a beautiful, expensive, useless museum.

ELT pulls data IN. Reverse ETL pushes governed data back OUT — from the warehouse into the operational tools: the CRM, the ad platforms, the sequencer, the CS health dashboard, the thing the rep actually opens in the morning. This is how a clean "account health score" that you computed in the warehouse — with real definitions, real lineage, real governance — shows up as a field in Salesforce that a human can act on without logging into four systems and doing arithmetic in their head.

Without reverse ETL, the warehouse is a correct, governed, well-tested, utterly inert monument to data nobody uses, a shrine to accuracy that produces zero behavioral change and sits quietly while reps make decisions from the same stale Salesforce fields they've always used. With it, truth becomes operational. Tools: Hightouch, Census. Budget for this shit. It is not optional if you want the warehouse to actually change behavior and not just validate your architecture slides in a room where everyone nods and then goes back to pulling CRM exports. The difference between a data warehouse and a data strategy is reverse ETL doing the last goddamn mile.

4. Build the semantic / metrics layer. This is the actual cure. Everyone skips it. Everyone suffers. The suffering is avoidable and entirely self-inflicted.

This is the part that separates the operators who've been to data hell before from the ones who are about to check in. A semantic layer — metrics layer, headless BI layer, call it what you want — is one central place, in version-controlled code, where each metric is defined once, so every downstream tool computes it identically. Not approximately. Not "pretty close." The same fucking number, every single time, whether the BI tool, the board deck template, or the embedded CRM widget is asking.

For each core metric, define and commit to code:

  • The exact formula. What rows, what aggregation, what time window. Not vibes. Actual math. If you can't write it down, you don't know what you're measuring and you should be embarrassed about that.
  • The filters baked in. Does "ARR" exclude one-time professional services fees? Does it include the renewal that closed today at 11:59 p.m.? Does it count the deal that "won" into a customer who immediately filed a chargeback and sent a cease-and-desist? WRITE IT THE HELL DOWN, in the definition, not in someone's head.
  • The grain. Per account? Per opportunity? Per fiscal day? Per fiscal quarter as defined by a CFO who changed the fiscal year in 2022 and told approximately nobody?
  • The owner. A human name. Not "RevOps." Not "Data Team." A specific, findable, reachable person whose job it is to know when this metric is wrong and why.

When the BI tool, the board deck, and the embedded widget all query "Net New ARR," they hit the same definition and return the same number. The disagreement doesn't get mediated in a two-hour meeting that leaves everyone exhausted and nothing resolved. It becomes architecturally impossible. That's the win. That's the only win that lasts.

5. Write the metric dictionary. Treat it like scripture. Maintain it like a living organism. Burn anyone who presents an undefined number on a slide.

A shared, public, findable document — not a Wiki page nobody has touched since 2021, not a Google Doc titled "Metrics v3 FINAL FINAL use this one FOR REAL this time," not a Notion page that requires four clicks and a prayer to locate — where every number that appears on any dashboard has an entry:

  • Plain-English name that a new hire can parse without asking anyone to explain it to them.
  • The precise definition and formula. No ambiguity. No "it depends on context." Write the context into the definition.
  • Source tables and lineage. Where does this come from, and how many transforms has it passed through?
  • Refresh cadence. Real-time? Hourly? Daily at 2 a.m. when the Snowflake job runs? Stamp this on every single output. Anyone who presents a number without saying when it was pulled is hiding something or doesn't know and should be embarrassed — and if they're senior, should be embarrassed in public.
  • The owner. Full name. Team. A human being who signed on the dotted line and will answer when it's wrong.

If a number isn't in the dictionary, it isn't allowed on a board slide. I don't care how confident the SVP is. I don't care if the CEO asked for it at 10 p.m. No definition, no entry, no slide. This rule is the rule. The moment you make one exception, you've created a precedent that metastasizes into six undefined metrics on the next board deck, all of which disagree with each other in ways nobody can explain because nobody wrote a damn thing down.

6. Diagnose why two dashboards disagree. There are exactly five reasons. Memorize them — they end 90% of the arguments before they become political incidents and HR complaints.

When pipeline in dashboard A ≠ pipeline in dashboard B, it is always one of these. Not "might be." Always:

  1. Different definitions. "Pipeline" = all open opps here, qualified-and-beyond there. The most common cause. The most fixable cause. Fix it in the semantic layer and it's fixed everywhere, not just in the one dashboard you patched while the other one keeps lying to people.

  2. Different refresh times. A synced six hours ago; B is live. Both are right about different moments in time. Neither is lying. The fix is simple and nobody does it: stamp every goddamn dashboard with as of [timestamp] so nobody pretends time is a fixed property.

  3. Different filters. One silently excludes closed-lost, or a rogue sales region someone named "AMER_EXCLUDE_LEGACY" and forgot about three years ago, or the self-serve segment that nobody technically "owns." Silent filters are land mines that blow up six months after the person who buried them left the company. Document every filter or accept that your dashboards are lying in ways you can't diagnose and your pipeline conversations are arguments about invisible assumptions. This is stupid and avoidable and yet here we are, every quarter, having the same damn meeting.

  4. Different lineage. Dashboard A reads the CRM directly. Dashboard B reads a warehouse table that's three dbt transforms downstream and hasn't refreshed since the Snowflake credits ran out on the 28th. They've diverged and neither one has the decency to say so.

  5. Different timezone or date logic. "Closed this quarter" in UTC vs. Pacific moves deals across the quarter boundary like ghosts. The Number does not care about your timezone preference. It will, however, use your timezone to humiliate you in front of the CFO with the impersonal precision of a machine that has no feelings about your pain.

7. Establish lineage and earn trust, because trust in data is not a feeling — it's a capability.

Data lineage is the documented map of where every number comes from: source → ingestion pipeline → warehouse table → dbt transformation → metric definition → dashboard cell. When someone asks "where does this come from?" and you can answer in under thirty seconds with a link to the graph, you have trust. When you can't, you have a tarpit. When the answer is "I'll have to ask Sarah," you've outsourced your institutional knowledge to a single human being who represents a catastrophic point of failure and who is, frankly, underpaid for the burden.

Use dbt's lineage graph. Use column-level lineage if your tooling supports it. Make it visible, clickable, parseable by a person who joined last week without a guide. Trust in data is the documented, auditable, no-excuses ability to trace any number back to its birth, through every transformation, with timestamps. Without that, you don't own your data. It owns you — and it is a bastard landlord with no fixed address, no listed phone number, and a policy of raising the rent every time the CFO wants to look at a metric you can't fully explain.

8. Govern it. Not once. Forever. This is not a project with an end date. This is a practice with a death date if you abandon it.

  • Data contracts between source teams and the warehouse: formal, written agreements that if Sales renames a Salesforce field — if some well-meaning admin, drunk on housekeeping energy at 4 p.m. on a Friday, changes Close_Date to Close_Date_ACTUAL__c — the downstream pipeline does not silently shatter overnight and serve null revenue to the board dashboard the next morning. Source teams do not have unilateral field-destruction rights. That is the social contract. Write it down and make people sign it.

  • Automated tests. dbt tests, freshness checks, row-count assertions, null-rate monitors. Automated alarms that scream — loudly, to the right person, not to a Slack channel nobody watches — when a sync dies, when a value goes null in a column that should never fucking go null, when row volume drops in a way that suggests the source system stopped sending data and nobody in the building has noticed yet. Tests are how you find out about broken data before the CFO finds out about it in the board prep call. You finding out is a fixable problem. The CFO finding out first is a career event with an investigation attached.

  • Quarterly review of definitions and owners. The metric dictionary rots if nobody tends it. People leave the company and take the definitions in their heads with them. Owners change. Fiscal year changes. What "revenue" means changes after the audit. Quarterly review is the minimum viable governance — treat it like paying rent, because that's exactly what it is.

HOW IT GOES TO HELL

  • The RevOps Martyr becomes the human semantic layer. Every metric request routes through her brain, because the definitions live only in her skull, accumulated over three years of being the only person in the building who cared. She IS the lineage graph. She is load-bearing infrastructure wearing a person's clothes. The day she takes two weeks of PTO, the company forgets what ARR means and the board prep call becomes a goddamn séance — everyone staring at dashboards, trying to channel a spirit who is, at this moment, unreachable and unavailable and has email notifications off. This is not her fault. This is the organization's failure to institutionalize what she knows. She's been holding the building up for years and nobody noticed the building wasn't actually attached to the ground.

  • The VP of Vibes builds his own dashboard. Connects a BI tool directly to Salesforce because the official dashboard takes "too long to load" — it takes four seconds, which is apparently an eternity when you have the patience of a caffeinated terrier and the self-restraint of a golden retriever near an open garbage can. Filters by gut. Screenshots it into the board deck with the confidence of a man who has never had to explain a revenue restatement and doesn't know what one costs. It disagrees with Finance by eleven percent. He calls Finance "too conservative" with the serene certainty of someone who has never once been wrong in print and has no intention of starting. There are now two sources of truth and he built the second one in nine shit-soaked minutes and considers it an act of leadership initiative. It is an act of institutional vandalism and it will haunt the next three board decks.

  • The Founder demands "real-time everything." Pays for streaming pipelines that cost six times the batch alternative, because "latency is death" is something he heard at a conference and he doesn't do half measures. Now the dashboards refresh constantly and disagree constantly, faster. The confusion has not decreased. It has been accelerated, monetized, and optimized for delivery with impressive low latency. Velocity of confusion is still fucking confusion. He wanted real-time; he bought real-time chaos with a beautiful UI.

  • Dr. Tanya Vex — self-described "Data Clarity Architect," 200,000 LinkedIn followers, TEDx talk titled "Data-Driven Is Dead: What Comes After"— sells "Single Source of Truth in 30 Days." It's a Snowflake trial, a Looker license, and zero governance. No metric dictionary. No owners. No semantic layer. No contracts between source teams. Just raw data in a fancier container with better color schemes. Ninety days later you have the same four disagreeing numbers, now in a more expensive warehouse, with a framed certificate of completion on the wall and Dr. Tanya somewhere in Tulum, charging for her next cohort, unreachable except through her assistant and her course platform.

  • Nobody owns a metric, so everyone owns it, which means nobody owns it. "Pipeline coverage" is defined four ways across four teams. Each team defends their definition in the QBR with the ferocity of people defending land their ancestors died on. Definitions without owners aren't standards. They're hostages held by competing factions, and the ransom is paid in meeting hours, analyst sanity, and whatever credibility you had with the board before they noticed the numbers change depending on who presents them.

FIELD RULES

  • RULE No. 19A: There is no single source of truth — there's a single source of AGREEMENT, governed on purpose. Stop buying the myth. Start building the discipline.
  • RULE No. 19B: Define the metric once, in the semantic layer, in version-controlled code. Disagreement should be architecturally impossible, not endlessly mediated by exhausted humans in recurring meetings.
  • RULE No. 19C: Land raw, transform in the warehouse. Never throw the source data away. Future you will be grateful; past you is a bastard if you deleted it.
  • RULE No. 19D: Every dashboard wears a definition and an as of timestamp, or it doesn't come into the room. No vibes. No "it's probably close enough." Close enough is how you end up explaining a restatement.
  • RULE No. 19E: If a number isn't in the metric dictionary, it isn't real. No definition, no slide. I don't give a damn how senior the person holding it is.
  • RULE No. 19F: Lineage is trust. Trace any number to its source in thirty seconds or you don't own your data stack. It owns you.

From the field: I once watched a company spend $1.4M on a warehouse migration to achieve "one source of truth," and on the day the migration went live there were still three different revenue numbers floating around, because not one soul in the entire building — not the data engineer, not the VP of Analytics, not the CFO who signed the purchase order for the warehouse contract — had ever written down what "revenue" actually, precisely, unambiguously meant. The tarpit doesn't want your architecture. It doesn't want your Snowflake credits or your dbt Cloud subscription. It wants your definitions, and it will take them slowly, silently, one un-owned metric at a time, until the only honest thing you can say in a board meeting is "it depends on how you define it" — which is the sentence that ends careers. Expense report attached. My attorney advises that "as of" timestamps are defensible and admissible and that I should put them on everything I sign, including this.