Pitfall #14: Not Reducing to Lean Event Tables
The Trap
Section titled “The Trap”You’ve avoided pitfall #13. You’re not extracting everything. But you’re still running your analysis directly against the raw events_* tables — or against a lightly transformed version that’s basically the same schema with some columns renamed. Every query still scans nested arrays, processes hundreds of gigabytes, and takes thirty seconds to return. Your analysts grumble about query times. Your finance team grumbles about the bill. And nobody wants to iterate on an analysis that costs $2 every time you hit Run.
Why It Happens
Section titled “Why It Happens”There’s a gap between “I know I shouldn’t extract everything” and “I’ve actually built the lean tables I need.” Pitfall #13 is about not pulling too much. This pitfall is about the positive design step that comes after — actually building purpose-built tables with only the fields your use cases require.
Many teams stop at the raw layer because building the clean layer feels like a big project. They want to “get it right” before committing to a table design. So they keep querying the raw data directly, telling themselves the clean model is coming “next sprint.” Months pass.
The Fix
Section titled “The Fix”Once you’ve made your design decisions — identity strategy, touchpoint definition, use cases — you know exactly what you need. The answer is almost always surprisingly small.
A typical lean event table for attribution work might look like this:
CREATE OR REPLACE TABLE `your_project.analytics_123456789.lean_events`PARTITION BY event_dateASSELECT PARSE_DATE('%Y%m%d', event_date) AS event_date, event_timestamp, event_name, user_pseudo_id, user_id, (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS ga_session_id, (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'source') AS source, (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'medium') AS medium, (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'campaign') AS campaign, (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location') AS page_locationFROM `your_project.analytics_123456789.events_*`WHERE _TABLE_SUFFIX BETWEEN '20260101' AND '20260325';That’s it. Six extracted fields plus the standard columns. Everything is flat — no nested arrays, no UNNEST in downstream queries. Partitioned by date so queries scan only the days they need.
The difference is dramatic:
- Cost. Queries against this table will scan a fraction of the bytes. We’re talking 10x-50x reduction depending on how many event parameters your raw data carries.
- Speed. Flat columns with partition pruning means sub-second response times for queries that used to take thirty seconds.
- Clarity. An analyst can open this table, see 10 columns with clear names, and immediately understand what’s available. No digging through nested structs, no guessing which
valuesubfield to use.
And here’s a benefit that’s becoming increasingly important: a lean table is what makes AI-assisted analysis possible. An LLM — whether it’s powering a chatbot, a code assistant, or an autonomous agent — can reason about 6-10 clearly named columns. It cannot usefully reason about 200 nested event parameters. If you want to use agents or natural-language interfaces on your analytics data, lean tables aren’t optional. They’re a prerequisite.
Schedule this as a daily or weekly job (depending on your freshness requirements), and your analysts never touch the raw export again for routine work. The raw tables are still there for edge cases — but those should be rare.
How to Check If You Have This Problem
Section titled “How to Check If You Have This Problem”Run the same analysis query twice — once against the raw table, once against a lean version — and compare the bytes processed and elapsed time.
First, check a typical query against raw:
-- Run this and note the bytes processedSELECT PARSE_DATE('%Y%m%d', event_date) AS day, COUNT(DISTINCT user_pseudo_id) AS users, COUNT(DISTINCT CONCAT(user_pseudo_id, '.', (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id')) ) AS sessionsFROM `your_project.analytics_123456789.events_*`WHERE _TABLE_SUFFIX BETWEEN '20260301' AND '20260325'GROUP BY dayORDER BY day;Then run the equivalent against a lean table (after creating one):
SELECT event_date AS day, COUNT(DISTINCT user_pseudo_id) AS users, COUNT(DISTINCT CONCAT(user_pseudo_id, '.', ga_session_id) ) AS sessionsFROM `your_project.analytics_123456789.lean_events`WHERE event_date BETWEEN '2026-03-01' AND '2026-03-25'GROUP BY dayORDER BY day;Compare the bytes processed (shown in the query results panel) and the wall-clock time. If the lean table version processes 10x fewer bytes and returns in under a second, you’ve proven the case.
Also do a quick column audit on your current working tables. If any table your team queries regularly has more than 10-15 columns, ask yourself: which of these are actually used in downstream analysis? The rest is dead weight — adding cost, slowing queries, and making the schema harder to understand than it needs to be.