Pitfall #13: Over-Extracting Just in Case

By Timo Dechau · Last updated March 25, 2026

The Trap

You’re building your GA4 data model in BigQuery. You know the raw export is messy — nested arrays, hundreds of event parameters, cryptic key names. So you decide to be thorough. You unnest everything. Every event parameter, every user property, every item field. You extract all 200+ parameter keys into their own columns and pipe the whole thing into a staging table. Because what if someone needs engagement_time_msec six months from now? What if entrances turns out to be important? Better to have it and not need it than need it and not have it. Right?

Six months later, your staging table is enormous. Queries take minutes instead of seconds. Your BigQuery bill has a line item nobody wants to explain. And the worst part — nobody is using 90% of those columns. They never did.

Why It Happens

It’s a fear-driven design choice, and I get it. The GA4 export has a genuinely intimidating number of parameters. When you run a quick count of distinct event_params.key values across your dataset, you’ll typically see 50-200+ unique keys. Some are standard GA4 parameters, some were set by your implementation, and some are mysterious leftovers from plugins or tag configurations nobody remembers setting up.

The instinct is to preserve everything because re-processing historical data is expensive. If you don’t extract a parameter now and realize you need it later, you’ll have to go back and scan the raw tables again. That’s a real cost, and it makes “extract everything” feel like the safe choice.

But “safe” has its own costs. Every additional column in your model means more bytes stored, more bytes scanned on every query, and more cognitive load for anyone trying to understand the table. A model with 200 columns is a model nobody can reason about. Analysts open it, see a wall of fields, and either pick the wrong ones or give up and ask someone for help.

The Fix

Flip the sequence. Instead of extracting everything and hoping use cases emerge, define your use cases first, then identify exactly which parameters those use cases require.

Here’s the process:

Start with the question. What business questions does this model need to answer? Attribution? Funnel analysis? Churn prediction? Be specific.
Map questions to fields. For each question, identify the minimum set of event parameters needed. Attribution might need source, medium, campaign, gclid, page_location, and ga_session_id. That’s six parameters. Not sixty.
Extract only those. Build your model around the fields you actually need. Everything else stays in the raw events_* tables, accessible if a new use case genuinely emerges.

Look, if a new use case comes up in three months that needs a parameter you didn’t extract — you go back and add it. That’s a 20-minute change to your model. It’s not a disaster. The disaster is running a 200-column model for three months that nobody can understand and everyone pays for.

How to Check If You Have This Problem

Run this to see how many distinct event parameter keys exist in your data:

SELECT
  params.key,
  COUNT(*) AS occurrences
FROM
  `your_project.analytics_123456789.events_*`,
  UNNEST(event_params) AS params
WHERE
  _TABLE_SUFFIX BETWEEN
    FORMAT_DATE('%Y%m%d',
      DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY))
    AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())
GROUP BY params.key
ORDER BY occurrences DESC;

Now compare that list to the parameters your downstream queries actually reference. You can check this by searching your scheduled queries, dbt models, or Looker explores for the specific event_params keys they filter or extract.

If you have 150 parameter keys in the raw data but your actual queries only touch 12 of them, you know the ratio. Extracting those other 138 “just in case” is costing you money and clarity for zero return.