Building Data Pipelines at Scale: Lessons from NYC

The Challenge
Architecture Decisions
Choosing BigQuery over Alternatives
Event Schema Design
The Pub/Sub + Cloud Run Pattern
DBT for the Data Mart
Key DBT Patterns We Adopted
Lessons Learned
1. Start with the End in Mind
2. Invest in Data Quality Early
3. Make It Observable
Results
What I Would Do Differently

After spending nearly two years building data infrastructure at a NYC-based media startup, I wanted to share some hard-earned lessons about building data pipelines that actually scale.

The Challenge

When I joined, the data infrastructure was essentially non-existent. We needed to:

Capture millions of user interaction events daily
Process them in near real-time for ML models
Build a data mart layer for business analytics
Keep costs manageable on a startup budget

Architecture Decisions

Choosing BigQuery over Alternatives

We evaluated several options including Snowflake, Redshift, and BigQuery. For our use case, BigQuery won because:

Serverless: No cluster management overhead
Cost model: Pay-per-query worked well for our bursty workloads
GCP Integration: We were already on GCP for Cloud Run

Event Schema Design

One of the most critical early decisions was the event schema. We adopted a semi-structured approach:

{
  "event_id": "uuid",
  "event_type": "video_watch",
  "user_id": "uuid",
  "timestamp": "2023-03-15T10:30:00Z",
  "properties": {
    "video_id": "abc123",
    "watch_duration_seconds": 145,
    "completion_rate": 0.72
  },
  "context": {
    "device": "mobile",
    "platform": "ios",
    "app_version": "2.1.0"
  }
}

The properties and context fields are JSON columns in BigQuery, giving us flexibility without schema migrations.

The Pub/Sub + Cloud Run Pattern

For ingestion, we built a log proxy server:

Client SDKs send events to our NestJS API
Events are validated and enriched
Published to Google Pub/Sub
Cloud Run subscribers batch-write to BigQuery

This pattern gave us:

Decoupling: Ingestion speed independent of storage
Buffering: Pub/Sub handles traffic spikes
Reliability: At-least-once delivery guarantees

DBT for the Data Mart

The raw event data isn't useful for business users. We used DBT to build a proper data mart:

-- models/marts/user_engagement.sql
{{ config(materialized='incremental') }}

SELECT
  user_id,
  DATE(timestamp) as activity_date,
  COUNT(DISTINCT session_id) as sessions,
  SUM(CASE WHEN event_type = 'video_watch' THEN 1 ELSE 0 END) as videos_watched,
  AVG(properties.watch_duration_seconds) as avg_watch_duration
FROM {{ ref('stg_events') }}
{% if is_incremental() %}
WHERE timestamp > (SELECT MAX(activity_date) FROM {{ this }})
{% endif %}
GROUP BY 1, 2

Key DBT Patterns We Adopted

Incremental models for large tables
Snapshots for slowly changing dimensions
Tests on every model (not null, unique, relationships)
Documentation as code

Lessons Learned

1. Start with the End in Mind

Before writing any pipeline code, we mapped out:

What questions will stakeholders ask?
What metrics drive business decisions?
What data do ML models need?

This prevented building pipelines that nobody used.

2. Invest in Data Quality Early

We added data quality checks from day one:

Schema validation at ingestion
Anomaly detection on key metrics
Automated alerts for data freshness

The cost of bad data compounds quickly.

3. Make It Observable

Every pipeline should answer:

Is it running?
When did it last succeed?
How long did it take?
How much data did it process?

We used Cloud Monitoring dashboards and PagerDuty alerts.

Results

After 18 months:

Processing 10M+ events daily
99.9% pipeline reliability
Query costs reduced by 60% through optimization
ML team shipping models 3x faster with clean data

What I Would Do Differently

Adopt dbt earlier - We spent months with ad-hoc SQL before DBT
Better CDC patterns - Our initial approach to database replication was naive
More aggressive data retention policies - Storage costs snuck up on us

Building data infrastructure at a startup is challenging but incredibly rewarding. You get to make architectural decisions that would take years at larger companies.

If you are building something similar, feel free to reach out!

Table of Contents