- Published on
Building Data Pipelines at Scale: Lessons from NYC
4 min read
- Authors
- Name
- DQ Gyumin Choi
- @dq_hustlecoding
Table of Contents
After spending nearly two years building data infrastructure at a NYC-based media startup, I wanted to share some hard-earned lessons about building data pipelines that actually scale.
The Challenge
When I joined, the data infrastructure was essentially non-existent. We needed to:
- Capture millions of user interaction events daily
- Process them in near real-time for ML models
- Build a data mart layer for business analytics
- Keep costs manageable on a startup budget
Architecture Decisions
Choosing BigQuery over Alternatives
We evaluated several options including Snowflake, Redshift, and BigQuery. For our use case, BigQuery won because:
- Serverless: No cluster management overhead
- Cost model: Pay-per-query worked well for our bursty workloads
- GCP Integration: We were already on GCP for Cloud Run
Event Schema Design
One of the most critical early decisions was the event schema. We adopted a semi-structured approach:
{
"event_id": "uuid",
"event_type": "video_watch",
"user_id": "uuid",
"timestamp": "2023-03-15T10:30:00Z",
"properties": {
"video_id": "abc123",
"watch_duration_seconds": 145,
"completion_rate": 0.72
},
"context": {
"device": "mobile",
"platform": "ios",
"app_version": "2.1.0"
}
}
The properties and context fields are JSON columns in BigQuery, giving us flexibility without schema migrations.
The Pub/Sub + Cloud Run Pattern
For ingestion, we built a log proxy server:
- Client SDKs send events to our NestJS API
- Events are validated and enriched
- Published to Google Pub/Sub
- Cloud Run subscribers batch-write to BigQuery
This pattern gave us:
- Decoupling: Ingestion speed independent of storage
- Buffering: Pub/Sub handles traffic spikes
- Reliability: At-least-once delivery guarantees
DBT for the Data Mart
The raw event data isn't useful for business users. We used DBT to build a proper data mart:
-- models/marts/user_engagement.sql
{{ config(materialized='incremental') }}
SELECT
user_id,
DATE(timestamp) as activity_date,
COUNT(DISTINCT session_id) as sessions,
SUM(CASE WHEN event_type = 'video_watch' THEN 1 ELSE 0 END) as videos_watched,
AVG(properties.watch_duration_seconds) as avg_watch_duration
FROM {{ ref('stg_events') }}
{% if is_incremental() %}
WHERE timestamp > (SELECT MAX(activity_date) FROM {{ this }})
{% endif %}
GROUP BY 1, 2
Key DBT Patterns We Adopted
- Incremental models for large tables
- Snapshots for slowly changing dimensions
- Tests on every model (not null, unique, relationships)
- Documentation as code
Lessons Learned
1. Start with the End in Mind
Before writing any pipeline code, we mapped out:
- What questions will stakeholders ask?
- What metrics drive business decisions?
- What data do ML models need?
This prevented building pipelines that nobody used.
2. Invest in Data Quality Early
We added data quality checks from day one:
- Schema validation at ingestion
- Anomaly detection on key metrics
- Automated alerts for data freshness
The cost of bad data compounds quickly.
3. Make It Observable
Every pipeline should answer:
- Is it running?
- When did it last succeed?
- How long did it take?
- How much data did it process?
We used Cloud Monitoring dashboards and PagerDuty alerts.
Results
After 18 months:
- Processing 10M+ events daily
- 99.9% pipeline reliability
- Query costs reduced by 60% through optimization
- ML team shipping models 3x faster with clean data
What I Would Do Differently
- Adopt dbt earlier - We spent months with ad-hoc SQL before DBT
- Better CDC patterns - Our initial approach to database replication was naive
- More aggressive data retention policies - Storage costs snuck up on us
Building data infrastructure at a startup is challenging but incredibly rewarding. You get to make architectural decisions that would take years at larger companies.
If you are building something similar, feel free to reach out!