πŸš— CarTrade Tech  Β·  Platform ML Engineering

Agentic Feature Mart

Intelligent multi-agent system for large-scale feature discovery, engineering & serving β€” powering ML decisions at β‚Ή5Cr+ scale across 5+ product lines.

πŸ€– LangGraph Multi-Agent Orchestration
❄️ Snowflake Feature Store
πŸ” Argo CI/CD Pipelines
🧠 GPT-4 Location Intelligence
🐍 Python + SQL + dbt
πŸ“Š
15K+
Total Features
πŸ’°
β‚Ή5Cr+
Annual Revenue
πŸ’Έ
β‚Ή1Cr/yr
CRM Savings
πŸ“¦
5+
Product Lines
πŸ”
Monthly
Auto Stamping
End-to-End Architecture
System Architecture Overview
Complete data flow: raw sources β†’ ETL β†’ feature engineering β†’ Snowflake β†’ Agentic Layer β†’ model consumers
β‘  DATA SOURCES
πŸ“§
Marketing EventsCRM / Campaigns
πŸͺ
Seller DataListings / Profiles
πŸ–±οΈ
ClickstreamUser Behavior
πŸ“
Geo / Maps APIPOI Β· Geocodes
🌐
Web + LLM DataRegional Intel
β‘‘ ETL Β· FEATURE ENGINEERING Β· LLM AUGMENTATION
βš™οΈ
Python + SQL + dbt ETL PipelinesBatch feature computation Β· Aggregations Β· Grain alignment
🧠
LLM Augmentation (GPT-4)Location context Β· POI embeddings Β· Geo-IQ
πŸ—ΊοΈ
Geo ProcessingGeoPandas Β· Shapely Β· Reverse Geocoding
β‘’ ORCHESTRATION
πŸ”
Argo CI/CD WorkflowsDAG-based Β· Monthly cron Β· Feature versioning & stamping Β· Quality gates Β· Docker + K8s
βœ…
Quality GatesNull checks Β· PSI drift Β· Slack alerts
πŸ”–
Feature Stampingv{YYYY}.{MM} Β· Train-serve consistency
β‘£ FEATURE STORE
❄️
Snowflake Feature Mart
Schema: FEATURE_MART  |  50+ tables  |  Single source of truth
Marketing (3.2K)
Seller (4.1K)
Clickstream (2.8K)
Geo-IQ (2.9K)
LLM Location (2K)
15K+
Features
50+
Tables
99.1%
Non-Null
β‘€ AGENTIC LAYER β€” LangGraph Multi-Agent System
πŸ€–
OrchestratorIntent parsing Β· Routing
πŸ—‚οΈ
Table RecommenderSchema knowledge Β· Table selection
πŸ”
Data FetcherSnowflake SQL Β· JOINs
πŸ§ͺ
Feature EngineerValidation Β· Enrichment
β‘₯ MODEL CONSUMERS
πŸ“ˆ
Seller Prioritizationβ‚Ή5Cr+ revenue
πŸš—
Used Car LoanRisk scoring
πŸ’³
Personal LoanCredit features
πŸ‘€
Buyer ModelIntent signals
πŸ“²
CRM Scoringβ‚Ή1Cr/yr savings
Data & Pipeline Layer
Data Sources, Features & Argo CI/CD Pipeline
Five data verticals β†’ 15,000+ ML features Β· Fully automated monthly stamping via Argo Workflows
πŸ“§
Marketing
CTR, campaigns, lead sources, WhatsApp, ROAS
3,200 features
πŸͺ
Seller Engagement
Views, conversion, response rate, renewal, pricing
4,100 features
πŸ–±οΈ
Clickstream
Session depth, funnel, dwell time, search, bounce
2,800 features
πŸ“
Geo-IQ
POIs, highway dist, density, city tier, catchment
2,900 features
🧠
LLM Location
GPT-4 quality scores, embeddings, neighbourhood class.
2,000 features
ARGO CI/CD PIPELINE β€” Monthly Automated Execution
πŸ“…
Monthly Cron
Argo CronWorkflow
β†’
πŸ”„
Extract & Load
Python ETL Jobs
β†’
βš™οΈ
SQL Transform
dbt Feature Models
β†’
🧠
LLM Augment
GPT-4 API Calls
β†’
βœ…
Quality Gate
Null Β· PSI Β· Stats
β†’
πŸ”–
Feature Stamp
v{YYYY}.{MM} tag
β†’
❄️
Write Snowflake
FEATURE_MART
πŸ”– Feature Stamping
  • Every version tagged as v{YYYY}.{MM}
  • Training reads same stamp as serving
  • Eliminates train-serve skew entirely
  • Rollback to any historical stamp supported
βœ… Quality Gate Checks
  • Null rate <2% per column enforced
  • PSI distribution drift vs prior month
  • Row count sanity vs baseline
  • Slack alert β€” pipeline halts on failure
βš™οΈ Argo Workflow
  • DAG-based task dependency graph
  • Parallel computation per feature category
  • 3-retry with exponential backoff
  • Dockerized containers per step Β· K8s
Agentic System Design
Multi-Agent System Architecture
LangGraph-orchestrated agents enabling conversational feature discovery from Snowflake Feature Mart
πŸ‘€
Data Scientist / ML Engineer
Natural language feature requests via chat interface
↓
πŸ€–
Orchestrator Agent (LangGraph StateGraph)
Intent parsing Β· Task decomposition Β· Sub-agent routing Β· Multi-turn memory
↓
πŸ—‚οΈ
Table
Recommender
πŸ”
Data
Fetcher
πŸ§ͺ
Feature
Engineer
↓
πŸ“¦
Feature Matrix Output
Quality-checked Β· Versioned Β· S3 export Β· Model-ready
πŸ› οΈ LangGraph Tools used: StateGraph Β· ToolNode Β· MessagesState Β· SQLDatabase tool Β· SnowflakeConnector tool Β· Memory Checkpointer
πŸ€– Orchestrator Agent
  • LangGraph StateGraph with shared conversation memory
  • GPT-4 structured output for intent classification
  • Routes to correct sub-agent based on query type
  • Handles multi-turn dialog with context preservation
πŸ—‚οΈ Table Recommender Agent
  • Indexes FEATURE_REGISTRY table (15K+ feature metadata)
  • Semantic matching: user requirement β†’ relevant tables
  • Returns table names, join keys, grain, freshness, feature counts
πŸ” Data Fetcher + πŸ§ͺ Feature Engineer
  • Fetcher generates & executes optimized Snowflake SQL with multi-table JOINs
  • Returns data quality report (null rates, join success %, row count)
  • Engineer validates features, interactively proposes enrichment groups
  • Exports final versioned matrix to S3 with stamp metadata
Live Agent Demo Β· Step 1 of 3
Agent Conversation: Feature Request β†’ Table Recommendation
User describes needs β†’ Orchestrator parses intent β†’ Table Recommender Agent selects optimal Snowflake tables

Agent Flow

πŸ‘€ User
↓
πŸ€– Orchestrator Active
↓
πŸ—‚οΈ Table Rec. Active
↓
πŸ” Data Fetcher
↓
πŸ§ͺ Feature Eng.
πŸ“‘ LangGraph State
intent: TABLE_RECOMMEND
stage: RECOMMENDATION
turns: 2  |  memory: active
DS
I need to build a seller prioritization model for Tier-2 cities. I need location quality signals, seller engagement metrics, and demographic context for each seller.
πŸ€–
Orchestrator Agent
Analyzing your feature request... Decomposed into 3 categories:

1. πŸ“ Location quality signals β†’ Geo-IQ tables
2. πŸͺ Seller engagement metrics β†’ Seller engagement tables
3. πŸ‘₯ Demographic context β†’ City demographics tables

Routing to Table Recommender Agent...
πŸ—‚οΈ
Table Recommender Agent
Based on your requirements, here are the recommended tables from FEATURE_MART schema in Snowflake:
Table NameFeaturesGrainFreshnessMatch
SELLER_GEO_FEATURES452seller_id Γ— monthMonthlyβ˜…β˜…β˜… High
SELLER_ENGAGEMENT_FEATURES318seller_id Γ— weekWeeklyβ˜…β˜…β˜… High
CITY_DEMOGRAPHICS_FEATURES98city_id Γ— quarterQuarterlyβ˜…β˜… Medium
LLM_LOCATION_FEATURES45seller_id Γ— monthMonthlyβ˜…β˜…β˜… High
πŸ”‘ Primary join key: seller_id  |  πŸ”— Secondary: city_id via seller_city_mapping
🎯 Recommended filter: city_tier IN (2, 3)
βœ… Total available: 913 features across 4 tables. Shall I proceed with fetching?
Live Agent Demo Β· Step 2 of 3
Agent Conversation: Snowflake SQL Execution & Multi-Table JOIN
Data Fetcher Agent generates optimized SQL β†’ executes on Snowflake β†’ returns data quality report

Agent Flow

πŸ‘€ User
↓
πŸ€– Orchestrator
↓
πŸ—‚οΈ Table Rec.
↓
πŸ” Data Fetcher Active
↓
πŸ§ͺ Feature Eng.
❄️ Snowflake Status
Warehouse: FEATURE_WH
Schema: FEATURE_MART
Query time: 3.2s
Rows fetched: 67,430
Base features: 868
DS
Yes, proceed with fetching all 4 recommended tables and join them.
πŸ”
Data Fetcher Agent β€” Connecting to Snowflake FEATURE_WH
Generating optimized multi-table JOIN query:
SELECT g.seller_id, g.city_tier, g.poi_count_500m, g.poi_count_1km, g.poi_count_3km, g.avg_distance_to_highway_km, g.residential_density_score, g.commercial_zone_flag, e.avg_listing_views_30d, e.avg_response_time_hrs, e.conversion_rate_30d, e.active_listings_count, e.reply_rate_7d, d.avg_household_income_proxy, d.population_density_per_sqkm, l.llm_area_quality_score, l.llm_area_tier, l.llm_poi_sentiment FROM FEATURE_MART.SELLER_GEO_FEATURES g INNER JOIN FEATURE_MART.SELLER_ENGAGEMENT_FEATURES e ON g.seller_id = e.seller_id AND g.feature_month = DATE_TRUNC('MONTH', e.feature_week_start) LEFT JOIN FEATURE_MART.CITY_DEMOGRAPHICS_FEATURES d ON g.city_id = d.city_id AND g.feature_quarter = d.feature_quarter LEFT JOIN FEATURE_MART.LLM_LOCATION_FEATURES l ON g.seller_id = l.seller_id AND g.feature_month = l.feature_month WHERE g.city_tier IN (2, 3) AND g.feature_month = DATE_TRUNC('MONTH', CURRENT_DATE()) AND g.seller_id IS NOT NULL
Executed in 3.2s
67,430 sellers
Geo↔Engagement join: 98.7%
Demographics join: 94.2%
LLM join: 99.1%
πŸ“Š Base feature set assembled: 67,430 rows Γ— 868 features  |  Overall non-null: 99.1%
Handing off to Feature Engineer Agent for validation & enrichment β†’
Live Agent Demo Β· Step 3 of 3
Agent Conversation: Feature Enrichment & Final Versioned Output
Feature Engineer Agent asks clarifying questions, adds enrichment layers, delivers stamped feature matrix

Agent Flow

πŸ‘€ User
↓
πŸ€– Orchestrator
↓
πŸ—‚οΈ Table Rec.
↓
πŸ” Data Fetcher
↓
πŸ§ͺ Feature Eng. Active
πŸ“¦ Final Output
Total features: 1,141
Sellers: 67,430
Version: v2024.01.T2
Null rate: 0.8%
Status: βœ… Ready
πŸ§ͺ
Feature Engineer Agent
Base set validated βœ…  868 features ready. I can enrich with additional feature groups:
Feature Group+CountTable
πŸ“± Clickstream Behavioral (90d)+127CLICKSTREAM_SELLER_90D
πŸ“ˆ Historical Trends (12m rolling)+63SELLER_TREND_FEATURES
🏠 Property Market Signals+38PROPERTY_MARKET_FEATURES
πŸ€– Extra LLM Embeddings v2+45LLM_EMBEDDINGS_V2
Which feature groups would you like to add?
DS
Add Clickstream Behavioral, Historical Trends, and Property Market Signals.
πŸ§ͺ
Feature Engineer Agent β€” Fetching additional tables from Snowflake
CLICKSTREAM_SELLER_90D β€” 127 features βœ…
SELLER_TREND_FEATURES β€” 63 features βœ…
PROPERTY_MARKET_FEATURES β€” 38 features βœ…
🎯 Final Feature Matrix β€” Ready for Model Training
Geo-IQ Features452
Seller Engagement318
LLM Location Intel45
City Demographics98
Clickstream (90d)127
Historical Trends63
Property Market38
TOTAL FEATURES1,141
Null rate: 0.8%
Version: v2024.01.SELLER_TIER2
Train-serve consistency: βœ… Verified
Exported: s3://feature-mart-outputs/seller_t2_v2024_01/
Argo stamp: STAMP_2024_01_15
Technology
Complete Tech Stack
End-to-end technology decisions across data engineering, ML platform, agentic layer and infrastructure
❄️
Snowflake
Data warehouse + Feature Store. FEATURE_MART schema. 15K+ features across 50+ tables. Snowpipe for streaming ingest.
Feature Store
πŸ€–
LangGraph
Multi-agent orchestration framework. StateGraph, ToolNode, shared memory, conversational routing between agents.
Agent Orchestration
🐍
Python
Feature engineering, ETL pipelines, agent tools, Snowflake connector, data validation and quality checks.
Core Language
πŸ—„οΈ
SQL + dbt
Feature computation in Snowflake SQL. dbt models for transformation DAGs, testing, documentation and lineage.
Transformations
πŸ”
Argo Workflows
CI/CD for feature pipelines. DAG execution, monthly cron trigger, 3-retry logic, Kubernetes-native task runner.
Orchestration
🧠
OpenAI GPT-4
LLM-augmented location features. Area quality scoring, POI descriptions, embeddings via text-embedding-3-large.
LLM Provider
πŸ“
Google Maps API
POI data, geocoding, reverse geocoding, distance matrix. Powers all Geo-IQ features in SELLER_GEO_FEATURES table.
Geo Data
⚑
Kafka + Spark
Clickstream event streaming. Kafka topics β†’ Spark Streaming β†’ 5-min micro-batch aggregations β†’ Snowpipe.
Streaming
πŸ—ΊοΈ
GeoPandas / Shapely
Geospatial feature computation. Catchment areas, density calcs, polygon intersections, coordinate clustering.
Geo Processing
πŸ€–
XGBoost / sklearn
Downstream model training on Feature Mart output. Seller prioritization, loan scoring, CRM propensity models.
ML Models
🐳
Docker + Kubernetes
Containerized pipeline steps. Auto-scaling Argo tasks. K8s namespace: feature-pipelines. Helm charts for deployment.
Infrastructure
☁️
AWS S3
Artifact storage for pipeline intermediates. Final feature matrix versioned export. Prefixes per stamp version.
Storage
Business Impact
Business Impact & Results
Agentic Feature Mart powers β‚Ή6Cr+ combined annual value across product lines and operational savings
πŸ“ˆ
β‚Ή5Cr+
Seller Prioritization Model Revenue
Feature Mart powers the seller prioritization XGBoost model that ranks sellers for sales outreach. Better features β†’ better ranking β†’ more high-value seller conversions. Annual revenue attribution: β‚Ή5Cr+.
XGBoost Model
Geo-IQ Features
Seller Engagement
πŸ’Έ
β‚Ή1Cr/yr
CRM Cost Savings via Scoring Framework
Integrated into CRM scoring to identify sellers who do NOT need WhatsApp outreach. Reduces unnecessary messaging costs and sales team effort. Saves β‚Ή1Cr/year in operational spend by reducing low-ROI contacts by 40%.
CRM Integration
WhatsApp Cost Reduction
πŸ“¦
5+ Products
Cross-Product Feature Consumption
Single Feature Mart consumed across Buyer, Used Car Loan, Personal Loan, Seller Prioritization and CRM product lines. Eliminates duplicate feature engineering efforts across teams β€” one stamp, many models.
Buyer Model
Used Car Loan
Personal Loan
πŸ”
15K+ Features
Scale & Engineering Excellence
15,000+ features spanning 5 verticals. Fully automated monthly pipeline with 99.1% non-null rate. Training–serving consistency enforced via Argo stamps. Zero manual intervention required post-setup.
Monthly Automation
99.1% Quality
Zero Skew
LLM-Augmented Features
Geo-IQ & LLM Location Intelligence Pipeline
GPT-4 transforms raw coordinates + POI data into rich contextual ML signals β€” a key differentiator in location-aware models
πŸ“₯ INPUT SIGNALS
πŸ“ Raw Geo Data
lat: 19.0760, lon: 72.8777
city: "Mumbai", pin: "400001"
state: "Maharashtra"
city_tier: 1
πŸ—ΊοΈ POI Data (Google Maps)
poi_500m: ["SBI ATM","McD"...]
poi_type_counts: {hospital:2,school:3}
nearest_highway_km: 0.8
residential_density: "High"
πŸ“Š Census Proxy Signals
Population density, income index, literacy rate from PIN/area identifiers
TRANSFORMS VIA
β†’
πŸ€–
GPT-4o
OpenAI API
text-embedding-3
β†’
Prompt Engineering
Chain-of-Thought
Structured JSON Out
1536-dim Embeddings
πŸ“€ OUTPUT: LLM_LOCATION_FEATURES table
🌟 Location Quality Score
llm_location_quality: 7.4
llm_area_tier: "Prime Urban"
llm_growth_potential: "High"
llm_infra_score: 8.1
πŸ“ Area Embedding (1536-dim)
llm_area_embedding: [0.023,-0.14...]
llm_area_summary: "Well-connected..."
llm_poi_sentiment: 0.81
🏘️ Neighbourhood Classification
Residential / Commercial / Industrial / Mixed-use labels with confidence scores per zone