πŸ—οΈ Agentic Feature Mart β€” System Architecture

End-to-end ML Feature Platform with LangGraph Multi-Agent Orchestration Β· CarTrade Tech

❄️ Snowflake πŸ€– LangGraph 🐍 Python + dbt πŸ” Argo Workflows 🧠 GPT-4o πŸ“ Google Maps API ⚑ Kafka + Spark 🐳 Docker + K8s ☁️ AWS S3
β‘  DATA SOURCES β€” Raw ingestion from 5 verticals
πŸ“£ Marketing CRM
Campaign events, email opens, WhatsApp delivery, ROAS signals
MySQL CDCREST API
🏷️ Seller Platform
Listing data, renewals, seller profile, pricing history
PostgreSQLBatch ETL
⚑ Clickstream
Page views, session depth, funnel events, dwell time per seller page
Kafka TopicsSpark Stream
πŸ“ Google Maps API
POI data, geocoding, distance matrix, reverse geocoding
REST APIGeoPandas
🌐 Web + Census
Regional intelligence, population density, income proxy, market data
ScrapyCSV Ingest
Extract & Load
β‘‘ ETL + FEATURE ENGINEERING + LLM AUGMENTATION β€” Compute, transform & enrich
🐍 Python ETL
Data extraction, type normalization, deduplication, Snowflake connector writes
pandassnowflake-connectorpydantic
πŸ—„οΈ SQL + dbt
Feature computation in Snowflake SQL. dbt models with tests, lineage DAG & docs
dbt-snowflakedbt testSQL macros
⚑ Kafka + Spark
Clickstream streaming. 5-min micro-batch agg β†’ Snowpipe into staging tables
KafkaPySparkSnowpipe
πŸ—ΊοΈ Geo Computation
POI radius counts, polygon intersections, catchment areas, density grids
GeoPandasShapelyH3 Index
🧠 GPT-4o LLM Layer
Area quality scoring, POI context analysis, neighbourhood classification, 1536-dim embeddings
GPT-4o APItext-embed-3JSON mode
Orchestrated via CI/CD
β‘’ PIPELINE ORCHESTRATION β€” CI/CD β€” Monthly automated, zero-touch
πŸ” Argo Workflows
CronWorkflow triggers monthly. DAG with 6 sequential steps. 3-retry with backoff on failure. Slack alerts.
CronWorkflowDAG stepsretry:3
🐳 Docker + Kubernetes
Each pipeline step runs as isolated K8s pod. Namespace: feature-pipelines. Auto-scaled by Argo.
DockerK8s PodsHelm Charts
βœ… Quality Gates
Null rate <2%, PSI drift check, row count sanity, schema validation. Pipeline halts if any gate fails.
Great ExpectationsPSI check
🏷️ Feature Stamping
Every row tagged with v{YYYY}.{MM} stamp. Enforces train-serve consistency. Registered in FEATURE_REGISTRY table.
version stampFEATURE_REGISTRY
☁️ AWS S3
Intermediate artifacts, pipeline logs, final feature matrix exports. Prefixed per stamp version.
S3 VersionedParquet
Write stamped features
β‘£ SNOWFLAKE FEATURE MART β€” Single source of truth for all ML features

❄️ Snowflake Β· FEATURE_MART Schema

Warehouse: FEATURE_WH Β· Database: ANALYTICS_DB Β· Schema: FEATURE_MART

15,000+
Total Features
50+
Tables
99.1%
Non-null Rate
Monthly
Refresh Cycle
v{YYYY.MM}
Version Stamp
SELLER_GEO_FEATURES SELLER_ENGAGEMENT_FEATURES CITY_DEMOGRAPHICS_FEATURES LLM_LOCATION_FEATURES CLICKSTREAM_SELLER_90D SELLER_TREND_FEATURES PROPERTY_MARKET_FEATURES MARKETING_ENGAGEMENT_FEATURES FEATURE_REGISTRY + 41 more tables
Natural language query
β‘€ AGENTIC LAYER β€” LangGraph Multi-Agent System β€” Conversational feature discovery & retrieval
πŸ€– Orchestrator Agent
Entry Point Β· Router
β€’ GPT-4 function calling
β€’ Parses NL feature requests
β€’ Decomposes into sub-tasks
β€’ Routes to specialist agents
β€’ StateGraph shared memory
β€’ Maintains conversation turns
LangGraphGPT-4StateGraph
β†’
πŸ—‚οΈ Table Recommender Agent
Semantic Search Β· Discovery
β€’ Indexes FEATURE_REGISTRY
β€’ Semantic match on 15K features
β€’ Returns ranked table list
β€’ Provides join keys & grain
β€’ Relevance score per table
β€’ Recommends filters
TF-IDF / EmbedFEATURE_REGISTRY
β†’
πŸ” Data Fetcher Agent
SQL Gen Β· Snowflake Exec
β€’ Auto-generates JOIN SQL
β€’ Handles multi-grain alignment
β€’ Executes on Snowflake
β€’ Measures join success rates
β€’ Returns quality report
β€’ Null rate per column
SQL GenSnowflakeToolNode
β†’
πŸ§ͺ Feature Engineer Agent
Enrichment Β· Output
β€’ Validates base feature set
β€’ Proposes enrichment groups
β€’ Asks clarifying questions
β€’ Fetches additional tables
β€’ Applies version stamp
β€’ Exports to S3 as Parquet
Human-in-loopS3 Export
πŸ’¬ Flow: User types natural language feature request β†’ Orchestrator decomposes β†’ Table Recommender surfaces tables β†’ Data Fetcher writes & runs SQL on Snowflake β†’ Feature Engineer enriches + asks follow-ups β†’ Final versioned 1,141-feature matrix exported in ~10 min
Versioned feature matrix
β‘₯ MODEL CONSUMERS β€” 5+ product lines training on Feature Mart
πŸ“ˆ Seller Prioritization
XGBoost ranker. Ranks sellers for sales outreach. β‚Ή5Cr+ annual revenue attribution.
XGBoost1,141 features
πŸ’Έ CRM Scoring
Identifies sellers who don't need WhatsApp outreach. β‚Ή1Cr/yr savings. 40% cost reduction.
Propensitysklearn
πŸ›» Used Car Loan
Loan eligibility scoring using seller geo & engagement features as credit proxies.
LightGBMGeo features
🏦 Personal Loan
Buyer income proxies and engagement signals for personal loan pre-qualification.
LogRegDemographics
πŸ§‘β€πŸ’Ό Buyer Model
Buyer intent & conversion scoring using clickstream and location context features.
EnsembleClickstream
Agentic Feature Mart Β· CarTrade Tech Β· ML Platform
15,000+ Features 50+ Snowflake Tables β‚Ή6Cr+ Annual Value 5+ Product Lines Monthly Auto-Pipeline