Project Description
# B2B Intelligence Platform Development — Production-Grade AI + Data Pipeline
## Project Overview
We are building a **production-grade B2B intelligence platform** focused on large-scale public data acquisition, AI-powered document intelligence, and real-time business alerts.
The platform will crawl and process data from **50+ public-facing websites**, extract structured intelligence from multilingual PDFs (English, Hindi, Marathi), and deliver actionable insights through search, AI-generated reports, and multi-channel notifications.
This is a **long-term product engagement**, not a short-term prototype assignment.
The selected team/freelancer will work on a high-scale architecture designed for:
* Large-volume document ingestion
* AI-assisted extraction pipelines
* Search + semantic intelligence
* Risk scoring
* Real-time alerts
* Enterprise-grade observability
* Scalable AWS infrastructure
NDA is mandatory before sharing complete architecture, workflows, source mappings, schemas, and internal business logic.
---
# Required Technical Skills
## Backend & Core Stack
* Python 3.11
* FastAPI
* asyncio
* asyncpg
* Production-grade architecture
* Typed, tested, maintainable code
## Web Crawling & Data Acquisition
* Playwright
* httpx
* curl-cffi
* JS-rendered page handling
* Session management
* Queue-based distributed crawling
* Rate limiting & retry orchestration
## Document Processing & OCR
* pdfplumber
* PyMuPDF
* Tesseract 5
* Hindi + Marathi OCR language packs
* OCR fallback pipelines
## AI / LLM Integration
* Anthropic Claude API
* OpenAI API
* Structured JSON extraction
* Schema validation
* Confidence scoring
* Embedding pipelines
* OpenAI text-embedding-3
* BGE-M3
## Data & Search Infrastructure
* PostgreSQL 14+
* JSONB
* Query optimisation
* Table partitioning
* pgvector
* OpenSearch / Elasticsearch
* Custom analyzers for multilingual search
* Redis for queues, caching, throttling
## Cloud & DevOps
* AWS ap-south-1 (Mumbai only)
* ECS Fargate
* S3
* RDS
* IAM
* Secrets Manager
* Docker
* Terraform
* GitHub Actions
---
# Preferred / Bonus Skills
* Apache Airflow / MWAA
* Indic-language NLP experience
* React + TypeScript
* WhatsApp Cloud API
* Firebase Cloud Messaging
* AWS SES
* Sentry
* OpenTelemetry
* Grafana
* LLM cost optimisation strategies
* High-scale document processing systems
* DPDP Act 2023 compliance
* Experience handling 10,000+ documents/day pipelines
---
# Scope of Work
The selected developer/team will build the following production components:
### Core Pipeline
1. Distributed web crawler
2. Document acquisition engine
3. S3 document storage layer
4. OCR cascade pipeline
5. Section detection engine
6. Structured field extraction
7. Revision diff engine
8. Change classification layer
9. Intelligence/risk scoring engine
10. Hybrid search engine
11. AI-powered report generation
12. Multi-channel alerting engine
13. Admin dashboard
14. Operational tooling & monitoring
---
# Deliverables
## Code Deliverables
* 11 Dockerised microservices
* PostgreSQL schema with migrations
* React + TypeScript admin dashboard
* Public REST APIs
* OpenAPI 3.0 documentation
## Infrastructure Deliverables
* Terraform infrastructure-as-code
* AWS deployment architecture
* ECS deployment pipelines
* CI/CD workflows
* Observability stack
* Production deployment on AWS Mumbai region
## Quality Deliverables
* ≥75% test coverage on core extraction logic
* Integration testing
* Production-scale load testing
* Technical documentation
* Runbooks
* Monitoring dashboards
* Error tracking setup
---
# Important Constraints
* AWS Mumbai region only (ap-south-1)
* Indian data residency mandatory
* No cross-border data transfer
* No anti-bot bypassing
* Only compliant/public-access acquisition flows
* Production-quality engineering required
* Observability + testing mandatory
* IST timezone preferred (±2 hours)
---
# Engagement Model
We are open to:
* Fixed-price engagement
* Milestone-based delivery
* Hourly engagement
* Long-term retainer
Please propose the engagement structure best suited for your team.
---
# What We Are Looking For
We prefer teams/freelancers who:
* Have built production-scale platforms
* Understand distributed systems
* Can work independently
* Write maintainable code
* Think in systems, not just tasks
* Can support long-term product evolution
Please include the following in your proposal:
* Relevant project experience
* Team composition
* Architecture approach
* Deployment strategy
* Sample production systems
* Engagement model preference
* Post-launch support capability
NDA will be executed before detailed technical discussions.