Project Description
I am an independent professional reaching out to enquire about your data extraction and Data-as-a-Service (DaaS) capabilities. I am looking for an expert partner to process approximately 48,000 specific project URLs from the MahaRERA (Maharashtra Real Estate Regulatory Authority) portal.
I do not require the scraping scripts or source code; my goal is simply to procure the final, clean dataset (Excel/CSV) and a structured folder of the processed PDF documents.
Below is a detailed overview of the project requirements:
Pipeline & Workflow Requirements:
Captcha Bypass: Each URL is protected by a simple alphanumeric captcha. You will need to automate solving this (OCR/Proxies/Sessions) to access the project pages.
Data Extraction: Scrape specific structured data points from the HTML tables on each page.
Download & Merge PDFs: Under the "Promoter Documents" section, locate multi-part files labeled "Land Ownership Document" (e.g., REGISTERED EXCHANGE DEED Part 1, Part 2). Download all parts for each project, merge them into a single PDF, and name the final file [Registration_Number]_Land_Document.pdf.
AI/NLP Document Analysis (Critical): Run the merged legal documents (which may be in English or Marathi) through an AI/NLP model to extract the "Consideration" or "Deal Structure" between the multiple parties.
Required Data Points:
Primary ID: Registration Number & Date of Registration.
Basic Details: Project Name, Project Type, and Proposed Completion Date.
Area Details: Land Area for Project Applied (Sq. Mts.), Permissible Built-up Area, Sanctioned Built-up Area, and Aggregate Area of Recreational Open Space.
Legal & Promoter Details: CC Date, Landowner Type, GSTIN Number, Promoter Name, and all individual names listed in the "Member Details" table.
Joint Venture Flag: A True/False column (Mark 'True' if the "Promoter Name" contains a comma or lists multiple entities).
Unit Details: Total Residential & Non-Residential Units.
AI Consideration Category: Categorize the deal structure from the PDF as: Pure Monetary, Barter, Constructed Area Share, Revenue Share, or Mixed.
AI Consideration Summary: A 1-2 sentence English summary of the commercial terms extracted from the PDF.
Project Deliverables:
One clean .xlsx or .csv file containing all extracted data points and AI summaries for all 48,000 URLs.
A .zip folder containing the correctly named, merged Land Ownership Document PDFs for every project.
Proposed Milestone Structure:
To ensure quality and alignment, I propose dividing this contract into two milestones:
Milestone 1 (Proof of Concept): I will provide 100 URLs. You will deliver the CSV (including the AI extraction) and the 100 merged PDFs to prove the pipeline works accurately.
Milestone 2 (Final Delivery): Upon approval of the sample, you will process the remaining URLs and deliver the final bulk files.
Request for Proposal:
If your team is capable of handling this workflow, please reply to this email with a proposal. To ensure you have read through the requirements, please start your response with the word "MahaRERA".
Kindly include the following in your proposal:
Your estimated total cost for the project in INR.
Your expected turnaround time.
A brief explanation of the AI/NLP stack you would utilize to read and summarize the Marathi/English property deeds.
I look forward to hearing from you.