Project Description
I need a script that visits the California State License Board site, drills down into every contractor record, and captures every publicly visible field—license number, classifications, bonding, personnel, status history, contact info, everything. The scraper must handle pagination, search-form quirks, and any soft-rate limiting the site imposes so the run finishes without manual stops or captchas blocking the flow.
Once gathered, the data should be written to a single CSV file, overwriting the previous file on each run so I always have a fresh snapshot. The entire process has to trigger automatically every 24 hours (cron, systemd timer, Cloud Scheduler—whatever you prefer) and run headless on a Linux VPS I’ll provide. I am fine with Python (requests, BeautifulSoup, Scrapy, Selenium), Node with Puppeteer, or another solid stack as long as setup is straightforward.
Deliverables
• Source code with clear README covering setup, environment variables, and scheduling steps
• One-time deployment assistance on my VPS
• Proof of a successful unattended daily run (sample CSV + log)
Acceptance criteria: a full CSV containing every current CSLB license record and all associated fields, generated automatically for three consecutive days without errors.