Project Description
I have a list of niche websites and need every relevant piece of text and its accompanying images extracted, organised, and delivered in a clean, ready-to-use format. I’ll share the URLs and the exact data points once we start, but expect a mix of article-style pages and media galleries.
Scope
• Capture both written content (headings, paragraphs, metadata) and all on-page images.
• Provide the text in CSV or JSON and store images in clearly named folders that map back to the records.
• Preserve basic structure—so each text record includes the image file name or path.
• Respect robots.txt and rate limits; the scrape must be discreet and repeatable.
What I’d like to see in your proposal
Please outline your end-to-end approach: preferred language or framework (e.g. Python with Scrapy/BeautifulSoup, Selenium for dynamic pages, or another stack you trust), handling of pagination/login barriers, deduplication strategy, and estimated turnaround time. A brief sample architecture diagram or code snippet showing how you handle image downloads would be a plus.
Deliverables
1. Scraper script(s) with clear setup instructions.
2. Final datasets (CSV/JSON) and corresponding image folders.
3. Short read-me explaining how to rerun the scrape and update the data.
I’m ready to move quickly once I see a detailed project proposal that convinces me you can gather both text and visual assets accurately and efficiently.