Project Description
I will hand you a two-column CSV of real email bodies and their spam/ham labels. From that single file I need a full production-ready pipeline:
• Text preprocessing that cleans each message, tokenises it and converts it to TF-IDF vectors.
• Training loops for Logistic Regression, Multinomial Naïve Bayes and linear-kernel SVM, with code that automatically picks the best performer.
• A FastAPI service exposing /predict so any caller can POST raw email text and receive the predicted class plus a probability score.
• A clear evaluation notebook or script that prints accuracy, F1, ROC-AUC and a confusion matrix so I can verify performance.
• Repository structured for easy hand-off, including requirements.txt and a concise README explaining setup, training and API usage.
I would appreciate—but do not strictly require—a small Streamlit dashboard that lets me paste an email and see the prediction in real time; feel free to propose how you would add that.
Acceptance is based on:
1. Re-running train.py on my machine reproduces the best model.
2. Calling /predict with sample emails returns correct spam/ham flags with confidences.
3. metrics_report.txt (or notebook) matches or exceeds the scores you quote.
4. Project installs cleanly inside a fresh virtual environment with one pip install -r requirements.txt.
If any clarifications around the email domain data are needed, just ask; otherwise please outline your approach and timeline and we can get started right away.