Project Description
I’m refining an AI-driven question-paper checker and parser, and my top priority right now is accuracy. With multiple-choice items the system too often marks wrong answers as correct and flags correct ones as wrong, so I need a specialist who can put several models through their paces, isolate the causes of these misclassifications, and tune the pipeline until the results are consistently reliable. Speed and consistency will matter later, but first I need the numbers to be right.
You’ll have freedom to evaluate any mix of commercial LLMs (GPT-4, Claude, Bard, etc.) and open-source options (Llama 2, Falcon, custom fine-tuned transformers) so long as you keep the comparison fair and reproducible. The current stack is Python with Hugging Face and a small PostgreSQL database for answer keys; if another framework is essential, outline why and how you’ll integrate it back into this environment.
Deliverables
• A benchmark report showing precision, recall, and F1 for each model on my labeled multiple-choice dataset
• The cleaned, documented evaluation scripts or notebooks (Python preferred)
• An updated inference module or prompt library that meets or beats 98 % exact-match accuracy on the test set
• A short hand-off guide so I can run future batches the same way
If further optimisation suggests extending support to short-answer or essay questions, let me know in your findings; for now, keep the focus tightly on multiple-choice grading accuracy.