Rigorous AI/LLM Study Methodology Review

—

Pending

💰 INR 12500–37500 👤 Unknown 🕒 9d ago status: new

Required Skills

Machine Learning (ML) AI Research

Project Description

Title: AI/LLM Evaluation Methodology Reviewer — Pilot Study Audit (NDA, fixed-price, single reviewer) We are commissioning a hostile-but-fair methodology review of an internal pilot study before we expand it or commission independent validation. We are NOT looking for friendly feedback. We are looking for the strongest critique you can write. THE STUDY (outlined only; full materials shared under NDA): A pre-registered empirical pilot benchmarking nine prompting conditions on a single frontier large language model across a small stratified sample. Scoring is performed by a heuristic automated scorer plus a single LLM-as-judge validation pass on a random sample. Descriptive statistics only. Approximately 6,000 words of manuscript plus supporting methodology documentation and raw outputs. ENGAGEMENT: - Fixed-price, single reviewer - 10 working days from kickoff - Mutual NDA required before any document is shared; 3-year survival - Deliverable: 6-10 page review document (template provided on signature) CONFIDENTIALITY OF ENGAGEMENT: This is an internal quality-bar review. The review document itself will not be published, and your involvement in this engagement will remain confidential unless mutually agreed otherwise in writing. The mutual NDA covers both directions — we will not name you publicly, and you will not disclose the study, our identity, or the existence of this engagement to any third party. YOU MUST BE ABLE TO CREDIBLY DO ALL SEVEN OF THE FOLLOWING: 1. Audit an LLM-as-judge protocol for the standard contamination patterns (position bias, length bias, sycophancy, same-model-family judge dependence) per Zheng et al. 2023 (MT-Bench / "Judging LLM-as-a-Judge") and the multi-judge literature. 2. Evaluate prompt-template fairness across experimental conditions — token-budget parity, instruction-specificity asymmetry, output-format scaffolding asymmetry, and whether comparator conditions represent the strongest possible instantiation of each baseline. 3. Audit a heuristic automated scorer (marker counting / regex / embedding-based) for whether "scorer-blind to condition identity" actually implies "scorer-blind to design intent" — recognize the measurement-instrument selection bias pattern. 4. Specify the right small-sample paired non-parametric test, bootstrap CI procedure, and multiple-comparison correction for a benchmark pilot, and call out when point-estimate reporting is doing more harm than good. 5. Assess reproducibility under stochastic generation — temperature, nucleus sampling, seed control, deterministic decoding — and the downstream effect on reported effect sizes. 6. Distinguish defensible pre-registration (OSF / AsPredicted with public timestamp and calibrated effect-size thresholds) from self-asserted pre-registration in a private repository. 7. Deliver a severity-ranked critique with hostile-but-fair posture: acknowledge strengths honestly alongside major flaws, distinguish reject-worthy from revise-worthy from line edits, and benchmark the submission against NeurIPS Datasets & Benchmarks Track readiness. REQUIRED: - PhD, advanced graduate student, or equivalent industry research background in NLP, ML, computational linguistics, AI evaluation, or applied AI research - Demonstrable familiarity with LLM benchmarking literature (HELM, MMLU, MT-Bench, BIG-bench, AlpacaEval, G-Eval, or equivalent) - Demonstrable peer-review experience at named venues (ACL, EMNLP, NAACL, NeurIPS, ICML, ICLR, COLM, or equivalent — please cite) - Willingness to sign mutual NDA before any study material is shared WE DO NOT NEED: - Subject-matter expertise in the application domain (we will brief you) - Engineering or implementation review (we have that) - Friendly validation (we have that too) TO APPLY (generic copy-paste applications auto-rejected): 1. One paragraph naming which of the seven capabilities is your strongest, with a specific example from prior work (paper, review, or project) that demonstrates it. 2. PhD-granting institution (or current program), field, year. 3. 1-2 named venues you have reviewed for. 4. Your fixed-price quote for the engagement. 5. Earliest start date. 6. NDA willingness: yes / no. We will respond to all shortlisted applicants within 5 business days.

Actions

↗ View on Freelancer