Project Description
Title: AI/LLM Evaluation Methodology Reviewer — Pilot Study Audit
(NDA, fixed-price, single reviewer)
We are commissioning a hostile-but-fair methodology review of an
internal pilot study before we expand it or commission independent
validation. We are NOT looking for friendly feedback. We are looking
for the strongest critique you can write.
THE STUDY (outlined only; full materials shared under NDA):
A pre-registered empirical pilot benchmarking nine prompting
conditions on a single frontier large language model across a small
stratified sample. Scoring is performed by a heuristic automated
scorer plus a single LLM-as-judge validation pass on a random sample.
Descriptive statistics only. Approximately 6,000 words of manuscript
plus supporting methodology documentation and raw outputs.
ENGAGEMENT:
- Fixed-price, single reviewer
- 10 working days from kickoff
- Mutual NDA required before any document is shared; 3-year survival
- Deliverable: 6-10 page review document (template provided on signature)
CONFIDENTIALITY OF ENGAGEMENT:
This is an internal quality-bar review. The review document itself
will not be published, and your involvement in this engagement will
remain confidential unless mutually agreed otherwise in writing. The
mutual NDA covers both directions — we will not name you publicly,
and you will not disclose the study, our identity, or the existence
of this engagement to any third party.
YOU MUST BE ABLE TO CREDIBLY DO ALL SEVEN OF THE FOLLOWING:
1. Audit an LLM-as-judge protocol for the standard contamination
patterns (position bias, length bias, sycophancy, same-model-family
judge dependence) per Zheng et al. 2023 (MT-Bench / "Judging
LLM-as-a-Judge") and the multi-judge literature.
2. Evaluate prompt-template fairness across experimental conditions —
token-budget parity, instruction-specificity asymmetry, output-format
scaffolding asymmetry, and whether comparator conditions represent
the strongest possible instantiation of each baseline.
3. Audit a heuristic automated scorer (marker counting / regex /
embedding-based) for whether "scorer-blind to condition identity"
actually implies "scorer-blind to design intent" — recognize the
measurement-instrument selection bias pattern.
4. Specify the right small-sample paired non-parametric test, bootstrap
CI procedure, and multiple-comparison correction for a benchmark
pilot, and call out when point-estimate reporting is doing more harm
than good.
5. Assess reproducibility under stochastic generation — temperature,
nucleus sampling, seed control, deterministic decoding — and the
downstream effect on reported effect sizes.
6. Distinguish defensible pre-registration (OSF / AsPredicted with
public timestamp and calibrated effect-size thresholds) from
self-asserted pre-registration in a private repository.
7. Deliver a severity-ranked critique with hostile-but-fair posture:
acknowledge strengths honestly alongside major flaws, distinguish
reject-worthy from revise-worthy from line edits, and benchmark the
submission against NeurIPS Datasets & Benchmarks Track readiness.
REQUIRED:
- PhD, advanced graduate student, or equivalent industry research
background in NLP, ML, computational linguistics, AI evaluation,
or applied AI research
- Demonstrable familiarity with LLM benchmarking literature (HELM,
MMLU, MT-Bench, BIG-bench, AlpacaEval, G-Eval, or equivalent)
- Demonstrable peer-review experience at named venues (ACL, EMNLP,
NAACL, NeurIPS, ICML, ICLR, COLM, or equivalent — please cite)
- Willingness to sign mutual NDA before any study material is shared
WE DO NOT NEED:
- Subject-matter expertise in the application domain (we will brief you)
- Engineering or implementation review (we have that)
- Friendly validation (we have that too)
TO APPLY (generic copy-paste applications auto-rejected):
1. One paragraph naming which of the seven capabilities is your
strongest, with a specific example from prior work (paper, review,
or project) that demonstrates it.
2. PhD-granting institution (or current program), field, year.
3. 1-2 named venues you have reviewed for.
4. Your fixed-price quote for the engagement.
5. Earliest start date.
6. NDA willingness: yes / no.
We will respond to all shortlisted applicants within 5 business days.