Project Description
I hope this message finds you well.
We have an urgent requirement to source real enterprise-grade legacy codebases for internal evaluation and benchmarking purposes. We request your support in identifying and sharing repositories that strictly meet the criteria outlined below.
1. Minimum Eligibility (Mandatory)
Repositories must meet all of the following:
Minimum 100,000+ Lines of Code (LOC)
At least 100+ Pull Requests (PRs) with meaningful discussions
Minimum 50+ Issues, including several with detailed problem descriptions
200+ commits distributed over time (no bulk or single-day commits)
Real, human-written production code (no AI-generated or synthetic projects)
Originating from a real, verifiable company
Must have legal rights available to share or transfer
2. Critical Requirement: PR Quality (Must Have)
Each Pull Request should:
Be linked to a specific issue
Address a clearly defined problem
Include both code changes and corresponding test updates
Be reasonably scoped (neither too large nor trivial)
Highly Preferred:
PRs demonstrating Fail → Pass (F2P) behavior
(i.e., tests fail before the fix and pass after implementation)
⚠️ Note: Repositories where PRs contain only code changes without test coverage will not be considered.
3. Preferred Technology Stack
C#
Java
Python
PHP
.NET Framework
COBOL
Other legacy enterprise technologies
4. Preferred Industry Domains (High Priority)
Banking / Financial Services
Accounting
Insurance
Healthcare
Legal Technology
Government Systems
Enterprise SaaS (complex workflow-driven platforms)
Note: Ecommerce, retail, content platforms, and frontend-heavy applications are not within scope.
5. Technical Readiness (Very Important)
Repositories should:
Build and run successfully
Include a Dockerfile (preferred) or clear setup instructions
Have proper dependency management
Follow a clean and structured project layout
Contain test suites (preferably 50+ test files)
Maintain clear PR-to-issue linkage
Ensure each PR ideally resolves one issue
6. Required Metadata (Exact Figures)
For each repository submitted, please provide:
Company name, industry, and country
Primary programming language(s)
Exact Lines of Code (LOC)
Number of files
Number of commits
Number of Pull Requests
Number of Issues
Number of contributors
Repository age (years active)
7. Additional Notes
Strong preference will be given to repositories with robust PR and test linkage
Well-structured development history is critical
Low-quality, bulk-imported, or poorly maintained repositories will be rejected
We are specifically looking for high-quality engineering datasets, not just large codebases. Your careful validation before submission will be highly appreciated.
Please treat this request as high priority and share suitable options at the earliest.
If you have any questions or need clarification, feel free to reach out.
Thank you for your support.
Warm regards,