Skip to content

Releases: bigcode-project/bigcode-evaluation-harness

Initial release of BigCode Evaluation Harness

25 May 09:37
04b3493
Compare
Choose a tag to compare

Release notes

These are the release notes of the initial release of the BigCode Evaluation Harness.

Goals

The framework aims to achieve the following goals:

  • Reproducibility: Making it easy to report and reproduce results.
  • Ease-of-use: Providing access to a diverse range of code benchmarks through a unified interface.
  • Efficiency: Leveraging data parallelism on multiple GPUs to generate benchmark solutions quickly.
  • Isolation: Using Docker containers for executing the generated solutions.

Release overview

The framework supports the following features & tasks:

  • Features:

    • Any autoregressive model available on Hugging Face hub can be used, but we recommend using code generation models trained specifically on Code.
    • We provide Multi-GPU text generation with accelerate for multi-sample problems and Dockerfiles for evaluating on Docker containers for security and reproducibility.
  • Tasks:

    • 4 code generation Python tasks (with unit tests): HumanEval, APPS, MBPP and DS-1000 for both completion (left-to-right) and insertion (FIM) mode.
    • MultiPL-E evaluation suite (HumanEval translated into 18 programming languages).
    • Pal Program-aided Language Models evaluation for grade school math problems : GSM8K and GSM-HARD. These problems are solved by generating reasoning chains of text and code.
    • Code to text task from CodeXGLUE (zero-shot & fine-tuning) for 6 languages: Python, Go, Ruby, Java, JavaScript and PHP. Documentation translation task from CodeXGLUE.
    • CoNaLa for Python code generation (2-shot setting and evaluation with BLEU score).
    • Concode for Java code generation (2-shot setting and evaluation with BLEU score).
    • 3 multilingual downstream classification tasks: Java Complexity prediction, Java code equivalence prediction, C code defect prediction.

More details about each task can be found in the documentation in docs/README.md.

Main Contributors

Full Changelog: https://github.com/bigcode-project/bigcode-evaluation-harness/commits/v0.1.0