GitHub Releases Open Dataset To Support Multilingual AI Development

GitHub announced the release of the GitHub Multilingual Repositories Dataset, a new open dataset designed to help researchers and developers discover multilingual developer content across public GitHub repositories.

The repository-level metadata dataset is available on GitHub under the CC0-1.0 license and is intended to support developers, researchers, open source maintainers, and model builders working on multilingual AI systems.

GitHub said the dataset was created to make it easier to identify public repositories with evidence of non-English natural-language content across READMEs, issues, and pull requests. The company noted that while much developer collaboration happens in English, many open source communities also collaborate in other languages, making multilingual data increasingly important as AI becomes more central to software development.

The dataset covers more than 80 million classification rows across over 40 million repositories. For each public repository, GitHub provides language classifications of the README, the most-commented issue, and the most-commented pull request, using the first 150 characters of each text sample. Text samples under 20 characters were excluded.

The dataset includes classifications from fastText, gcld3, and lingua-py, each with confidence scores. GitHub said it intentionally did not merge the three classifiers into a single label, allowing researchers and developers to choose their own precision and recall thresholds depending on the use case.

The dataset also includes repository metadata such as creation timestamp, disk usage, stars, forks, primary programming language, SPDX license, issue and pull request counts, and snapshot date.

GitHub said the dataset can help researchers discover repositories likely to contain developer documentation or collaboration in specific languages, study how non-English developer communities use issues and pull requests, and build evaluation sets for AI coding tools, documentation generators, and review assistants that need to perform well across languages.

GitHub also noted that Portuguese is the most common non-English language in README files, appearing across more than 3 million repositories. Korean was found to be the most common non-English language in issue text, though it ranked fifth in READMEs.

The company cautioned that the dataset should not be treated as a ground-truth benchmark for language identification. Instead, GitHub described it as a transparent discovery tool that enables users to inspect classifications, confidence scores, and sources before making their own research or development decisions.

GitHub said the release follows through on a 2025 commitment made as part of Microsoft’s European Digital Commitments to make multilingual data more accessible, including to open source AI developers.