What is a Software Repository?
A repository is a centralized digital storage that developers use to make and manage changes to an application’s source code. Version control systems like Git track what was changed, who did the change, when, and why — enabling teams to collaborate efficiently and retrieve any previous version of their code.
For AI researchers, repositories are goldmines. Every commit, every bug report, every discussion thread is a potential data point for training models that understand and generate code.
Three Types of Repositories
Source Repositories
Bug Repositories
Communication Repositories
Key Terms
- Commit
- A snapshot of changes to one or more files, with a message describing what changed and why. Each commit has a unique SHA hash.
- Branch
- An independent line of development allowing parallel work on features, bug fixes, or experiments.
- Merge / PR
- Integrating changes from one branch into another, often through a reviewed pull request that adds a code-review layer.
- Tag
- A named reference to a specific commit, typically marking a release version (e.g., v2.1.0).
- Diff
- Shows the exact changes between two versions of a file: lines added, removed, or modified. Central to code review and change-aware AI models.
Rule-Based vs. Data-Driven Approaches
Not all AI systems learn from data. Some encode expert knowledge directly as hand-crafted rules — IF/THEN patterns created by domain experts. These rule-based systems require no training data but are brittle, fail on edge cases, and are hard to scale.
Modern software engineering automation has moved decisively toward data-driven approaches, particularly deep learning. These models learn patterns automatically from large datasets, generalize to unseen examples, and improve with more data. The trade-off: they require vast amounts of high-quality training data.
Why So Much Data?
Deep learning models are fundamentally statistical pattern matchers. They learn by observing millions of examples and discovering regularities — correlations between inputs and outputs that generalize to unseen data. The more complex the task, the more examples the model needs to learn robust patterns rather than memorizing surface-level noise.
Consider an analogy from computer vision: a model trained to classify cat breeds needs millions of labeled photographs to distinguish a Maine Coon from a Norwegian Forest Cat. With only a few hundred images, it might latch onto background color or image resolution instead of actual feline features. The same principle applies to code. A model that learns to summarize Java methods needs millions of method–summary pairs to understand that return a + b; is an addition regardless of whether the variables are named x and y, left and right, or salary and bonus.
What Do We Mine?
Software repositories are uniquely rich because they contain both source code and natural language, tightly interleaved:
- Source code — method bodies, class definitions, configuration files
- Comments & documentation — Javadoc, docstrings, inline annotations
- Commit messages — concise descriptions of what changed and why
- Issue reports & discussions — bug descriptions, feature requests, design debates
- Code reviews — pull request comments explaining improvements, catching mistakes
Each of these artifacts provides a different view of developer intent. By mining all of them, we can train models that understand not just what code does, but why it was written that way.
Building a Dataset: Selecting Repositories
Garbage in, garbage out. A model trained on low-quality data will produce low-quality predictions. Data curation is arguably the most important — and most underappreciated — step in the ML pipeline.
Repository selection uses quality proxies to filter the hundreds of millions of public repos down to a manageable, high-quality subset:
- Minimum stars — popularity as a quality signal
- Active maintenance — recent commits indicate a living project
- Non-fork status — avoid counting duplicated repositories
- Proper licensing — ensure legal use for training
- Meaningful commit history — enough data to be useful
Tools for Mining at Scale
Specialized APIs and search platforms make large-scale dataset construction possible: the GitHub REST & GraphQL API (rate-limited, 5,000 req/hour with auth), SEART-GHS (a search engine for GitHub repos with advanced filtering), and libraries like PyGitHub (Python), Octokit (JS), and go-github (Go).
Here’s how to query GitHub’s API for high-quality Java repositories, filtering by stars and excluding forks:
def fetch_top_java_repos(num_repos=200, per_page=100):
repos = []
page = 1
while len(repos) < num_repos:
url = "https://api.github.com/search/repositories"
params = {
"q": "language:java stars:>1000",
"sort": "stars",
"order": "desc",
"per_page": per_page,
"page": page
}
response = requests.get(url, params=params)
data = response.json()
for item in data.get("items", []):
if item.get("fork", False):
continue
repos.append({
"full_name": item["full_name"],
"clone_url": item["clone_url"],
"stars": item["stargazers_count"],
})
page += 1
return repos[:num_repos]
Once we have the repo list, we shallow-clone each one — --depth 1 grabs only the latest snapshot, saving time and disk space:
def clone_repo(clone_url, dest_dir):
cmd = ["git", "clone", "--depth", "1", "--quiet", clone_url, dest_dir]
result = subprocess.run(cmd, capture_output=True, text=True, timeout=120)
return result.returncode == 0
Data Quality Challenges
Selecting high-quality repositories is necessary but not sufficient. The raw code extracted from even the best projects contains numerous quality issues that can corrupt model training if left unaddressed.
Common Quality Issues
- Encoding problems — files with mixed encodings (UTF-8, Latin-1, Shift-JIS) produce garbled tokens.
- Auto-generated code — protobuf stubs, IDE-generated boilerplate, ORM mapping files inflate the dataset with repetitive, formulaic code.
- Test code vs. production code — unit tests follow very different patterns from production code (setup/teardown, assertions, mocking).
- Dead code and commented-out blocks — abandoned code paths add noise without contributing meaningful signal.
- Minified or obfuscated code — JavaScript bundles contain valid syntax but no readable structure.
A Filtering Example
Here is a realistic breakdown of what happens when you apply quality filters to a raw Java method dataset:
| Filter step | Removed | Remaining |
|---|---|---|
| Raw extracted methods | — | 500,000 |
| Remove auto-generated (protobuf, Lombok, IDE stubs) | −62,000 | 438,000 |
| Remove encoding errors / non-ASCII identifiers | −18,000 | 420,000 |
| Remove methods <3 tokens or >512 tokens | −45,000 | 375,000 |
| Remove trivial getters, setters, constructors | −38,000 | 337,000 |
| Remove exact and near-duplicates | −17,000 | 320,000 |
Extracting and Filtering Methods
Once repositories are cloned, we need to extract individual methods or functions from the source files. For Java, this means finding .java files, parsing them with tools like javalang or JavaParser, and extracting method bodies along with their signatures.
Raw extracted methods need systematic cleaning before they can be used for training:
- Remove duplicates — their presence creates data leakage between training and test sets
- ASCII-only characters — avoid encoding issues across different systems
- Remove outliers — methods that are incredibly long (1000+ lines) or incredibly short (single-line getters)
- Remove boilerplate — trivial code like getters, setters, and auto-generated constructors
- Strip comments — remove all inline and block comments from the method body
The core extraction logic uses brace-counting to find where each method starts and ends:
def extract_method_source(source_code, method_node, lines):
start_line = method_node.position.line - 1
brace_count = 0
started = False
end_line = start_line
for i in range(start_line, len(lines)):
for char in lines[i]:
if char == '{':
brace_count += 1
started = True
elif char == '}':
brace_count -= 1
if started and brace_count == 0:
end_line = i
break
return '\n'.join(lines[start_line:end_line + 1])
Tokenization: From Code to Tokens
Tokenization breaks down raw source code into smaller units (tokens) that can be analyzed separately. For code, this means converting raw source into a structured sequence of keywords, identifiers, operators, and literals.
Lexer-Based vs. BPE Tokenization
Lexer-based tokenization is language-aware — it knows Java keywords, operators, and types. It produces whole identifiers as single tokens (e.g., getMaxValue → 1 token). This is great for analysis but creates a fixed, language-specific vocabulary.
BPE (Byte Pair Encoding) is language-agnostic — it learns frequent character sequences from training data and builds a vocabulary of 32K–100K subword tokens. It splits identifiers into common subwords (e.g., getMaxValue → [get, Max, Value]). This handles any vocabulary, including unseen identifiers and mixed languages.
Each extracted method is tokenized into space-separated tokens using javalang’s lexer:
def tokenize_method(source_code):
tokens = list(tokenize(source_code))
token_values = [token.value for token in tokens]
return ' '.join(token_values)
Abstract Syntax Trees
Lexer-based tokenization and BPE both produce flat sequences of tokens — they treat code as a linear stream, much like reading a sentence word by word. But code has a deeper, hierarchical structure that flat sequences discard. An Abstract Syntax Tree (AST) captures this structure explicitly, representing code as a tree where each node corresponds to a syntactic construct.
Consider this simple Java method:
public int add(int a, int b) {
return a + b;
}
Its AST looks like this:
MethodDeclaration (name="add", returnType="int")
├── Modifier: public
├── FormalParameter (name="a", type="int")
├── FormalParameter (name="b", type="int")
└── BlockStatement
└── ReturnStatement
└── BinaryExpression (operator="+")
├── NameExpr: a
└── NameExpr: b
Structure over Surface
Code Understanding
Tree-Based Neural Models
Parsing Tools
ast module make AST extraction straightforward at scale.Deduplication
Duplicate code inflates datasets and causes data leakage — if the same function appears in both training and test sets, the model appears to generalize but is actually recalling memorized examples. Studies show 10–30% of GitHub code is duplicated.
Exact Duplicates: SHA-256 Hashing
Compute a hash for each file or method. Identical hashes mean identical content. Fast, simple, and catches the same file copied across multiple repos.
Near-Duplicates: MinHash + LSH
Jaccard similarity on token sets measures overlap. For scalability, use MinHash + Locality-Sensitive Hashing (LSH) to find near-duplicates across millions of files without expensive pairwise comparison.
These two snippets are near-duplicates — same logic, renamed variables:
public int calculateSum(int x, int y) {
int result = x + y;
return result;
}public int addNumbers(int a, int b) {
int sum = a + b;
return sum;
}An exact hash check would miss this pair entirely. MinHash + LSH catches them because their token sets overlap significantly.
def is_clean_method(tokenized_code):
method_keywords = (tokenized_code.count("public ") +
tokenized_code.count("private ") +
tokenized_code.count("protected "))
if method_keywords > 1:
return False
if not tokenized_code.endswith("}"):
return False
return True
seen = set()
unique_methods = []
for m in tokenized_methods:
if m['tokenized_code'] not in seen:
seen.add(m['tokenized_code'])
unique_methods.append(m)
Splitting the Dataset
How you split data into train, validation, and test sets matters as much as the data itself. The wrong strategy can silently invalidate your results.
- Random split
- Shuffle all methods and split. Fast but dangerous — methods from the same project can appear in both train and test sets, causing data leakage.
- Project-based split
- All methods from one project go into the same split. Prevents cross-project leakage since methods in the same project share coding style, API usage, and naming conventions.
- Temporal split
- Train on older commits, test on newer ones. Simulates real-world deployment where the model must predict code it has never seen from the future.
Typical split ratios are 80/10/10 or 70/15/15 (train / validation / test). Larger test sets give more reliable evaluation estimates.
import random
def project_based_split(methods, train_ratio=0.8, val_ratio=0.1):
projects = {}
for m in methods:
proj = m["project"]
projects.setdefault(proj, []).append(m)
proj_names = list(projects.keys())
random.shuffle(proj_names)
total = len(methods)
train, val, test = [], [], []
count = 0
for name in proj_names:
group = projects[name]
if count < total * train_ratio:
train.extend(group)
elif count < total * (train_ratio + val_ratio):
val.extend(group)
else:
test.extend(group)
count += len(group)
return train, val, test
Code as Data: What Makes It Special?
Formal Syntax
Executable Semantics
Multi-level Representation
Bimodal Nature
Real-World MSR Datasets
Researchers have curated benchmark datasets from mined repositories. These standardized datasets enable reproducible experiments and fair comparisons across techniques.
| Dataset | Languages | Size | Primary task |
|---|---|---|---|
| CodeSearchNet | 6 languages | 2M code-NL pairs | Code search & retrieval |
| Defects4J | Java | 835 real bugs | Automated program repair |
| BigCloneBench | Java | 8M clone pairs | Clone detection |
| The Stack | 300+ languages | 6 TB | Pre-training code LLMs |
| Methods2Test | Java | 780K focal-test | Test generation |
Ethics, Licensing, and Provenance
Public code is not necessarily free to use for any purpose. Ethical and legal considerations are critical when building MSR datasets.
- MIT — permissive, almost no restrictions
- Apache 2.0 — permissive with patent grants
- GPL — copyleft, derivatives must also be GPL
The Copilot controversy highlighted the tension: GitHub Copilot trained on public repos regardless of license, sparking a class-action lawsuit. Developers argued their copyleft code was used without respecting license terms.
Repositories often contain privacy risks — PII in comments (names, emails), hardcoded API keys, database credentials, internal URLs — all must be scrubbed.
The Complete Pipeline
From raw repositories to a clean, tokenized dataset ready for model training:
git clone --depth 1, parse methods with javalang.Try It Yourself
Put your knowledge into practice. Clone 3 Java repositories from GitHub (50+ stars), extract all methods using javalang, compute basic statistics (vocabulary size, average method length, duplicate count), and save the cleaned methods as a JSONL file. The lab notebooks below walk through this end-to-end. You will use this dataset in the next module on Source Code Modeling.
Course materials
Lecture slides, lab handouts, and reference papers from the spring cohort — the canonical sources the article above was built on.
Lab A · Repository mining & dedup
Clone three permissively-licensed Java repos, extract methods, tokenize, and dedup. Deliverable: a clean JSONL corpus plus a short report on what got cut and why.