mastropaolo.dev playbook Back to playbook

Module 01 · Data Collection

Mining Software Repositories

Learn how to collect, clean, and prepare source-code data from public repositories — the critical first step before any deep learning model can learn.

Reading time~20 min read InstructorDr. Mastropaolo CohortSpring 2026

What is a Software Repository?

A repository is a centralized digital storage that developers use to make and manage changes to an application’s source code. Version control systems like Git track what was changed, who did the change, when, and why — enabling teams to collaborate efficiently and retrieve any previous version of their code.

For AI researchers, repositories are goldmines. Every commit, every bug report, every discussion thread is a potential data point for training models that understand and generate code.

Three Types of Repositories

Source Repositories

Store the complete history of source code changes. GitHub, GitLab, Bitbucket. Contains source code files, commit messages, diffs, branch history, pull request reviews, and configuration files.

Bug Repositories

Track defects, feature requests, and tasks. Jira, BugZilla, GitHub Issues. Contains bug reports, labels, priority metadata, discussion threads, status transitions, and links to fixing commits.

Communication Repositories

Capture developer discussions. Mailing lists, Slack, Stack Overflow, IRC logs. Contains Q&A threads, chat logs, meeting notes, design documents, and announcements.

Key Terms

Commit
A snapshot of changes to one or more files, with a message describing what changed and why. Each commit has a unique SHA hash.
Branch
An independent line of development allowing parallel work on features, bug fixes, or experiments.
Merge / PR
Integrating changes from one branch into another, often through a reviewed pull request that adds a code-review layer.
Tag
A named reference to a specific commit, typically marking a release version (e.g., v2.1.0).
Diff
Shows the exact changes between two versions of a file: lines added, removed, or modified. Central to code review and change-aware AI models.

Rule-Based vs. Data-Driven Approaches

Not all AI systems learn from data. Some encode expert knowledge directly as hand-crafted rules — IF/THEN patterns created by domain experts. These rule-based systems require no training data but are brittle, fail on edge cases, and are hard to scale.

Modern software engineering automation has moved decisively toward data-driven approaches, particularly deep learning. These models learn patterns automatically from large datasets, generalize to unseen examples, and improve with more data. The trade-off: they require vast amounts of high-quality training data.

Why So Much Data?

Deep learning models are fundamentally statistical pattern matchers. They learn by observing millions of examples and discovering regularities — correlations between inputs and outputs that generalize to unseen data. The more complex the task, the more examples the model needs to learn robust patterns rather than memorizing surface-level noise.

Consider an analogy from computer vision: a model trained to classify cat breeds needs millions of labeled photographs to distinguish a Maine Coon from a Norwegian Forest Cat. With only a few hundred images, it might latch onto background color or image resolution instead of actual feline features. The same principle applies to code. A model that learns to summarize Java methods needs millions of method–summary pairs to understand that return a + b; is an addition regardless of whether the variables are named x and y, left and right, or salary and bonus.

What Do We Mine?

Software repositories are uniquely rich because they contain both source code and natural language, tightly interleaved:

  • Source code — method bodies, class definitions, configuration files
  • Comments & documentation — Javadoc, docstrings, inline annotations
  • Commit messages — concise descriptions of what changed and why
  • Issue reports & discussions — bug descriptions, feature requests, design debates
  • Code reviews — pull request comments explaining improvements, catching mistakes

Each of these artifacts provides a different view of developer intent. By mining all of them, we can train models that understand not just what code does, but why it was written that way.

Building a Dataset: Selecting Repositories

Garbage in, garbage out. A model trained on low-quality data will produce low-quality predictions. Data curation is arguably the most important — and most underappreciated — step in the ML pipeline.

Repository selection uses quality proxies to filter the hundreds of millions of public repos down to a manageable, high-quality subset:

  • Minimum stars — popularity as a quality signal
  • Active maintenance — recent commits indicate a living project
  • Non-fork status — avoid counting duplicated repositories
  • Proper licensing — ensure legal use for training
  • Meaningful commit history — enough data to be useful

Tools for Mining at Scale

Specialized APIs and search platforms make large-scale dataset construction possible: the GitHub REST & GraphQL API (rate-limited, 5,000 req/hour with auth), SEART-GHS (a search engine for GitHub repos with advanced filtering), and libraries like PyGitHub (Python), Octokit (JS), and go-github (Go).

Here’s how to query GitHub’s API for high-quality Java repositories, filtering by stars and excluding forks:

python
def fetch_top_java_repos(num_repos=200, per_page=100):
    repos = []
    page = 1
    while len(repos) < num_repos:
        url = "https://api.github.com/search/repositories"
        params = {
            "q": "language:java stars:>1000",
            "sort": "stars",
            "order": "desc",
            "per_page": per_page,
            "page": page
        }
        response = requests.get(url, params=params)
        data = response.json()
        for item in data.get("items", []):
            if item.get("fork", False):
                continue
            repos.append({
                "full_name": item["full_name"],
                "clone_url": item["clone_url"],
                "stars": item["stargazers_count"],
            })
        page += 1
    return repos[:num_repos]

Once we have the repo list, we shallow-clone each one — --depth 1 grabs only the latest snapshot, saving time and disk space:

python
def clone_repo(clone_url, dest_dir):
    cmd = ["git", "clone", "--depth", "1", "--quiet", clone_url, dest_dir]
    result = subprocess.run(cmd, capture_output=True, text=True, timeout=120)
    return result.returncode == 0

Data Quality Challenges

Selecting high-quality repositories is necessary but not sufficient. The raw code extracted from even the best projects contains numerous quality issues that can corrupt model training if left unaddressed.

Common Quality Issues

  • Encoding problems — files with mixed encodings (UTF-8, Latin-1, Shift-JIS) produce garbled tokens.
  • Auto-generated code — protobuf stubs, IDE-generated boilerplate, ORM mapping files inflate the dataset with repetitive, formulaic code.
  • Test code vs. production code — unit tests follow very different patterns from production code (setup/teardown, assertions, mocking).
  • Dead code and commented-out blocks — abandoned code paths add noise without contributing meaningful signal.
  • Minified or obfuscated code — JavaScript bundles contain valid syntax but no readable structure.

A Filtering Example

Here is a realistic breakdown of what happens when you apply quality filters to a raw Java method dataset:

Filter stepRemovedRemaining
Raw extracted methods500,000
Remove auto-generated (protobuf, Lombok, IDE stubs)−62,000438,000
Remove encoding errors / non-ASCII identifiers−18,000420,000
Remove methods <3 tokens or >512 tokens−45,000375,000
Remove trivial getters, setters, constructors−38,000337,000
Remove exact and near-duplicates−17,000320,000

Extracting and Filtering Methods

Once repositories are cloned, we need to extract individual methods or functions from the source files. For Java, this means finding .java files, parsing them with tools like javalang or JavaParser, and extracting method bodies along with their signatures.

Raw extracted methods need systematic cleaning before they can be used for training:

  1. Remove duplicates — their presence creates data leakage between training and test sets
  2. ASCII-only characters — avoid encoding issues across different systems
  3. Remove outliers — methods that are incredibly long (1000+ lines) or incredibly short (single-line getters)
  4. Remove boilerplate — trivial code like getters, setters, and auto-generated constructors
  5. Strip comments — remove all inline and block comments from the method body

The core extraction logic uses brace-counting to find where each method starts and ends:

python
def extract_method_source(source_code, method_node, lines):
    start_line = method_node.position.line - 1
    brace_count = 0
    started = False
    end_line = start_line
    for i in range(start_line, len(lines)):
        for char in lines[i]:
            if char == '{':
                brace_count += 1
                started = True
            elif char == '}':
                brace_count -= 1
        if started and brace_count == 0:
            end_line = i
            break
    return '\n'.join(lines[start_line:end_line + 1])

Tokenization: From Code to Tokens

Tokenization breaks down raw source code into smaller units (tokens) that can be analyzed separately. For code, this means converting raw source into a structured sequence of keywords, identifiers, operators, and literals.

Lexer-Based vs. BPE Tokenization

Lexer-based tokenization is language-aware — it knows Java keywords, operators, and types. It produces whole identifiers as single tokens (e.g., getMaxValue → 1 token). This is great for analysis but creates a fixed, language-specific vocabulary.

BPE (Byte Pair Encoding) is language-agnostic — it learns frequent character sequences from training data and builds a vocabulary of 32K–100K subword tokens. It splits identifiers into common subwords (e.g., getMaxValue[get, Max, Value]). This handles any vocabulary, including unseen identifiers and mixed languages.

Each extracted method is tokenized into space-separated tokens using javalang’s lexer:

python
def tokenize_method(source_code):
    tokens = list(tokenize(source_code))
    token_values = [token.value for token in tokens]
    return ' '.join(token_values)

Abstract Syntax Trees

Lexer-based tokenization and BPE both produce flat sequences of tokens — they treat code as a linear stream, much like reading a sentence word by word. But code has a deeper, hierarchical structure that flat sequences discard. An Abstract Syntax Tree (AST) captures this structure explicitly, representing code as a tree where each node corresponds to a syntactic construct.

Consider this simple Java method:

java
public int add(int a, int b) {
    return a + b;
}

Its AST looks like this:

ast
MethodDeclaration (name="add", returnType="int")
├── Modifier: public
├── FormalParameter (name="a", type="int")
├── FormalParameter (name="b", type="int")
└── BlockStatement
    └── ReturnStatement
        └── BinaryExpression (operator="+")
            ├── NameExpr: a
            └── NameExpr: b

Structure over Surface

ASTs capture syntactic structure, not surface tokens. Two methods with different variable names but identical logic produce different token sequences but structurally similar ASTs.

Code Understanding

Tasks like clone detection, bug finding, and code classification benefit from structural representations that reveal what code does rather than how it looks.

Tree-Based Neural Models

Architectures like Tree-LSTM and code2seq operate directly on AST nodes, learning to compose meaning bottom-up from leaves to root.

Parsing Tools

Libraries like JavaParser (Java), tree-sitter (multi-language), and Python’s built-in ast module make AST extraction straightforward at scale.

Deduplication

Duplicate code inflates datasets and causes data leakage — if the same function appears in both training and test sets, the model appears to generalize but is actually recalling memorized examples. Studies show 10–30% of GitHub code is duplicated.

Exact Duplicates: SHA-256 Hashing

Compute a hash for each file or method. Identical hashes mean identical content. Fast, simple, and catches the same file copied across multiple repos.

Near-Duplicates: MinHash + LSH

Jaccard similarity on token sets measures overlap. For scalability, use MinHash + Locality-Sensitive Hashing (LSH) to find near-duplicates across millions of files without expensive pairwise comparison.

These two snippets are near-duplicates — same logic, renamed variables:

Version A
public int calculateSum(int x, int y) {
  int result = x + y;
  return result;
}
Version B
public int addNumbers(int a, int b) {
  int sum = a + b;
  return sum;
}

An exact hash check would miss this pair entirely. MinHash + LSH catches them because their token sets overlap significantly.

python
def is_clean_method(tokenized_code):
    method_keywords = (tokenized_code.count("public ") +
                       tokenized_code.count("private ") +
                       tokenized_code.count("protected "))
    if method_keywords > 1:
        return False
    if not tokenized_code.endswith("}"):
        return False
    return True

seen = set()
unique_methods = []
for m in tokenized_methods:
    if m['tokenized_code'] not in seen:
        seen.add(m['tokenized_code'])
        unique_methods.append(m)

Splitting the Dataset

How you split data into train, validation, and test sets matters as much as the data itself. The wrong strategy can silently invalidate your results.

Random split
Shuffle all methods and split. Fast but dangerous — methods from the same project can appear in both train and test sets, causing data leakage.
Project-based split
All methods from one project go into the same split. Prevents cross-project leakage since methods in the same project share coding style, API usage, and naming conventions.
Temporal split
Train on older commits, test on newer ones. Simulates real-world deployment where the model must predict code it has never seen from the future.

Typical split ratios are 80/10/10 or 70/15/15 (train / validation / test). Larger test sets give more reliable evaluation estimates.

python
import random

def project_based_split(methods, train_ratio=0.8, val_ratio=0.1):
    projects = {}
    for m in methods:
        proj = m["project"]
        projects.setdefault(proj, []).append(m)
    proj_names = list(projects.keys())
    random.shuffle(proj_names)
    total = len(methods)
    train, val, test = [], [], []
    count = 0
    for name in proj_names:
        group = projects[name]
        if count < total * train_ratio:
            train.extend(group)
        elif count < total * (train_ratio + val_ratio):
            val.extend(group)
        else:
            test.extend(group)
        count += len(group)
    return train, val, test

Code as Data: What Makes It Special?

Formal Syntax

Code must compile or parse. A single misplaced semicolon breaks everything. This rigid structure is both a constraint and an advantage for learning.

Executable Semantics

Code has deterministic meaning: we can run it, test it, and verify outputs. This enables automatic labeling and evaluation.

Multi-level Representation

The same code can be viewed as characters, tokens, AST nodes, control-flow graphs, or data-flow graphs. Each level reveals different patterns.

Bimodal Nature

Repositories contain both code and natural language (comments, docs, commit messages). Models can learn the mapping between intent and implementation.

Real-World MSR Datasets

Researchers have curated benchmark datasets from mined repositories. These standardized datasets enable reproducible experiments and fair comparisons across techniques.

DatasetLanguagesSizePrimary task
CodeSearchNet6 languages2M code-NL pairsCode search & retrieval
Defects4JJava835 real bugsAutomated program repair
BigCloneBenchJava8M clone pairsClone detection
The Stack300+ languages6 TBPre-training code LLMs
Methods2TestJava780K focal-testTest generation

Ethics, Licensing, and Provenance

Public code is not necessarily free to use for any purpose. Ethical and legal considerations are critical when building MSR datasets.

  • MIT — permissive, almost no restrictions
  • Apache 2.0 — permissive with patent grants
  • GPL — copyleft, derivatives must also be GPL

The Copilot controversy highlighted the tension: GitHub Copilot trained on public repos regardless of license, sparking a class-action lawsuit. Developers argued their copyleft code was used without respecting license terms.

Repositories often contain privacy risks — PII in comments (names, emails), hardcoded API keys, database credentials, internal URLs — all must be scrubbed.

The Complete Pipeline

From raw repositories to a clean, tokenized dataset ready for model training:

Select repos
Quality filters: stars, activity, license, non-fork.
Clone &amp; extract
git clone --depth 1, parse methods with javalang.
Preprocess
Strip comments, drop trivial getters/setters, ASCII-only.
Tokenize
BPE for LLMs, lexer for structural analysis.
Deduplicate
SHA-256 for exact, MinHash+LSH for near-duplicates.
Split
Project-based 80/10/10, optionally temporal.
3,847
Repos queried
2.1M
Raw pairs
810K
After cleaning
62%
Discard rate

Try It Yourself

Put your knowledge into practice. Clone 3 Java repositories from GitHub (50+ stars), extract all methods using javalang, compute basic statistics (vocabulary size, average method length, duplicate count), and save the cleaned methods as a JSONL file. The lab notebooks below walk through this end-to-end. You will use this dataset in the next module on Source Code Modeling.

Course materials

Lecture slides, lab handouts, and reference papers from the spring cohort — the canonical sources the article above was built on.

Lab A · Repository mining & dedup

Clone three permissively-licensed Java repos, extract methods, tokenize, and dedup. Deliverable: a clean JSONL corpus plus a short report on what got cut and why.

pdf lab-01-msr-handout.pdf 241 KB · download
ipynb PyDriller.ipynb sync pending
ipynb Preprocessing_Code.ipynb sync pending
ipynb srcML.ipynb sync pending