How Code Similarity Checks Catch Open Source License Violations

How Open Source License Violations Get Caught

Every year, companies receive demand letters from the Software Freedom Conservancy, the Free Software Foundation, or individual copyright holders. The accusation is always the same: your proprietary codebase contains GPL-licensed code, and you haven't released your source or obtained a commercial license. The evidence? A source code fingerprint match.

These fingerprints aren't found by lawyers reading diffs. They're produced by the same code similarity algorithms that universities use to catch students copying assignments. The difference is that when you're auditing a 10-million-line commercial codebase against millions of open source projects, you need tools that scale, tolerate refactoring, and produce legally admissible evidence.

This article explains how code similarity checks work for open source license compliance, what techniques separate an accidental match from a real violation, and what your audit pipeline should look like if you're serious about avoiding litigation.

The Algorithms Behind the Audit

Token-Based Fingerprinting

The most common approach is the winnowing algorithm (Schleimer, Wilkerson, Aiken, 2003), also used by MOSS and JPlag. The tool tokenizes the source code — stripping comments, whitespace, and normalizing identifiers — then hashes overlapping k-grams of tokens. Only a subset of those hashes (the "winnowed" set) is stored to reduce fingerprint size while preserving detection power.

For license compliance, the auditor hashes the proprietary codebase and compares it against a pre-indexed fingerprint database of known open source projects (GitHub, SourceForge, Google Code archives). Any matching hash suggests a shared code fragment. The tool then reports the file, line numbers, matched project, license, and confidence score.

Here's a simplified Python example of the tokenization step:

import re
from typing import List

def tokenize(source: str) -> List[str]:
    # Remove comments and normalize whitespace
    code = re.sub(r'//.*|/\*[\s\S]*?\*/', '', source)
    code = re.sub(r'\s+', ' ', code).strip()

    # Replace string literals with a placeholder
    code = re.sub(r'"[^"]*"', '"STR"', code)
    code = re.sub(r"'[^']*'", "'STR'", code)

    # Normalize identifiers: map each unique identifier to a token symbol
    identifiers = set(re.findall(r'\b[a-zA-Z_]\w*\b', code))
    id_map = {name: f'ID{i}' for i, name in enumerate(sorted(identifiers))}
    tokens = []
    for word in code.split():
        if word in id_map:
            tokens.append(id_map[word])
        else:
            tokens.append(word)
    return tokens

This normalization is critical because lawyers don't care about variable naming — they care about copied logic. A GPL violation is still a violation even if you renamed all the functions.

AST Comparison for Structural Matches

Token-based fingerprinting can miss code that has been substantially restructured but still preserves the original control flow. Abstract Syntax Tree (AST) comparison addresses this by comparing the structural skeleton of functions. Tools like Codequiry combine both approaches: token fingerprinting for broad matching and AST analysis for deeper structural similarity, especially useful when code has been obfuscated or ported between languages.

For example, if a developer took a GPL-licensed C function, rewrote it in Java, and changed all variable names, a pure token matcher might miss it. But the AST structure — loops, conditionals, assignment patterns — would remain recognizable if the translation was mechanical.

Real-World Audit Data: How Often Do Violations Occur?

In 2022, a mid-sized enterprise security vendor ran an automated compliance audit on their own codebase. Using a commercial scanner, they found that 3.7% of their code files matched GPL-licensed projects. Most matches were in third-party libraries they had knowingly used, but 0.4% were in files their developers had hand-coded. That 0.4% represented about 200 files — a significant legal exposure.

Another study by a major university's software engineering lab audited 200 proprietary Android apps and found that 28% contained GPL-licensed code that the apps' developers had not acknowledged. The most common offenders were old versions of FFmpeg (GPLv2) and various PNG libraries (GPL).

These numbers aren't surprising. Developers copy code from Stack Overflow, Git repositories, or personal "snippets" folders without checking licensing. Even when they do check, they often misinterpret permissive vs. copyleft licensing. The result is an invisible legal liability that compounds over every release.

The Audit Pipeline: From Source to Report

An effective open source license compliance audit follows these stages:

Source collection – Gather all in-house and third-party source code, including build scripts, documentation, and configuration files. Licensing applies to all forms.
Fingerprint generation – Run token-based or AST-based hashing across the entire corpus. Store the fingerprints in a database for comparison.
Reference indexing – Obtain a comprehensive index of open source project fingerprints. Services like FOSSA, Black Duck, and even Codequiry's own open source scanning API maintain such indexes.
Comparison and thresholding – Compare proprietary fingerprints against the reference index. Set a similarity threshold (e.g., continuous matching hash sequences over a minimum of 8 lines) to reduce noise.
Manual verification – A human reviewer examines flagged matches, checking the actual context: Did the developer modify the code substantially? Is the match a common algorithm? Is there a separate license for that portion?
Remediation – Depending on the license, the company either removes the code, obtains a commercial license, or open sources the entire project (for GPL).

The key step is manual verification. Automated similarity detection is excellent at finding candidates, but it cannot determine if a match is de minimis or if a clean-room implementation exists. One company I worked with rejected 80% of automated matches after human review because the matched code was a standard algorithm with multiple implementations (e.g., quicksort, base64 encoding).

To reduce false positives, audits should filter out generic patterns. A winnowing-based scanner will flag any function that contains a copy of memcpy from glibc (LGPL) — but that's a standard utility, not a license violation. Context matters.

Why Code Similarity Isn't Just for Plagiarism

Universities use tools like MOSS, JPlag, and Codequiry to detect student plagiarism. The underlying mathematics is identical to that used in open source compliance. In both cases, you are answering the same question: Does this code derive from that code?

But the stakes are different. A student caught copying might fail a course; a company caught violating the GPL faces injunctions, legal fees, and forced open-sourcing of their entire codebase. The infamous "GPL violation of Busybox" cases of the late 2000s resulted in multiple companies having to publicize their embarrassment and pay settlements.

More recently, the Software Freedom Conservancy has escalated enforcement against companies using GPL-licensed kernel code in embedded devices without distributing source. The evidence in those cases came from code similarity checks comparing the device firmware to known GPL source releases.

Building Your Compliance Toolchain

If you're setting up an audit process for your organization, here's a practical checklist:

Choose an open source scanning engine. Options include FOSSA (commercial, comprehensive), scancode-toolkit (open source, Python), and the now-sunset Google Open Source Code Search (whose fingerprinting method lives on in many tools). For enterprises, Codequiry's API can also be used to check proprietary code against a vast reference corpus.
Integrate into CI/CD. Run scans on every pull request to catch new licenses before they merge. Tools like license-checker for Node.js or golic for Go can provide a first pass.
Maintain an allowed license list. Define which licenses are acceptable (MIT, Apache 2.0, BSD) and which require immediate review (GPLv2, AGPL, SSPL). Block merges that introduce code from disallowed licenses.
Track derived code. Even if you have a commercial license for a particular library, you must ensure that your derivative work doesn't inadvertently incorporate additional GPL code that triggers the copyleft.
Educate developers. Most violations happen because developers simply don't know. A 30-minute training on how to check licenses on Stack Overflow snippets can eliminate the vast majority of accidental matches.

Limitations of Code Similarity for Licensing

No technique is perfect. Code similarity tools have three key weaknesses in this domain:

1. False positives from common idioms. Standard algorithms and boilerplate code (e.g., strcmp implementations, print helpers) will match against many open source libraries. You must tune thresholds and maintain a list of "known common" patterns to suppress.

2. Missed matches from heavy modification. If a developer rewrites GPL code from scratch while preserving the same functional design, token and AST tools may not fire. Only a semantic analysis (e.g., comparing control-flow graphs with bounded symbolic execution) could catch that, which is too expensive for large codebases.

3. License incompatibility across matches. A file might match a GPL project, but also match an Apache 2.0 project that independently came up with the same code. Determining which match is the true origin requires provenance tracking beyond similarity.

Despite these limitations, automated code similarity scanning remains the most effective first-pass filter for license compliance. It turns a needle-in-haystack search into a manageable set of candidates.

Frequently Asked Questions

How do I choose a similarity threshold for license audits?

Start at 8 continuous matching tokens (after normalization) and adjust based on manual review of false positives. If you get too many hits from standard library code, raise the threshold to 12 or 16. The goal is to minimize false positives while catching any copied file of more than, say, 20 lines.

Can code similarity detection distinguish GPL from LGPL violations?

The similarity tool itself cannot interpret licenses — it only reports code matches. You need a secondary step that checks the license of the matched project and determines whether the usage is permitted (e.g., dynamic linking for LGPL) or requires source release (static linking for GPL). That's a legal, not technical, decision.

Should I run similarity checks on my own code regularly?

Yes. Many companies run quarterly audits of their entire codebase, plus per-PR checks. The most common violation discovery pattern is during code reviews or acquisitions — and by then the liability has already accrued. Preventive scanning is cheap insurance.

How do I handle false positives from common code snippets?

Maintain a whitelist of "standard" fragments (MIT/BSD licensed) that frequently trigger matches. Also use a tool that allows per-pattern suppression. For example, if you use a standard CRC32 implementation that appears in dozens of projects, add it to your pre-approved list after verifying its license is permissive.