Cross-Language Code Plagiarism Detection Methods Tested

Why Cross-Language Plagiarism Detection Is a Harder Problem

Every CS professor has seen it: a student turns in a Python solution that looks suspiciously like a Java example from the lecture slides. Variables renamed, loops converted from for to while, comment style changed—the structure holds. Traditional plagiarism detectors like MOSS and JPlag operate within a single language, hashing token sequences or normalizing AST trees per language grammar. Cross them, and the token stream collapses. The same algorithm in Python and Java produces radically different token sets.

Over the past three semesters at a large state university, we ran a controlled study: 100 programming assignments originally written in Java had to be ported to Python (or C++) as part of a “language-agnostic data structures” course. We knew from previous work that ~15% of submissions were direct translations (often from Stack Overflow or GitHub). We needed a detection method that could see through the syntax barrier. This article compares three approaches to cross-language detection—token normalization, AST fingerprinting, and semantic embedding—on that dataset.

Test Methodology: 100 Assignments, 3 Languages, 4 Scenarios

We took 100 student Java assignments (binary search trees, graph traversals, and a maze solver) and hand-crafted a “translation” for each into Python (both idiomatic and unidiomatic) and C++ (with manual pointer management). Each translation preserved the core algorithm but changed:

  • Variable names entirely (temp_tmp)
  • Loop constructs (for(int i=0; i < n; i++)while i < n/for i in range(n))
  • Iteration direction (reverse order, different increment)
  • Comment style and whitespace

We then injected these translations into a background set of 500 genuine Python assignments (written from scratch) and 500 genuine C++ assignments. The detection task: given a suspect Python submission, find its Java source progenitor. We tested three tools and one custom pipeline:

  1. MOSS (winnowing algorithm, default settings) – run on Java vs. Python directly.
  2. JPlag (token-based, language-agnostic mode limited to same language, but we tried using its “custom tokenizer” API to feed normalized tokens).
  3. Dolos (AST-based, supports multiple languages via tree-sitter grammars).
  4. Codequiry’s cross-language fingerprinting – which uses a combination of normalized token sequences and control-flow graph hashing.

Algorithm Comparison: How Each Approach Fails or Succeeds

Tokenization and Winnowing (MOSS)

MOSS converts source code into tokens—for, if, identifier, number—and then selects a subset of k-grams as fingerprints. Within one language, this is remarkably effective. But across languages, the token lists diverge. Python’s for x in list: produces tokens like FOR, ID, IN, ID, COLON. Java’s for (int i=0; i < n; i++) yields FOR, LPAREN, INT, ID, ASSIGN, INT_LIT, SEMI, ID, LT, ID, SEMI, ID, INC, RPAREN. The k-gram fingerprints rarely align. In our test, MOSS returned a true positive rate of 8% across Java→Python, while flagging 22% of unrelated Python submissions as “similar” because common keywords like if and else spuriously matched.

// Java snippet  
public int factorial(int n) {
    if (n == 0) return 1;
    return n * factorial(n - 1);
}

# Python translation  
def factorial(n):
    if n == 0:
        return 1
    return n * factorial(n - 1)

MOSS sees two different token streams—one with braces, semicolons, type keywords; one with colons and indentation. The core recursion structure is invisible to a token-based hash.

AST Fingerprinting (Dolos / tree-sitter)

Dolos parses source into a language-agnostic AST via tree-sitter grammars. It then normalizes node types across languages: a while loop in Python and a while loop in C++ both become WhileStatement. Varible nodes are replaced with generic Identifier markers. The resulting tree is hashed subtree-by-subtree. This approach improves cross-language detection significantly. For the factorial example, both trees share a FunctionDeclaration containing an IfStatement with a BinaryExpression and ReturnStatement and a RecursiveCall (function call to itself).

Dolos achieved a true positive rate of 53% on Java→Python and 61% on Java→C++. False positives remained high (18%) because trivial algorithms (e.g., “sum of array”) produce identical tree shapes even when semantically unrelated. Tree-normalization alone cannot distinguish a deliberately translated selection sort from two independently written selection sorts.

Semantic Fingerprinting with Control-Flow Graphs (Codequiry)

The most effective approach we tested used normalized token sequences paired with control-flow graph hashing. Codequiry’s method first strips away language-specific keywords, converts all identifiers to placeholders, and maps control-flow paths (condition branches, loop back-edges) into a language-agnostic graph. The graph is then hashed into a set of “structural fingerprints.” A linear scan of a student’s Python code versus a database of Java fingerprints can detect algorithmic isomorphism even when syntax differs.

For our dataset, this pipeline achieved a true positive rate of 87% on Java→Python and 92% on Java→C++. False positive rate dropped to 4.3%. The only failures occurred when translations introduced new auxiliary variables or broke atomic operations into multiple steps (e.g., a Java x += 1 turned into x = x + 1). Those cases required a more expensive graph edit-distance comparison that we treated as a fallback.

Results Summary Table

Approach True Positive (Java→Python) True Positive (Java→C++) False Positive Rate
MOSS (token winnowing) 8% 12% 22%
JPlag (custom tokenizer) 31% 34% 15%
Dolos (AST normalization) 53% 61% 18%
Codequiry (CFG + normalized tokens) 87% 92% 4.3%

Practical Implications for Educators and Code Reviewers

If you suspect a student has translated code from one language to another, a single-language detector will miss it. Pair a cross-language check with a traditional one. Here’s what we recommend:

  • Run all submissions through a same-language plagiarism tool first (MOSS or JPlag for the assignment’s primary language). This catches verbatim copies.
  • Then run a cross-language pass on any submission whose language differs from the course’s primary language (e.g., student submits Python in a Java course). For that, use a tool like Codequiry that supports multi-language fingerprint databases, or build a pipeline using tree-sitter ASTs with careful subtree normalization.
  • Be suspicious of specific patterns: consistent naming style across languages (camelCase in Python, same algorithm structure but different variable names), or identical error-handling paths.
  • Manual verification is still essential. No automated tool should lead to an automatic honor-code violation. The structural fingerprints we used still produce false positives (4.3% in our test). Always compare the source and target side by side, focusing on algorithmic choices (order of if-else branches, loop bounds, helper function decomposition).

For enterprise code-reuse audits in mixed-stack projects (e.g., a microservice written in Java and a newer one in Go with identical logic), the same pipeline applies. Static analysis tools that only compare within a language miss the biggest intellectual property risks: systematic translation of an entire codebase into another language to evade detection.

Where Cross-Language Detection Breaks Down

Our test also exposed two edge cases. First, trivial algorithms (“swap two numbers”, “find max in array”) produce nearly identical structures across languages. Any detector will flag many false positives. We recommend setting a similarity threshold above 0.7 for formal reports, and manually reviewing borderline matches. Second, heavily refactored translations that flatten recursion into iteration, or vice versa, defeat both AST and CFG approaches. In our dataset, six translations converted a recursive tree traversal into an explicit stack-based loop. None of the three methods correctly linked them. Detecting such cases requires heavy-weight semantic analysis (e.g., symbolic execution or program synthesis), which is not yet practical for grading at scale. (Interestingly, the Codequiry fallback graph-edit-distance module flagged two of the six—a promising direction.)

Frequently Asked Questions

Can I use MOSS to detect cross-language plagiarism?

Not directly. MOSS is designed for same-language comparisons. You would need to pre-process both source files into a common intermediate representation (e.g., normalized AST dumps) before feeding them as “language X” to MOSS. That workaround is fragile; we recommend a dedicated cross-language tool instead.

Are there open-source options for cross-language detection?

Dolos is open source and supports multiple languages through tree-sitter. It works reasonably well for detection but has a higher false positive rate than commercial offerings. You can improve accuracy by adding a second pass that normalizes control-flow graphs.

How does Codequiry’s approach differ from JPlag’s language-agnostic mode?

JPlag’s language-agnostic mode uses a generic tokenizer that treats any code as sequences of a few token types (keyword, identifier, literal, etc.). It discards too much structure. Codequiry builds a language-aware AST for each language, then extracts a language-neutral fingerprint from the AST’s control-flow skeleton. This preserves algorithmic structure while discarding keyword and bracket differences.

What languages are supported for cross-language checks today?

Currently, practical support exists for Java, Python, C++, JavaScript, and C#. Tools like Dolos and Codequiry are adding new grammars regularly. If you need to check between Lisp and Java, you are likely out of luck—the structural differences are too great for current fingerprinting.

Cross-language plagiarism is a reality in both academic and industrial settings. As students and developers become more polyglot, detection tools must evolve beyond single-language token matching. Our tests show that structural fingerprinting—combining normalized ASTs with control-flow graph hashes—is the most practical approach today, achieving near-90% accuracy with manageable false positives. The remaining gap (recursive-to-iterative translations) remains an open research challenge, but for the vast majority of copy-paste translations, the tools are ready.