In this project, we create 3 algorithms for analysing source code plagiarism.
- Naive algorthim condenses code logic to the root node, drawing inspiration from merkle root. We compare trees by the root node. (Scalability)
- Snapshot algorithm looks at each tree breadth-wise and depth-wise to analyse code similarity. We use frequency analysis to give inverted weights to each node type for a more nuanced score. Weights are based on fibonacci sequence, which highlights differences (~60%) and reduces noise in the similarity comparison. (Accuracy)
- Progressive algorithm compares two trees, iteratively exploring each tree level. (Efficiency)
Main data structures used are abstract syntax trees, hashmaps, linked lists and arrays. Dataset taken from IEEE Dataport: https://ieee-dataport.org/open-access/programming-homework-dataset-plagiarism-detection#files.
1. Naive implementation
2. Snapshot implementation
3. Progression/Dynamic implementation
"c-to-json" directory -> contains c code from https://github.com/deiuch/c-to-json
=> To convert C language codes to ASTs(Abstract Syntax Tree) in the form of JSON.
=> Segregation of datasets into small, medium and large file sizes.
=> To perform EDA for the datasets.
Import the project and press the play button.
pip install maven
cd codes
mvn compile
mvn exec:java -D"exec.mainClass"="copycat.Application"
cd c2json
sh parse.sh
Specify the c file you wish to convert and the directory it is found at.
Download WSL
- To allow for comparisons of entire repos, not just single-file source code.
- To allow for analysis of larger source code, which would require progressive algorithm to build compare the trees iteratively instead of having the requisite of --having the entire tree built in-memory.
- To allow for multiple comparisons usin indexing, which would build on the naive merkle-root algorithm.
- To highlight plagiarised portions through a UI.
Plagiarism is subjective. There is still need for manual review of each code on a case by case basis.
https://www.youtube.com/watch?v=t68WSdoI_Do&feature=youtu.be
Job Seow, John Luke, Kenneth Kho, Wen Yu, Christopher Lim