To have solutions with non-essential differences have the same representation, the Representer should apply normalizations. In general, the process of create a normalized representation looks like this:
- Parse the solution's code into an Abstract Syntax Tree (AST)
- Apply normalizations to the AST
- Convert the normalized AST to a string
- Write the normalized AST string to a file named
representation.txt
(see the interface)
Note though, that the representation doesn't have to be an AST, it could be regular (normalized) code, whichever works best for your track.
To help you get started, we'll now present some common normalization strategies.
Note 1: the normalization examples don't build on each other, they only show one specific normalization. An actual Representer would want to apply them in succession.
Note 2: the code for these guidelines will be in C#, but the guidelines are language-agnostic.
To allow for representations to be naming-agnostic, user-defined names (such as variables, functions, etc) can be replaced with placeholders. In this case, a mapping.json
should be produced (see the interface).
It is important to note that all identical names must be replaced with the same placeholder, irrespective of scope.
public static class Fake
{
public static int Test(int input)
{
var test = input + 2;
return test;
}
}
public static class PLACEHOLDER_1
{
public static int PLACEHOLDER_2(int PLACEHOLDER_3)
{
var PLACEHOLDER_4 = PLACEHOLDER_3 + 2;
return PLACEHOLDER_4;
}
}
Inconsistent white space is so frequent, that normalizing it is a common normalization step. Line endings should also be normalized.
using System;
public static class Fake
{
public static DateTime Add (DateTime birthDate)
{
return birthDate.Add( TimeSpan.FromSeconds ( 10 ) ) ;
}
}
public static class Fake
{
public static DateTime Add(DateTime birthDate)
{
return birthDate.Add(TimeSpan.FromSeconds(10));
}
}
In many languages, the user has some freedom in how to define a block (or scope). For example, in most C-like languages, scope is declared between curly braces. Usually, it doesn't matter if you put them on the same line or the next, so one could normalize this.
public static class Fake {
public static int Test() {
if (1 > 2) {
return 0;
}
return 1;
}
}
public static class Fake
{
public static int Test()
{
if (1 > 2)
{
return 0;
}
return 1;
}
}
Not all bits of code are significant to the Representer, which could then be removed. As an example, in most languages comments will be insignificant and can safely be removed.
/*
These are some very nice
comments spanning multiple lines
*/
public static class Fake
{
// Nice method
public static string Test()
{
return "Test"; // This is very nice
}
}
public static class Fake
{
public static string Test()
{
return "Test";
}
}
In some cases the order of code does not matter. To prevent the same code in different order to create different representations, it might be useful to sort it. Usually the items to sort are functions or declarations. It might be tricky to find the metric(s) to sort by and also tricky to implement. One metric could be the how many AST node children the node contains, another the name of the type of the first child node.
This example sorts by some kind of function length:
public static class Fake
{
public static string Test2()
{
int a = "Test2";
return a;
}
public static string Test()
{
return "Test";
}
}
public static class Fake
{
public static string Test()
{
return "Test";
}
public static string Test2()
{
int a = "Test2";
return a;
}
}