Skip to content

Technical Documentation

Daniil Tiganov edited this page Dec 18, 2019 · 8 revisions

Introduction

SWAN has four primary functions:

  1. Translate Swift Intermediate Language (SIL) to an IR for analysis.
  2. Analysis framework (or extension to a framework) for analyzing the translated IR.
  3. Taint analysis built on the analysis framework.
  4. Provide a frontend for the analysis.

This page will go over these four functions in detail separately.

1. SIL translation

Compiler invocation

The Swift compiler is manually invoked with specific arguments (more on how these are determined below) and provided with an 'observer'. The 'observer' handles some callbacks from the compiler, including the produced SIL data which is inside of the returned SILModule object.

Determining arguments for the compiler

For a single file, the arguments are very simple and look like this.

["", "-emit-silgen", "-Onone", <path_to_file>]

and optionally an -sdk path.

For an xcodeproject, the arguments are intercepted from an xcodebuild shim, which basically intercepts the arguments that would normally go to the Swift compiler if compiling that xcodeproject.

Initial instruction data

We use the SIL instruction visitor provided by the compiler to look at the produced instructions one by one. We do not immediately translate these instructions in the C++ code, but instead package any data we need about that instruction into CAstNode jobjects using the Java Native Interface (JNI) and some C++ code from WALA. CAstNodes are meant to be used for building WALA's frontend AST representation, CAst. However, we want to translate on the Java side of SWAN so we need to transfer all information needed to translate the SIL to Java. Therefore, we use nodes for the sake of convenience and arbitrarily use the CAstNode kind PRIMITIVE to store the information.

Note: The format of the nodes is coupled between the C++ side, which creates the nodes in a specific way, and the Java side, which reads the nodes and knows what format they are in. For instance, a node representing a function needs to contain the function name, source position, arguments (name and type), return type, etc.

This initial data representation is referred to as "raw representation" or simply "raw" in the codebase. Most instructions have the same format: 0..* results, and 0..* operands. These instructions are referred to as "simple instructions". This just means that most instructions don't need to be uniquely packaged into nodes, and instead can be handled by a function that packages this instruction format. The Java side also has a way of easily grabbing information from this format, instead of manually indexing the node's children.

Translation from "raw" to SILIR

The next step in the translation process is to unpackage the data about the functions, basic blocks, and instructions and translate them to an IR. SWAN uses its own IR called SILIR (temporary name). Due to SIL's nature and nuances, it is difficult to translate SIL directly to CAst. Having a custom IR also allows SWAN to be used with other frameworks since SILIR is relatively easy to translate.

The translation process from "raw" to SILIR is naunced, but the following are the most important points about the process.

  1. Does away with pointers by replacing a pointer access with a field access. e.g. a = *x --> a = x.value Access path aliasing, referred to as "field aliasing" in the codebase, is also necessary to resolve cases where the address of an object's field is copied.
  2. Inlines (asymmetric) coroutines by decomposing them and replacing the caller/callee flow with intra-procedural control flow. This is a messy process, but it is necessary since WALA does not have a way of dealing with coroutines.
  3. Resolves calls to builtin functions by either looking up a summary, which is hardcoded in SILIR, or generating an empty function to resolve the call to if a summary is not available. This generated function just allocates a new object based on the return type and returns it.

Other nuances include operator handling and handling the way SIL allocates and reads/writes to arrays.

SILIR instructions themselves should be easy to understand just by looking at the name and fields of each instruction class. Note: Field instructions can have a literal variable as the field name instead of a static literal. WALA is able to resolve this with its constant keys.

Translation from SILIR to CAst

This translation is fairly straightforward and can be understood by looking at the translator. The only thing worth mentioning here is that switches need to be converted to "if, else, if, ..." style and array operations are not yet supported.


The CAst is then translated to WALA IR in the usual WALA way. The CAst to WALA IR translator is mostly borrowed from WALA's JavaScript translator.

2. Analysis framework

SWAN uses WALA as the analysis framework. There are multiple ways to do analysis in WALA, but we use the System Dependence Graph (SDG) representation. An SDG just represents all dataflow in the program, including through the heap. We generate a fully context sensitive SDG. A flow sensitive slicer is also available to use on SDGs, but we do not use one. An analysis on an SDG basically becomes a graph reachability problem.

An "analysis engine" serves as the driver for the WALA (analysis) side of SWAN. The engine sets up the entry points, scope, calls the translator, generates the CHA, CG, and SDG. We then feed the resulting SDG to our own analysis.

3. Taint analysis

Our goal is to find paths from sources to sinks in the SDG. If there is a path from a source (a NORMAL_RET_CALLER node with a source name) to a sink (a PARAM_CALLEE node with a sink name), then we have a dangerous path. A simple propagation-based forward analysis is done on the SDG that propagates all sources of the data. SDG nodes are annotated with a container that holds sources , which are represented as WALA statements. A node with at least one source is considered tainted. If we reach a sink node and the data is tainted, we record the source and sink to an external recorder. After the analysis is complete for the entire SDG, we use BFS and the recorded sources and sinks to find all paths. This is done to record the data path with source position information.

The result of the analysis is an ArrayList<ArrayList<Position>>, which represents dangerous paths. A Position is a WALA class representing source position. We then feed this information to our frontend to report back to the user.

Note: Sanitizers don't work very well yet.

4. Frontend

SWAN is ran using a Visual Studio Code extension available here. The user can set up the programs they want to translate in the settings, and they have the option to define their own sources, sinks, and sanitizers.

Dangerous paths are reported to the user in a tree so that they can see the data path through their program. They can click on a path element and the extension will open up the program at that exact point.

Note: These paths are not perfect and sometimes are not in order, due to the nature of a SDG.

The extension is very important as it also calls the compiler shim which intercepts arguments needed to invoke the compiler, and does some vital argument manipulation.

The extension will start a new SWAN JVM if one is not already running. If one is already running, it will use that one. This is useful for debugging and viewing the console output of SWAN. The extension communicates with the JVM using sockets.

Note: Error reporting is not that great yet. If something goes wrong in the JVM, the extension may not report it to the user in the extension UI.