Skip to content

caltechlibrary/trinomial

Repository files navigation

Trinomial

Trinomial is a simple Python library for performing a one-way transformation from a text string (such as a person's name or email address) to a short hexadecimal character sequence. The result can be used in place of the original string to hide a person's identity in log messages and similar situations.

License Python Latest release DOI PyPI

Table of contents

Introduction

If you want to preserve user's privacy in software applications, you need to avoid storing or printing user identities to the maximum extent possible. One of the situations in which user identities can leak is software logging or debugging messages. Even when stored only on servers in server logs, user identities are at risk of exposure to systems administrators, hackers, or the developers of the software. The challenge is that it's often important for debugging or other analysis to be able to recognize the same user in multiple messages even if we don't need to know their real identities. Thus, what we need is a way to tell user A from user B, even if we don't care to know who A and B are in the real world.

Trinomial (trivial anonymization library) is a Python package that can help keep users anonymous in such situations. It takes a string (such as an email address, or a name) and transforms it in a consistent way – the same input will always yield the same output – that is also irreversible: given only the output, it is impossible to determine the unique original input that produce it, even knowing Trinomial's source code. You can apply Trinomial to names in error messages in your application, and the names will be transformed to short strings of (essentially) random hexadecimal digits everywhere they appear.

Using Trinomial in code is simply a matter of calling a certain function when you want to print something that may be identifiable. Here is a hypothetical example:

from trinomial import anon

# do some stuff ...

email = request.forms.get('email')
logging.info(f'got submission from user {anon(email)}')

# do some other stuff ...

logging.info(f'redirecting {anon(email)} to page /flowers')

Please be aware that this kind of approach only offers pseudoanonymity at best. It cannot protect against a number of other methods of breaking anonymity, such as analyzing correlations between information in your logs or reading IP addresses (if your logs also contain IP addresses). Trinomial can help improve anonymity, but it cannot do everything alone. It is not intended for sensitive applications, or legal requirements such as the GDPR, HIPAA, or producing public data sets, or similar situations.

Installation

The instructions below assume you have a Python interpreter installed on your computer; if that's not the case, please first install Python version 3 and familiarize yourself with running Python programs on your system.

On Linux, macOS, and Windows operating systems, you should be able to install trinomial with pip. To install trinomial from the Python package repository (PyPI), run the following command:

python3 -m pip install trinomial

As an alternative to getting it from PyPI, you can use pip to install trinomial directly from GitHub, like this:

python3 -m pip install git+https://github.com/caltechlibrary/trinomial.git

Usage

The main function provided by Trinomial is anon. It takes an input string of characters and returns a transformed, shorter string.

>>> from trinomial import anon
>>> email = '[email protected]'
>>> anon(email)
'bcb403adb7'

The output of anon is a string of hexadecimal digits. The function anon accepts an optional argument to control the length of the output string. The default length is 10 hex digits. (See the section on Known issues and limitations for more information about the implications of this.)

>>> anon(email, length = 5)
'ed598'

Special functions

Trinomial takes measures to increase anonymity beyond what would be obtained by simply hashing text strings. One is that it computes hashes by incorporating a unique key derived from the identity of the computer on which it is running. Thus, a given input to the anon function on two different computers will produce two different results. This is on purpose, so that someone can't take the output of anon and easily mount an offline brute-force preimage attack to guess what input produced that output without also having access to the machine that produced the output, to determine the unique key. Nevertheless, for some purposes such as software testing, it may be desirable to set the unique key to a known value. This can be done using the function set_unique_key:

>>> import trinomial
>>> trinomial.set_unique_key('my secret unique key here')

Do not do this in production code. Setting the value in your code makes it much easier for someone to try to reverse the process of producing the output. The function set_unique_key is meant for testing and debugging.

Known issues and limitations

Trinomial is intended as a simple package to replace meaningful textual information with meaningless identifiers, such that (a) it is impractically difficult to discover the original text given only such an identifier, and (b) correlations between occurrences of the original text are preserved. However, it is at best a pseudoanonymization tool. It is not intended for sensitive applications, or legal requirements such as the GDPR, HIPAA, or similar situations.

The possibility of output collisions between two or more identical input values is low, but not zero. The calculation of collisions for a hash function is based on the number of bits b in the hashed output value, according to the function 2b/2. A hexadecimal character can encode 4 bits, which means a hexadecimal string of length n is equal to n×4 bits. This means that the Trinomial default length of 10 output characters gives a maximum of 2(4×10)/2 = 1,048,576 possible unique values. In the author's opinion, this is reasonable for a situation such as (e.g.) anonymizing email addresses in the logs of a program at a small educational institution, but may be too low for other situations. Users may want to increase the length parameter to anon accordingly.

Getting help

If you find an issue, please submit it in the GitHub issue tracker for this repository.

Contributing

We would be happy to receive your help and participation with enhancing Trinomial! Please visit the guidelines for contributing for some tips on getting started.

License

Software produced by the Caltech Library is Copyright (C) 2021, Caltech. This software is freely distributed under a BSD/MIT type license. Please see the LICENSE file for more information.

Authors and history

Trinomial was designed and implemented by Michael Hucka.

Acknowledgments

This work was funded by the California Institute of Technology Library.

The vector artwork used as a starting point for the logo for this repository was created by Rflor for the Noun Project. It is licensed under the Creative Commons Attribution 3.0 Unported license. The vector graphics was modified by Mike Hucka to change the color.