Discussion forum text anonymization

The forum anonymization software consists of several sequential steps, outlined below. The only thing needed to get started is some input data in the correct format. See the step 1 help information (-h option) for specifics. An example, test_posts.csv is included here.

The software includes pre-trained machine learning models to distinguish names from non-name words, but custom models can be trained between steps 2a and 2b. See the annotation guide, anonymization_coding_scheme.txt and the model training code train_model.py if custom models are needed.

A detailed description of the anonymization method is available in:

Bosch, N., Crues, R. W., Shaik, N., & Paquette, L. (2020). "Hello, [REDACTED]": Protecting student privacy in analyses of online discussion forums. In Proceedings of the 13th International Conference on Educational Data Mining (EDM 2020) (pp. 39–49). International Educational Data Mining Society.

Set up the python environment

Specific versions of Python packages should be installed to ensure the machine learning models will work. Packages like Scikit-learn and TensorFlow update frequently, and newer versions of these packages are likely incompatible with the machine learning models.

Assuming you have Anaconda or Miniconda installed, run these commands:

conda create -n forum-redact python==3.7
conda activate forum-redact
pip install tensorflow==1.13.1 scikit-learn==0.22 pandas==0.25.3 numpy==1.17.3 scipy==1.3.0 Keras==2.2.4 tqdm==4.40.0

For future use, after installation, the environment can be activated by running conda activate forum-redact.

Step 1: Extract possible names

This step identifies possible names, consisting of words that are either not in the dictionary or are known names. See the help page for details.

python anonymize_forums_step1.py -h
python anonymize_forums_step1.py test_posts.csv test_posts-step1.csv

Step 2a: Extract features

This step extracts features that characterize the possible names identified in step 1. If using the pre-trained machine learning models, it is necessary to use the --match-context-file option to ensure that the context words match those the model expects, instead of whichever words are most frequent in your specific dataset. If training custom models or evaluating accuracy versus some annotated data on a new dataset, this step can also align annotations (see the -h output).

python anonymize_forums_step2a.py -h
python anonymize_forums_step2a.py test_posts-step1.csv test_posts-step2a.csv --match-context-file context_template.csv

Step 2b: Filter down possible names

This step applies the pre-trained machine learning models to identify what is actually a name or not from among the set of possible names identified in step 1.

python anonymize_forums_step2b.py -h
python anonymize_forums_step2b.py test_posts-step2a.csv test_posts-step2b.csv

Edit test_posts-step2b.csv to remove false positives if needed. The file will be sorted in descending order of word frequency, so egregious false positives will appear toward the top. In the example, url and numbers could probably be removed, though in a typical application only the top few rows with high-frequency, high-impact false positives need to be examined.

Step 3: Redact names

This step removes the names identified from step 2b from the forum posts, along with other potential identifying information like URLs and numbers.

python anonymize_forums_step3.py -h
python anonymize_forums_step3.py test_posts-step2b.csv test_posts.csv test_posts-step3.csv

The output file should now contain the anonymized forum posts.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
README.md		README.md
anonymization_coding_scheme.txt		anonymization_coding_scheme.txt
anonymize_forums_step1.py		anonymize_forums_step1.py
anonymize_forums_step2a.py		anonymize_forums_step2a.py
anonymize_forums_step2b.py		anonymize_forums_step2b.py
anonymize_forums_step3.py		anonymize_forums_step3.py
census-derived-all-first.txt		census-derived-all-first.txt
census-dist-2500-last.txt		census-dist-2500-last.txt
context_template.csv		context_template.csv
holdout_model_extratrees.pkl		holdout_model_extratrees.pkl
holdout_model_nn.pkl		holdout_model_nn.pkl
test_posts.csv		test_posts.csv
train_model.py		train_model.py
train_nn_def.py		train_nn_def.py
ubuntu-american-english.txt		ubuntu-american-english.txt
world-cities.csv		world-cities.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Discussion forum text anonymization

Set up the python environment

Step 1: Extract possible names

Step 2a: Extract features

Step 2b: Filter down possible names

Step 3: Redact names

About

Releases

Packages

Languages

pnb/forum-anonymization

Folders and files

Latest commit

History

Repository files navigation

Discussion forum text anonymization

Set up the python environment

Step 1: Extract possible names

Step 2a: Extract features

Step 2b: Filter down possible names

Step 3: Redact names

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages