Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bring Your Own LIWC #281

Open
xehu opened this issue Aug 16, 2024 · 2 comments · Fixed by #322
Open

Bring Your Own LIWC #281

xehu opened this issue Aug 16, 2024 · 2 comments · Fixed by #322
Assignees
Labels
enhancement New feature or request now Good immediate issues to work on in the next couple days. priority 2 Important tasks after addressing P1

Comments

@xehu
Copy link
Collaborator

xehu commented Aug 16, 2024

We include a corpus of words in our features called LIWC, and they are under a license that says they can only be free if they are distributed for an academic purpose. The current version of these features in our repository are the 2007 version of the lexicon. However, users may wish to "bring" another version of the dictionary and ask us to generate up-to-date features for them.

Proposed Feature - Bring your own LIWC: LIWC is a lexicon, which means it is literally a list of words. We simply ask the user to point us to a local directory where the word list is stored, and we will calculate the metrics for them.

@xehu xehu added enhancement New feature or request now Good immediate issues to work on in the next couple days. labels Aug 16, 2024
@xehu xehu added the priority 2 Important tasks after addressing P1 label Aug 28, 2024
@xehu
Copy link
Collaborator Author

xehu commented Sep 10, 2024

Implementation Specification: "Bring Your Own LIWC"

Per email communications with Jamie Pennebaker and Ryan Boyd, the LIWC team is OK with us having the encrypted/compressed version of the old LIWC dictionary as part of the Team Communication Toolkit. However, users with access to more recent versions of the LIWC dictionary may be interested in "plugging in" their own local copy of a more up-to-date dictionary and using that instead of using our cached 2007 version.

The objective of "Bring Your Own LIWC," or BYO-LIWC, is to make this possible. Broadly, it requires three steps:

  • Read in a LIWC dictionary file in the official format (the .dic format shared by Ryan)
  • Convert the LIWC dictionary file matches the expected format; if not, throw an error.
  • Convert the LIWC dictionary to a regular expression
  • Using the regular expressions, compute "new" LIWC features exactly the same way as we compute our existing features.

Step 1: Read in the Dictionary

Step 1a: Modify FeatureBuilder to Accept Custom LIWC Dictionary

First, we must modify the FeatureBuilder to accept a parameter called ("custom_liwc_dictionary").

Step 1b: Read in .dic File

Once we are able to get a parameter for the path to the dictionary, we need to parse the contents.

The .dic file has two parts: the header, which maps numbers to "category names," and the body, which maps words in the lexicon to different categories.

The Header

The header is a portion of text data between two '%' symbols. It looks like this...

1	category1
2	category2
3	category3
4	category4
...

This means that the number "1" corresponds to "category 1."

The Body

In the body, the leftmost item is a word or word stem (e.g., run* is a stem that captures "run," "running," etc.). Next to each stem is a sequence of numbers, which tells us the categorie(s) that the given word belongs to:

word1	20	30	31	50	51
word2*	30	31	50	51
word3	20	30	31	50	51	90

For example, in this case, word1 belongs to multiple categories (20, 30, 31, 50, and 51).

Ryan has provided in his email some possible starting scripts for parsing this format: https://github.com/ryanboyd/ContentCoder-Py/blob/main/ContentCodingDictionary.py#L81

Step 2: Confirm that the Expected Dictionary Matches the Format

In this step, we should add error handling in case:

  • No file exists at the path
  • The path leads to something that does not match our expected .dic format.

If this is the case, present a warning and continue building the features without the custom LIWC features.

Step 3: Convert the LIWC dictionary to a regular expression

Next, we need to convert the LIWC dictionary format to a regular expression that we can easily incorporate into the way that we compute features. Currently, an example of how we process lexicons is in: https://github.com/Watts-Lab/team_comm_tools/blob/main/src/team_comm_tools/utils/check_embeddings.py (utils/check_embeddings)

# Read in the lexicons (helper function for generating the pickle file)
def read_in_lexicons(directory, lexicons_dict):
    for filename in os.listdir(directory):
        with open(directory/filename, encoding = "mac_roman") as lexicons:
            if filename.startswith("."):
                continue
            lines = []
            for lexicon in lexicons:
                # get rid of parentheses
                lexicon = lexicon.strip()
                lexicon = lexicon.replace('(', '')
                lexicon = lexicon.replace(')', '')
                if '*' not in lexicon:
                    lines.append(r"\b" + lexicon.replace("\n", "") + r"\b")
                else:
                    # get rid of any cases of multiple repeat -- e.g., '**'
                    lexicon = lexicon.replace('\**', '\*')

                    # build the final lexicon
                    lines.append(r"\b" + lexicon.replace("\n", "").replace("*", "") + r"\S*\b")
        clean_name = re.sub('.txt', '', filename)
        lexicons_dict[clean_name] = "|".join(lines)

Step 4: Generate the features

Currently, an example of how we generate the LIWC features is as follows (https://github.com/Watts-Lab/team_comm_tools/blob/main/src/team_comm_tools/features/lexical_features_v2.py):

def liwc_features(chat_df: pd.DataFrame, message_col) -> pd.DataFrame:
	"""
		This function takes in the chat level input dataframe and computes lexical features 
		(rates at which the message contains contains words from a given lexicon, such as LIWC).
			  
	Args:
		chat_df (pd.DataFrame): This is a pandas dataframe of the chat level features. Should contain 'message' column.
		message_col (str): This is a string with the name of the column containing the message / text.

	Returns:
		pd.DataFrame: Dataframe of the lexical features stacked as columns.
	"""
	# Load the preprocessed lexical regular expressions
	try:
		current_dir = os.path.dirname(__file__)
		lexicon_pkl_file_path = os.path.join(current_dir, './assets/lexicons_dict.pkl')
		lexicon_pkl_file_path = os.path.abspath(lexicon_pkl_file_path)
		with open(lexicon_pkl_file_path, "rb") as lexicons_pickle_file:
			lexicons_dict = pickle.load(lexicons_pickle_file)
		
		# Return the lexical features stacked as columns
		return pd.concat(
			# Finding the # of occurrences of lexicons of each type for all the messages.
			[pd.DataFrame(chat_df[message_col + "_original"].apply(lambda chat: get_liwc_rate(regex, chat)))\
											.rename({message_col + "_original": lexicon_type + "_lexical_per_100"}, axis=1)\
				for lexicon_type, regex in lexicons_dict.items()], 
			axis=1
		)
	except:
		print("WARNING: Lexicons not found. Skipping feature..."

The lexicons_dict.pkl file contains lexicons other than just LIWC, and my thought is that comparing the 2007 version to the 2015 version might actually be valuable. So, rather than simply replacing the 2007 version, those who pass in a valid dictionary file would just get a new set of additional features generated.

What this could look like is a modification to liwc_features above, so that it takes the dictionary as an argument; then, we could call it once on the "regular" (2007) dictionary, and once on the "new" (user-provided) dictionary. We could then calculate_chat_level_features so that the function will make the second call only if the user provided a valid dictionary; otherwise, we maintain status quo.

Step 5: Clean up, Test, and Document

When developing this feature, we should try it with the 2015 version of the dictionary that Ryan provided, as well as invalid versions (to check that error checking functions as expected).

We'll also want to update the documentation to explain how users should expect to use this feature, e.g., here: https://conversational-featurizer.readthedocs.io/en/latest/basics.html

@xehu xehu changed the title Required Infrastructure Updates to Make LIWC Compliant Bring Your Own LIWC Sep 10, 2024
@sundy1994
Copy link
Collaborator

Read in .dic File

I borrowed Ryan's code for loading .dic files: https://github.com/Watts-Lab/team_comm_tools/blob/yuxuan/bring_your_own_liwc/src/team_comm_tools/utils/check_embeddings.py#L179

In feature builder, if the user provides a valid path that ends in .dic, we'll try to load the dictionary. We can add more error checking in the future. https://github.com/Watts-Lab/team_comm_tools/blob/yuxuan/bring_your_own_liwc/src/team_comm_tools/feature_builder.py#L132

Convert to RegEx

I followed the same method in read_in_lexicons(). I updated it a bit cause lexicon = lexicon.replace('\**', '\*') will throw an invalid syntax error.

Generate the features

Per the instruction above, if the custom LIWC dict is provided, we'll generate a second set of lexicon features. I can further update this after #306 is solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request now Good immediate issues to work on in the next couple days. priority 2 Important tasks after addressing P1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants