-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bring Your Own LIWC #281
Comments
Implementation Specification: "Bring Your Own LIWC"Per email communications with Jamie Pennebaker and Ryan Boyd, the LIWC team is OK with us having the encrypted/compressed version of the old LIWC dictionary as part of the Team Communication Toolkit. However, users with access to more recent versions of the LIWC dictionary may be interested in "plugging in" their own local copy of a more up-to-date dictionary and using that instead of using our cached 2007 version. The objective of "Bring Your Own LIWC," or BYO-LIWC, is to make this possible. Broadly, it requires three steps:
Step 1: Read in the DictionaryStep 1a: Modify FeatureBuilder to Accept Custom LIWC DictionaryFirst, we must modify the FeatureBuilder to accept a parameter called ("custom_liwc_dictionary"). Step 1b: Read in .dic FileOnce we are able to get a parameter for the path to the dictionary, we need to parse the contents. The The HeaderThe header is a portion of text data between two '%' symbols. It looks like this...
This means that the number "1" corresponds to "category 1." The BodyIn the body, the leftmost item is a word or word stem (e.g.,
For example, in this case, word1 belongs to multiple categories (20, 30, 31, 50, and 51). Ryan has provided in his email some possible starting scripts for parsing this format: https://github.com/ryanboyd/ContentCoder-Py/blob/main/ContentCodingDictionary.py#L81 Step 2: Confirm that the Expected Dictionary Matches the FormatIn this step, we should add error handling in case:
If this is the case, present a warning and continue building the features without the custom LIWC features. Step 3: Convert the LIWC dictionary to a regular expressionNext, we need to convert the LIWC dictionary format to a regular expression that we can easily incorporate into the way that we compute features. Currently, an example of how we process lexicons is in: https://github.com/Watts-Lab/team_comm_tools/blob/main/src/team_comm_tools/utils/check_embeddings.py (
Step 4: Generate the featuresCurrently, an example of how we generate the LIWC features is as follows (https://github.com/Watts-Lab/team_comm_tools/blob/main/src/team_comm_tools/features/lexical_features_v2.py):
The What this could look like is a modification to Step 5: Clean up, Test, and DocumentWhen developing this feature, we should try it with the 2015 version of the dictionary that Ryan provided, as well as invalid versions (to check that error checking functions as expected). We'll also want to update the documentation to explain how users should expect to use this feature, e.g., here: https://conversational-featurizer.readthedocs.io/en/latest/basics.html |
Read in .dic FileI borrowed Ryan's code for loading .dic files: https://github.com/Watts-Lab/team_comm_tools/blob/yuxuan/bring_your_own_liwc/src/team_comm_tools/utils/check_embeddings.py#L179 In feature builder, if the user provides a valid path that ends in .dic, we'll try to load the dictionary. We can add more error checking in the future. https://github.com/Watts-Lab/team_comm_tools/blob/yuxuan/bring_your_own_liwc/src/team_comm_tools/feature_builder.py#L132 Convert to RegExI followed the same method in Generate the featuresPer the instruction above, if the custom LIWC dict is provided, we'll generate a second set of lexicon features. I can further update this after #306 is solved. |
We include a corpus of words in our features called LIWC, and they are under a license that says they can only be free if they are distributed for an academic purpose. The current version of these features in our repository are the 2007 version of the lexicon. However, users may wish to "bring" another version of the dictionary and ask us to generate up-to-date features for them.
Proposed Feature - Bring your own LIWC: LIWC is a lexicon, which means it is literally a list of words. We simply ask the user to point us to a local directory where the word list is stored, and we will calculate the metrics for them.
The text was updated successfully, but these errors were encountered: