[pytx] Tech Against Terrorism SignalTypeAPI Implementation #1622

Bruce-Pouncey-TAT · 2024-09-10T13:24:29Z

Summary

Here is a draft PR for part 2 of the issue_1610

I'm looking for some support on a few key details of my implementation.

JSON File download
Given our approach to delivering the hash list is differently from StopNCII and NCME I have wondered how you want to do this . In the draft I am simply using a get to..

get the tat_api response
use the file_url in the response to fetch the pre-signed JSON file
write it to a local temp file
load the temp file into memory
yield the FetchDelta instances.

I wanted to check this with you before I continued with this approach.

The code for this logic is in impl/techagainstterrorism_api.py lines: 148-161

TATSignalMetadata
I'm stumped at this time as to what we want to put into TATSignalMetadata or my understanding of it is not quite there yet. Outside of the response from the API and along side the JSON file URL we have the fields: file_name created_on total_hashes ideology .

We also have additional information in each hash entry, see below what each hash entry in the JSON file contains.

[
   ...
   {
      "hash_digest": "12345abcde",
       "algorithim": "MD5"
       "ideology": "far-right"
       "file_type": "jpg"
   }
   ...
]

Am I right to assume we want this information lifting up into the TATSignalMetadata ?
or
Is this Metadata that gets populated when a match lookup takes place ?

Thanks!

Dcallies

Am I right to assume we want this information lifting up into the TATSignalMetadata ?

Yup, that's where that lives.

Make sure to rebase your PR on main so it doesn't contain your previous PR in the change contents!

python-threatexchange/threatexchange/exchanges/impl/techagainstterrorism_api.py

Dcallies

Additionally, for your simplest test plan:
Assume python-threatexchange CLI is aliased to tx

Add Tech Against Terrorism in the default list of SignalTypeAPIs so that it will show up when you list exchanges on the CLI
tx collab edit <whatever is needed to set up TAT>
tx fetch
Confirm you have real data by doing tx dataset
Match a piece of content by doing tx match ...
Make sure it has sane re-fetch behavior by doing tx fetch again

python-threatexchange/threatexchange/exchanges/impl/techagainstterrorism_api.py

Bruce-Pouncey-TAT · 2024-09-13T11:39:38Z

Hey @Dcallies

I have pushed the first working version of the TAT cli!

Manual test

Using docker assuming a blank start

Build tx
docker build --tag threatexchange .
Create credentials
docker run -v $HOME/.threatexchange:/var/lib/threatexchange threatexchange config api tat --credentials '<TCAP_USERNAME>' '<TCAP PASSWORD>'
Create collab config
docker run -v $HOME/.threatexchange:/var/lib/threatexchange threatexchange config collab edit tat --create "TAT"
Fetch hash list with verbose output
docker run -v $HOME/.threatexchange:/var/lib/threatexchange threatexchange -v fetch
View dataset information
docker run -v $HOME/.threatexchange:/var/lib/threatexchange threatexchange dataset

To test sane fetching run step 3 and 4 again and you will see that output of dataset has not changed (no new hashes or new hasn't been appended to old). We don't support incremental fetching.

Automatic testing

pytest

A couple of thoughts 🤔
Our hash list contains the following algorithms for each piece of content
MD5 SH256 PDQ SHA512 SHA256 for every media type.

I could only find VideoMD5Signal and PdqSignal and was hesitant to start creating new Signals.

We have a PDQ logo, so let's add it

We have a logo, so link it in the README

We have an approved logo for TMK, so lets add it

Dcallies

A couple of thoughts 🤔
Our hash list contains the following algorithms for each piece of content
MD5 SH256 PDQ SHA512 SHA256 for every media type.

I could only find VideoMD5Signal and PdqSignal and was hesitant to start creating new Signals.

It is fairly straightforward to add more cryptographic hash types. I am of the opinion that cryptographic hashing on image types is a waste of time, but we did have a version of pytx that had it in the past. For video, In general MD5 has been the most common denominator for video hashes.

A questions for you: You said that you support PDQ for every piece of content. How are you handling that for videos?

My recommendation is to start with just image PDQ and video MD5, and if you have a user who wants support, they are easy to add later.

On finishing this PR: You are almost there! There is one bug with the wrong signal types being returned in some cases, and I think you should reduce the amount of monkeypatching in your test to just only the client. I think you will be done in the next iteration, and I'm willing to land-and-iterate as long as you fix the correctness issues.

I think the NCMEC test does this the best: https://github.com/facebook/ThreatExchange/blob/main/python-threatexchange/threatexchange/exchanges/impl/tests/test_ncmec.py

It has a mocked client that is hardcoded to return certain values from it's get() methods, and then the exchange is hardcoded to use the mocked client.

Dcallies · 2024-09-13T14:19:03Z

python-threatexchange/threatexchange/cli/config_cmd.py

+    def init_argparse(cls, settings: CLISettings, ap: argparse.ArgumentParser) -> None:
+        ap.add_argument(
+            "--credentials",
+            metavar="STR",


this might be the default, remove it and see if it changes the output

python-threatexchange/threatexchange/cli/config_cmd.py

python-threatexchange/README.md

Dcallies · 2024-09-13T14:25:59Z

python-threatexchange/threatexchange/exchanges/clients/techagainstterrorism/tests/test_api.py

    TATHashListAPI,
 )


 def mock_get_hash_list(
    ideology: str = TATIdeology._all.value,
-) -> t.Union[TATHashListResponse, t.Dict[str, str]]:
+) -> t.Union[t.List[t.Dict[str, str]], t.Dict[str, str]]:


This union is a pretty complex type, and I don't think it can return a bare dict anymore can it? Shouldn't it throw an exception?

python-threatexchange/threatexchange/exchanges/clients/fb_threatexchange/api.py

Dcallies · 2024-09-14T20:46:47Z

python-threatexchange/threatexchange/exchanges/clients/techagainstterrorism/api.py

@@ -133,19 +138,24 @@ def get_auth_token(self, username: str, password: str) -> t.Optional[str]:

    def get_hash_list(
        self, ideology: str = TATIdeology._all.value
-    ) -> TATHashListResponse:
+    ) -> t.List[t.Dict[str, str]]:


Dcallies · 2024-09-14T20:47:55Z

python-threatexchange/threatexchange/exchanges/impl/tests/test_techagainstterrorism.py

+def test_get_name(monkeypatch):
+    monkeypatch.setattr(
+        "threatexchange.exchanges.impl.techagainstterrorism_api._API_NAME",
+        "test_api_name",
+    )
+    assert TATSignalExchangeAPI.get_name() == "test_api_name"


blocking: This is a pretty tautological test. Why not just check get_name() against the string "tat" ?

Dcallies · 2024-09-14T20:48:25Z

python-threatexchange/threatexchange/exchanges/impl/techagainstterrorism_api.py

+from threatexchange.signal_type.pdq.signal import PdqSignal
+from threatexchange.signal_type.md5 import VideoMD5Signal
+
+_API_NAME: str = "tat"


blocking: It's more confusing to have this outside than inlined. I think in the place where you are copying it from, it is being reused somewhere.

Dcallies · 2024-09-14T20:51:38Z

python-threatexchange/threatexchange/exchanges/impl/techagainstterrorism_api.py

+def _is_compatible_signal_type(record: t.Dict[str, str]) -> bool:
+    return record["file_type"] in ["mov", "m4v", "mp4"] or record["algorithm"] == "PDQ"
+
+
+def _type_mapping() -> t.Dict[str, str]:
+    return {
+        "PDQ": PdqSignal.get_name(),
+        "MD5": VideoMD5Signal.get_name(),
+    }
+
+
+def _get_delta_mapping(
+    record: t.Dict[str, str],
+) -> t.Tuple[t.Tuple[str, str], t.Optional[state.FetchedSignalMetadata]]:
+
+    if not _is_compatible_signal_type(record):
+        return (("", ""), None)
+
+    type_str = _type_mapping().get(record["algorithm"])
+
+    metadata = state.FetchedSignalMetadata()
+    return ((type_str or "", record["hash_digest"]), metadata)


blocking: You are returning photo MD5 hashes as VideoMD5s here.

Dcallies · 2024-09-14T20:54:09Z

python-threatexchange/threatexchange/exchanges/impl/tests/test_techagainstterrorism.py

+def test_fetch_iter(monkeypatch):
+    api_instance = TATSignalExchangeAPI(username="test_user", password="test_pass")
+    mock_client_instance = type(
+        "MockClient",
+        (object,),
+        {"get_hash_list": lambda self: [{"id": 1, "data": "test_data"}]},
+    )()
+    monkeypatch.setattr(api_instance, "get_client", lambda: mock_client_instance)
+
+    def mock_get_delta_mapping(entry):
+        return (("signal_type", "signal_value"), entry)
+
+    monkeypatch.setattr(
+        "threatexchange.exchanges.impl.techagainstterrorism_api._get_delta_mapping",
+        mock_get_delta_mapping,
+    )
+
+    result = list(api_instance.fetch_iter([], None))
+    assert len(result) == 1
+    assert isinstance(result[0], state.FetchDelta)
+    assert result[0].checkpoint == state.NoCheckpointing()
+    assert result[0].updates == {
+        ("signal_type", "signal_value"): {"id": 1, "data": "test_data"}
+    }


blocking: This test is mocking so much that you aren't really testing any of the logic, which is true of much of the tests in this function.

You probably only need to patch exactly one functionality - the return of the client.get_hashes(), which you should be able to copy the shape of a few records from your real API (you can replace the hashes with hashes from test files, or from the example signal types.

…pi file

Bruce-Pouncey-TAT · 2024-09-16T13:56:19Z

Hi! @Dcallies

I have made the changes you requested - thank you for the support and feedback on the tests. I have used the existing pytest api fixture in clients/techagainstterrorism/tests/test_api.py line:43 which will return the right data for testing and has the key methods for this to work already mocked.

And to answer your question. My mistake - we don't support PDQ for videos. We use TMK for videos and PDQ for images, for perceptual hashing. Every piece of content we hash will have 4 hashes, MD5, SHA256, SHA512, TMK or PDQ

Dcallies

All of my blocking comments are addressed, so I accept! If there are more fixes we can do them in future PRs.

Thanks again for all of your hard work @Bruce-Pouncey-TAT !

Dcallies · 2024-09-18T01:03:55Z

python-threatexchange/threatexchange/exchanges/clients/techagainstterrorism/tests/test_api.py

+        {
+            "hash_digest": "12345abcde",
+            "algorithim": "MD5",
+            "ideology": ideology,
+            "file_type": "jpg",
+        },
+        {
+            "hash_digest": "12345abcde",
+            "algorithim": "MD5",
+            "ideology": ideology,
+            "file_type": "jpg",
+        },
+        {
+            "hash_digest": "12345abcde",
+            "algorithim": "MD5",
+            "ideology": ideology,
+            "file_type": "jpg",
+        },


ignorable: I can't tell whether ot not these are changing between cases, you can could put this in a global to reference multiple times during the test

facebook-github-bot added the CLA Signed label Sep 10, 2024

Dcallies requested changes Sep 10, 2024

View reviewed changes

Dcallies changed the title ~~Issue 1610~~ [pytx] Tech Against Terrorism SignalTypeAPI Implementation Sep 10, 2024

Dcallies reviewed Sep 10, 2024

View reviewed changes

python-threatexchange/threatexchange/exchanges/impl/techagainstterrorism_api.py Outdated Show resolved Hide resolved

Bruce-Pouncey-TAT marked this pull request as ready for review September 13, 2024 11:40

Bruce-Pouncey-TAT and others added 15 commits September 13, 2024 16:34

initial exchange impl

54f4850

Woek in progress. TAT Exchange interface implementation

061bb72

black

664eb60

extended self._get in client to accept optional full_url

cfc6193

removed test file

cefed79

impl working correctly

1044a70

black

5ef4943

unittests

0b939bd

updated README with TAT instructions

2323097

[pytx] Add Tech Against Terrorism Hash API (facebook#1617)

afa1996

[pdq] Add logo (facebook#1619)

756c466

We have a PDQ logo, so let's add it

[pdq] Add logo in README.md (facebook#1621)

6984cde

We have a logo, so link it in the README

[tmk] Add logo (facebook#1620)

fd8b80c

We have an approved logo for TMK, so lets add it

updated type to t.Type for tests

092fc2c

rebased

6ad6f15

Bruce-Pouncey-TAT force-pushed the issue_1610 branch from 712c770 to 6ad6f15 Compare September 13, 2024 15:56

black

2895215

Bruce-Pouncey-TAT force-pushed the issue_1610 branch from 05806bb to 2895215 Compare September 13, 2024 16:00

Dcallies requested changes Sep 14, 2024

View reviewed changes

updated TimeoutHTTPAdapter to absolute import

849e173

Bruce-Pouncey-TAT force-pushed the issue_1610 branch from 92f7fe9 to 849e173 Compare September 16, 2024 08:30

Bruce-Pouncey-TAT and others added 3 commits September 16, 2024 09:32

Merge branch 'main' into issue_1610

6c537d8

removed metavar STR from config api command

008561b

removed complex union type in test

9bae806

Bruce-Pouncey-TAT added 4 commits September 16, 2024 10:28

removed _API_NAME from test_get-name | also changed it to inline in a…

7c7a654

…pi file

enforced checking if MD5 hash is from a video

b00ae08

updated test impl

fc78dde

re-wrote tests using less patching and more logic testing

fad646c

Bruce-Pouncey-TAT added 2 commits September 16, 2024 19:45

code golf: full_url easier to read

bd4ce21

nit: removed unecessary credentials check

0acae2f

Dcallies approved these changes Sep 18, 2024

View reviewed changes

Dcallies merged commit b58abe6 into facebook:main Sep 18, 2024
6 checks passed

Dcallies assigned Bruce-Pouncey-TAT Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pytx] Tech Against Terrorism SignalTypeAPI Implementation #1622

[pytx] Tech Against Terrorism SignalTypeAPI Implementation #1622

Bruce-Pouncey-TAT commented Sep 10, 2024 •

edited

Loading

Dcallies left a comment

Dcallies left a comment

Bruce-Pouncey-TAT commented Sep 13, 2024

Dcallies left a comment •

edited

Loading

Dcallies Sep 13, 2024

Dcallies Sep 13, 2024

Dcallies Sep 14, 2024

Dcallies Sep 14, 2024

Dcallies Sep 14, 2024

Dcallies Sep 14, 2024

Dcallies Sep 14, 2024

Bruce-Pouncey-TAT commented Sep 16, 2024

Dcallies left a comment

Dcallies Sep 18, 2024

[pytx] Tech Against Terrorism SignalTypeAPI Implementation #1622

[pytx] Tech Against Terrorism SignalTypeAPI Implementation #1622

Conversation

Bruce-Pouncey-TAT commented Sep 10, 2024 • edited Loading

Summary

Dcallies left a comment

Choose a reason for hiding this comment

Dcallies left a comment

Choose a reason for hiding this comment

Bruce-Pouncey-TAT commented Sep 13, 2024

Manual test

Automatic testing

Dcallies left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bruce-Pouncey-TAT commented Sep 16, 2024

Dcallies left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bruce-Pouncey-TAT commented Sep 10, 2024 •

edited

Loading

Dcallies left a comment •

edited

Loading