Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pytx] Tech Against Terrorism SignalTypeAPI Implementation #1622

Merged
merged 26 commits into from
Sep 18, 2024

Conversation

Bruce-Pouncey-TAT
Copy link
Contributor

@Bruce-Pouncey-TAT Bruce-Pouncey-TAT commented Sep 10, 2024

Summary

Hey @Dcallies

Here is a draft PR for part 2 of the issue_1610

I'm looking for some support on a few key details of my implementation.

JSON File download
Given our approach to delivering the hash list is differently from StopNCII and NCME I have wondered how you want to do this . In the draft I am simply using a get to..

  1. get the tat_api response
  2. use the file_url in the response to fetch the pre-signed JSON file
  3. write it to a local temp file
  4. load the temp file into memory
  5. yield the FetchDelta instances.

I wanted to check this with you before I continued with this approach.

The code for this logic is in impl/techagainstterrorism_api.py lines: 148-161

TATSignalMetadata
I'm stumped at this time as to what we want to put into TATSignalMetadata or my understanding of it is not quite there yet. Outside of the response from the API and along side the JSON file URL we have the fields: file_name created_on total_hashes ideology .

We also have additional information in each hash entry, see below what each hash entry in the JSON file contains.

[
   ...
   {
      "hash_digest": "12345abcde",
       "algorithim": "MD5"
       "ideology": "far-right"
       "file_type": "jpg"
   }
   ...
]

Am I right to assume we want this information lifting up into the TATSignalMetadata ?
or
Is this Metadata that gets populated when a match lookup takes place ?

Thanks!

Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right to assume we want this information lifting up into the TATSignalMetadata ?

Yup, that's where that lives.

Make sure to rebase your PR on main so it doesn't contain your previous PR in the change contents!

@Dcallies Dcallies changed the title Issue 1610 [pytx] Tech Against Terrorism SignalTypeAPI Implementation Sep 10, 2024
Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, for your simplest test plan:
Assume python-threatexchange CLI is aliased to tx

  1. Add Tech Against Terrorism in the default list of SignalTypeAPIs so that it will show up when you list exchanges on the CLI
  2. tx collab edit <whatever is needed to set up TAT>
  3. tx fetch
  4. Confirm you have real data by doing tx dataset
  5. Match a piece of content by doing tx match ...
  6. Make sure it has sane re-fetch behavior by doing tx fetch again

@Bruce-Pouncey-TAT
Copy link
Contributor Author

Hey @Dcallies

I have pushed the first working version of the TAT cli!

Manual test

Using docker assuming a blank start

  1. Build tx
    docker build --tag threatexchange .

  2. Create credentials
    docker run -v $HOME/.threatexchange:/var/lib/threatexchange threatexchange config api tat --credentials '<TCAP_USERNAME>' '<TCAP PASSWORD>'

  3. Create collab config
    docker run -v $HOME/.threatexchange:/var/lib/threatexchange threatexchange config collab edit tat --create "TAT"

  4. Fetch hash list with verbose output
    docker run -v $HOME/.threatexchange:/var/lib/threatexchange threatexchange -v fetch

  5. View dataset information
    docker run -v $HOME/.threatexchange:/var/lib/threatexchange threatexchange dataset

To test sane fetching run step 3 and 4 again and you will see that output of dataset has not changed (no new hashes or new hasn't been appended to old). We don't support incremental fetching.

Automatic testing

pytest

A couple of thoughts 🤔
Our hash list contains the following algorithms for each piece of content
MD5 SH256 PDQ SHA512 SHA256 for every media type.

I could only find VideoMD5Signal and PdqSignal and was hesitant to start creating new Signals.

@Bruce-Pouncey-TAT Bruce-Pouncey-TAT marked this pull request as ready for review September 13, 2024 11:40
Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of thoughts 🤔
Our hash list contains the following algorithms for each piece of content
MD5 SH256 PDQ SHA512 SHA256 for every media type.

I could only find VideoMD5Signal and PdqSignal and was hesitant to start creating new Signals.

It is fairly straightforward to add more cryptographic hash types. I am of the opinion that cryptographic hashing on image types is a waste of time, but we did have a version of pytx that had it in the past. For video, In general MD5 has been the most common denominator for video hashes.

A questions for you: You said that you support PDQ for every piece of content. How are you handling that for videos?

My recommendation is to start with just image PDQ and video MD5, and if you have a user who wants support, they are easy to add later.

On finishing this PR: You are almost there! There is one bug with the wrong signal types being returned in some cases, and I think you should reduce the amount of monkeypatching in your test to just only the client. I think you will be done in the next iteration, and I'm willing to land-and-iterate as long as you fix the correctness issues.

I think the NCMEC test does this the best: https://github.com/facebook/ThreatExchange/blob/main/python-threatexchange/threatexchange/exchanges/impl/tests/test_ncmec.py

It has a mocked client that is hardcoded to return certain values from it's get() methods, and then the exchange is hardcoded to use the mocked client.

def init_argparse(cls, settings: CLISettings, ap: argparse.ArgumentParser) -> None:
ap.add_argument(
"--credentials",
metavar="STR",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might be the default, remove it and see if it changes the output

python-threatexchange/README.md Show resolved Hide resolved
TATHashListAPI,
)


def mock_get_hash_list(
ideology: str = TATIdeology._all.value,
) -> t.Union[TATHashListResponse, t.Dict[str, str]]:
) -> t.Union[t.List[t.Dict[str, str]], t.Dict[str, str]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This union is a pretty complex type, and I don't think it can return a bare dict anymore can it? Shouldn't it throw an exception?

@@ -133,19 +138,24 @@ def get_auth_token(self, username: str, password: str) -> t.Optional[str]:

def get_hash_list(
self, ideology: str = TATIdeology._all.value
) -> TATHashListResponse:
) -> t.List[t.Dict[str, str]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines 36 to 41
def test_get_name(monkeypatch):
monkeypatch.setattr(
"threatexchange.exchanges.impl.techagainstterrorism_api._API_NAME",
"test_api_name",
)
assert TATSignalExchangeAPI.get_name() == "test_api_name"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: This is a pretty tautological test. Why not just check get_name() against the string "tat" ?

from threatexchange.signal_type.pdq.signal import PdqSignal
from threatexchange.signal_type.md5 import VideoMD5Signal

_API_NAME: str = "tat"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: It's more confusing to have this outside than inlined. I think in the place where you are copying it from, it is being reused somewhere.

Comment on lines 108 to 129
def _is_compatible_signal_type(record: t.Dict[str, str]) -> bool:
return record["file_type"] in ["mov", "m4v", "mp4"] or record["algorithm"] == "PDQ"


def _type_mapping() -> t.Dict[str, str]:
return {
"PDQ": PdqSignal.get_name(),
"MD5": VideoMD5Signal.get_name(),
}


def _get_delta_mapping(
record: t.Dict[str, str],
) -> t.Tuple[t.Tuple[str, str], t.Optional[state.FetchedSignalMetadata]]:

if not _is_compatible_signal_type(record):
return (("", ""), None)

type_str = _type_mapping().get(record["algorithm"])

metadata = state.FetchedSignalMetadata()
return ((type_str or "", record["hash_digest"]), metadata)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: You are returning photo MD5 hashes as VideoMD5s here.

Comment on lines 63 to 86
def test_fetch_iter(monkeypatch):
api_instance = TATSignalExchangeAPI(username="test_user", password="test_pass")
mock_client_instance = type(
"MockClient",
(object,),
{"get_hash_list": lambda self: [{"id": 1, "data": "test_data"}]},
)()
monkeypatch.setattr(api_instance, "get_client", lambda: mock_client_instance)

def mock_get_delta_mapping(entry):
return (("signal_type", "signal_value"), entry)

monkeypatch.setattr(
"threatexchange.exchanges.impl.techagainstterrorism_api._get_delta_mapping",
mock_get_delta_mapping,
)

result = list(api_instance.fetch_iter([], None))
assert len(result) == 1
assert isinstance(result[0], state.FetchDelta)
assert result[0].checkpoint == state.NoCheckpointing()
assert result[0].updates == {
("signal_type", "signal_value"): {"id": 1, "data": "test_data"}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: This test is mocking so much that you aren't really testing any of the logic, which is true of much of the tests in this function.

You probably only need to patch exactly one functionality - the return of the client.get_hashes(), which you should be able to copy the shape of a few records from your real API (you can replace the hashes with hashes from test files, or from the example signal types.

@Bruce-Pouncey-TAT
Copy link
Contributor Author

Hi! @Dcallies

I have made the changes you requested - thank you for the support and feedback on the tests. I have used the existing pytest api fixture in clients/techagainstterrorism/tests/test_api.py line:43 which will return the right data for testing and has the key methods for this to work already mocked.

And to answer your question. My mistake - we don't support PDQ for videos. We use TMK for videos and PDQ for images, for perceptual hashing. Every piece of content we hash will have 4 hashes, MD5, SHA256, SHA512, TMK or PDQ

Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of my blocking comments are addressed, so I accept! If there are more fixes we can do them in future PRs.

Thanks again for all of your hard work @Bruce-Pouncey-TAT !

Comment on lines 17 to 34
{
"hash_digest": "12345abcde",
"algorithim": "MD5",
"ideology": ideology,
"file_type": "jpg",
},
{
"hash_digest": "12345abcde",
"algorithim": "MD5",
"ideology": ideology,
"file_type": "jpg",
},
{
"hash_digest": "12345abcde",
"algorithim": "MD5",
"ideology": ideology,
"file_type": "jpg",
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignorable: I can't tell whether ot not these are changing between cases, you can could put this in a global to reference multiple times during the test

@Dcallies Dcallies merged commit b58abe6 into facebook:main Sep 18, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants