-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pytx] Tech Against Terrorism SignalTypeAPI Implementation #1622
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I right to assume we want this information lifting up into the TATSignalMetadata ?
Yup, that's where that lives.
Make sure to rebase your PR on main so it doesn't contain your previous PR in the change contents!
python-threatexchange/threatexchange/exchanges/impl/techagainstterrorism_api.py
Outdated
Show resolved
Hide resolved
python-threatexchange/threatexchange/exchanges/impl/techagainstterrorism_api.py
Outdated
Show resolved
Hide resolved
python-threatexchange/threatexchange/exchanges/impl/techagainstterrorism_api.py
Outdated
Show resolved
Hide resolved
python-threatexchange/threatexchange/exchanges/impl/techagainstterrorism_api.py
Outdated
Show resolved
Hide resolved
python-threatexchange/threatexchange/exchanges/impl/techagainstterrorism_api.py
Outdated
Show resolved
Hide resolved
python-threatexchange/threatexchange/exchanges/impl/techagainstterrorism_api.py
Outdated
Show resolved
Hide resolved
python-threatexchange/threatexchange/exchanges/impl/techagainstterrorism_api.py
Outdated
Show resolved
Hide resolved
python-threatexchange/threatexchange/exchanges/impl/techagainstterrorism_api.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, for your simplest test plan:
Assume python-threatexchange CLI is aliased to tx
- Add Tech Against Terrorism in the default list of SignalTypeAPIs so that it will show up when you list exchanges on the CLI
tx collab edit <whatever is needed to set up TAT>
tx fetch
- Confirm you have real data by doing
tx dataset
- Match a piece of content by doing
tx match ...
- Make sure it has sane re-fetch behavior by doing
tx fetch
again
python-threatexchange/threatexchange/exchanges/impl/techagainstterrorism_api.py
Outdated
Show resolved
Hide resolved
Hey @Dcallies I have pushed the first working version of the TAT cli! Manual testUsing docker assuming a blank start
To test sane fetching run step 3 and 4 again and you will see that output of Automatic testing
A couple of thoughts 🤔 I could only find |
We have a PDQ logo, so let's add it
We have a logo, so link it in the README
We have an approved logo for TMK, so lets add it
712c770
to
6ad6f15
Compare
05806bb
to
2895215
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of thoughts 🤔
Our hash list contains the following algorithms for each piece of content
MD5 SH256 PDQ SHA512 SHA256 for every media type.I could only find VideoMD5Signal and PdqSignal and was hesitant to start creating new Signals.
It is fairly straightforward to add more cryptographic hash types. I am of the opinion that cryptographic hashing on image types is a waste of time, but we did have a version of pytx that had it in the past. For video, In general MD5 has been the most common denominator for video hashes.
A questions for you: You said that you support PDQ for every piece of content. How are you handling that for videos?
My recommendation is to start with just image PDQ and video MD5, and if you have a user who wants support, they are easy to add later.
On finishing this PR: You are almost there! There is one bug with the wrong signal types being returned in some cases, and I think you should reduce the amount of monkeypatching in your test to just only the client. I think you will be done in the next iteration, and I'm willing to land-and-iterate as long as you fix the correctness issues.
I think the NCMEC test does this the best: https://github.com/facebook/ThreatExchange/blob/main/python-threatexchange/threatexchange/exchanges/impl/tests/test_ncmec.py
It has a mocked client that is hardcoded to return certain values from it's get() methods, and then the exchange is hardcoded to use the mocked client.
def init_argparse(cls, settings: CLISettings, ap: argparse.ArgumentParser) -> None: | ||
ap.add_argument( | ||
"--credentials", | ||
metavar="STR", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might be the default, remove it and see if it changes the output
TATHashListAPI, | ||
) | ||
|
||
|
||
def mock_get_hash_list( | ||
ideology: str = TATIdeology._all.value, | ||
) -> t.Union[TATHashListResponse, t.Dict[str, str]]: | ||
) -> t.Union[t.List[t.Dict[str, str]], t.Dict[str, str]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This union is a pretty complex type, and I don't think it can return a bare dict anymore can it? Shouldn't it throw an exception?
python-threatexchange/threatexchange/exchanges/clients/fb_threatexchange/api.py
Outdated
Show resolved
Hide resolved
@@ -133,19 +138,24 @@ def get_auth_token(self, username: str, password: str) -> t.Optional[str]: | |||
|
|||
def get_hash_list( | |||
self, ideology: str = TATIdeology._all.value | |||
) -> TATHashListResponse: | |||
) -> t.List[t.Dict[str, str]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
def test_get_name(monkeypatch): | ||
monkeypatch.setattr( | ||
"threatexchange.exchanges.impl.techagainstterrorism_api._API_NAME", | ||
"test_api_name", | ||
) | ||
assert TATSignalExchangeAPI.get_name() == "test_api_name" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking: This is a pretty tautological test. Why not just check get_name()
against the string "tat" ?
from threatexchange.signal_type.pdq.signal import PdqSignal | ||
from threatexchange.signal_type.md5 import VideoMD5Signal | ||
|
||
_API_NAME: str = "tat" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking: It's more confusing to have this outside than inlined. I think in the place where you are copying it from, it is being reused somewhere.
def _is_compatible_signal_type(record: t.Dict[str, str]) -> bool: | ||
return record["file_type"] in ["mov", "m4v", "mp4"] or record["algorithm"] == "PDQ" | ||
|
||
|
||
def _type_mapping() -> t.Dict[str, str]: | ||
return { | ||
"PDQ": PdqSignal.get_name(), | ||
"MD5": VideoMD5Signal.get_name(), | ||
} | ||
|
||
|
||
def _get_delta_mapping( | ||
record: t.Dict[str, str], | ||
) -> t.Tuple[t.Tuple[str, str], t.Optional[state.FetchedSignalMetadata]]: | ||
|
||
if not _is_compatible_signal_type(record): | ||
return (("", ""), None) | ||
|
||
type_str = _type_mapping().get(record["algorithm"]) | ||
|
||
metadata = state.FetchedSignalMetadata() | ||
return ((type_str or "", record["hash_digest"]), metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking: You are returning photo MD5 hashes as VideoMD5s here.
def test_fetch_iter(monkeypatch): | ||
api_instance = TATSignalExchangeAPI(username="test_user", password="test_pass") | ||
mock_client_instance = type( | ||
"MockClient", | ||
(object,), | ||
{"get_hash_list": lambda self: [{"id": 1, "data": "test_data"}]}, | ||
)() | ||
monkeypatch.setattr(api_instance, "get_client", lambda: mock_client_instance) | ||
|
||
def mock_get_delta_mapping(entry): | ||
return (("signal_type", "signal_value"), entry) | ||
|
||
monkeypatch.setattr( | ||
"threatexchange.exchanges.impl.techagainstterrorism_api._get_delta_mapping", | ||
mock_get_delta_mapping, | ||
) | ||
|
||
result = list(api_instance.fetch_iter([], None)) | ||
assert len(result) == 1 | ||
assert isinstance(result[0], state.FetchDelta) | ||
assert result[0].checkpoint == state.NoCheckpointing() | ||
assert result[0].updates == { | ||
("signal_type", "signal_value"): {"id": 1, "data": "test_data"} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking: This test is mocking so much that you aren't really testing any of the logic, which is true of much of the tests in this function.
You probably only need to patch exactly one functionality - the return of the client.get_hashes(), which you should be able to copy the shape of a few records from your real API (you can replace the hashes with hashes from test files, or from the example signal types.
92f7fe9
to
849e173
Compare
Hi! @Dcallies I have made the changes you requested - thank you for the support and feedback on the tests. I have used the existing pytest api fixture in And to answer your question. My mistake - we don't support PDQ for videos. We use TMK for videos and PDQ for images, for perceptual hashing. Every piece of content we hash will have 4 hashes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of my blocking comments are addressed, so I accept! If there are more fixes we can do them in future PRs.
Thanks again for all of your hard work @Bruce-Pouncey-TAT !
{ | ||
"hash_digest": "12345abcde", | ||
"algorithim": "MD5", | ||
"ideology": ideology, | ||
"file_type": "jpg", | ||
}, | ||
{ | ||
"hash_digest": "12345abcde", | ||
"algorithim": "MD5", | ||
"ideology": ideology, | ||
"file_type": "jpg", | ||
}, | ||
{ | ||
"hash_digest": "12345abcde", | ||
"algorithim": "MD5", | ||
"ideology": ideology, | ||
"file_type": "jpg", | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignorable: I can't tell whether ot not these are changing between cases, you can could put this in a global to reference multiple times during the test
Summary
Hey @Dcallies
Here is a draft PR for part 2 of the issue_1610
I'm looking for some support on a few key details of my implementation.
JSON File download
Given our approach to delivering the hash list is differently from StopNCII and NCME I have wondered how you want to do this . In the draft I am simply using a
get
to..file_url
in the response to fetch the pre-signed JSON fileFetchDelta
instances.I wanted to check this with you before I continued with this approach.
The code for this logic is in
impl/techagainstterrorism_api.py
lines: 148-161TATSignalMetadata
I'm stumped at this time as to what we want to put into
TATSignalMetadata
or my understanding of it is not quite there yet. Outside of the response from the API and along side the JSON file URL we have the fields:file_name
created_on
total_hashes
ideology
.We also have additional information in each hash entry, see below what each hash entry in the JSON file contains.
Am I right to assume we want this information lifting up into the
TATSignalMetadata
?or
Is this Metadata that gets populated when a match lookup takes place ?
Thanks!