-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Confusion about Affordance Features #1
Comments
Hi, thanks for you interests. Regards, |
Ok got it. In that case, I am curious what would happen if we just use humans as affordance features. Affordance features are basically concatenating with object features to compose new HOI. It would be interesting to see the results. I am not sure if you have tested it already. Anyway, thanks for the reply. |
Hi, With human box feature as affordance, the performance of HOI detection decreases apparently (see Table 3 in VCL) compared to union box. However, human box with compositional approach still effectively increases the baseline. We also evaluate the human box in FCL where we witness the similar trend. (human box baseline is 22.91 16.66 24.77, human box FCL is 23.83 18.62 25.39). Thus, human box does not affect the effectiveness of compositional approach. However, we do not evaluate human box on affordance recognition since we find union box achieves consistent improvement than human box. We think human box would not affect the effectiveness of visual compositional approach on affordance recognition compared to the baseline. For the comparison between human box and union box on affordance recognition, we intuitively think union box might be better because union box achieves better verb representation. But we are not sure. We have removed the model weights of human box and thus we can not evaluate this right now. In the process of considering your question, we find a set of experiment about the verb auxiliary loss which achieves better verb representation and HOI detection result,
The table (reported in mAP, we first evaluate ATL in F1. but we find F1 might be not robust compared to mAP when we prepare the camera ready) is corresponding to Tab 5 in Appendix. However, auxiliary loss doesn't seem to always improve affordance recognition. Thanks for your comments. we didn't notice this before. |
Thanks a lot for your clarification. I am just a bit skeptical to use union boxes as affordance features since union boxes have the old object features. |
You are welcome. Your question is valuable. I think the compositional approach (compose verb and object among different images) also enforces the verb representation be more discriminative (See the t-SNE figure in VCL). This approach might alleviate the effect from old object features. Otherwise, VCL would not improve the corresponding union box baseline. When the affordance recognition result of human box model is finished, I'll post the result. |
Well, the mAP of ATL (HICO) model on HICO test dataset is 46.32, which is much worse than the result (59.44) of the corresponding union box model in Tab 12 in Appendix. I'll check the result again after the model converges. |
HOI detection performance of human box (ATL (HICO)) in 22. 99% mAP. The mAP on COCO validation2017 is 39.40%, which is also much worse than 52.01 of the corresponding union box model in Tab 12 in Appendix. All the results are worse than I thought. |
I really appreciate how you take the time and run experiments to answer my questions. You might wanna add this experiment in the supplemental material of the paper. |
Thanks. It is just because I have benefited a lot from taking questions (especially comments from peer review) seriously. The first two works (VCL & ATL) were rejected in the first submission. But the comments from reviewer make the paper better and sometimes inspire me a lot. I'll consider to add this experiment in the Appendix. |
hi, I add more experiments and update the pre-print version in arxiv: https://arxiv.org/abs/2104.02867. Interestingly, I find with human box verb representation, the performance of baseline increases, while the performance of ATL drops. |
Hello, thanks for your nice works. After reading the ATL paper I am confused about the affordance features. You said in the paper
What are these affordance features actually? I mean from where these features are pooled?
The text was updated successfully, but these errors were encountered: