Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong Sampling Frame Rate and Questions about Data #1

Closed
MeiliMa opened this issue Jul 24, 2022 · 15 comments
Closed

Wrong Sampling Frame Rate and Questions about Data #1

MeiliMa opened this issue Jul 24, 2022 · 15 comments
Assignees
Labels

Comments

@MeiliMa
Copy link

MeiliMa commented Jul 24, 2022

Your work looks awesome. While trying to reproduce it myself, I found the sampling frame rate in the code is different from that reported in the paper.

The sampling is controlled by the following code

sample_rate, frame_rate = self.video_clip.paras
frame_step = int(0.4 / (sample_rate / frame_rate))

where frame_step is 1 leading to an interval of 6 frames in ETH/UCY.
However, the interval mentioned in the paper and that used by other baselines are 10 actually. Using a 6-frame interval will largely reduce the trajectory length and make the prediction easier.

For ETH/UCY, your data are different to some baselines that you compared in the paper. For example, the data used in Trajectron++ can be found at https://github.com/StanfordASL/Trajectron-plus-plus/tree/master/experiments/pedestrians/raw. They use the same data with most baselines, including SocialGAN, AgentFormer, TransformerTF, etc.

For SDD, the data used for YNet can be found at https://github.com/HarshayuGirase/Human-Path-Prediction/tree/master/ynet#pretrained-models-data-and-config-files, which is also different to those you used.

To draw a fair comparison, I think you need to make sure using the same training/testing data to those baselines listed in your paper.

@cocoon2wong
Copy link
Owner

cocoon2wong commented Jul 24, 2022

Hi! @MeiliMa
Thanks for your question!

  • ETH-UCY

Dataset files we used are the original true_pos.csv file, in which the ETH-eth sub-dataset is annotated per 6 frames, and others are annotated per 10 frames. When we reproduced the codes of social LSTM, SR-LSTM we found that they also used the same configuration on THAT sub-dataset. But for other sets (hotel, univ, zara), we use (sample_rate, frame_rate) = (10, 25), which are the same as most previous works and the same as our paper reported:

subsets['eth'] = dict(
dataset='eth',
dataset_dir='./data/eth/univ',
order=[1, 0],
paras=[6, 25],

subsets['hotel'] = dict(
dataset='hotel',
dataset_dir='./data/eth/hotel',
order=[0, 1],
paras=[10, 25],

subsets['zara1'] = dict(
dataset='zara1',
dataset_dir='./data/ucy/zara/zara01',
order=[1, 0],
paras=[10, 25],

  • SDD

SDD dataset files we used are their original annotations.txt like:

0 1354 1121 1406 1184 4000 1 0 0 "Biker"
0 1354 1121 1406 1184 4001 1 0 1 "Biker"
0 1354 1121 1406 1184 4002 1 0 1 "Biker"
0 1354 1121 1406 1184 4003 1 0 1 "Biker"
0 1354 1121 1406 1184 4004 1 0 1 "Biker"
0 1354 1121 1406 1184 4005 1 0 1 "Biker"
0 1354 1121 1406 1184 4006 1 0 1 "Biker"

These annotations are with bounding boxes, and we process these files and divide trajectories with ./scripts/sdd_txt2csv.py and ./scripts/add_sdd.py in these lines:

csv_data_c = [
float(data_original[5]),
float(data_original[0]),
(float(data_original[1]) + float(data_original[3]))/(2*scale),
(float(data_original[2]) + float(data_original[4]))/(2*scale),
]

Here scale = 100 is a scaling parameter to process data to the similar scales of ETH-UCY. The reported results in our paper have been corrected (*100) using this parameter.
for base_set in set_index:
for index in set_index[base_set][0]:
subsets['{}{}'.format(base_set, index)] = dict(
dataset='{}{}'.format(base_set, index),
dataset_dir='./data/sdd/{}/video{}'.format(
base_set, index),
order=[1, 0],
paras=[1, 30],
video_path='./videos/sdd_{}_{}.mov'.format(
base_set, index),
weights=[set_index[base_set][1], 0.0,
set_index[base_set][1], 0.0],
scale=2,
)

The frame_step = 0.4 / (1/30) = 12 frames, and the video is annotated at 30 fps, thus the sample interval is also 0.4 seconds.

I hope I can help to solve your problem (i.e., why our data seem different from the trajectron++ split (it seems to be used first by social GAN), even though both ours and theirs come from the same original file).
But I still have some doubts about the way eth dataset is handled in the ETH-UCY. For example, the cvpr2019/tpami2020 "SR-LSTM: State Refinement for LSTM towards Pedestrian Trajectory Prediction" has treated that as a preprocess trick. They say

ETH-univ frame rate issue (EUf): For ETH-Univ scenario, the original video from [4] is an accelerated version. We treat every 6 frames as 0.4s, rather than 10 frames in [11].

I see some previous works (with-no-Trajectron++ sources) have also actually used 6-frame-step on the eth sub-dataset. Further, I have checked the datasets used on eth-ucy in Ynet and trajectron++ that you mentioned above. The results seem that Ynet is using a pixel-based ETH-UCY data, and the total amount of them (884 different lines in eth) is less than the original data used in both trajectron++ (5492lines in eth) and our method (8909lines in eth). We will try to use the new data obtained after the traj++ interpolation on this dataset in the future. Please also point out if you have better insights.

@cocoon2wong
Copy link
Owner

cocoon2wong commented Jul 24, 2022

@MeiliMa
I have checked eth's dataset video file (which I have uploaded to my Google drive https://drive.google.com/file/d/1SELMRrVE9M3kIp7piNhic6aODVDBcNjh/view?usp=sharing) and this video was indeed accelerated. Therefore, the 6-frames interval of annotations on the eth subdataset (and only on that subdataset) are equivalent to being accelerated, which means that the trajectron++ source data maybe leads to a longer prediction period (rather than the 3.2->4.8 seconds).

@cocoon2wong
Copy link
Owner

cocoon2wong commented Jul 24, 2022

I compared the annotation lines of the ETHUCY dataset used in the different approaches, where one 2D coordinate of an agent at a given moment is defined as a line:

Dataset eth hotel univ1 zara1 zara2 annotation type
Ours (Social LSTM, ...) 8908 6154 21813 5153 9722 real-world meters
Trajectron++ (Social GAN, ...) 5492 6543 21813 5153 9722 real-world meters
Y-Net 884 3188 20983 4966 9429 pixels

It shows that the data used to train our model is almost the same with trajectron++ (except for the ETH-eth subset). After a rough comparison, we also found that the amount of SDD data used by Y-Net was also less than the original SDD dataset. Therefore, to ensure a fair comparison, we have produced data from the raw data (SDD: SDD, ETH: BIWI Walking Pedestrians dataset), as there are now so-many versions of processed data.

**Information about the frame_rate in ETH-eth dataset:
The original file downloaded from here contains a ewap_dataset/seq_eth/info.txt, which says

INFO:
The annotation was done at 2.5 fps, that is with a timestep of 0.4 seconds.  

NOTES:
This sequence was acquired from the top of the ETH main building, Zurich, by Stefano Pellegrini and Andreas Ess in 2009. 

Therefore, the annotation step is 0.4s, just as our dataset files paras = [6, 25].

@MeiliMa
Copy link
Author

MeiliMa commented Jul 24, 2022

A 6-frame sampling interval will lead to an observation and prediction horizon of 1.92s and 2.88s long given 8- and 12-frame samples respectively, while a 10-frame interval will lead to 3.2s and 4.8s. So, the time horizon of the data used for Trajectron++ is 1.7 times as long as your data. May this be a reason why your results on ETH are much better than Trajectron++?

Besides the frame sampling difference, the data used by Trajectron++, SocialGAN, etc. have only two digital places and your data have four digital places. I do not know how much error it would bring about, given the gap between your results and Trajectron++ is quite small on datasets other than ETH.

For SDD, the following are scenarios used by YNet for training and testing, according to the data they shared:

Training Set: bookstore_{0-3}, coupa_3, deathCircle_{0-4}, gates_{0,1,3-8}, hyang_{4-7,9}, nexus_{0-9}
Testing Set: coupa_{0,1}, gates_2, hyang_{0,1,3,8}, little_{0-3}, nexus_{5,6}, quad_{0-3}

The files you used are, according to sdd.plist:

Training Set: bookstore_{0-5}, coupa_{0,2,3}, deathCircle_{0-4}, gates_{0,2,5,6}, hyang_{0,1,5,8,10,12,14}, little_0, nexus_{0,1,2,5,6,8-11}, quad_{0,1,3}
Testing Set: bookstore_6, deathCircle_4, hyang_{2,3,6,7,11,13}, little_1, gates_{7,8}, nexus_3

So, you use 36 training scenarios and 12 testing scenarios, while YNet uses only 30 training scenarios but 17 testing scenarios.

Anyway, this is a great work. I hope you could show the advantages of your work clearly by drawing a more fair comparison with existing approaches.

@MeiliMa
Copy link
Author

MeiliMa commented Jul 24, 2022

Why this issue was closed? I think you need to solve the problem or give a mark in your paper to talk about this before closing this issue.

@cocoon2wong
Copy link
Owner

cocoon2wong commented Jul 24, 2022

Why this issue was closed? I think you need to solve the problem or give a mark in your paper to talk about this before closing this issue.

  1. Dataset files used by the trajectron++ in the eth sub-dataset is interpolated on the original annotation file, which you can download them at https://icu.ee.ethz.ch/research/datsets.html. Please note that the video is 25fps and the annotation is also 2.5fps, thus the original annotation file show the frames as [780, 786, 792, ...]. Only in this sub-dataset the sampling interval is 0.4s when frame_rate is 6. The data trajectron++ used is actually a wrong scale, which means that they actually use the 10x3.2/6 = 5.333s observations to predict the 10x4.8/6 = 8s trajectories, and it could be the reason for some methods' performance is quite bad. If my understanding is correct, this dataset we are using is actually the correct prediction time configuration. (See the ewap_dataset/seq_eth/info.txt file above)
  2. The dataset split file we used is copied from the Multiverse and the Simaug (CVPR and ECCV 2020). However, judging from the size of the dataset file, the SDD dataset used in ynet is only a fraction of the complete dataset. We cannot tell if these are the carefully selected parts of them, especially those used for testing. Simply put, you can tell how much training data is available by the file size. Their train_trajnet.pkl is only 7.1MB, but the weights saved by the model are 213MB, and this is actually quite a mismatch. We question that such a small amount of data cannot be trained to get such a large model. Therefore we used the above Multiverse as well as Simaug data splits to produce training samples from the full SDD dataset.

original ETH dataset files:
ewap_dataset.zip

@MeiliMa
Copy link
Author

MeiliMa commented Jul 24, 2022

I do not think your statement makes sense. To subsampling from 25FPS to 2.5FPS means that you need to draw samples every 10 frames. Why did you say that it is correct to use a 6-frame interval?

I know there are different versions of training and testing split or data processing for those data. It is normal, as there are lots of researchers doing work in this field. It is your duty to draw a fair comparison in your work when publishing the paper. Because Trajectron++, etc. have published their code, you can run it with your data if you think their data usage is wrong, rather than saying nothing in your paper and misleading the readers.

I think this issue should leave open to let others know there is a data usage difference between this work and other baselines. Or you should explicitly say it in the README file and your paper.

@cocoon2wong
Copy link
Owner

cocoon2wong commented Jul 24, 2022

I do not think your statement makes sense. To subsampling from 25FPS to 2.5FPS means that you need to draw samples every 10 frames. Why did you say that it is correct to use a 6-frame interval?

I apologize for the lack of clarity in my presentation. In simple terms, the original dataset file has labeled frames with intervals of six, e.g. [780, 786, 792, ...] However, the dataset collector indicates that this interval is 0.4 seconds. Therefore, you can see from the 25fps video that this video was accelerated.
The video is available at https://drive.google.com/file/d/1SELMRrVE9M3kIp7piNhic6aODVDBcNjh/view?usp=sharing.
As what SR-LSTM's authors said,

ETH-univ frame rate issue (EUf): For ETH-Univ scenario, the original video from [4] is an accelerated version. We treat every 6 frames as 0.4s, rather than 10 frames in [11].

But note that this phenomenon only appears in the ETH-eth sub-dataset.

@MeiliMa
Copy link
Author

MeiliMa commented Jul 24, 2022

This should be the actually original data annotated with 2.5FPS. https://github.com/crowdbotp/OpenTraj/blob/master/datasets/ETH/seq_eth/biwi_eth_10fps.txt
rather than the one you use.

@MeiliMa
Copy link
Author

MeiliMa commented Jul 24, 2022

The data you used were wrongly annotated, as the clarification can be found at
https://github.com/crowdbotp/OpenTraj/tree/master/datasets/ETH#obstacles

@cocoon2wong
Copy link
Owner

cocoon2wong commented Jul 24, 2022

This should be the actually original data annotated with 2.5FPS. https://github.com/crowdbotp/OpenTraj/blob/master/datasets/ETH/seq_eth/biwi_eth_10fps.txt
rather than the one you use.

I do not agree with you.
The original ETH dataset can be found at:
https://icu.ee.ethz.ch/research/datsets.html,
and this file was upload at 2009 by the ICCV09 paper "You'll Never Walk Alone: Modeling Social Behavior for Multi-​target Tracking".
I think you'd better download and check it out carefully.

The original file could be also found at
https://raw.githubusercontent.com/crowdbotp/OpenTraj/master/datasets/ETH/seq_eth/obsmat.txt
The 10fps file is the interpolation of that one.

And the info could fe found at:
https://github.com/crowdbotp/OpenTraj/blob/e7b12a0897e57a94b02a735248145c85d84dc01f/datasets/ETH/seq_eth/info.txt#L2-L6

@MeiliMa
Copy link
Author

MeiliMa commented Jul 24, 2022

The data you used were wrongly annotated, as the clarification can be found at https://github.com/crowdbotp/OpenTraj/tree/master/datasets/ETH#obstacles

The same statement can be found in README.txt from the data downloaded from the official link.

Regardless of which version of data is correct, yours or those used by SocialGAN, Trajectron++, AgentFormer, etc., you should mention it in your paper rather than saying nothing and misleading the readers, which also should include the difference of other data beside ETH.

As the data with 10-frame interval has been widely used, if you do think it is wrong, I really hope you could point it out in your paper and help the community correctize it.

@cocoon2wong
Copy link
Owner

cocoon2wong commented Jul 24, 2022

WARNING: on 17/09/2009 the dataset have been modified, the frame number in the obsmat had a wrong offset (Thanks for corrections to Paul Scovanner)

The offset appears not the interval, but the 780 start frame.
But thank you very much for your valuable advice anyway. I have updated the readme file with the controversy related to this issue. I will add a special note for the eth dataset in a later version of the paper. Thank you very much for your opinion!

@cocoon2wong cocoon2wong reopened this Jul 24, 2022
@cocoon2wong
Copy link
Owner

Their original video file could be downloaded at https://data.vision.ee.ethz.ch/cvl/aem/ewap_dataset_full.tgz.
You can see that the seq_eth.avi has been accelerated, while the seq_hotel.avi has been not.

@cocoon2wong
Copy link
Owner

cocoon2wong commented Jul 26, 2022

A 6-frame sampling interval will lead to an observation and prediction horizon of 1.92s and 2.88s long given 8- and 12-frame samples respectively, while a 10-frame interval will lead to 3.2s and 4.8s. So, the time horizon of the data used for Trajectron++ is 1.7 times as long as your data. May this be a reason why your results on ETH are much better than Trajectron++?

Besides the frame sampling difference, the data used by Trajectron++, SocialGAN, etc. have only two digital places and your data have four digital places. I do not know how much error it would bring about, given the gap between your results and Trajectron++ is quite small on datasets other than ETH.

For SDD, the following are scenarios used by YNet for training and testing, according to the data they shared:

Training Set: bookstore_{0-3}, coupa_3, deathCircle_{0-4}, gates_{0,1,3-8}, hyang_{4-7,9}, nexus_{0-9}
Testing Set: coupa_{0,1}, gates_2, hyang_{0,1,3,8}, little_{0-3}, nexus_{5,6}, quad_{0-3}

The files you used are, according to sdd.plist:

Training Set: bookstore_{0-5}, coupa_{0,2,3}, deathCircle_{0-4}, gates_{0,2,5,6}, hyang_{0,1,5,8,10,12,14}, little_0, nexus_{0,1,2,5,6,8-11}, quad_{0,1,3}
Testing Set: bookstore_6, deathCircle_4, hyang_{2,3,6,7,11,13}, little_1, gates_{7,8}, nexus_3

So, you use 36 training scenarios and 12 testing scenarios, while YNet uses only 30 training scenarios but 17 testing scenarios.

Anyway, this is a great work. I hope you could show the advantages of your work clearly by drawing a more fair comparison with existing approaches.

@MeiliMa I re-read the Y-Net article carefully and they say

The data contains various types of agent beyond pedestrians (bicyclists, skateboarders, cars, buses, and golf carts), we filter out all non-pedestrians and short trajectories below n p + n f out.

In fact, the SDD dataset contains many other kinds of agents such as bicycles and cars, which are filtered by Ynet's training data, and which will make the prediction easier than the original full-dataset (like we used) (because bicycles, cars, etc. all move faster than people and also have differences in interaction relationships). However, most of the previous baselines before Ynet considered all kinds of agents at the same time. Why don't you go ahead and ask them to use all the datasets, but assume that I am using the wrong dataset? We fully respect and understand what these authors at Y-Net's paper are doing, as they provide good enough work. But I don't agree with your use of the word wrong.

At the same time, this issue has been illustrated in the recent CPVR2022 work "End-to-End Trajectory Distribution Prediction Based on Occupancy Grid Maps":

Most of our tests are conducted on the Stanford Drone Dataset (SDD) [40] provides top-down RGB videos captured on the Stanford University campus by drones at 60 different scenes, containing annotated trajectories of more than 20,000 targets such as pedestrians, bicyclists and cars. Early works [5, 23, 43] consider all trajectories in SDD and subsequent works [27–29, 56] focus on pedestrian trajectories using the TrajNet benchmark [42]. On these two splits, we report the results of predicting the 12-step future with the 8-step history with 0.4 seconds step interval.

where the mentioned 27 and 28 are both works from Y-Net's authors.
(It seems that this is one of the few authors among the existing approaches who have noticed this difference. Even I would not have noticed that without your query.)
(And the mentioned 49 is the results in the preprint version of our paper, even though we have been classified into the wrong dataset type.)

Besides SDD, a series of classical works (like the Social LSTM) on trajectory prediction use the same ETH-UCY dataset files as we do. We cannot accept the wrong charge you mentioned. In our paper, we explicitly state the source of the dataset we used (ETH, UCY and SDD from their original papers, not any other splits). We have also give a footnote to say that our SDD split file is the same as Simaug/Multiverse, who have also use the full-SDD dataset to train and test their results. We believe that we as readers should fully respect the work of other authors, so we did not view and reproduce other baselines' codes, but directly used the results they reported.

Finally, we will point out the differences between these datasets, such as 6-frame-eth and 10-frame-eth, as well as SDD and SDD-trajNet (which we think are all not-wrong datasets), in a future article. We acknowledge that it was our mistake to not state these clearly, but we do not believe that the datasets we used are wrong. We as authors accept the criticism of our readers, but we do not agree with the term right or wrong as the researchers. In any case, I appreciate your comments. If you have further questions, please continue to leave your comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants