-
Notifications
You must be signed in to change notification settings - Fork 86
As for issue #484: remove pandas dependency #1072
Conversation
This PR needs Approvals as follows.
Please choose reviewers and requet reviews! Click to see how to approve each reviewsYou can approve this PR by triggered comments as follows.
See all trigger commentsPlease replace [Target] to review target
|
@lm-lily Can you remove pandas dependency from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lm-lily Thank you. Some comments added.
@@ -54,12 +54,12 @@ def num_max_boxes(self): | |||
return 42 | |||
|
|||
def _annotation_file_from_image_id(self, image_id): | |||
annotation_file = os.path.join(self.annotations_dir, "{:06d}.xml".format(image_id)) | |||
annotation_file = os.path.join(self.annotations_dir, "{}.xml".format(image_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why :06d
removed?
Related to pandas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file name extracted using my submitted code (after removing pandas dataframe method) failed with :06d
format.
I have to adjust this part to make it right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file name extracted using my submitted code (after removing pandas dataframe method) failed with :06d format.
Sorry, Why this happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file name extracted using my submitted code (after removing pandas dataframe method) failed with :06d format.
Sorry, Why this happen?
The image_id has the format of sample data: 2007_000032
. If we apply format {:06}
to it, the filename will not match the format we need. It will be 2007000032
instead of 2007_000032
, and hence cause the error of file not found
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lm-lily I'm still not sure why this is related to removing pandas. If this is an existing bug, please make another PR to fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tfujiwar
Thank you Fujiwara san.
I will revert this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tfujiwar
lmnet-test
failed when I revert the above code.
[2020-07-20T05:41:28Z] self = <blueoil.datasets.pascalvoc_2007.Pascalvoc2007 object at 0x7ffb8908dc50>
--
| [2020-07-20T05:41:28Z] image_id = '000085'
| [2020-07-20T05:41:28Z]
| [2020-07-20T05:41:28Z] def _annotation_file_from_image_id(self, image_id):
| [2020-07-20T05:41:28Z] > annotation_file = os.path.join(self.annotations_dir, "{:06d}.xml".format(image_id))
| [2020-07-20T05:41:28Z] E ValueError: Unknown format code 'd' for object of type 'str'
| [2020-07-20T05:41:28Z]
| [2020-07-20T05:41:28Z] ../blueoil/datasets/pascalvoc_2007.py:57: ValueError
My previous explanation was a fault.
In the original code, the data is handled by pandas dataframe, which will handle the data type conversion internally.
My code takes in the file name as string type and that caused the lmnet-test failure.
Since the file name format is safe to be read as whole, I replace {:06d}
to {}
as the quick fix.
Since this is pandas removal related issue, I will fix this code again and please advice if you have any better suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your investigation. I've understood the reason.
return annotation_file | ||
|
||
def _image_file_from_image_id(self, image_id): | ||
"""Return image file name of a image.""" | ||
return os.path.join(self.jpegimages_dir, "{:06d}.jpg".format(image_id)) | ||
return os.path.join(self.jpegimages_dir, "{}.jpg".format(image_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is the same as reply above.
names=['image_id']) | ||
image_id = list() | ||
|
||
with open(filename) as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't you use csv module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lines here are about reading a txt file, not csv file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use readlines
to read lines as list?
https://docs.python.org/ja/3/library/codecs.html?highlight=readlines#codecs.StreamReader.readlines
|
||
image_files = df.image_files.tolist() | ||
label_files = df.label_files.tolist() | ||
image_files, label_files = list(), list() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why Don't you use csv module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for reading a text file instead of a csv file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python csv
module can load space separated file.
https://docs.python.org/ja/3/library/csv.html#csv.Dialect.delimiter
It's good to use standard library than implement by ourself.
@iizukak |
@lm-lily OK. Then, It's good to remove DeLTA Mark dataset loader first. |
This PR is put to pending until PR #1086 is merged into Blueoil. |
|
@lm-lily Thank you for the review request. I'll review again soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lm-lily
I leave some comments.
Some of the code, I can not understand perfectly.
If my comment is not reasonable, sorry.
names=['image_id']) | ||
image_id = list() | ||
|
||
with open(filename) as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK Thanks.
names=['image_id']) | ||
image_id = list() | ||
|
||
with open(filename) as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use readlines
to read lines as list?
https://docs.python.org/ja/3/library/codecs.html?highlight=readlines#codecs.StreamReader.readlines
blueoil/datasets/pascalvoc_base.py
Outdated
delim_whitespace=True, | ||
header=None, | ||
names=['image_id']) | ||
image_id = list() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
image_id
looks like single value.
How about image_ids
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In reply to #1072 (comment)
read().splitlines()
is possible.
I will modify my code to use read()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
image_id
looks like single value.
How aboutimage_ids
?
Noted.
@@ -54,12 +54,12 @@ def num_max_boxes(self): | |||
return 42 | |||
|
|||
def _annotation_file_from_image_id(self, image_id): | |||
annotation_file = os.path.join(self.annotations_dir, "{:06d}.xml".format(image_id)) | |||
annotation_file = os.path.join(self.annotations_dir, "{}.xml".format(image_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file name extracted using my submitted code (after removing pandas dataframe method) failed with :06d format.
Sorry, Why this happen?
image_files = df.image_files.tolist() | ||
label_files = df.label_files.tolist() | ||
image_files, label_files = list(), list() | ||
with open(filename) as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python csv
module can load space separated file.
https://docs.python.org/ja/3/library/csv.html#csv.Dialect.delimiter
It's good to use standard library than implement by ourself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reply same as #1072 (comment), as same for #1072 (comment).
|
||
image_files = df.image_files.tolist() | ||
label_files = df.label_files.tolist() | ||
image_files, label_files = list(), list() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python csv
module can load space separated file.
https://docs.python.org/ja/3/library/csv.html#csv.Dialect.delimiter
It's good to use standard library than implement by ourself.
blueoil/cmd/output_event.py
Outdated
with open(output_csv, "w") as fp: | ||
wr = csv.writer(fp) | ||
wr.writerow(columns) | ||
for row in data_by_row: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use writerows
?
https://docs.python.org/ja/3/library/csv.html#csv.csvwriter.writerows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! It's a good suggestion, I have changed it.
blueoil/cmd/output_event.py
Outdated
columns.sort() | ||
df = pd.DataFrame([], columns=columns) | ||
|
||
values_step_dict = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this initialize code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not necessary. I will remove this.
Thank you for pointing it out.
df = df[["step"] + columns] | ||
for step in step_list: | ||
if step not in values_step_dict: | ||
values_step_dict[step] = '' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Key of values_step_dict
looks value
. Not step
. Correct?
I'm not sure why this line's key is step
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iizukak
I am sorry, it's my mistake, I forget to change line#37 to (event.value, event.step)
. The sequence is reversed.
Thank you for capturing this significant mistake!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OA
@lm-lily |
@tfujiwar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! Sorry for taking a long time... I left some comments.
blueoil/datasets/ilsvrc_2012.py
Outdated
linelist = [line.rstrip('\n') for line in open(os.path.join(self.data_dir, 'imagenet_classes.txt'))] | ||
return linelist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update the code to close the file?
with open(...) as f:
return [line.rstrip('\n') for ...]
@@ -54,12 +54,12 @@ def num_max_boxes(self): | |||
return 42 | |||
|
|||
def _annotation_file_from_image_id(self, image_id): | |||
annotation_file = os.path.join(self.annotations_dir, "{:06d}.xml".format(image_id)) | |||
annotation_file = os.path.join(self.annotations_dir, "{}.xml".format(image_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your investigation. I've understood the reason.
for metrics_key in metrics_keys: | ||
if not value_matrix: | ||
step_list = sorted(_step_list(event_accumulator, metrics_key), reverse=True) | ||
value_matrix.append(step_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With these lines, the list of steps comes from the first metrics_key. Is that the same with the original code? Did you check the result is the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tfujiwar
Yes, I have verified the result.
The list of steps is the same and is repeated for each metrics_key, it is enough to just read it once from any metrics_key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Then what about this code? I think we don't need to use a for-loop.
value_matrix = [sorted(_step_list(event_accumulator, metrics_keys[0]), reverse=True)]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tfujiwar
This can't work, there will be TypeError: 'set' object is not subscriptable
because metrics_keys
is a set.
How about the following options?
for metrics_key in metrics_keys:
step_list = sorted(_step_list(event_accumulator, metrics_key), reverse=True)
value_matrix.append(step_list)
break
or
metrics_key = next(iter(metrics_keys))
step_list = sorted(_step_list(event_accumulator, metrics_key), reverse=True)
value_matrix.append(step_list)
Which solution is more preferable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I've understood the situation. It might be better to refactor the existing code also but this PR is OK 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RA
/ready |
⏳Merge job is queued... |
What this patch does to fix the issue.
Remove pandas dependencies.
Link to any relevant issues or pull requests.
Fix for issue #484.