Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

558 Multiple disks issue on windows and latest rdflib release. #588

Merged

Conversation

kysrpex
Copy link
Contributor

@kysrpex kysrpex commented Mar 8, 2021

Among the solutions proposed in issue #558, I opted for the second. Once any part of osp-core is imported, the function rdflib.Graph.serialize is replaced by a decorated version of itself so that it always receives file:// at the beginning. I included a small assert statement at the beginning of each file in osp-core using the rdflib.Graph.serialize function and tried importing something from each of these files to check whether the patch is indeed being applied. Seems to be the case.

…changes on RDFLib when the user imports it by him/herself.

Includes a little test to be performed once manually.
@kysrpex kysrpex added 🐛 bug 🔨 simple fix Likely to be solvable in at most a few hours (full-time). labels Mar 8, 2021
@kysrpex kysrpex requested a review from urbanmatthias March 8, 2021 16:16
@kysrpex kysrpex self-assigned this Mar 8, 2021
@kysrpex kysrpex linked an issue Mar 8, 2021 that may be closed by this pull request
@kysrpex kysrpex changed the base branch from master to dev March 8, 2021 16:19
@kysrpex
Copy link
Contributor Author

kysrpex commented Mar 8, 2021

As said in the first commit message, the solution has the drawback of also reflecting the changes on RDFLib when the user imports it by him/herself. I guess this is not an issue since the only change is a bug being fixed.

After testing that the solution works, I have removed the assert statements. I will leave the issue as a draft until tomorrow, maybe a better solution comes to my mind in the meantime.

@kysrpex
Copy link
Contributor Author

kysrpex commented Mar 9, 2021

The previous commit had a mistake, and the problem turned out to be much trickier than expected. Actually I am not even sure that the fix in rdflib actually will solve the problem in newer versions, as urllib produces the following output.

location = 'file:///C:\\Users\\user\\Downloads\\ontology.ttl'
urlparse(location)
ParseResult(scheme='file', netloc='', path='/C:\\Users\\user\\Downloads\\ontology.ttl', params='', query='', fragment='')

Note the leading / in the path.

The approach proposed is still valid though, if standard paths, i.e. C:\Users\user\Downloads\ontology.ttl are converted to paths of the form the form \\.\Volume{22e70953-0000-0000-0000-100000000000}\Users\user\Downloads\ontology.ttl. I just committed this workaround. It is a hack, but it is what we have.

kysrpex added 2 commits March 9, 2021 13:27
…e ctypes library and a win32 API call. Hopefully fixes the security issue the CI is referring to.
@kysrpex kysrpex marked this pull request as ready for review March 9, 2021 12:44
@kysrpex kysrpex removed the 🔨 simple fix Likely to be solvable in at most a few hours (full-time). label Mar 9, 2021
Copy link
Member

@urbanmatthias urbanmatthias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a shame I cannot easily test this because I have only a single drive in my dev laptop...

When I wrote my solution idea I was thinking more about calling the serialize always with "file://" no matter if on windows or linux. I am also fine with decorating the serialize method like you did. Why do you need the case distiction on the platform, though? I think calling graph.serialize("file:///home/user/something") should work fine on linux as well?

osp/__init__.py Outdated
# Call the win32 API to get the volume ID.
windows_func = ctypes.windll.kernel32.GetVolumeNameForVolumeMountPointW
windows_func.argtypes = wintypes.LPCWSTR, wintypes.LPWSTR, wintypes.DWORD
lpszVolumeMountPoint = wintypes.LPCWSTR(f'{scheme}:\\')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks a bit complicated for me. What is this? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is the same thing that I though, and if you check the evolution of the code, it's what I intended to do but after trying several times I think that it is not really possible both to just wrap the serialize function and to do it this way.

The problem is the way that urllib is handling the windows paths inside the serialize function. The actual path needs to be on the path field of the ParseResult, and it does not happens with the windows path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the ctypes stuff, yeah it looks very complicated also for me, but it was the only way that I found of getting windows paths by volume (which work with urlparse) and also passing the CI. At first I thought of just querying windows itself via the command line, opening a subprocess that returns a the volume ID and capturing the output, but then bandit, the security vulnerability scanner complained about it being potentially insecure (which is true).

Thus, the solution was to query it in some other way. It turns out that python has no easy solution to call the win32 API to get this information. There are modules on pypy but I did not want to have an extra dependency just for this. ctypes on the other hand belongs to the standard library, and lets you call C functions from Python, which is precisely a way to call the win32 API to get the ID of the disk.

@urbanmatthias
Copy link
Member

location = 'file:///C:\Users\user\Downloads\ontology.ttl'

But this isn't a valid windows path? At least I always thought three slashes after file: is only possible linux. Two come from the protocol file:// and the third from the file system root, which is "/". On windows the file system root is C:\ and therefore no third slash.

file:///some/linux/path
file://C:\some\windows\path

@urbanmatthias
Copy link
Member

I read somewhere that init.py files with __path__ = __import__('pkgutil').extend_path(__path__, __name__) in it should contain no further statements. I cannot find the site though, and I don't remember the reasoning behind it. If we can still import wrappers and osp-core with import osp.<...> without anything overriding something else it should be fine IMO

@kysrpex
Copy link
Contributor Author

kysrpex commented Mar 9, 2021

I'll further investigate what you said about init files.

@kysrpex
Copy link
Contributor Author

kysrpex commented Mar 9, 2021

After investigating after the init file I could not find major drawbacks. I just found a confirmation of the reason why it works (see section 5.2.1), but nothing significantly useful.

So the only remarks that I will make is that the names created in __init__.py "pollute" the osp namespace (i.e. you would have osp._rdflib), but anyway there is nothing interesting for the user there. The code in __init__.py is run every time any subpackage or submodule is imported, before the actual module's code, so of course you would not want to have any heavy computation in there. Also, as said, the Graph.serialize method from rdflib is changed for the whole instance of the python interpreter, but this should have no drawbacks.

@urbanmatthias
Copy link
Member

Thanks for your effort. I'll test it a bit and then I'll approve!

@urbanmatthias urbanmatthias self-requested a review March 10, 2021 06:53
osp/__init__.py Outdated Show resolved Hide resolved
@urbanmatthias urbanmatthias self-requested a review March 10, 2021 07:56
urbanmatthias
urbanmatthias previously approved these changes Mar 10, 2021
Copy link
Member

@urbanmatthias urbanmatthias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my side it can be merged. Maybe we should add a unittest for the version comparison, but I'm not sure about that. What do you think?

osp/__init__.py Outdated
return v < o

# check remaining numbers of other_version
return all(o <= 0 for o in other_version)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since other_version is a iterator (because of map), here it should only iterate over the items not covered in the loop. If there are no more, all of an empty sequence is defined to be true.

osp/__init__.py Outdated
if _osp_urlparse(kwargs['destination']).path.startswith('\\'): # The destination is a windows path.
# Call the win32 API to get the volume ID.
windows_func = _osp_ctypes.windll.kernel32.GetVolumeNameForVolumeMountPointW
windows_func.argtypes = _osp_wintypes.LPCWSTR, _osp_wintypes.LPWSTR, _osp_wintypes.DWORD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any resource describing this. Or did you come up with this on your own?

Copy link
Contributor Author

@kysrpex kysrpex Mar 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The win32 API documentation for GetVolumeNameForVolumeMountPointW and the ctypes library documentation. I did know about ctypes, but not about this function in the Windows API, but it was mentioned online as the standard way to get the disk ID.

@kysrpex
Copy link
Contributor Author

kysrpex commented Mar 10, 2021

From my side it can be merged. Maybe we should add a unittest for the version comparison, but I'm not sure about that. What do you think?

Yeah I will write a test for this, better safe than sorry.

@kysrpex
Copy link
Contributor Author

kysrpex commented Mar 10, 2021

@urbanmatthias Now it's ready again.

@kysrpex kysrpex requested a review from urbanmatthias March 10, 2021 10:43
@kysrpex kysrpex merged commit 57c6310 into dev Mar 10, 2021
@kysrpex kysrpex deleted the 558-Multiple_disks_issue_on_windows_and_latest_rdflib_release_ branch March 10, 2021 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multiple disks issue on windows and latest rdflib release.
2 participants