-
-
Notifications
You must be signed in to change notification settings - Fork 381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for reading and writing files directly to/from ftp #723
Conversation
4b531bd
to
66fa8b2
Compare
66fa8b2
to
4c07b26
Compare
smart_open/ftp.py
Outdated
|
||
def _connect(hostname, username, port, password, transport_params): | ||
try: | ||
from ftplib import FTP_TLS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this work if the FTP server doesn't support TLS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a great point. Sorry about that, I have changed the implementation to:
from ftplib import FTP
The purpose of this PR is to just add functionality for reading from an FTP server, not an FTPS server (with TLS protocol). Thank you for bringing this up. In a next PR, I can add functionality for handling FTPS servers.
smart_open/ftp.py
Outdated
try: | ||
ftp.connect(hostname, port) | ||
ftp.sendcmd(f'USER {username}') | ||
ftp.sendcmd(f'PASS {password}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine many users who want to use TLS would like a secure connection, which would require adding ftp.prot_p()
https://docs.python.org/3/library/ftplib.html#ftplib.FTP_TLS.prot_p
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. For now, I am adding functionality to handle reading from a pure FTP server, but I am planning on making a future PR for handling reading from an FTPS server.
8a8c990
to
9ce942a
Compare
smart_open/ftp.py
Outdated
|
||
def convert_transport_params_to_args(transport_params): | ||
supported_keywords = [ | ||
"keyfile", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are outdated from FTP_TLS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for bringing this up. I have updated the convert_transport_params_to_args
function to reflect the possible arguments that can be passed into FTP, as specified here.
smart_open/tests/test_ftp.py
Outdated
smart_open.open( | ||
"ftp://user@localhost/dir1/dir2/dir3/file", | ||
transport_params={ | ||
"keyfile": "some_key_file", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same outdated note as above for these tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for bringing this up. I have updated the tests to reflect the change from FTP_TLS to FTP.
9ce942a
to
2cc4c78
Compare
smart_open/ftp.py
Outdated
if not transport_params: | ||
transport_params = {} | ||
conn = _connect(host, user, port, password, transport_params) | ||
fobj = conn.transfercmd(f"RETR {path}").makefile(mode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RETR
is only used for reading a file. I think we should add support for writing to a file on a FTP server as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also what is the memory footprint of this? Does it put the entire file into memory? Every other example of ftplib I've found uses FTP.retrlines
, which is documented as a way to stream the file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good question. The entire file is not downloaded into memory. In fact, when open is called, it only returns a file like object representing the file on the FTP server, but it will not have downloaded any part of that file into memory. How the file is downloaded depends on what the developer wants to do with the object returned by open
. For example:
# At this point, the file has not been downloaded at all
ftp_file_obj = open("ftp://username:password@host:port/file)
# Now, the file object could be read line by line (this is where the file from the ftp server is actually read)
for line in ftp_file_obj:
print(line)
More specifically: conn.transfercmd(f"RETR {path}")
returns a socket object. When created, this object doesn't download any part of the file. You can see an example of the makefile function being called on a socket object here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good find and good to know we are streaming the file. In the ftplib.py file you linked it looks like retrlines
has quite a lot of useful functionality for reading a file that we don't get with our current implementation though such as ensuring that we close the socket you mentioned so we don't leak resources as well as ensuring that we handle various newlines and CLRF's properly. To me it seems like we should use the higher level retrlines
method with all of the functionality baked in
smart_open/ftp.py
Outdated
ftp = FTP(**kwargs) | ||
try: | ||
ftp.connect(hostname, port) | ||
ftp.sendcmd(f"USER {username}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use the FTP.login
method instead of two sendcmd
's here
kwargs = convert_transport_params_to_args(transport_params) | ||
ftp = FTP(**kwargs) | ||
try: | ||
ftp.connect(hostname, port) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an exception was thrown during connect
we would raise it as an exception during logging in, which seems quite confusing for the user. What do we get out of throwing our own exception here? What are the exceptions for a failed connect
and a failed login
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a failed connect the following error is thrown:
ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it
For a failed login, the following is thrown:
ftplib.error_perm: 530 Login incorrect.
I felt that these errors are somewhat confusing and don't really tell the user the real issue, which is that either their host or port is wrongly entered or a ftp server doesn't exist at that host or port (if connect fails) or the username or password is wrong (if login fails).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those errors are somewhat confusing, but are more informative and specific than anything we could provide. I wouldn't recommend suppressing the exceptions thrown by ftplib
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a best of both worlds situation would be
try:
ftp.connect(hostname, port)
except Exception e:
logger.error("Unable to connect to FTP server: try checking the host and port!")
raise e
try:
ftp.login(username, password)
except Exception e:
logger.error("Unable to login to FTP server: try checking the username and password!")
raise e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on not rephrasing the traceback exception messages. Because this is what users google. We want them to hit the actual support threads of the core libraries, where people share and discuss the same issue based on the same error message – and not hit smart_open, which is just a wrapper.
smart_open/tests/test_ftp.py
Outdated
|
||
|
||
class FTPOpen(unittest.TestCase): | ||
@patch("smart_open.ftp.FTP") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I'm not sure what these tests really give us. All it is testing is that we are calling certain methods from the ftplib. I'd suggest:
- Running an FTP server in CI and writing integration tests against it
- Implementing mocks that imitate functionality of a FTP server. There's some good examples of us doing this here for GCS
f2864e2
to
c51a9d3
Compare
0f36f3b
to
71d3e14
Compare
Just to summarize the changes that I have made:
Let me know if there is anything that you think should be changed or added. |
from smart_open import open | ||
|
||
def test_nonbinary(): | ||
file_contents = "Test Test \n new test \n another tests" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add some tests for CLRF and B_CLRF? I noticed there is quite a lot of explicit handling of these in ftplib
https://github.com/python/cpython/blob/f00645d5dbf4cfa0b8f382c8977724578dff191d/Lib/ftplib.py#L70
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point. I have added the tests test_line_endings_non_binary
and test_line_endings_binary
. My logic for these tests is that, if you look at retrlines vs retrbinary in ftplib here, you will notice that retrlines
explicitly removes CRLF line endings for each line, but retrbinary
does not. As a result, in both of the aforementioned tests, I test to make sure that when the file is read in non binary mode, there is no CRLF ending in any line, but when the file is read in binary mode, each line contains the CRLF endings (in order to make sure that the behavior mimics ftplib
).
I have tested ftplib retrlines
vs retrbinary
and have confirmed that, if you upload a file with CRLF endings, then calling retrbinary
on that file will result in each line containing a CRLF ending, but when calling retrlines
each line will not contain those endings.
Let me know if there is anything I should clarify or add.
smart_open/ftp.py
Outdated
raise e | ||
try: | ||
ftp.login(username, password) | ||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can be more explicit for this error and use ftp.error_reply
https://github.com/python/cpython/blob/f00645d5dbf4cfa0b8f382c8977724578dff191d/Lib/ftplib.py#L418
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't think I was clear here. Raising the original exception seems fine, no need to wrap it. What I meant is that we should only be catching ftp.error_reply
for a failed login, instead of the more general Exception
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem. I have updated the code accordingly.
smart_open/transport.py
Outdated
@@ -100,6 +100,7 @@ def get_transport(scheme): | |||
register_transport('smart_open.http') | |||
register_transport('smart_open.s3') | |||
register_transport('smart_open.ssh') | |||
register_transport('smart_open.ftp') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit but it looks like we should keep the alphabetization here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good! Just a few more small comments
963e5bf
to
46bc025
Compare
@petedannemann Thank you! I have added your recommended changes. Let me know if you have any questions or if anything is confusing. |
94c6740
to
2a9ef86
Compare
smart_open/ftp.py
Outdated
ftp = FTP(**kwargs) | ||
try: | ||
ftp.connect(hostname, port) | ||
except error_reply as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only login throws an error_reply, not connect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try:
ftp.connect(hostname, port)
except Exception e:
logger.error("Unable to connect to FTP server: try checking the host and port!")
raise e
try:
ftp.login(username, password)
except error_reply as e:
logger.error("Unable to login to FTP server: try checking the username and password!")
raise e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about that. I have fixed accordingly.
2a9ef86
to
cbec937
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work on this!
@mpenkov Do these changes look good to you? |
smart_open/ftp.py
Outdated
|
||
def parse_uri(uri_as_string): | ||
split_uri = urllib.parse.urlsplit(uri_as_string) | ||
assert split_uri.scheme in SCHEME |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert split_uri.scheme in SCHEME | |
assert split_uri.scheme == SCHEME, 'unexpected scheme: %r' % split_uri.scheme |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, left you some comments.
smart_open/transport.py
Outdated
@@ -17,7 +17,7 @@ | |||
|
|||
logger = logging.getLogger(__name__) | |||
|
|||
NO_SCHEME = '' | |||
NO_SCHEME = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in this module appear cosmetic and irrelevant, please roll them back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about that, I have removed the change.
smart_open/ftp.py
Outdated
if not host: | ||
raise ValueError("you must specify the host to connect to") | ||
if not user: | ||
user = getpass.getuser() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a bad idea to use getpass here.
smart_open is a library, and there's no guarantee that anyone will be watching standard error for the password prompt. This kind of logic is better handled by whatever application is using the library. So, if they don't specify a password, then error out (same as for hostname)
smart_open/ftp.py
Outdated
try: | ||
ftp_mode, file_obj_mode = mode_to_ftp_cmds[mode] | ||
except KeyError: | ||
raise ValueError(f"unsupported mode: {mode}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise ValueError(f"unsupported mode: {mode}") | |
raise ValueError(f"unsupported mode: {mode!r}") |
cbec937
to
fdb8ca5
Compare
fdb8ca5
to
6aebba4
Compare
@mpenkov Thank you for the suggestions. I have integrated the changes. Let me know if there is anything else that should be added. |
Thank you for your work @RachitSharma2001 !! |
Title
Support for reading and writing files directly to/from ftp
Motivation
As elaborated in issue #33, there is currently no support for reading from ftp servers. With my change, developers will be able to read ftp files using smart open by doing the following:
Tests
Added
test_ftp::FTPOpen