Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for reading and writing files directly to/from ftp #723

Merged
merged 1 commit into from
Nov 3, 2022

Conversation

RachitSharma2001
Copy link
Contributor

@RachitSharma2001 RachitSharma2001 commented Sep 16, 2022

Title

Support for reading and writing files directly to/from ftp

Motivation

As elaborated in issue #33, there is currently no support for reading from ftp servers. With my change, developers will be able to read ftp files using smart open by doing the following:

from smart_open import open
with open('ftp://user:password@host:port/dir1/dir2/file') as fin:
     for line in fin:
          print(line)

Tests

Added test_ftp::FTPOpen

smart_open/ftp.py Outdated Show resolved Hide resolved

def _connect(hostname, username, port, password, transport_params):
try:
from ftplib import FTP_TLS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this work if the FTP server doesn't support TLS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a great point. Sorry about that, I have changed the implementation to:
from ftplib import FTP
The purpose of this PR is to just add functionality for reading from an FTP server, not an FTPS server (with TLS protocol). Thank you for bringing this up. In a next PR, I can add functionality for handling FTPS servers.

try:
ftp.connect(hostname, port)
ftp.sendcmd(f'USER {username}')
ftp.sendcmd(f'PASS {password}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine many users who want to use TLS would like a secure connection, which would require adding ftp.prot_p()
https://docs.python.org/3/library/ftplib.html#ftplib.FTP_TLS.prot_p

Copy link
Contributor Author

@RachitSharma2001 RachitSharma2001 Sep 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. For now, I am adding functionality to handle reading from a pure FTP server, but I am planning on making a future PR for handling reading from an FTPS server.

smart_open/ftp.py Outdated Show resolved Hide resolved
@RachitSharma2001 RachitSharma2001 force-pushed the develop branch 6 times, most recently from 8a8c990 to 9ce942a Compare September 20, 2022 16:53
smart_open/ftp.py Outdated Show resolved Hide resolved

def convert_transport_params_to_args(transport_params):
supported_keywords = [
"keyfile",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are outdated from FTP_TLS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for bringing this up. I have updated the convert_transport_params_to_args function to reflect the possible arguments that can be passed into FTP, as specified here.

smart_open.open(
"ftp://user@localhost/dir1/dir2/dir3/file",
transport_params={
"keyfile": "some_key_file",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same outdated note as above for these tests

Copy link
Contributor Author

@RachitSharma2001 RachitSharma2001 Sep 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for bringing this up. I have updated the tests to reflect the change from FTP_TLS to FTP.

if not transport_params:
transport_params = {}
conn = _connect(host, user, port, password, transport_params)
fobj = conn.transfercmd(f"RETR {path}").makefile(mode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RETR is only used for reading a file. I think we should add support for writing to a file on a FTP server as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also what is the memory footprint of this? Does it put the entire file into memory? Every other example of ftplib I've found uses FTP.retrlines, which is documented as a way to stream the file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good question. The entire file is not downloaded into memory. In fact, when open is called, it only returns a file like object representing the file on the FTP server, but it will not have downloaded any part of that file into memory. How the file is downloaded depends on what the developer wants to do with the object returned by open. For example:

# At this point, the file has not been downloaded at all
ftp_file_obj = open("ftp://username:password@host:port/file)

# Now, the file object could be read line by line (this is where the file from the ftp server is actually read)
for line in ftp_file_obj:
    print(line)

More specifically: conn.transfercmd(f"RETR {path}") returns a socket object. When created, this object doesn't download any part of the file. You can see an example of the makefile function being called on a socket object here.

Copy link
Contributor

@petedannemann petedannemann Sep 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find and good to know we are streaming the file. In the ftplib.py file you linked it looks like retrlines has quite a lot of useful functionality for reading a file that we don't get with our current implementation though such as ensuring that we close the socket you mentioned so we don't leak resources as well as ensuring that we handle various newlines and CLRF's properly. To me it seems like we should use the higher level retrlines method with all of the functionality baked in

ftp = FTP(**kwargs)
try:
ftp.connect(hostname, port)
ftp.sendcmd(f"USER {username}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use the FTP.login method instead of two sendcmd's here

kwargs = convert_transport_params_to_args(transport_params)
ftp = FTP(**kwargs)
try:
ftp.connect(hostname, port)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an exception was thrown during connect we would raise it as an exception during logging in, which seems quite confusing for the user. What do we get out of throwing our own exception here? What are the exceptions for a failed connect and a failed login?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a failed connect the following error is thrown:
ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it

For a failed login, the following is thrown:
ftplib.error_perm: 530 Login incorrect.

I felt that these errors are somewhat confusing and don't really tell the user the real issue, which is that either their host or port is wrongly entered or a ftp server doesn't exist at that host or port (if connect fails) or the username or password is wrong (if login fails).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those errors are somewhat confusing, but are more informative and specific than anything we could provide. I wouldn't recommend suppressing the exceptions thrown by ftplib

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a best of both worlds situation would be

try:
    ftp.connect(hostname, port)
except Exception e:
   logger.error("Unable to connect to FTP server: try checking the host and port!")
   raise e
try:
    ftp.login(username, password)
except Exception e:
   logger.error("Unable to login to FTP server: try checking the username and password!")
   raise e

Copy link
Owner

@piskvorky piskvorky Sep 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on not rephrasing the traceback exception messages. Because this is what users google. We want them to hit the actual support threads of the core libraries, where people share and discuss the same issue based on the same error message – and not hit smart_open, which is just a wrapper.



class FTPOpen(unittest.TestCase):
@patch("smart_open.ftp.FTP")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I'm not sure what these tests really give us. All it is testing is that we are calling certain methods from the ftplib. I'd suggest:

  1. Running an FTP server in CI and writing integration tests against it
  2. Implementing mocks that imitate functionality of a FTP server. There's some good examples of us doing this here for GCS

@RachitSharma2001 RachitSharma2001 force-pushed the develop branch 12 times, most recently from f2864e2 to c51a9d3 Compare September 23, 2022 02:37
@RachitSharma2001 RachitSharma2001 force-pushed the develop branch 4 times, most recently from 0f36f3b to 71d3e14 Compare September 24, 2022 17:56
@RachitSharma2001
Copy link
Contributor Author

@mpenkov @petedannemann

Just to summarize the changes that I have made:

  1. I have added integration tests which bring up an FTP server and tests reading, writing and appending to it.
  2. I have created a wrapper around the file object returned by open("ftp://...") so that when the file is closed, the underlying socket is closed as well
  3. For bringing up the FTP server during ci tests, there doesn't seem to be any viable lightweight solutions, besides perhaps twisted.

Let me know if there is anything that you think should be changed or added.

from smart_open import open

def test_nonbinary():
file_contents = "Test Test \n new test \n another tests"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add some tests for CLRF and B_CLRF? I noticed there is quite a lot of explicit handling of these in ftplib
https://github.com/python/cpython/blob/f00645d5dbf4cfa0b8f382c8977724578dff191d/Lib/ftplib.py#L70

Copy link
Contributor Author

@RachitSharma2001 RachitSharma2001 Oct 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point. I have added the tests test_line_endings_non_binary and test_line_endings_binary. My logic for these tests is that, if you look at retrlines vs retrbinary in ftplib here, you will notice that retrlines explicitly removes CRLF line endings for each line, but retrbinary does not. As a result, in both of the aforementioned tests, I test to make sure that when the file is read in non binary mode, there is no CRLF ending in any line, but when the file is read in binary mode, each line contains the CRLF endings (in order to make sure that the behavior mimics ftplib).

I have tested ftplib retrlines vs retrbinary and have confirmed that, if you upload a file with CRLF endings, then calling retrbinary on that file will result in each line containing a CRLF ending, but when calling retrlines each line will not contain those endings.

Let me know if there is anything I should clarify or add.

raise e
try:
ftp.login(username, password)
except Exception as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can be more explicit for this error and use ftp.error_reply
https://github.com/python/cpython/blob/f00645d5dbf4cfa0b8f382c8977724578dff191d/Lib/ftplib.py#L418

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't think I was clear here. Raising the original exception seems fine, no need to wrap it. What I meant is that we should only be catching ftp.error_reply for a failed login, instead of the more general Exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. I have updated the code accordingly.

@@ -100,6 +100,7 @@ def get_transport(scheme):
register_transport('smart_open.http')
register_transport('smart_open.s3')
register_transport('smart_open.ssh')
register_transport('smart_open.ftp')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit but it looks like we should keep the alphabetization here

Copy link
Contributor

@petedannemann petedannemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good! Just a few more small comments

@RachitSharma2001 RachitSharma2001 force-pushed the develop branch 3 times, most recently from 963e5bf to 46bc025 Compare October 2, 2022 19:00
@RachitSharma2001
Copy link
Contributor Author

@petedannemann Thank you! I have added your recommended changes. Let me know if you have any questions or if anything is confusing.

@RachitSharma2001 RachitSharma2001 force-pushed the develop branch 2 times, most recently from 94c6740 to 2a9ef86 Compare October 7, 2022 20:30
ftp = FTP(**kwargs)
try:
ftp.connect(hostname, port)
except error_reply as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only login throws an error_reply, not connect

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try: 
    ftp.connect(hostname, port) 
except Exception e: 
    logger.error("Unable to connect to FTP server: try checking the host and port!") 
    raise e 
try: 
    ftp.login(username, password) 
except error_reply as e: 
    logger.error("Unable to login to FTP server: try checking the username and password!") 
    raise e

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that. I have fixed accordingly.

Copy link
Contributor

@petedannemann petedannemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on this!

@RachitSharma2001
Copy link
Contributor Author

@mpenkov Do these changes look good to you?


def parse_uri(uri_as_string):
split_uri = urllib.parse.urlsplit(uri_as_string)
assert split_uri.scheme in SCHEME
Copy link
Collaborator

@mpenkov mpenkov Oct 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert split_uri.scheme in SCHEME
assert split_uri.scheme == SCHEME, 'unexpected scheme: %r' % split_uri.scheme

Copy link
Collaborator

@mpenkov mpenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, left you some comments.

@@ -17,7 +17,7 @@

logger = logging.getLogger(__name__)

NO_SCHEME = ''
NO_SCHEME = ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in this module appear cosmetic and irrelevant, please roll them back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that, I have removed the change.

if not host:
raise ValueError("you must specify the host to connect to")
if not user:
user = getpass.getuser()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a bad idea to use getpass here.

smart_open is a library, and there's no guarantee that anyone will be watching standard error for the password prompt. This kind of logic is better handled by whatever application is using the library. So, if they don't specify a password, then error out (same as for hostname)

try:
ftp_mode, file_obj_mode = mode_to_ftp_cmds[mode]
except KeyError:
raise ValueError(f"unsupported mode: {mode}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise ValueError(f"unsupported mode: {mode}")
raise ValueError(f"unsupported mode: {mode!r}")

@RachitSharma2001
Copy link
Contributor Author

@mpenkov Thank you for the suggestions. I have integrated the changes. Let me know if there is anything else that should be added.

@mpenkov mpenkov merged commit 4268a1a into piskvorky:develop Nov 3, 2022
@mpenkov
Copy link
Collaborator

mpenkov commented Nov 3, 2022

Thank you for your work @RachitSharma2001 !!

RachitSharma2001 added a commit to RachitSharma2001/smart_open that referenced this pull request Dec 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants