Skip to content

Commit

Permalink
Support for reading and writing files directly to/from ftp (piskvorky…
Browse files Browse the repository at this point in the history
  • Loading branch information
RachitSharma2001 committed Dec 24, 2022
1 parent d79ab9f commit c01272b
Show file tree
Hide file tree
Showing 22 changed files with 724 additions and 1,275 deletions.
6 changes: 6 additions & 0 deletions .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,10 @@ jobs:

- run: bash ci_helpers/helpers.sh enable_moto_server
if: ${{ matrix.moto_server }}

- run: |
sudo apt-get install vsftpd
sudo bash ci_helpers/helpers.sh create_ftp_ftps_servers
- name: Run integration tests
run: python ci_helpers/run_integration_tests.py
Expand All @@ -146,6 +150,8 @@ jobs:

- run: bash ci_helpers/helpers.sh disable_moto_server
if: ${{ matrix.moto_server }}

- run: sudo bash ci_helpers/helpers.sh delete_ftp_ftps_servers

benchmarks:
needs: [linters,unit_tests]
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -65,3 +65,4 @@ target/

# env files
.env
.venv
12 changes: 12 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,17 @@
# Unreleased

## 6.3.0, 2022-12-12

* Refactor Google Cloud Storage to use blob.open (__[ddelange](https://github.com/ddelange)__, [#744](https://github.com/RaRe-Technologies/smart_open/pull/744))
* Add FTP/FTPS support (#33) (__[RachitSharma2001](https://github.com/RachitSharma2001)__, [#739](https://github.com/RaRe-Technologies/smart_open/pull/739))
* Bring back compression_wrapper(filename) + use case-insensitive extension matching (__[piskvorky](https://github.com/piskvorky)__, [#737](https://github.com/RaRe-Technologies/smart_open/pull/737))
* Fix avoidable S3 race condition (#693) (__[RachitSharma2001](https://github.com/RachitSharma2001)__, [#735](https://github.com/RaRe-Technologies/smart_open/pull/735))
* setup.py: Remove pathlib2 (__[jayvdb](https://github.com/jayvdb)__, [#733](https://github.com/RaRe-Technologies/smart_open/pull/733))
* Add flake8 config globally (__[cadnce](https://github.com/cadnce)__, [#732](https://github.com/RaRe-Technologies/smart_open/pull/732))
* Added buffer_size parameter to http module (__[mullenkamp](https://github.com/mullenkamp)__, [#730](https://github.com/RaRe-Technologies/smart_open/pull/730))
* Added documentation to support GCS anonymously (__[cadnce](https://github.com/cadnce)__, [#728](https://github.com/RaRe-Technologies/smart_open/pull/728))
* Reconnect inactive sftp clients automatically (__[Kache](https://github.com/Kache)__, [#719](https://github.com/RaRe-Technologies/smart_open/pull/719))

# 6.2.0, 14 September 2022

- Fix quadratic time ByteBuffer operations (PR [#711](https://github.com/RaRe-Technologies/smart_open/pull/711), [@Joshua-Landau-Anthropic](https://github.com/Joshua-Landau-Anthropic))
Expand Down
28 changes: 28 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Quickstart

Clone the repo and use a python installation to create a venv:

```sh
git clone [email protected]:RaRe-Technologies/smart_open.git
cd smart_open
python -m venv .venv
```

Activate the venv to start working and install test deps:

```sh
.venv/bin/activate
pip install -e ".[test]"
```

Tests should pass:

```sh
pytest
```

Thats it! When you're done, deactivate the venv:

```sh
deactivate
```
44 changes: 44 additions & 0 deletions ci_helpers/helpers.sh
100755 → 100644
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,52 @@ enable_moto_server(){
moto_server -p5000 2>/dev/null&
}

create_ftp_ftps_servers(){
#
# Must be run as root
#
home_dir=/home/user
user=user
pass=123
ftp_port=21
ftps_port=90

mkdir $home_dir
useradd -p $(echo $pass | openssl passwd -1 -stdin) -d $home_dir $user
chown $user:$user $home_dir

server_setup='''
listen=YES
listen_ipv6=NO
write_enable=YES
pasv_enable=YES
pasv_min_port=40000
pasv_max_port=40009
chroot_local_user=YES
allow_writeable_chroot=YES'''

additional_ssl_setup='''
ssl_enable=YES
allow_anon_ssl=NO
force_local_data_ssl=NO
force_local_logins_ssl=NO
require_ssl_reuse=NO
'''

cp /etc/vsftpd.conf /etc/vsftpd-ssl.conf
echo -e "$server_setup\nlisten_port=${ftp_port}" >> /etc/vsftpd.conf
echo -e "$server_setup\nlisten_port=${ftps_port}\n$additional_ssl_setup" >> /etc/vsftpd-ssl.conf

service vsftpd restart
vsftpd /etc/vsftpd-ssl.conf &
}

disable_moto_server(){
lsof -i tcp:5000 | tail -n1 | cut -f2 -d" " | xargs kill -9
}

delete_ftp_ftps_servers(){
service vsftpd stop
}

"$@"
1 change: 1 addition & 0 deletions ci_helpers/run_integration_tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
'pytest',
'integration-tests/test_207.py',
'integration-tests/test_http.py',
'integration-tests/test_ftp.py'
]
)

Expand Down
84 changes: 84 additions & 0 deletions integration-tests/test_ftp.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
from __future__ import unicode_literals
import pytest
from smart_open import open


@pytest.fixture(params=[("ftp", 21), ("ftps", 90)])
def server_info(request):
return request.param

def test_nonbinary(server_info):
server_type = server_info[0]
port_num = server_info[1]
file_contents = "Test Test \n new test \n another tests"
appended_content1 = "Added \n to end"

with open(f"{server_type}://user:123@localhost:{port_num}/file", "w") as f:
f.write(file_contents)

with open(f"{server_type}://user:123@localhost:{port_num}/file", "r") as f:
read_contents = f.read()
assert read_contents == file_contents

with open(f"{server_type}://user:123@localhost:{port_num}/file", "a") as f:
f.write(appended_content1)

with open(f"{server_type}://user:123@localhost:{port_num}/file", "r") as f:
read_contents = f.read()
assert read_contents == file_contents + appended_content1

def test_binary(server_info):
server_type = server_info[0]
port_num = server_info[1]
file_contents = b"Test Test \n new test \n another tests"
appended_content1 = b"Added \n to end"

with open(f"{server_type}://user:123@localhost:{port_num}/file2", "wb") as f:
f.write(file_contents)

with open(f"{server_type}://user:123@localhost:{port_num}/file2", "rb") as f:
read_contents = f.read()
assert read_contents == file_contents

with open(f"{server_type}://user:123@localhost:{port_num}/file2", "ab") as f:
f.write(appended_content1)

with open(f"{server_type}://user:123@localhost:{port_num}/file2", "rb") as f:
read_contents = f.read()
assert read_contents == file_contents + appended_content1

def test_line_endings_non_binary(server_info):
server_type = server_info[0]
port_num = server_info[1]
B_CLRF = b'\r\n'
CLRF = '\r\n'
file_contents = f"Test Test {CLRF} new test {CLRF} another tests{CLRF}"

with open(f"{server_type}://user:123@localhost:{port_num}/file3", "w") as f:
f.write(file_contents)

with open(f"{server_type}://user:123@localhost:{port_num}/file3", "r") as f:
for line in f:
assert not CLRF in line

with open(f"{server_type}://user:123@localhost:{port_num}/file3", "rb") as f:
for line in f:
assert B_CLRF in line

def test_line_endings_binary(server_info):
server_type = server_info[0]
port_num = server_info[1]
B_CLRF = b'\r\n'
CLRF = '\r\n'
file_contents = f"Test Test {CLRF} new test {CLRF} another tests{CLRF}".encode('utf-8')

with open(f"{server_type}://user:123@localhost:{port_num}/file4", "wb") as f:
f.write(file_contents)

with open(f"{server_type}://user:123@localhost:{port_num}/file4", "r") as f:
for line in f:
assert not CLRF in line

with open(f"{server_type}://user:123@localhost:{port_num}/file4", "rb") as f:
for line in f:
assert B_CLRF in line
42 changes: 42 additions & 0 deletions integration-tests/test_ssh.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# -*- coding: utf-8 -*-
#
# Copyright (C) 2022 Radim Rehurek <[email protected]>
#
# This code is distributed under the terms and conditions
# from the MIT License (MIT).
#

import os
import tempfile
import pytest

import smart_open
import smart_open.ssh


def explode(*args, **kwargs):
raise RuntimeError("this function should never have been called")


@pytest.mark.skipif("SMART_OPEN_SSH" not in os.environ, reason="this test only works on the dev machine")
def test():
with smart_open.open("ssh://misha@localhost/Users/misha/git/smart_open/README.rst") as fin:
readme = fin.read()

assert 'smart_open — utils for streaming large files in Python' in readme

#
# Ensure the cache is being used
#
assert ('localhost', 'misha') in smart_open.ssh._SSH

try:
connect_ssh = smart_open.ssh._connect_ssh
smart_open.ssh._connect_ssh = explode

with smart_open.open("ssh://misha@localhost/Users/misha/git/smart_open/howto.md") as fin:
howto = fin.read()

assert 'How-to Guides' in howto
finally:
smart_open.ssh._connect_ssh = connect_ssh
2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[tool.pytest.ini_options]
testpaths = ["smart_open"]
7 changes: 4 additions & 3 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,16 +37,16 @@ def read(fname):


aws_deps = ['boto3']
gcs_deps = ['google-cloud-storage>=1.31.0']
gcs_deps = ['google-cloud-storage>=2.6.0']
azure_deps = ['azure-storage-blob', 'azure-common', 'azure-core']
http_deps = ['requests']
ssh_deps = ['paramiko']

all_deps = aws_deps + gcs_deps + azure_deps + http_deps
all_deps = aws_deps + gcs_deps + azure_deps + http_deps + ssh_deps
tests_require = all_deps + [
'moto[server]',
'responses',
'boto3',
'paramiko',
'pytest',
'pytest-rerunfailures'
]
Expand Down Expand Up @@ -79,6 +79,7 @@ def read(fname):
'all': all_deps,
'http': http_deps,
'webhdfs': http_deps,
'ssh': ssh_deps,
},
python_requires=">=3.6,<4.0",

Expand Down
1 change: 0 additions & 1 deletion smart_open/azure.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@
import base64
import io
import logging
from typing import Union

import smart_open.bytebuffer
import smart_open.constants
Expand Down
18 changes: 9 additions & 9 deletions smart_open/compression.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ def register_compressor(ext, callback):
"""
if not (ext and ext[0] == '.'):
raise ValueError('ext must be a string starting with ., not %r' % ext)
ext = ext.lower()
if ext in _COMPRESSOR_REGISTRY:
logger.warning('overriding existing compression handler for %r', ext)
_COMPRESSOR_REGISTRY[ext] = callback
Expand Down Expand Up @@ -103,24 +104,23 @@ def _handle_gzip(file_obj, mode):
return result


def compression_wrapper(file_obj, mode, compression):
def compression_wrapper(file_obj, mode, compression=INFER_FROM_EXTENSION, filename=None):
"""
This function will wrap the file_obj with an appropriate
[de]compression mechanism based on the specified extension.
Wrap `file_obj` with an appropriate [de]compression mechanism based on its file extension.
file_obj must either be a filehandle object, or a class which behaves
like one. It must have a .name attribute.
If the filename extension isn't recognized, simply return the original `file_obj` unchanged.
If the filename extension isn't recognized, will simply return the original
file_obj.
`file_obj` must either be a filehandle object, or a class which behaves like one.
If `filename` is specified, it will be used to extract the extension.
If not, the `file_obj.name` attribute is used as the filename.
"""
if compression == NO_COMPRESSION:
return file_obj
elif compression == INFER_FROM_EXTENSION:
try:
filename = file_obj.name
filename.upper() # make sure this thing is a string
filename = (filename or file_obj.name).lower()
except (AttributeError, TypeError):
logger.warning(
'unable to transparently decompress %r because it '
Expand Down
Loading

0 comments on commit c01272b

Please sign in to comment.