Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problem with python get_file_properties #713

Open
peder45552 opened this issue Jul 14, 2023 · 2 comments
Open

problem with python get_file_properties #713

peder45552 opened this issue Jul 14, 2023 · 2 comments

Comments

@peder45552
Copy link

Environment

Ubuntu 20.04 on both servers
python 3.8.10 on both servers

MS modules on dev
azure-core 1.28.0
azure-storage-blob 12.17.0
azure-storage-file-datalake 12.12.0

MS modules on prod
azure-core 1.25.0
azure-storage-blob 12.14.0b2
azure-storage-file-datalake 12.9.0b1

Background

we have a linux prod server where we upload file from into Azure Datalake using python scripts.
Have the Microsoft module azure-storage-file-datalake installed, which pulled in azure-core, azure-storage-blob.
Our scripts use a adls class where we wrapped our function around MS functions. Our code for upload check
if a file already exist and if the size is different. We used MS function get_file_properties for that,
and then upload the file.
Sometime during fall 2022 our script failed during the upload and it turned out that our function using MS function get_file_properties
timed out after 3+ minutes. Could not get any info why that started happening. Ended up rewriting our code for checking
if file exists and getting file size with a function that us MS get_paths, loop through all the files until found (or not)
and return data about that file.

Problem

This year we needed to test (from the prod server) towards our Azure dev Datalake. Discovered that the function with get_paths
did not work towards adls dev, but MS function get_file_properties did.
Same code, same MS module versions.

Was able to get a dev linux server, install our software, installed the MS modules. On dev server MS function get_file_properties worked.
Noticed that we had different version of the MS modules. Wrote a handful of test scripts that check towards adls prod and dev.
Tested functions for
get meta data for a file, and prinf file size.
get meta data for a folder, print last_modified
(used MS function get_file_properties)

list folders in a folder
list files in a folder
(used MS get_paths, check for file or dir, return object)

upload a file, this includes check if parent exist, if file exist, and check size.

download a file, include checking if file exist.

These test script was run against adls prod and dev (2 different file systems)
All tests ran successfully on the linux dev server.

We assumed the difference in the MS module version was the reason for our problem on the linux prod server.
Since all tests was successful on linux dev, we upgraded our linux prod. Installed the same test script,
upgraded MS modules to the same version.

Test scripts failed on the linux prod towards all adls dev.
Test script to upload files to adls prod worked, after we changed the part of the code that use MS function get_file_properties.
Test to download from adls prod failed.

we ended up rolling back our software, rolling back the MS module versions on linux prod.

How do we troubleshoot this ?
works on one linux with same OS, same python, same tokens for adls prod and dev.

@peder45552
Copy link
Author

went back to linux prod server, and installed the latest azure python module versions.
instead of just a generic timeout msg I get
Connection to hpimdpdatalake.blob.core.windows.net timed out. (connect timeout=20

after some more searching, I found this bug page
Azure/azure-sdk-for-python#28643

that talks about the python module use the dfs rest api when pointing to dfs, but for get_file_properties it uses the blob rest api.

Also mentions to try function get_access_control (uses the dfs rest api), so I wrote a new test script, and confirmed that get_access_control does not timeout for dfs (on linux prod server).

I will work with our IT support to see why the prod server seem to be blocked where linux dev is not

@peder45552
Copy link
Author

prod linux has been unblocked in the firewall and test cases reading/writing from linux prod to adls prod are working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant