Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Unrecognized filesystem type in URI: abfss:// #32912

Closed
asfimport opened this issue Sep 10, 2022 · 5 comments
Closed

[C++] Unrecognized filesystem type in URI: abfss:// #32912

asfimport opened this issue Sep 10, 2022 · 5 comments

Comments

@asfimport
Copy link
Collaborator

I am running the below commands in databricks.

When I am trying to read a file which is stored in adls using pandas:

pip install adlfs 
import pandas as pd
data = pd.read_parquet("abfss://data.parquet", storage_options= {})

Then I got the below error: 

File "/databricks/python/lib/python3.7/site-packages/pandas/io/parquet.py", line 310, in read_parquet
return impl.read(path, columns=columns, **kwargs)
File "/databricks/python/lib/python3.7/site-packages/pandas/io/parquet.py", line 125, in read
path, columns=columns, **kwargs
File "/databricks/python/lib/python3.7/site-packages/pyarrow/parquet.py", line 1573, in read_table
ignore_prefixes=ignore_prefixes,
File "/databricks/python/lib/python3.7/site-packages/pyarrow/parquet.py", line 1434, in __init__
ignore_prefixes=ignore_prefixes)
File "/databricks/python/lib/python3.7/site-packages/pyarrow/dataset.py", line 667, in dataset
return _filesystem_dataset(source, **kwargs)
File "/databricks/python/lib/python3.7/site-packages/pyarrow/dataset.py", line 424, in _filesystem_dataset
fs, paths_or_selector = _ensure_single_source(source, filesystem)
File "/databricks/python/lib/python3.7/site-packages/pyarrow/dataset.py", line 371, in _ensure_single_source
filesystem, path = FileSystem.from_uri(path)
File "pyarrow/_fs.pyx", line 347, in pyarrow._fs.FileSystem.from_uri
File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Unrecognized filesystem type in URI: abfss://data.parquet 

Reporter: Prakhar Sandhu

Note: This issue was originally created as ARROW-17672. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Kouhei Sutou / @kou:
We need to implement a filesystem module for Azure Data Lake Storage in C++ like ARROW-2034 to support this case.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
You are using adlfs, which is an fsspec-compatible filesystem, and so normally I expect that the pandas read_parquet call converts the "abfss://data.parquet" URI to an fsspec filesystem, passing that to the underlying pyarrow function, and we do have support for fsspec filesystems (and in that way we can support filesystems that don't have native support inside Arrow C++, such as Azure at the moment).

So something is going wrong here. As a starter, can you indicate which versions you are using for pyarrow, pandas, fsspec and adlfs? (eg a pip list or conda list)

@asfimport
Copy link
Collaborator Author

Prakhar Sandhu:
Please find the versions used below:
 
pandas==1.3.5
pyarrow==4.0.0
python==3.7.6
adlfs==2022.2.0
fsspec==2022.8.2

@Tom-Newton
Copy link
Contributor

If you really want to use adlfs this issue is definitely solvable just with changes to the user code. However, I think this will also be solved by #39317. This will connect up the new C++ AzureFileSystem on the python side and provide much better performance and reliability compared to adlfs.

@jorisvandenbossche
Copy link
Member

Let's close this issue in favor of #39317

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants