Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when joining dataframes with duplicate column names if dataframes generated from file #14147

Open
fullstart opened this issue Jan 16, 2025 · 4 comments
Labels
bug Something isn't working

Comments

@fullstart
Copy link

Describe the bug

Encountered an issue joining dataframes with duplicate column names if they generated from file read (I tried csv and parquet).
Dataframes produced from python dict do join without problem.

I did my testing with latest version of Datafusion on Windows.

To Reproduce

Fine with dataframes from dict

from datafusion import SessionContext
ctx = SessionContext()
x1 = ctx.from_pydict({'id1': [1, 2, 4, 5, 6], 'col2': [3, 4, 3, 5, 2], 'col3': [3, 4, 1, 2, 3]})
x2 = ctx.from_pydict({'id1': [1, 2, 4, 5, 6], 'col2': [3, 4, 3, 5, 2], 'col3': [5, 6, 7, 8, 9]})
x1.join(x2, on="id1")
Out[16]:
DataFrame()
+-----+------+------+-----+------+------+
| id1 | col2 | col3 | id1 | col2 | col3 |
+-----+------+------+-----+------+------+
| 1   | 3    | 3    | 1   | 3    | 5    |
| 2   | 4    | 4    | 2   | 4    | 6    |
| 4   | 3    | 1    | 4   | 3    | 7    |
| 5   | 5    | 2    | 5   | 5    | 8    |
| 6   | 2    | 3    | 6   | 2    | 9    |
+-----+------+------+-----+------+------+

Continue to file read

x1.write_csv("df1.csv")
x2.write_csv("df2.csv")

x1_f = ctx.read_csv("df1.csv")
x2_f = ctx.read_csv("df2.csv")

x1_f.join(x2_f, on="id1")
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In[21], line 1
----> 1 x1_f.join(x2_f, on="id1")

File ~\prj\datafusion_test\venv\Lib\site-packages\datafusion\dataframe.py:468, in DataFrame.join(self, right, on, how, left_on, right_on, join_keys)
    465 if isinstance(right_on, str):
    466     right_on = [right_on]
--> 468 return DataFrame(self.df.join(right.df, how, left_on, right_on))

Exception: Schema error: No field named id1. Valid fields are "?table?"."1", "?table?"."3".

Expected behavior

No response

Additional context

No response

@fullstart fullstart added the bug Something isn't working label Jan 16, 2025
@chenkovsky
Copy link

x1.write_csv("df1.csv", with_header=True) 
x2.write_csv("df2.csv", with_header=True)

@fullstart
Copy link
Author

You're right, it works with CSV-created dataframes, I missed header stuff

But originally issue arised with parquet files. I tested with parquet this time, and problem is there:

   2: from datafusion import SessionContext, lit, col, functions as f
   3: ctx = SessionContext()
   4: x1 = ctx.from_pydict({"id1": [1, 2, 3], "val1": ["a", "b", "c"]})
   5: x2 = ctx.from_pydict({"id1": [2, 3, 4], "val1": ["b", "c", "d"]})
   6: x1.write_parquet("df1.parquet")
   7: x2.write_parquet("df2.parquet")
   8: xf1 = ctx.read_parquet("df1.parquet")
   9: xf2 = ctx.read_parquet("df2.parquet")
  10: xf1
Out[10]:
DataFrame()
+-----+------+
| id1 | val1 |
+-----+------+
| 1   | a    |
| 2   | b    |
| 3   | c    |
+-----+------+
  11: xf2
Out[11]:
DataFrame()
+-----+------+
| id1 | val1 |
+-----+------+
| 2   | b    |
| 3   | c    |
| 4   | d    |
+-----+------+
  12: x1.join(x2, on="id1")
Out[12]:
DataFrame()
+-----+------+-----+------+
| id1 | val1 | id1 | val1 |
+-----+------+-----+------+
| 2   | b    | 2   | b    |
| 3   | c    | 3   | c    |
+-----+------+-----+------+
  13: xf1.join(xf2, on="id1")
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In[13], line 1
----> 1 xf1.join(xf2, on="id1")

File ~\prj\datafusion\venv\Lib\site-packages\datafusion\dataframe.py:468, in DataFrame.join(self, right, on, how, left_on, right_on, join_keys)
    465 if isinstance(right_on, str):
    466     right_on = [right_on]
--> 468 return DataFrame(self.df.join(right.df, how, left_on, right_on))

Exception: Schema error: Schema contains duplicate qualified field name "?table?".id1

@chenkovsky
Copy link

i created an issue in data fusion python apache/datafusion-python#996

@Omega359
Copy link
Contributor

Datafusion currently requires all field names to be unique. Recently added was alias support for tables which might help in the future. Hopefully that will make it to the pythonlib #14127

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants