Error when joining dataframes with duplicate column names if dataframes generated from file #14147

fullstart · 2025-01-16T06:17:39Z

Describe the bug

Encountered an issue joining dataframes with duplicate column names if they generated from file read (I tried csv and parquet).
Dataframes produced from python dict do join without problem.

I did my testing with latest version of Datafusion on Windows.

To Reproduce

Fine with dataframes from dict

from datafusion import SessionContext
ctx = SessionContext()
x1 = ctx.from_pydict({'id1': [1, 2, 4, 5, 6], 'col2': [3, 4, 3, 5, 2], 'col3': [3, 4, 1, 2, 3]})
x2 = ctx.from_pydict({'id1': [1, 2, 4, 5, 6], 'col2': [3, 4, 3, 5, 2], 'col3': [5, 6, 7, 8, 9]})
x1.join(x2, on="id1")
Out[16]:
DataFrame()
+-----+------+------+-----+------+------+
| id1 | col2 | col3 | id1 | col2 | col3 |
+-----+------+------+-----+------+------+
| 1   | 3    | 3    | 1   | 3    | 5    |
| 2   | 4    | 4    | 2   | 4    | 6    |
| 4   | 3    | 1    | 4   | 3    | 7    |
| 5   | 5    | 2    | 5   | 5    | 8    |
| 6   | 2    | 3    | 6   | 2    | 9    |
+-----+------+------+-----+------+------+

Continue to file read

x1.write_csv("df1.csv")
x2.write_csv("df2.csv")

x1_f = ctx.read_csv("df1.csv")
x2_f = ctx.read_csv("df2.csv")

x1_f.join(x2_f, on="id1")
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In[21], line 1
----> 1 x1_f.join(x2_f, on="id1")

File ~\prj\datafusion_test\venv\Lib\site-packages\datafusion\dataframe.py:468, in DataFrame.join(self, right, on, how, left_on, right_on, join_keys)
    465 if isinstance(right_on, str):
    466     right_on = [right_on]
--> 468 return DataFrame(self.df.join(right.df, how, left_on, right_on))

Exception: Schema error: No field named id1. Valid fields are "?table?"."1", "?table?"."3".

Expected behavior

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

chenkovsky · 2025-01-16T07:34:37Z

x1.write_csv("df1.csv", with_header=True) 
x2.write_csv("df2.csv", with_header=True)

fullstart · 2025-01-16T11:41:25Z

You're right, it works with CSV-created dataframes, I missed header stuff

But originally issue arised with parquet files. I tested with parquet this time, and problem is there:

   2: from datafusion import SessionContext, lit, col, functions as f
   3: ctx = SessionContext()
   4: x1 = ctx.from_pydict({"id1": [1, 2, 3], "val1": ["a", "b", "c"]})
   5: x2 = ctx.from_pydict({"id1": [2, 3, 4], "val1": ["b", "c", "d"]})
   6: x1.write_parquet("df1.parquet")
   7: x2.write_parquet("df2.parquet")
   8: xf1 = ctx.read_parquet("df1.parquet")
   9: xf2 = ctx.read_parquet("df2.parquet")
  10: xf1
Out[10]:
DataFrame()
+-----+------+
| id1 | val1 |
+-----+------+
| 1   | a    |
| 2   | b    |
| 3   | c    |
+-----+------+
  11: xf2
Out[11]:
DataFrame()
+-----+------+
| id1 | val1 |
+-----+------+
| 2   | b    |
| 3   | c    |
| 4   | d    |
+-----+------+
  12: x1.join(x2, on="id1")
Out[12]:
DataFrame()
+-----+------+-----+------+
| id1 | val1 | id1 | val1 |
+-----+------+-----+------+
| 2   | b    | 2   | b    |
| 3   | c    | 3   | c    |
+-----+------+-----+------+
  13: xf1.join(xf2, on="id1")
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In[13], line 1
----> 1 xf1.join(xf2, on="id1")

File ~\prj\datafusion\venv\Lib\site-packages\datafusion\dataframe.py:468, in DataFrame.join(self, right, on, how, left_on, right_on, join_keys)
    465 if isinstance(right_on, str):
    466     right_on = [right_on]
--> 468 return DataFrame(self.df.join(right.df, how, left_on, right_on))

Exception: Schema error: Schema contains duplicate qualified field name "?table?".id1

chenkovsky · 2025-01-16T11:51:53Z

i created an issue in data fusion python apache/datafusion-python#996

Omega359 · 2025-01-16T15:10:37Z

Datafusion currently requires all field names to be unique. Recently added was alias support for tables which might help in the future. Hopefully that will make it to the pythonlib #14127

fullstart added the bug Something isn't working label Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when joining dataframes with duplicate column names if dataframes generated from file #14147

Error when joining dataframes with duplicate column names if dataframes generated from file #14147

fullstart commented Jan 16, 2025

chenkovsky commented Jan 16, 2025

fullstart commented Jan 16, 2025

chenkovsky commented Jan 16, 2025

Omega359 commented Jan 16, 2025

Error when joining dataframes with duplicate column names if dataframes generated from file #14147

Error when joining dataframes with duplicate column names if dataframes generated from file #14147

Comments

fullstart commented Jan 16, 2025

Describe the bug

To Reproduce

Expected behavior

Additional context

chenkovsky commented Jan 16, 2025

fullstart commented Jan 16, 2025

chenkovsky commented Jan 16, 2025

Omega359 commented Jan 16, 2025