Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EntitySet.add_relationship should error if the child variable is also the index variable of the child entity #1009

Closed
martysai opened this issue Jun 3, 2020 · 5 comments · Fixed by #1034
Assignees
Labels
bug Something isn't working

Comments

@martysai
Copy link

martysai commented Jun 3, 2020

Depth Feature Synthesis Error

I am trying to apply DFS to my EntitySet. It is simple and has the following structure:
Image of Structure
However, ft.dfs is not appliable nor to target_entity='orders', neither to target_entity='customers'. I found an open issue in your repository: issue
But this is still open and I can't figure out how to fix the errors which appear with both of target entities.

Bug/Feature Request Description

There are warnings of the following kind:
WARNING Attempting to add feature <Feature: customer_zip_code_prefix / 1> which is already present. This is likely a bug.
Error's message:
ValueError: 'order_id' is both an index level and a column label, which is ambiguous.

Expectations

How to get here a feature_matrix_spec without any errors and warnings?
Thanks for your time!

Output of featuretools.show_info()

[paste the output of featuretools.show_info() here below this line]
Featuretools version: 0.15.0
Featuretools installation directory: /usr/local/lib/python3.6/dist-packages/featuretools

SYSTEM INFO

python: 3.6.9.final.0
python-bits: 64
OS: Linux
OS-release: 4.19.104+
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

INSTALLED VERSIONS

numpy: 1.18.4
pandas: 1.0.4
tqdm: 4.41.1
PyYAML: 3.13
cloudpickle: 1.3.0
dask: 2.12.0
distributed: 1.25.3
psutil: 5.4.8
pip: 19.3.1
setuptools: 47.1.1

@rwedge
Copy link
Contributor

rwedge commented Jun 3, 2020

Hi @MaratSaidov , thanks for the bug report. Happy to help you with these issues. Could you share the code you used to create the entityset, and attach the structure image to your comment directly?

@martysai
Copy link
Author

martysai commented Jun 4, 2020

Hi @rwedge , thank you for your help. Here is the structure of an EntitySet:
table-description

Here is the code which creates this EntitySet:

es = ft.EntitySet(id = 'automl')

es = es.entity_from_dataframe(entity_id='orders', dataframe=olist_orders, index='order_id')

es = es.entity_from_dataframe(entity_id='customers', dataframe=olist_customers, index='customer_id')

r_customer_order = ft.Relationship(es['customers']['customer_id'], es['orders']['customer_id'])

es = es.add_relationships([r_customer_order])

Then I call a depth wise synthesis function:

feature_matrix_spec, feature_names_spec = ft.dfs(entityset=es, target_entity='customers',  
                                                 agg_primitives=agg_primitives,
                                                 trans_primitives=trans_primitives,
                                                 max_depth=2, features_only=False)

Here agg_primitives and trans_primitives are just lists of randomly taken primitives from your library.

The described dataframes ( olist_orders, olist_customers) actually have the unique identifiers order_id and customer_id respectively.

@rwedge
Copy link
Contributor

rwedge commented Jun 9, 2020

Hi @MaratSaidov ,

I made a mock dataset and was able to replicate the warnings you were getting, we can investigate those. I didn't encounter the ValueError you reported. Are you able to share the data / code you used to create the EntitySet. A reproducible example would help a lot with diagnosing this.

Other information that would help:

  • The full stack trace of the ValueError
  • If you could find the smallest group of primitives necessary create this error

@martysai
Copy link
Author

Hi @rwedge,

I allocated a problem a bit. Consider two structures:

three-tables

This input is suitable to a bunch of DFS runs. However, if I try to add an edgeless table to this structure:

four-tables

In this case I get the described above ValueError.

Here you could find a bug example notebook: Bug Example

Here the data tables are placed: data

The original dataset you could find here: dataset

Notes about other information:

  • A full stack trace is quite large, so you might find it in the notebook I attached to.
  • Actually the problem appears in lots of randomly generated primitives sets. Consider the following for example:
agg_primitives = ['last']
trans_primitives = ['age']

Thank you!

@rwedge
Copy link
Contributor

rwedge commented Jun 12, 2020

Hi @MaratSaidov,

Thanks for the example notebook! I was able to reproduce the error and figure out what was causing it.

In both of the graphs you included in your last post, the index column for the order_items entity, order_id, is listed as id variable type instead of the expected index type. This happened when adding the relationship between orders and order_items. Featuretools does not support having having a column be both the index column for an entity and a foreign key column. There should be an error when trying to add a relationship that would cause this scenario. Since there was no error when trying to add the relationship, there ends up being an error later during the calculation that is harder to diagnose.

However, we can fix this relationship problem. Instead of using order_id as the index for the order_items entity, we should use a different index column. Featuretools can create a new index column automatically when creating an entity from dataframe.

We can remove one preprocessing step:

# choose only unique indices
print("olist_order_items.shape:", olist_order_items.shape)
olist_order_items = olist_order_items.iloc[olist_order_items.drop_duplicates(['order_id']).index, :]
print("olist_order_items.shape:", olist_order_items.shape)

I removed the above code dropping rows since with a different index than order_id for the order_items entity, it is ok to have multiple rows with the same order id.

Then we need to add a new index. entity_from_dataframe has a make_index option, so we just need to update the line where we add the order_items entity. I called the new index column "order_item_unique_id".

es = es.entity_from_dataframe(entity_id='order_items', dataframe=olist_order_items, index='order_item_unique_id', make_index=True)

After making those changes I was able to run dfs using Last and Age without errors.

I'm changing the name of the issue to reflect what we should do to fix it in the future, which is to prevent adding a relationship where the child variable is also the index of the child entity.

@rwedge rwedge changed the title ValueError: 'order_id' is both an index level and a column label, which is ambiguous. EntitySet.add_relationship should error if the child variable is also the index variable of the child entity Jun 12, 2020
@rwedge rwedge added the bug Something isn't working label Jun 12, 2020
@frances-h frances-h self-assigned this Jun 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants