Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy instantiate a node with unique property #353

Closed
lerela opened this issue Jul 11, 2018 · 9 comments
Closed

Lazy instantiate a node with unique property #353

lerela opened this issue Jul 11, 2018 · 9 comments
Labels
enhancement marked_for_imminent_closure Fully or partially obsolete issue. Request clarification, close if none provided.

Comments

@lerela
Copy link
Contributor

lerela commented Jul 11, 2018

Hello,

To delete a node without loading it first (because why load it to delete it?) one can do:

node = MyNode.inflate(node_id)
node.delete()

However using the internal Neo4j id is a bad practice. Also, this approach fails for updates as the query built by neomodel does not use the node id in account, meaning that:

MyNode.create_or_update({
    id: node_id,
    my_property: "bob"
})

fails to update the object (it creates a new one).

The recommended way to reference Neo4j objects is to use uids, however neomodel does not support lazy inflating objects through their uids for instance with:

node = MyNode.inflate(node_uid)

All logic currently rely on the internal Neo4j id to be loaded which requires a database hit to be fetched from the uid. This database hit is unnecessary in many cases (delete, merges) and leads to huge performance impact when trying to update a batch of nodes (say, setting the properties of 10,000 nodes of which we have the ids/uids).

I believe that being able to lazy inflate a node through an unique property and perform queries that rely on it instead of id would solve this, what are your thoughts?

@aanastasiou
Copy link
Collaborator

However using the internal Neo4j id is a bad practice.

In what sense? All nodes have to have a node id associated with them and node ids are expected to be consistent at least within the context of a session. Node ids would be unreliable if business logic assumed that a node id never changes for as long as a node remains stored in a database instance.

From the point of view of Neomodel, the way you demonstrate create_or_update cannot work because of the way the actual query is expanded to internally.

This could be a use-case for a batched mode way of running operations on a database though.

As long as node_uid is an int, that MyNode.inflate() should inflate the node.

All logic currently rely on the internal Neo4j id to be loaded which requires a database hit to be fetched from the uid.

What / where is the uid (?).

... lazy inflate a node through an unique property and perform queries that rely on it instead of id...

Basically, what you are saying is that any property with a unique constraint should be used to fetch a node because it is guaranteed that it will reference that node uniquely (?). Would your use-case require this kind of referencing at the level of inflate?

This kind of thing sounds useful (to me at least) but the description is probably a bit simplistic at the moment. The scenario needs to become a bit more specific (what if there are more than one unique constraints?). Personally, I would see it as a modification to create_or_update or addition of a function to do property updates en masse.

Just as a side note which you might not find entirely helpful. There are certain things you can do with Neomodel and others that you currently cannot but could do with other ways. For those things that you cannot do with Neomodel at the moment (and provided that they are not bugs) it would be useful to capture how you went about solving it in your specific use-case and whether it could be turned into a generic solution expressed over the Graph data model. This could then be reviewed on how it fits with the current way Neomodel works and see how it could be included in an upcoming release.

@lerela
Copy link
Contributor Author

lerela commented Aug 7, 2018

However using the internal Neo4j id is a bad practice.

In what sense? All nodes have to have a node id associated with them and node ids are expected to be consistent at least within the context of a session. Node ids would be unreliable if business logic assumed that a node id never changes for as long as a node remains stored in a database instance.

Indeed, our basic use case is a SQL record associated with a Neo4J node. Therefore I need to store an unique reference to this Neo4J node.
I can do it with the Neo4J internal id, which is not advised in the long run as you state it (cf).
The other way is to use an unique property id (which I called uid without defining it, sorry) with UniqueIdProperty().

Basically, what you are saying is that any property with a unique constraint should be used to fetch a node because it is guaranteed that it will reference that node uniquely (?). Would your use-case require this kind of referencing at the level of inflate?

Exactly. Main issue I'm trying to raise is that the Neo4J recommended way to store a persistent reference to a Neo4J node is to use such an unique property. Yet the only way in neomodel to inflate a node without fetching it first is to use internal ids (thanks to .inflate).
Therefore if I store the unique property to reference the Neo4J node, I'll need an extra Neo4J query for any DELETE or MERGE (to fetch the internal id), where as it could be done in a single Cypher query (MATCH (n) WHERE n.uid='...' DETACH DELETE n).

On the contrary, the only way to perform batch merges without fetching the nodes first is to use an unique id property (using .create_or_update). Internal ids won't work in this scenario as stated previously.

This sounds inconsistent to me and causes important performance issues on large batch operations.

To sum it all up:

  • if I store the internal ids, I can't perform batch merges (and storing internal ids is bad),
  • if I store the unique properties, I can't perform batch deletes,
  • without unique property-based inflate, there is an overhead of one query for each MERGE or DELETE because neomodel needs to fetch the node from the database even though it is not needed.

This API could solve those limitations:

Person.objects.batch_delete(id=(1, 2, 3, ...)) # delete the Person nodes with internal ids 1, 2, 3 in a single Neo4J query

Person.objects.batch_delete(uid=('136b42f8-8002-4a03-81a2-6acd041de739', '72f45813-99a4-4889-8177-712d64a1f015', ..., ...)) # delete the Person nodes with the unique property `uid` set to provided values in a singe Neo4J query

Person.inflate(uid='72f45813-99a4-4889-8177-712d64a1f015', prop='updated value')
# Lazy instantiate a Person object with this UID so that some properties can be updated with .save()
# Currently this fails as .inflate expects the internal Neo4J id
# This also fails: Person(uid='72f45813-99a4-4889-8177-712d64a1f015', prop='updated value').save() with "UniqueProperty: Node(53338) already exists with label `Person` and property `uid`='72f45813-99a4-4889-8177-712d64a1f015'
# So one should first fetch the Person: p=Person.objects.get(uid=...) then delete it with p.delete(), which is wasting a query

In case of multiple unique properties, this approach would still work as I'm inserting the name of the property in all those calls.

What are your thoughts?

@aanastasiou
Copy link
Collaborator

...inflate a node without fetching it first...

I have been working along the lines of this in a different package that makes use of neomodel and (I am hoping) I will be able to release soon. However, there is something we have briefly discussed with the rest of the people involved in Neomodel around "Batch Operations".

This is nothing more than CRUD calls that appear to return immediately because they don't really apply the changes to the database but simply add the transaction to a list and return. The main reason for this is that when you know that you have large quantities of operations then you can take this into account when translating operations to queries (unwind lists of values, execute operations in transactions, etc). Once the size of this list reaches a user-defined level, the operations are "auto-committed" to the database. This can be done in an asynchronous or blocking way (the default is blocking). At the moment, these "batch" operations cover Creation of nodes and Creation of relationships.

The way these are implemented respects the principles behind Neomodel. So, re-use the abstract objects to communicate data back and forth to the backend rather than raw queries, re-use established (and very sensible) data structures and operations and extend them.

Updating properties en masse looks like another broad pattern we can bring in to Neomodel, possibly within the context of "Batch Operations".

I think that the point about .inflate() is valid...BUT, you have to understand that .inflate() is weaved into the way Neomodel operates. It is used to resolve relationships and return nodes wherever this is required. So, changing its operation, in the short term is not going to make Neomodel 10x faster unless this change is propagated everywhere.

So:

  • Part of your use case can be solved by .nodes.get(unique_id_property='something'), at least for retrieval, followed by property setting and .save().

    • Internally, in the short term this is how .inflate() could deal with retrieval if it is passed a named parameter that happens to be a unique_index property. StructuredNode's metamodel knows everything about properties so it would have to get an additional dictionary for fast lookups of uniquely indexed properties. Otherwise, the gain is lost in iterating through all properties to find out which one was declared with unique_index.
  • "Batch operations" at least as envisaged until now, support Creation....which is just 1/4 of the CRUD functionality, so, yeah, there is work to be done in this :). The way Batch Operations works right now is by "batching" existing operations. So, you can batch .connect which is a valid operation for relationships. And given the way things work in a transaction right now, there is little difference from issuing repeated atomic queries to one huge UNWIND type of query.

    • Perhaps we can work on a version of .save() that figures out that this is an update for an existing node (of the same class it represents!) judging by the existence (or not) of a unique field.
      • But you can see here that because of the validation layer that the OGM brings in, there is a number of checks in the middle that may be killing performance. So, this might have to be offered as something like a "special function" that makes certain assumptions on how it is going to be used.

@robinedwards would be interesting to get your thoughts on this too at your convenience.

@aanastasiou
Copy link
Collaborator

@lerela Shall we close this for the moment? Did .nodes.get() worked for you as indicated above?

@lerela
Copy link
Contributor Author

lerela commented Aug 14, 2018

You can close the ticket but I don't feel .nodes.get() solves my use case.

It returns the expected node indeed, but to update a node it's still making one unneeded query (since the uid was enough to uniquely identify the node in the MERGE query).
In the MERGE case, the queries can be carefully batched into create_or_update but that's just one aspect of it (and this still requires extra work & caution).
Say you have 2 instances, each one with an uid, and now need to create some relationship between them. You need to get the 2 instances before you can connect them. I'm arguing that none of this is needed and one could simply do:

     # No query to inflate the nodes:
     node0, node1 = MyNode.inflate(uid=node0_uid), MyNode.inflate(uid=node1_uid)
     # A single query, that can be wrapped in a transaction if we have multiple relationships to create
     node0.relation.connect(node1)

So instead of : fetching one node, waiting for the Neo4J response, fetching another node, waiting for the Neo4J response, then creating the relationship; we'd only have one request.

Yet I understand your points and that it might require quite a few changes in different places, but implementing more batch operations and the ability to manipulate a node without fetching it when there's no need to do so would greatly improve performance in our use cases.

@aanastasiou
Copy link
Collaborator

@lerela There are two things here:

  1. Helping you through your current situation within the constrains of Neomodel (which at least partially, I think we can tick that box :) ). Some more notes:

    1. Use Transactions
      • Because of the way things work between you issuing a Neomodel "call" and that "call" actually being submitted to the server, your query times are going to be slashed if you make sure that they run WITHIN a transaction. So, this [MyObject(k).save() for k in range(0,10)] is slower than if it happened within a with neomodel.db.transaction:.
    2. Make sure that your server doesn't "go to sleep". In my case, I noticed that the server process was getting locked (for a few seconds) due to garbage collection. I am now using dbms.jvm.additional=-XX:+UseConcMarkSweepGC instead of dbms.jvm.additional=-XX:+UseG1GC.
  2. Taking the use case you underline into account in further work on Neomodel.

    • We would try to do this by looking at the general case and again by expressing things in the way Neomodel works. So:
      1. If you can describe the general case from what you are dealing with specifically, that would be helpful;
      2. If you can provide a minimal example of how you would expect it to work, even better;
      3. If you can spare some time to see what is currently available and contribute to the development, ..., we would be hitting "Ideal" levels :)

All the best

@lerela
Copy link
Contributor Author

lerela commented Aug 16, 2018

Thanks for the tips, I'll definitely try those! I'm already wrapping queries in transactions but of course things are slower when Neo4J response must be awaited.

I can work on specifying more thoroughly the requirements but the minimal API I described above sums up most of the needs. If my company can allocate time to develop those we'll do it for sure but it'd probably be helpful to get some feedback from the neomodel team first in order to give us some clues about the implementation and make sure it follows your standards and expectations so that it can be merged at some point.

@aanastasiou
Copy link
Collaborator

aanastasiou commented Aug 17, 2018

...things are slower when Neo4J response must be awaited.

Can we unpick this a bit please, I am not sure I get it. What is the "problem" here? The fact that the OGM returns a complete record for an operation when it could be ignoring the response from the server? Or something else? Is your server on a different machine or the same one? What sort of delays are we talking about and at what point? (Is it possible to profile?)

EDIT: Sorry, forgot to add: get_or_create also has a lazy parameter to only return the id...

Regarding the rest, I understand, I had similar concerns before I took the plunge to work with Neomodel.

@aanastasiou
Copy link
Collaborator

@lerela I just had another read through this thread.

After this, you can now do SomeNode.inflate(SomeNode.nodes.get(uid="a864b92...325a5c", lazy=True)).delete() but this still works with two queries, one for the get (which however here only returns the node's id) and one for the inflate which brings the node in.

I feel that this is the least amount of steps you could perform and at the same time ensuring that the operation will indeed run on an existing node.

What I think you are describing here is something like SomeNode.nodes.get(uid="a864b92...325a5c") which however does not execute (yet). And following this, you would like to be doing something like .delete() at which point the full expression is effectively executed.

If that is correct (?), then this series of operations does not guarantee that a Node with uid="a864b92...325a5c"exists in the database....and that is where this would break down.

But, here is what I am thinking:

  1. Your intention is to form queries without first inflating nodes to cut down (what you perceive as) unnecessary delays.

    1. This is a valid concern, but you can still do this with the lazy inflate option (at the same time ensuring that the node is indeed there).
    2. If somehow you already have stored uids and you now want to apply a delete, then apply it through a direct cypher_query (Probably the most straightforward way to do this).
    3. At the moment, I can see this mode of operation being applicable to DELETE and UPDATE operations only.
  2. I am going to close this issue but retain our discussion as part of a future "enhancement" solely for the idea of instantiating "ghost" nodes or constructing some operations that can be applied as queries, as a potential way of applying "batch operations". The difficulty there would be in ensuring that the query is consistent with the state of the db.

All the best

@aanastasiou aanastasiou changed the title Lazy instanciate a node with unique property Lazy instantiate a node with unique property Mar 7, 2019
@aanastasiou aanastasiou added enhancement marked_for_imminent_closure Fully or partially obsolete issue. Request clarification, close if none provided. labels Mar 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement marked_for_imminent_closure Fully or partially obsolete issue. Request clarification, close if none provided.
Projects
None yet
Development

No branches or pull requests

2 participants