Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Facing SEVER Magic number verification failed for page 0 of xxxxx.pcl. [OWOWCache] #8560

Closed
ssenapat opened this issue Sep 23, 2018 · 24 comments
Assignees
Labels
Milestone

Comments

@ssenapat
Copy link

OrientDB Version: 3.0.7

Java Version: 1.8

OS: Centos

RAM : 512 MB

Expected behavior

should work with out loosing vertexes, edges and the data.

Actual behavior

I was creating a new database model, creating vertexes, edges and inserting data.
Suddenly OrientDB crashes and restarts. After restart I see the vertexes and the edges are lost.

The log show many verification failed messages.
SEVER Magic number verification failed for page 0 of cvcontactprops.pcl. [OWOWCache]

Steps to reproduce

it had happened couple of times.

I am sure there is something is wrong. My guess is the RAM of 512 is causing the crash but not sure.
I appreciate you help in this. Attaching

the log file.

@andrii0lomakin andrii0lomakin self-assigned this Sep 24, 2018
@andrii0lomakin andrii0lomakin added this to the 3.0.x milestone Sep 24, 2018
@andrii0lomakin
Copy link
Member

H @ssenapat , sure will try this case. Small question, do you create new classes during the addition of edges/ vertexes, or you create all schema before head?

@ssenapat
Copy link
Author

ssenapat commented Sep 24, 2018 via email

@ssenapat
Copy link
Author

ssenapat commented Sep 24, 2018 via email

@ssenapat
Copy link
Author

ssenapat commented Sep 24, 2018 via email

@andrii0lomakin
Copy link
Member

That is for sure, can not create a problem. @ssenapat I will look on this issue in a couple of days. Will need to fix another issue first.

@ssenapat
Copy link
Author

ssenapat commented Sep 25, 2018 via email

@gtadudeps
Copy link

@Laa any update on this, we are also facing this issue.

@ssenapat
Copy link
Author

ssenapat commented Oct 11, 2018 via email

@gtadudeps
Copy link

@Laa is there anything we can do to avoid this its mainly observed if the orientdb crashes/restarts.
@ssenapat unfortunately we are too near to go prod with orientdb this issue is causing the project to go in risk.

@andrii0lomakin
Copy link
Member

@gtadudeps if you provide me test which will reproduce the issue I will fix it quickly, as I wrote @ssenapat, right now we are busy with issues on commercial support. @ssenapat was not right that we can not reproduce the issue, right now we work on other issues, not on this one, but if you provide me test which will likely reproduce issue it will speed up fix of your problem.

@nicolasembleton
Copy link

nicolasembleton commented Oct 11, 2018

Confirmed that we are also seeing this in 3.0.8 and it's quite unfortunate.

I can't provide much to test as I'm not sure why it happens but we had to renew a lot of our servers as we had a few issues and had them synced over time. Also we have updated the writeQuorum from 3 to 1 quite painfully.

2018-10-11 13:39:53:076 SEVER Magic number verification failed for page 37842 of XXX.pcl

That's about all we can see. This is pretty serious. Everything syncs back in distributed mode then this happens randomly.

We have 12GB of RAM, distributed, 3 servers well synced and started, working for a couple hours (after a long downtime) and then it starts to happen and then it doesn't get out of it.

Note that aside the previous resync, no downtime happened, no crash, we always use shutdown.sh to stop a server, etc... Using it pretty properly.

@nicolasembleton
Copy link

Switching off the WAL (as recommended by the doc now we use SSDs) brings an WAL is unavailable, unable to restore error. It's a bit of a bummer.

@gtadudeps
Copy link

@nicolasembleton wouldn't switching off WAL causes reliability issues and is too risky by itself to ensure data consistency. Moreover, is it solving this issue?

@gtadudeps
Copy link

@Laa this issue is quite random so no fixed steps to reproduce it, but one way could be to continuously perform write operations and intermittently restart orientdb. Please take it on high priority as we need to clear this to go beta, we were also pushing the management to procure enterprise edition of orientdb but this severely weakens our case.

@andrii0lomakin
Copy link
Member

Hi, we have made several changes to improve durability system, could you try on 3.0.9?

@nicolasembleton
Copy link

@gtadudeps it is solving the magic number issue when using an SSD yes. It is the recommended setting when using an SSD so I'd assume it's still ok (although we are going to switch back to HDD as WAL is quite important indeed).

@Laa that's great. Thanks for the heads up. We are going to test that out.

@andrii0lomakin
Copy link
Member

@nicolasembleton it is strongly not recommended to switch off WAL. Looking forward to your feedback.

@nicolasembleton
Copy link

nicolasembleton commented Oct 20, 2018

@Laa I think the doc should be updated because on this page: https://orientdb.com/docs/last/Write-Ahead-Log.html it says If you have a SSD we suggest to use for database files only, not WAL..

update: Reading it back I think I understand the ambiguity that I may have missed. This piece said Don't use SSD for WAL, use normal HDD for WALs instead of If you have an SSD, don't use WAL. It's written in an ambiguous way.

@209
Copy link

209 commented Feb 24, 2019

We have this problem now.
VPS, 1600Gb ssd (only ssd), 60GB RAM, 10 cores.
We use OrientDB in docker container, v3.0.15.

In log: "Magic number verification failed for page 9198".
Now we only added many records (with duplicates - treated adding error). And often start/stop container.
Now: Vertex: 6 millions, Edges: 30 millions.
We want increase count of Vertex to x20. Edges more.

But such problems very seriously inhibit the process.

@andrii0lomakin
Copy link
Member

@209 could you send me stack trace which is printed when you see this exception?

@andrii0lomakin
Copy link
Member

But please do load from the empty database.

@andrii0lomakin
Copy link
Member

@209 I have provided small change, just to be sure that your problem is fixed in 3.0.16, we release it next week. Please try this distribution and provide me feedback with a stack trace if that happens again (hope not).

@209 209 mentioned this issue Feb 24, 2019
@209
Copy link

209 commented Feb 28, 2019

@Laa I see, OrientDB was updated, but docker container isn't. I can't try new version.

@luigidellaquila
Copy link
Member

Hi @209

The pull request was submitted to Docker a few hours ago, but it takes some time to be approved. I'd suggest you to check again in next 24/48 hours

Thanks

Luigi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

6 participants