-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Importing VertexID with UTF8 characters in CSV file #257
Comments
cc @veezhang |
@goranc Hi, can you paste your csv data here? |
OK, I've invested a little bit more time with this issue. What is changed from previous testing is that now I'm handling strings to be complete UTF8 characters, so we avoid using escaping sequences started with hexadecimal escape codes (like \xe6 in previous example) and that can be completely different feature which can be provided. So let's concentrate on importing regular UTF8 strings as VertexID. You can try to import this Domain data I've got errors for, and see that this is the case here with size. Tag definition for this records is:
And the Insert commands which have issues is like in this example:
|
Just to explain this hybrid VertexID structure. It is combination of Lexicographic prefix and hashing function, so we use that to avoid collisions in the graph space. The VertexID is generated based on TAG Property domain.name as: Note: |
Dear @goranc Sorry @whitewum was not aware that you cannot read Chinese, we have this screen capture in Chinese Documentation mentioning that one Chineses UTF-8 char is 3-byte(may be not as you expected when calculating its length?) As the following:
In this case the length of In [7]: 3 * len("中醫中藥") + len("cn.t4508929864515433325")
Out[7]: 35 We will cover this info to en documentation later, sorry for this. |
We have this note in cn docs already. related issue: vesoft-inc/nebula-importer#257
We have this note in cn docs already. related issue: vesoft-inc/nebula-importer#257
Ok, it is clear now, what is behind the scene. |
Thanks @goranc do you think this patch to doc is enough or? https://github.com/vesoft-inc/nebula-docs/pull/1871/files |
closing it, thanks @goranc ! |
Importing data with escaped UTF8 characters for string type VertexID is not converting input string to UTF8 character, but inserts escaped characters as is from CSV file.
My Nebula cluster is using 3.3.0 version.
VertexID is fixed string type with 28 characters length.
I'm using custom algorithm for VertexID to avoid collision. It is combination of Lexicographic prefix based on string which have 8 characters length and concatenated with hash (standard Nebula hash function) converted to string.
Steps to reproduce the behavior:
Create space with VertexID definition as fixed string which use 28 characters
Create TAG for URL vertex
Import data for URL with UTF8 having specific characters
Examples:
CREATE SPACE IF NOT EXISTS graph(partition_num=128, replica_factor=3, vid_type=fixed_string(28));
USE graph;
CREATE TAG url(link string, subdomain_name string, domain_name string, protocol string, classification string);
Try to import data into TAG
"stubhub\xe6-2541048767624938324": ("http://stubhub手数料3.xyz","stubhub手数料3.xyz","stubhub手数料3.xyz","http",""),
"download1336853390718461484": ("http://downloads.sourceforge.net/project/orz123/a23.mp3?r=&ts=1448325706&use_mirror=heanet","downloads.sourceforge.net","sourceforge.net","http",""),
"oss.jfro1186231920510779202": ("https://oss.jfrog.org/artifactory/jcenter-remote/com/google/apis/google-api-services-cloudkms/v1-rev20-1.21.0/google-api-services-cloudkms-v1-rev20-1.21.0.jar","oss.jfrog.org","jfrog.org","https","");
ErrMsg: Storage Error: The VID must be a 64-bit integer or a string fitting space vertex id length limit., ErrCode: -1005
We expect to get specific characters in VertexID field as it is in Domain field, but instead it is not converted and we got error about VertexID exceeded length.
The text was updated successfully, but these errors were encountered: