Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Term::text() doesn't panic if the term isn't valid UTF-8 #138

Closed
kaedroho opened this issue May 9, 2017 · 8 comments
Closed

Term::text() doesn't panic if the term isn't valid UTF-8 #138

kaedroho opened this issue May 9, 2017 · 8 comments
Milestone

Comments

@kaedroho
Copy link
Contributor

kaedroho commented May 9, 2017

There's a problem with the implementation of the following method:

    /// Returns the text associated with the term.
    ///
    /// # Panics
    /// If the value is not valid utf-8. This may happen
    /// if the index is corrupted or if you try to 
    /// call this method on a non-string type.
    pub unsafe fn text(&self) -> &str {
        str::from_utf8_unchecked(self.value())
    }

(https://github.com/tantivy-search/tantivy/blob/0dad02791c74e572ec005abb1f75e3601bd25db5/src/schema/term.rs#L110-L118)

Unlike the comment suggests, str::from_utf8_unchecked does not panic if the passed value isn't valid UTF-8 (If it did, the function wouldn't be unsafe). This is pretty bad as if it's invalid, the returned string would be invalid too and could cause a confusing error later on.

The following implementation should make it behave correctly:

    /// Returns the text associated with the term.
    ///
    /// # Panics
    /// If the value is not valid utf-8. This may happen
    /// if the index is corrupted or if you try to 
    /// call this method on a non-string type.
    pub fn text(&self) -> &str {
        str::from_utf8(self.value()).expect("Term is not a valid UTF-8 string")
    }
@fulmicoton
Copy link
Collaborator

Good point. Will fix

fulmicoton added a commit that referenced this issue May 11, 2017
Also.

Cleaned up the code to make sure that the logic
is only in one place.
Removed allocate_vec

Closes #141
Closes #139
Closes #142
Closes #138
fulmicoton added a commit that referenced this issue May 11, 2017
Also.

Cleaned up the code to make sure that the logic
is only in one place.
Removed allocate_vec

Closes #141
Closes #139
Closes #142
Closes #138
fulmicoton added a commit that referenced this issue May 11, 2017
Also.

Cleaned up the code to make sure that the logic
is only in one place.
Removed allocate_vec

Closes #141
Closes #139
Closes #142
Closes #138
@BurntSushi
Copy link

Why not return an error and let the caller decide whether to panic or not?

@fulmicoton
Copy link
Collaborator

@BurntSushi Not sure what is the best here.

Let me list the case where this can panic...

String terms are supposedly built by passing a &stras an argument, so right now :
we panics in the following case

a) the data is corrupted.
b) there is a bug in tantivy
c) the user is trying to extract text from a int fields
d) the user has built the Term using some non-standard API, and filled non-valid data

I chose (I think you have the same policy in fst) that tantivy may panic if its data is corrupted. b) is not really a valid reason to return a result.

So we are left with c) and d).
Ideally I should rely more on the type system to remove the problem of c). I'll see if I can fix that.

I just marked as pub(crate) the functions Term::set_content and Term::from_bytes so that d) is not really possible.

What do you guys think? Should I return a Result for the case c) ?

@BurntSushi
Copy link

@fulmicoton Thanks for explaining! I think (a) and (b) would be fine reasons to panic. You're right that the fst crate could panic if the data is corrupted (it could also produce silently wrong results).

(c) and (d) are trickier. I don't know what the right answer is. One possible alternative is that you leave text as is, and add a try_text method which returns a Result. (There is some precedence for this naming convention in std, although not specifically for string types.)

@fulmicoton
Copy link
Collaborator

Just to make it clear. d) has been ruled out.

I am wondering if user might misunderstand the semantics of try_text() and assume that it can be used to test if the field was textual. try_text().is_ok() could happen with some u64 values.

I will let like that for the moment and fix the API eventually to had a typed version of field and term.

@fulmicoton fulmicoton added this to the 0.4.0 milestone Jul 13, 2017
@ansjsun
Copy link

ansjsun commented Mar 6, 2020

hi @fulmicoton

my _id is [u8] but I need it to index for delete.

if set unsafe { String::from_utf8_unchecked(Vec::from(key)) }

but when search has panic .how to solve it?

@fulmicoton
Copy link
Collaborator

@ansjun Tantivy does not support indexed &[u8] at the moment.

As a workaround, could you encode your "_id" in base64 ?

@ansjsun
Copy link

ansjsun commented Mar 6, 2020

Yes, that's the only way right now

thx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants