Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linking in Oniguruma (-lonig) breaks POSIX regex #233

Closed
SammyK opened this issue Apr 1, 2021 · 3 comments
Closed

Linking in Oniguruma (-lonig) breaks POSIX regex #233

SammyK opened this issue Apr 1, 2021 · 3 comments

Comments

@SammyK
Copy link

SammyK commented Apr 1, 2021

This issue relates to #210 and it has been fixed since 6.9.5-rc1. But I'm a bit stuck on how to accommodate for older versions of Oniguruma from the perspective of shared libraries.

The core issue is that Oniguruma <= 6.9.4 will break POSIX regex when it is linked in (-lonig). For example:

#include <regex.h>
#include <stdio.h>

int main(void)
{
    regex_t regex = {0};
    regcomp(&regex, "^[0-9]+:.*$", REG_EXTENDED | REG_NOSUB);

    const char str[] = "1234:abcd\n";

    if (regexec(&regex, str, 0, NULL, 0) == 0) {
        fprintf(stdout, "Matched.\n");
    } else {
        fprintf(stderr, "Failed to match.\n");
    }

    regfree(&regex);
    return 0;
}
$ gcc test.c && ./a.out
Matched.
$ gcc test.c -lonig && ./a.out
Failed to match.

The bug does not seem to affect BRE, so if I change the syntax to BRE and remove REG_EXTENDED, the above code will work as expected when linking in Oniguruma. However, BRE returns unexpected results when using pmatch.

#include <regex.h>
#include <string.h>
#include <stdio.h>

int main(void)
{
    regex_t regex = {0};
    regcomp(&regex, "[0-9]\\{1,20\\}", 0);

    const char str[] = "1234:abcd\n";

    regmatch_t pmatch[1];
    if (regexec(&regex, str, 1, pmatch, 0) == 0) {
        char buf[64];

        size_t len = (size_t)pmatch[0].rm_eo - pmatch[0].rm_so;
        fprintf(stdout, "Matched len: %zu.\n", len);
        
        memcpy(buf, str + pmatch[0].rm_so, len);
        buf[len] = '\0';
        fprintf(stdout, "Matched str: '%s'.\n", buf);
    } else {
        fprintf(stderr, "Failed to match.\n");
    }

    regfree(&regex);
    return 0;
}
$ gcc -g test_match.c && ./a.out
Matched len: 4.
Matched str: '1234'.
$ gcc -g test_match.c -lonig && ./a.out
Matched len: 18446744056529682432.
Illegal instruction: 4

In my case, I maintain a PHP extension that is always built as a shared library and is dynamically linked into a build of PHP that is often built with -lonig since it was unbundled from ext/mbstring in PHP 7.4. The part of the extension I'm working on is decoupled from the Zend Engine so there is no way to detect if Oniguruma has been loaded or not. I can still detect "broken regex behavior", but I don't know what the workaround for this issue would be.

Do you have any advice for shared libraries to get POSIX regex working when they are dynamically loaded into a build with -lonig?

@kkos
Copy link
Owner

kkos commented Apr 2, 2021

I am sorry for any inconvenience this may cause.
However, since it is impossible to fix something that has been released in the past, it is difficult to deal with.
I was able to get it working on MacOS X using the following method.

#include <regex.h>
#include <string.h>
#include <stdio.h>
#include <dlfcn.h>

/* MacOS X */
#define POSIX_REGEX_PATH   "libSystem.B.dylib"

typedef int regcomp_type(regex_t* preg, const char* pattern, int cflags);
typedef int regexec_type(const regex_t* preg, const char* string,
                         size_t nmatch, regmatch_t pmatch[], int eflags);
typedef void regfree_type(regex_t *preg);

int main(void)
{
  void* handle;
  regcomp_type* regcomp_func;
  regexec_type* regexec_func;
  regfree_type* regfree_func;

  regmatch_t pmatch[1];
  regex_t regex = {0};
  const char str[] = "1234:abcd\n";

  if ((handle = (void* )dlopen(POSIX_REGEX_PATH, RTLD_NOLOAD)) == NULL) {
    char* e = dlerror();
    fprintf(stderr, "ERROR: dlopen: %s\n", e);
    return -1;
  }

  regcomp_func = dlsym(handle, "regcomp");
  if (regcomp_func == NULL) {
    char* e = dlerror();
    fprintf(stderr, "ERROR: %s\n", e);
    return -1;
  }

  if (regcomp_func == regcomp) {
    fprintf(stdout, "Use POSIX regex as is.\n");

    regexec_func = regexec;
    regfree_func = regfree;
  }
  else {
    fprintf(stdout, "Escape Oniguruma regex.\n");

    regexec_func = dlsym(handle, "regexec");
    if (regexec_func == NULL) {
      char* e = dlerror();
      fprintf(stderr, "ERROR: %s\n", e);
      return -1;
    }
    regfree_func = dlsym(handle, "regfree");
    if (regfree_func == NULL) {
      char* e = dlerror();
      fprintf(stderr, "ERROR: %s\n", e);
      return -1;
    }
  }

  regcomp_func(&regex, "[0-9]\\{1,20\\}", 0);

  if (regexec_func(&regex, str, 1, pmatch, 0) == 0) {
    char buf[64];

    size_t len = (size_t)pmatch[0].rm_eo - pmatch[0].rm_so;
    fprintf(stdout, "Matched len: %zu.\n", len);
        
    memcpy(buf, str + pmatch[0].rm_so, len);
    buf[len] = '\0';
    fprintf(stdout, "Matched str: '%s'.\n", buf);
  } else {
    fprintf(stderr, "Failed to match.\n");
  }

  regfree_func(&regex);
  return 0;
}

@rofl0r
Copy link

rofl0r commented Apr 2, 2021

Illegal instruction: 4

looks like the regmatch_t/regoff_t types you got the libc's definition of dont match the ones ONIG is using, therefore accessing the pmatch member invokes UB.

i'd suggest to use the pmatch-less test you developed to check whether old onig is loaded and then do one of the following 2 things:

  • print an error message that onig needs to be updated to latest and abort
  • copy the definition of onig's regmatch_t and regoff_t into your code and use those from then on. that should make even old onig behave as expected. you basically need 2 different codepaths, one using the libc types and one the custom types you defined.

another alternative is to always use the onig_ prefixed version and types (using the onig header rather than regex.h) and make onig a hard dependency.

@SammyK
Copy link
Author

SammyK commented Apr 2, 2021

Thank you both for the advice and suggestions.

In our case, since we're doing some pretty basic pattern matching on small strings, the workaround we ended up going with is to match using BRE and REG_NOSUB. Once we have a match we can use sscanf or traverse the string character by character to extract the pattern.

So just to summarize for anyone else who maintains a shared lib running into this issue, some workaround options are:

  1. Use dlsym to resolve the symbols at runtime. (See @kkos's example above.)
  2. Detect broken regex behavior and fall back to copied versions of Oniguruma's regmatch_t and regoff_t; assuming you can detect that regex is broken because of Oniguruma and not some other reason. (As @rofl0r suggested.)
  3. Vendor in/have a hard dependency on Oniguruma for regex. (As @rofl0r also suggested.)
  4. Use basic regular expression (BRE) syntax with no position matching (REG_NOSUB) and manually extract the desired pattern from the matched string with sscanf or something similar.

Thanks again for the help!

@SammyK SammyK closed this as completed Apr 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants