utf8tok and utf8tok_r #89

warmwaffles · 2021-06-15T15:57:19Z

I've been playing with adding utf8tok but the problem with the original implementation is that it is not re-entrant.

I've been looking at musl at how they implemented utf8tok_r and it's relatively simple. here

void *utf8tok_r(void *utf8_restrict str, const void *utf8_restrict sep, void **utf8_restrict ptr) {
  char* s = (char*) str;
  char** p = (char**) ptr;

  if (!s && !(s = *p)) {
    return NULL;
  }

  s += utf8spn(s, sep);
  if (!*s) {
    return *p = 0;
  }

  *p = s + utf8cspn(s, sep);
  if (**p) {
    *(*p)++ = 0;
  } else {
    *p = 0;
  }

  return s;
}

The following is the implemented test (it fails at the assert for föőf.

UTEST(utf8tok_r, token_walking) {
    char* string = utf8dup("this|aäáé|föőf|that|");
    char* ptr = NULL;

    ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "this", 4));
    ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "aäáé", 4));
    ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "föőf", 4));
    ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "that", 4));

    free(string);
}

After playing with this for a bit, I am kind of at a loss for what to do.

Anyways, leaving this here in case someone else wants to pick it up and go on.

The text was updated successfully, but these errors were encountered:

gulrak · 2021-12-12T10:16:33Z

I stumbled across this and had a look: The problem is, that while utf8spn/utf8cspn are logically equivalent to strspn/strcspn, their results are codepoints not bytes, so they can not be simply added to char pointers. Besides that, the API of stroke_r demands to set the first parameter to NULL on subsequent calls, so combining this and adding another helper (didn't find something matching in utf8.h) I came up with:

void *utf8incr(void *utf8_restrict str, size_t len) {
    char* s = (char*) str;
    while(*s && len--) {
        size_t l = utf8codepointcalcsize(s);
        while(*s && l--) ++s;
    }
    return s;
}

void *utf8tok_r(void *utf8_restrict str, const void *utf8_restrict sep, void **utf8_restrict ptr) {
  char* s = (char*) str;
  char** p = (char**) ptr;

  if (!s && !(s = *p)) {
    return NULL;
  }

  s = utf8incr(s, utf8spn(s, sep));
  if (!*s) {
    return *p = 0;
  }

  *p = utf8incr(s, utf8cspn(s, sep));
  if (**p) {
    *(*p)++ = 0;
  } else {
    *p = 0;
  }

  return s;
}

And as a small change to the test:

UTEST(utf8tok_r, token_walking) {
    char* string = utf8dup("this|aäáé|föőf|that|");
    char* ptr = NULL;

    ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "this", 4));
    string = NULL;
    ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "aäáé", 4));
    ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "föőf", 4));
    ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "that", 4));

    free(string);
}

warmwaffles · 2021-12-13T22:46:15Z

@sheredom this is a pretty interesting find

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utf8tok and utf8tok_r #89

utf8tok and utf8tok_r #89

warmwaffles commented Jun 15, 2021

gulrak commented Dec 12, 2021

warmwaffles commented Dec 13, 2021

utf8tok and utf8tok_r #89

utf8tok and utf8tok_r #89

Comments

warmwaffles commented Jun 15, 2021

gulrak commented Dec 12, 2021

warmwaffles commented Dec 13, 2021