Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for charsets and unify internal character storage type #3

Closed
wants to merge 3 commits into from

Conversation

mplucinski
Copy link
Contributor

This pull request introduces two changes:

  • 2e6add4: in places where text characters are processed, replace char type by YY_CHAR macro. This introduces distinguishness between character as byte of file ("physical") and character as the smallest part of text ("logical"). For logical one, YY_CHAR macro should be used. For now, they represent the same value, but in future YY_CHAR may be replaced by wider type. This is essential to introduce support for extended character sets (any set with number of code points bigger than 256).

    To properly introduce YY_CHAR macro usage, some tests are changed to not use str* family of standard C functions, but instead versions that operate on YY_CHAR values. New functions are sitting in tests/strutils.h header, and test case for this file is located in tests/test-tests_strutils directory.

  • f1456f9, 0d3107e: adds support for character sets. The first commit adds test case, the second actual support. This gives way to convert between "physical" bytes and "logical" characters with user-provided function. It allows input data to be encoded in different character set than the one that has been used to write input ".l" file.

    Charset support is completely optional. It is activated by %option charset. When it is active, new variables are activated: yycharset and yycharset_handler. First one should be set to currently used character set before calling yylex(). Second one should be set to function that will be called to convert incoming bytes into characters. Additionally, %option charset-source="ENCODING" should be set to the name of encoding used to write ".l" file - it will be used as internal scanner's character encoding. In case when incoming data are in the same encoding as internal one, conversion function is not used.

@mplucinski
Copy link
Contributor Author

Forgot to mention: first commit also introduces "char header file". The idea is to put generated YY_CHAR definition into separate file, to make this macro available in places where we cannot include general scanner header, e.g. Bison files.

@westes
Copy link
Owner

westes commented Jul 25, 2014

Could you rebase this onto the current tip of master?

The usual pull methods are generating conflicts.

@mplucinski mplucinski mentioned this pull request Jul 25, 2014
@mplucinski
Copy link
Contributor Author

Sure, I've just submitted it as request #8.

@mplucinski mplucinski closed this Jul 25, 2014
@westes westes mentioned this pull request Apr 16, 2018
eric-s-raymond referenced this pull request in eric-s-raymond/flex Sep 21, 2020
#3 in the retargeting patch series.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants