Support for charsets and unify internal character storage type #3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request introduces two changes:
2e6add4: in places where text characters are processed, replace
char
type byYY_CHAR
macro. This introduces distinguishness between character as byte of file ("physical") and character as the smallest part of text ("logical"). For logical one,YY_CHAR
macro should be used. For now, they represent the same value, but in futureYY_CHAR
may be replaced by wider type. This is essential to introduce support for extended character sets (any set with number of code points bigger than 256).To properly introduce
YY_CHAR
macro usage, some tests are changed to not usestr*
family of standard C functions, but instead versions that operate onYY_CHAR
values. New functions are sitting intests/strutils.h
header, and test case for this file is located intests/test-tests_strutils
directory.f1456f9, 0d3107e: adds support for character sets. The first commit adds test case, the second actual support. This gives way to convert between "physical" bytes and "logical" characters with user-provided function. It allows input data to be encoded in different character set than the one that has been used to write input ".l" file.
Charset support is completely optional. It is activated by
%option charset
. When it is active, new variables are activated:yycharset
andyycharset_handler
. First one should be set to currently used character set before callingyylex()
. Second one should be set to function that will be called to convert incoming bytes into characters. Additionally,%option charset-source="ENCODING"
should be set to the name of encoding used to write ".l" file - it will be used as internal scanner's character encoding. In case when incoming data are in the same encoding as internal one, conversion function is not used.