Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mysql mb4 #6409

Open
wants to merge 28 commits into
base: release-2.1
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
81aa8b7
Installer & upgrader to utf8mb4
sbulen Dec 8, 2020
2a3c0ef
UTF8MB4 maint functions
sbulen Dec 9, 2020
eef6185
Only utf8 going forward
sbulen Dec 13, 2020
4ffa40d
Only utf8 going forward
sbulen Dec 13, 2020
ddec442
Only utf8 going forward
sbulen Dec 13, 2020
d2bfbe8
All utf8 going forward
sbulen Dec 13, 2020
a95298f
All utf8 going forward
sbulen Dec 13, 2020
d7638d3
Only utf8 going forward
sbulen Dec 13, 2020
de79aa8
Only utf8 going forward
sbulen Dec 13, 2020
89a913d
Only utf8 going forward
sbulen Dec 13, 2020
0d64abd
Only utf8 going forward
sbulen Dec 13, 2020
8438e2b
Only utf8 going forward
sbulen Dec 13, 2020
4af44dc
Only utf8 going forward
sbulen Dec 13, 2020
26b7ad8
Too small; make consistent with other calls
sbulen Dec 14, 2020
398739a
Only utf8 going forward
sbulen Sep 24, 2021
d10533d
Use unicode
sbulen Sep 24, 2021
0adbe34
Only utf8 going forward
sbulen Sep 24, 2021
b588dc8
Only utf8 going forward
sbulen Sep 24, 2021
ed6bb67
Must also clean up old alt index names
sbulen Sep 24, 2021
26b310a
Use unicode
sbulen Sep 25, 2021
9ac9de4
Installer & upgrader to utf8mb4
sbulen Sep 25, 2021
b64d621
Merge remote-tracking branch 'upstream/release-2.1' into mysql_mb4
sbulen Jul 9, 2023
056f035
Preserve legacy setting in case used by mods
sbulen Jul 9, 2023
13a6271
Preserve legacy settings in case used by mods
sbulen Jul 9, 2023
583f69f
Only utf8 going forward
sbulen Jul 12, 2023
a9bac51
More mb4 index changes
sbulen Jul 13, 2023
0250924
mb_substr required here for upgrades
sbulen Jul 16, 2023
3168adb
Merge remote-tracking branch 'upstream/release-2.1' into mysql_mb4
sbulen Nov 5, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion SSI.php
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,7 @@

// @todo: probably not the best place, but somewhere it should be set...
if (!headers_sent())
header('content-type: text/html; charset=' . (empty($modSettings['global_character_set']) ? (empty($txt['lang_character_set']) ? 'ISO-8859-1' : $txt['lang_character_set']) : $modSettings['global_character_set']));
header('content-type: text/html; charset=UTF-8');

// Take care of any banning that needs to be done.
if (isset($_REQUEST['ssi_ban']) || (isset($ssi_ban) && $ssi_ban === true))
Expand Down
2 changes: 2 additions & 0 deletions Sources/DbPackages-mysql.php
Original file line number Diff line number Diff line change
Expand Up @@ -226,6 +226,8 @@ function smf_db_create_table($table_name, $columns, $indexes = array(), $paramet
$table_query .= ') ENGINE=' . $parameters['engine'];
if (!empty($db_character_set) && $db_character_set == 'utf8')
$table_query .= ' DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci';
else
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe an esle if would be better here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's either 'utf8' (currently an alias of 'utf8mb3') or 'utf8mb4'. These are the only two options at this point. And as noted in the description to this PR, we are defaulting to 'utf8mb4' going forward.

$table_query .= ' DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci';

// Create the table!
$smcFunc['db_query']('', $table_query,
Expand Down
2 changes: 1 addition & 1 deletion Sources/Drafts.php
Original file line number Diff line number Diff line change
Expand Up @@ -495,7 +495,7 @@ function XmlDraft($id_draft)
{
global $txt, $context;

header('content-type: text/xml; charset=' . (empty($context['character_set']) ? 'ISO-8859-1' : $context['character_set']));
header('content-type: text/xml; charset=UTF-8');

echo '<?xml version="1.0" encoding="', $context['character_set'], '"?>
<drafts>
Expand Down
45 changes: 20 additions & 25 deletions Sources/Load.php
Original file line number Diff line number Diff line change
Expand Up @@ -101,9 +101,8 @@ function reloadSettings()
if (empty($modSettings['force_ssl']))
$image_proxy_enabled = false;

// UTF-8 ?
$utf8 = (empty($modSettings['global_character_set']) ? $txt['lang_character_set'] : $modSettings['global_character_set']) === 'UTF-8';
$context['utf8'] = $utf8;
sbulen marked this conversation as resolved.
Show resolved Hide resolved
// Preserve legacy utf8 variables in case used by mods
$context['utf8'] = $utf8 = 'UTF-8';

// Set a list of common functions.
$ent_list = '&(?:#' . (empty($modSettings['disableEntityCheck']) ? '\d{1,7}' : '021') . '|quot|amp|lt|gt|nbsp);';
Expand All @@ -115,9 +114,9 @@ function reloadSettings()
{
return (string) $string;
};
$fix_utf8mb4 = function($string) use ($utf8, $smcFunc)
$fix_utf8mb4 = function($string) use ($smcFunc)
{
if (!$utf8 || $smcFunc['db_mb4'])
if ($smcFunc['db_mb4'])
return $string;

$i = 0;
Expand Down Expand Up @@ -162,26 +161,26 @@ function reloadSettings()
$num = $string[0] === 'x' ? hexdec(substr($string, 1)) : (int) $string;
return $num < 0x20 || $num > 0x10FFFF || ($num >= 0xD800 && $num <= 0xDFFF) || $num === 0x202E || $num === 0x202D ? '' : '&#' . $num . ';';
},
'htmlspecialchars' => function($string, $quote_style = ENT_COMPAT, $charset = 'ISO-8859-1') use ($ent_check, $utf8, $fix_utf8mb4, &$smcFunc)
'htmlspecialchars' => function($string, $quote_style = ENT_COMPAT, $charset = 'UTF-8') use ($ent_check, $fix_utf8mb4, &$smcFunc)
{
$string = $smcFunc['normalize']($string);

return $fix_utf8mb4($ent_check(htmlspecialchars($string, $quote_style, $utf8 ? 'UTF-8' : $charset)));
return $fix_utf8mb4($ent_check(htmlspecialchars($string, $quote_style, $charset)));
},
'htmltrim' => function($string) use ($utf8, $ent_check)
'htmltrim' => function($string) use ($ent_check)
{
// Preg_replace space characters depend on the character set in use
$space_chars = $utf8 ? '\p{Z}\p{C}' : '\x00-\x20\x80-\xA0';
$space_chars = '\p{Z}\p{C}';

return preg_replace('~^(?:[' . $space_chars . ']|&nbsp;)+|(?:[' . $space_chars . ']|&nbsp;)+$~' . ($utf8 ? 'u' : ''), '', $ent_check($string));
return preg_replace('~^(?:[' . $space_chars . ']|&nbsp;)+|(?:[' . $space_chars . ']|&nbsp;)+$~u', '', $ent_check($string));
},
'strlen' => function($string) use ($ent_list, $utf8, $ent_check)
'strlen' => function($string) use ($ent_list, $ent_check)
{
return strlen(preg_replace('~' . $ent_list . ($utf8 ? '|.~u' : '~'), '_', $ent_check($string)));
return strlen(preg_replace('~' . $ent_list . '|.~u', '_', $ent_check($string)));
},
'strpos' => function($haystack, $needle, $offset = 0) use ($utf8, $ent_check, $ent_list, $modSettings)
'strpos' => function($haystack, $needle, $offset = 0) use ($ent_check, $ent_list, $modSettings)
{
$haystack_arr = preg_split('~(' . $ent_list . '|.)~' . ($utf8 ? 'u' : ''), $ent_check($haystack), -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
$haystack_arr = preg_split('~(' . $ent_list . '|.)~u', $ent_check($haystack), -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);

if (strlen($needle) === 1)
{
Expand All @@ -190,7 +189,7 @@ function reloadSettings()
}
else
{
$needle_arr = preg_split('~(' . $ent_list . '|.)~' . ($utf8 ? 'u' : '') . '', $ent_check($needle), -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
$needle_arr = preg_split('~(' . $ent_list . '|.)~u' . '', $ent_check($needle), -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
$needle_size = count($needle_arr);

$result = array_search($needle_arr[0], array_slice($haystack_arr, $offset));
Expand All @@ -204,9 +203,9 @@ function reloadSettings()
return false;
}
},
'substr' => function($string, $start, $length = null) use ($utf8, $ent_check, $ent_list, $modSettings)
'substr' => function($string, $start, $length = null) use ($ent_check, $ent_list, $modSettings)
{
$ent_arr = preg_split('~(' . $ent_list . '|.)~' . ($utf8 ? 'u' : '') . '', $ent_check($string), -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
$ent_arr = preg_split('~(' . $ent_list . '|.)~u' . '', $ent_check($string), -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
return $length === null ? implode('', array_slice($ent_arr, $start)) : implode('', array_slice($ent_arr, $start, $length));
},
'strtolower' => function($string) use (&$smcFunc)
Expand All @@ -220,10 +219,10 @@ function reloadSettings()
'truncate' => function($string, $length) use ($utf8, $ent_check, $ent_list, &$smcFunc)
{
$string = $ent_check($string);
preg_match('~^(' . $ent_list . '|.){' . $smcFunc['strlen'](substr($string, 0, $length)) . '}~' . ($utf8 ? 'u' : ''), $string, $matches);
preg_match('~^(' . $ent_list . '|.){' . $smcFunc['strlen'](substr($string, 0, $length)) . '}~u', $string, $matches);
$string = $matches[0];
while (strlen($string) > $length)
$string = preg_replace('~(?:' . $ent_list . '|.)$~' . ($utf8 ? 'u' : ''), '', $string);
$string = preg_replace('~(?:' . $ent_list . '|.)$~u', '', $string);
return $string;
},
'ucfirst' => function($string) use (&$smcFunc)
Expand Down Expand Up @@ -321,15 +320,12 @@ function reloadSettings()

return random_bytes($length);
},
'normalize' => function($string, $form = 'c') use ($utf8)
'normalize' => function($string, $form = 'c')
{
global $sourcedir;

$string = (string) $string;

if (!$utf8)
return $string;

require_once($sourcedir . '/Subs-Charset.php');

$normalize_func = 'utf8_normalize_' . strtolower((string) $form);
Expand Down Expand Up @@ -3302,7 +3298,6 @@ function getBoardParents($id_parent)

/**
* Attempt to reload our known languages.
* It will try to choose only utf8 or non-utf8 languages.
*
* @param bool $use_cache Whether or not to use the cache
* @return array An array of information about available languages
Expand Down Expand Up @@ -3507,7 +3502,7 @@ function template_include($filename, $once = false)
ob_start();

if (isset($_GET['debug']))
header('content-type: application/xhtml+xml; charset=' . (empty($context['character_set']) ? 'ISO-8859-1' : $context['character_set']));
header('content-type: application/xhtml+xml; charset=UTF-8');

// Don't cache error pages!!
header('expires: Mon, 26 Jul 1997 05:00:00 GMT');
Expand Down
10 changes: 5 additions & 5 deletions Sources/ManageLanguages.php
Original file line number Diff line number Diff line change
Expand Up @@ -1115,13 +1115,13 @@ function($val1, $val2)
// Read in the file's contents and process it into entries.
// Also, remove any lines for uneditable variables like $forum_copyright from the working data.
$entries = array();
foreach (preg_split('~^(?=\$(?:' . implode('|', $string_types) . ')\[\'([^\n]+?)\'\])~m' . ($context['utf8'] ? 'u' : ''), preg_replace('~\s*\n(\$(?!(?:' . implode('|', $string_types) . '))[^\n]*)~', '', file_get_contents($current_file))) as $blob)
foreach (preg_split('~^(?=\$(?:' . implode('|', $string_types) . ')\[\'([^\n]+?)\'\])~mu', preg_replace('~\s*\n(\$(?!(?:' . implode('|', $string_types) . '))[^\n]*)~', '', file_get_contents($current_file))) as $blob)
{
// Comment lines at the end of the blob can make terrible messes
$blob = preg_replace('~(\n[ \t]*//[^\n]*)*$~' . ($context['utf8'] ? 'u' : ''), '', $blob);
$blob = preg_replace('~(\n[ \t]*//[^\n]*)*$~u', '', $blob);

// Extract the variable
if (preg_match('~^\$(' . implode('|', $string_types) . ')\[\'([^\n]+?)\'\](?:\[\'?([^\n]+?)\'?\])?\s?=\s?(.+);([ \t]*(?://[^\n]*)?)$~ms' . ($context['utf8'] ? 'u' : ''), strtr($blob, array("\r" => '')), $matches))
if (preg_match('~^\$(' . implode('|', $string_types) . ')\[\'([^\n]+?)\'\](?:\[\'?([^\n]+?)\'?\])?\s?=\s?(.+);([ \t]*(?://[^\n]*)?)$~msu', strtr($blob, array("\r" => '')), $matches))
{
// If no valid subkey was found, we need it to be explicitly null
$matches[3] = isset($matches[3]) && $matches[3] !== '' ? $matches[3] : null;
Expand Down Expand Up @@ -1196,7 +1196,7 @@ function($val1, $val2)
# Followed by a comma or the end of the string
(?=,|$)

/x' . ($context['utf8'] ? 'u' : ''), $entryValue['entry'], $matches);
/xu', $entryValue['entry'], $matches);

if (empty($m))
continue;
Expand Down Expand Up @@ -1427,7 +1427,7 @@ function($val1, $val2)
foreach ($final_saves as $save)
{
if (!empty($save['is_regex']))
$file_contents = preg_replace('~' . $save['find'] . '~' . ($context['utf8'] ? 'u' : ''), $save['replace'], $file_contents);
$file_contents = preg_replace('~' . $save['find'] . '~u', $save['replace'], $file_contents);
else
$file_contents = str_replace($save['find'], $save['replace'], $file_contents);
}
Expand Down
Loading