Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emojis in Comments #2077

Open
JoeCohen opened this issue Mar 29, 2024 · 6 comments · May be fixed by #2477
Open

Emojis in Comments #2077

JoeCohen opened this issue Mar 29, 2024 · 6 comments · May be fixed by #2477
Labels
bug Error Throws an Error

Comments

@JoeCohen
Copy link
Member

Emojis throw errors when creating Comments.
The code block below shows what happened locally when I clicked Create after putting an emoji in a Summary. The same thing happens for emojis in the Comment body.
Not sure what to do about this:

  • Allow emojis, vs
  • Prohibit emojis but don't throw Errors.
ActiveRecord::StatementInvalid in CommentsController#create
Trilogy::ProtocolError: 1366: Incorrect string value: '\xF0\x9F\xA5\xB8' for column 'summary' at row 1
Rails.root: /Users/joe/mushroom-observer

Application Trace | Framework Trace | Full Trace
Request
Parameters:

{"utf8"=>"✓", "authenticity_token"=>"[FILTERED]", "comment"=>{"summary"=>"asdf🥸", "comment"=>""}, "commit"=>"Create", "q"=>"1oaSB", "target"=>"547363", "type"=>"Observation"}
@JoeCohen JoeCohen added bug Error Throws an Error labels Mar 29, 2024
@JoeCohen
Copy link
Member Author

JoeCohen commented Mar 30, 2024

@mo-nathan suggests migrating the entire database to use utf8mb4 and also standardize the collation on utf8mb4_0900_ai_ci, which are the current Rails standards.

2024-03-10 04:12
nathan
Looks like it has to do with the table charset. Articles are cool with them and that table uses utf8mb4 whereas the comments table uses utf8mb3. I think the default in Rails changed at some point. We should probably migrate the entire database to use utf8mb4 and also standardize the collation on utf8mb4_0900_ai_ci (which also appears to be the current Rails standard).
04:14
Of our 73 tables, 8 currently use utf8mb4 and 65 use utf8mb3.
05:43
jdcohenesq
If I’m understanding you correctly, we should do a simple one-time migration like:
def self.up
execute "ALTER DATABASE #{ActiveRecord::Base.connection.current_database} CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;"
end
Are any of these things (from CoPilot) an issue:
Using the utf8mb4 character set in your MySQL database can have some performance implications. Let’s explore them:

  1. Storage Space:
    utf8mb4 uses 4 bytes per character, compared to 3 bytes per character in the older utf8 character set.
    This means that if your application stores a lot of text data (such as blog posts, comments, or user-generated content), the storage requirements will be higher with utf8mb4.
    However, storage is relatively inexpensive nowadays, so this may not be a significant concern for most applications.
  2. Index Size:
    utf8mb4 indexes are larger than utf8 indexes due to the increased character size.
    If you have large tables with many indexed columns, the index size can impact query performance.
    Consider using appropriate indexing strategies (e.g., composite indexes, partial indexes) to mitigate this.
  3. Memory Usage:
    When querying utf8mb4 data, MySQL needs to allocate more memory for sorting, grouping, and temporary tables.
    If your server has limited memory, this could affect performance during complex queries or high traffic.
  4. Query Execution Time:
    utf8mb4 character comparisons are more computationally expensive than utf8 comparisons.
    Queries involving string comparisons (e.g., sorting, filtering, joining) may be slightly slower with utf8mb4.
    However, the impact is usually minimal unless you have extremely high query loads.
  5. Connection Overhead:
    utf8mb4 requires more network bandwidth for transmitting data between the application and the database server.
    If your application serves a large number of concurrent users, this additional overhead may become noticeable.
  6. Full-Text Search Performance:
    If your application uses MySQL’s full-text search capabilities, utf8mb4 may affect performance.
    Full-text indexes on utf8mb4 columns are larger and slower to update.
  7. Migration Considerations:
    If you’re migrating an existing database to utf8mb4, be prepared for potential downtime during the migration process.
    Also, ensure that your application code and libraries (e.g., ActiveRecord in Rails) handle utf8mb4 correctly.
    In summary, while utf8mb4 provides better support for Unicode characters, it’s essential to weigh the benefits against the performance considerations. For most applications, the impact is manageable, especially with modern hardware and proper optimization.

06:12
nathan
I think the query you gave only changes the default values used by that database. I’m pretty sure we need to
ALTER TABLE tbl_name CONVERT TO CHARACTER SET charset_name;
for each table with the desired charset_name (‘utf8mb4’ for up). I expect we’ll also need a to run something similar for the collation for each table. (edited)
06:15
The performance/data concerns that CoPilot mentions should not be an issue. The standard unicode encodings only use the fourth byte when needed so it would depend on how much people actually use emoji’s which I expect will be very rare. You could check the size of the database dump before and after if you want more confidence in that. Since it is now the default in Rails, I expect any performance impact to be negligible.
06:16
Honestly, I’m really disappointed in that answer from CoPilot. It strikes me as essentially fear mongering to sound like they have an answer.
06:16
Also the first sentence is plain wrong, It only uses 4 bytes for characters that need 4 bytes.
06:21
https://dba.stackexchange.com/questions/285336/having-charset-utf8mb4-does-it-mean-every-character-takes-4-bytes-or-only-t
Database Administrators Stack ExchangeDatabase Administrators Stack Exchange
Having charset "utf8mb4" - does it mean every character takes 4 bytes, or only those that need 4 bytes?
I have a VARCHAR(80) utf8mb4 column, and I'm testing adding some ASCII & Emoji characters, and use LENGTH() and CHAR_LENGTH() to understand the differences.
From reading in different places, my

See Slack General Discussion

@mo-nathan
Copy link
Member

I looked a bit more at this and I don't think there's direct support for this in the rails migration framework, but I think it just needs to run some direct SQL per this page: https://stackoverflow.com/questions/8906813/how-to-change-the-default-charset-of-a-mysql-table.

Probably should be written as a migration that has an up and down and runs the appropriate ALTER TABLE queries. It's possible that this has to be a one way migration since it might throw an error trying to switch from utf8mb4 to utf8mb3.

@JoeCohen
Copy link
Member Author

Before I read the above comment, CoPIlot suggested:

def self.up
  execute "ALTER DATABASE `#{ActiveRecord::Base.connection.current_database}` CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;"
end

But:

  1. per @mo-nathan's comment above, the migration should instead be on a per-table basis, affecting only the tables which are not on utf8mb4 or utf8mb4_0900_ai_ci.
  2. Do we have to use CONVERT CHARACTER SET instead of CHARACTER SET?

@mo-nathan
Copy link
Member

Yes, I think we need "CONVERT CHARACTER SET". The point is that we are changing the table not the database (ALTER TABLE vs. ALTER DATABASE). The CoPilot suggestion would only change the default for new tables which are already getting created correctly with the current Rails default of utf8mb4/utf8mb4_0900_ai_ci.

@JoeCohen
Copy link
Member Author

JoeCohen commented Mar 30, 2024

Thanks!
Maybe I'll deal with this later. ("later" = some indefinite future date.) It's really low priority for me. But at least we have a record of how to do it.

@JoeCohen
Copy link
Member Author

JoeCohen commented Oct 8, 2024

I ran the migration below, which updated the schema.
Than manually created a Comment with some emoji's, and got this Trilogy error:

Trilogy::ProtocolError: 1366: Incorrect string value: '\xF0\x9F\x91\x8D u...' for column 'notes' at row 1 Rails.root: /Users/joe/mushroom-observer

Application Trace | Framework Trace | Full Trace Request Parameters:

{"utf8"=>"✓", "authenticity_token"=>"[FILTERED]", "comment"=>{"summary"=>"👍", "comment"=>"🍄‍🟫🍄‍🟫"}, "commit"=>"Create", "q"=>"1vU6y", "target"=>"561686", "type"=>"Observation"}

When I go back and display the Observation, the Comment is correct:
Screenshot 2024-10-08 at 3 03 36 PM

Here's my generator:

rails generate migration ChangeCommentsCharsetAndCollation

and migration:

class ChangeCommentsCharsetAndCollation < ActiveRecord::Migration[7.1]
  def up
    execute <<-SQL
      ALTER TABLE comments CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
    SQL
  end

  def down
    execute <<-SQL
      ALTER TABLE comments CONVERT TO CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci;
    SQL
  end
end

JoeCohen added a commit that referenced this issue Oct 8, 2024
- Prevents error from being thrown when someone uses an emoji in a Comment
- Changes Comments table charset and collation
- Delivers #2077
@JoeCohen JoeCohen linked a pull request Oct 8, 2024 that will close this issue
@JoeCohen JoeCohen linked a pull request Oct 8, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Error Throws an Error
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants