Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean instagram input #2400

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Clean instagram input #2400

wants to merge 2 commits into from

Conversation

VirginiaDooley
Copy link
Contributor

@VirginiaDooley VirginiaDooley commented Jun 26, 2024

This work cleans instagram entries to the Person Identifier form so that the result is a username with the value of the instagram url. I've also updated the people/README.md to list and describe the input and output of all the person identifiers. Finally, I've included a management command to reformat existing usernames to urls.


@VirginiaDooley VirginiaDooley marked this pull request as ready for review June 26, 2024 15:34
@VirginiaDooley VirginiaDooley changed the title WIP Clean and standardise instagram entries Clean instagram entries Jun 26, 2024
@VirginiaDooley VirginiaDooley changed the title Clean instagram entries Clean instagram input Jun 27, 2024
@symroe
Copy link
Member

symroe commented Jun 27, 2024

I think the regexs might need some work here.

First off, we have x.com in there (I guess copied from the twitter validator).

I think there's also instagr.am that we need to match (I don't know how much this is used), and matching .* at the start might be a problem that applies to Twitter too.

Can you use url parse (as we've done elsewhere) to match the domain to one of [ "instagram.com", "www.instagram.com", "instagr.am", "www.instagr.am", ]?

This is the first pass.

Then for usernames, we need to take the path part of the parsed value. I found this regex for validating usernames:

https://blog.jstassen.com/2016/03/code-regex-for-instagram-username-and-hashtags/

@VirginiaDooley
Copy link
Contributor Author

I think the regexs might need some work here.

First off, we have x.com in there (I guess copied from the twitter validator).

I think there's also instagr.am that we need to match (I don't know how much this is used), and matching .* at the start might be a problem that applies to Twitter too.

Can you use url parse (as we've done elsewhere) to match the domain to one of [ "instagram.com", "www.instagram.com", "instagr.am", "www.instagr.am", ]?

This is the first pass.

Then for usernames, we need to take the path part of the parsed value. I found this regex for validating usernames:

https://blog.jstassen.com/2016/03/code-regex-for-instagram-username-and-hashtags/

I've added a fixup commit with these suggested changes and made it clear to the user that a URL not a username is expected on the form.

@symroe
Copy link
Member

symroe commented Jul 1, 2024

The regex is looking good now.

If the idea is to store the username only, then we shouldn't make the change to call the field "Instagram URL". As things stand, someone will enter a URL and it will be replaced with a username. But if they enter a username (or edit a profile with just a username in the field), it will fail validation due to the field not containing one of the valid domains.

We would also need to update WCIVF and all previous values in this field to only expect the username.

It might be better to standardise the URL with a username, rather than changing it from being a URL

Copy link
Member

@symroe symroe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of inline comments.

I think we're going to need to think about validating all existing data in this field somehow. If we deploy this code as-is, some profiles might not be saveable because the value in this field isn't valid. Users can just remove it, but we should clean this up before that happens.

The quickest thing to do, I imagine, is to run the regex over a CSV export with the instagram URL added. We can then see if there are any profiles with bad values and manually clean them up.

We'll want to do the same for LinkedIn URLs (as per the other open PR), so I expect running one report for both makes sense.

ynr/apps/people/helpers.py Show resolved Hide resolved
@@ -169,6 +169,8 @@ def get_value_html(self):

if self.value_type == "twitter_username":
url = format_html("https://twitter.com/{}", self.value)
if self.value_type == "instagram_url":
url = format_html("https://instagram.com/{}", self.value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with the www. here

@@ -169,6 +169,8 @@ def get_value_html(self):

if self.value_type == "twitter_username":
url = format_html("https://twitter.com/{}", self.value)
if self.value_type == "instagram_url":
url = format_html("https://www.instagram.com/{}", self.value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm reading this wrong, but isn't clean_instagram_url returning a full URL? In that case, the url here should just be self.value?

@VirginiaDooley VirginiaDooley force-pushed the hotfix/clean-instagram branch 2 times, most recently from 8f883ca to bd0b93d Compare July 17, 2024 13:19
@VirginiaDooley
Copy link
Contributor Author

Couple of inline comments.

I think we're going to need to think about validating all existing data in this field somehow. If we deploy this code as-is, some profiles might not be saveable because the value in this field isn't valid. Users can just remove it, but we should clean this up before that happens.

The quickest thing to do, I imagine, is to run the regex over a CSV export with the instagram URL added. We can then see if there are any profiles with bad values and manually clean them up.

We'll want to do the same for LinkedIn URLs (as per the other open PR), so I expect running one report for both makes sense.

We have 4,227 instances of an instagram PersonIdentifier. All but one username passed the regex validation. However, we have approx 60 values stored as "@username" rather than a url. I could alter the script I used to detect dead links to find and replace those values with valid urls?

- Audit all person identifiers and their validation
- Check the domain and the username for valid entries
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants