Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generated fake data not unique for larger databases #29

Open
tyrann0us opened this issue May 24, 2019 · 1 comment
Open

Generated fake data not unique for larger databases #29

tyrann0us opened this issue May 24, 2019 · 1 comment

Comments

@tyrann0us
Copy link
Contributor

We anonymize MailPoet subscribers like this:

// Callback for `wpmdb_anonymization_config` filter.
$config['mailpoet_subscribers'] = [
  'first_name' => [
    'fake_data_type' => 'firstName',
  ],
  'last_name' => [
    'fake_data_type' => 'lastName',
  ],
  'email' => [
    'fake_data_type' => 'email',
  ],
  'subscribed_ip' => [
    'fake_data_type' => 'ipv4',
  ],
  'confirmed_ip' => [
    'fake_data_type' => 'ipv4',
  ],
  'unconfirmed_data' => [
    'fake_data_type' => 'randomLetter', // Random data type, could be anything.
    'post_process_function' => '__return_null',
  ],
];
return $config;

(Posted full snippet, but only email is relevant here.)

It works for small subscriber lists. For ~6,500 subscribers and more, however, it seems that Faker no longer generates unique email addresses. Since MailPoet requires the email column to be UNIQUE, importing an anonymized database will fail with error Duplicate entry '[email protected]' for key 'email'.
In fact, for the ~6,500 entities, Faker seems generate ~10 email addresses twice according to the error messages.

So I checked Faker's Modifiers (https://github.com/fzaninotto/Faker/#modifiers) and changed for testing

$data = call_user_func_array( array( $faker, $this->fake_data_type ), $args );

to

if ($this->fake_data_type === 'email') {
  $data = $faker->unique()->{$this->fake_data_type}($args);
} else {
  $data = $faker->{$this->fake_data_type}($args);
}

(it could be simplified by always using unique() but I'm not sure if this might have unwanted side effects):

$data = $faker->unique()->{$this->fake_data_type}($args);

That reduced the number of duplicates, but there were still some. Even updating Faker to v1.8 (that introduced more German email providers, fzaninotto/Faker#1320, see #25) did not solved it (and is no solution for other languages). And even if an export would run without creating duplicates, we can't say for sure that it will work for larger records.

I'm not entirely sure why Faker still creates duplicates despite the unique() call, but I think it might be related to the fact that the plugin is bootstrapped every time admin-ajax.php is called, which also reinitializes Faker every time. If it's true, I have no idea how to deal with this. @polevaultweb, do you?

Thanks!

@pajtai
Copy link

pajtai commented Apr 13, 2020

The way this problem - in general, not just for emails - is handled by Anonimatron is to maintain a list of synonyms in a separate file. The synonyms file consists of a mapping from input production data to anonymized output data. This synonyms file should be treated as sensitive production data.

The big advantage of the synonyms file is that it allows consistency across tables, and it allow one to maintain anonymized test names across multiple anonymizations, which can be very helpful for the non production QA team.

It'd be a simple check to see if the generated email is in the synonyms file.... so a non elegant fix would be, if the generated email is in the synonyms file, rerun the faker until it comes up with a unique email that is not in the synonyms file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants