Skip to content
This repository has been archived by the owner on Aug 27, 2020. It is now read-only.

Guide to Ethical Data Collections Practices #59

Open
shiffman opened this issue Oct 7, 2019 · 2 comments
Open

Guide to Ethical Data Collections Practices #59

shiffman opened this issue Oct 7, 2019 · 2 comments

Comments

@shiffman
Copy link
Member

shiffman commented Oct 7, 2019

The question came up today in class: "What if I want to collect data? Is there a helpful guide / document of tips / common strategies for ethical data collection?". Please add your suggestions here:

Also, nothing these two topics I referenced:

Duke University MTMC

Atlanta Asks Google Whether It Targeted Black Homeless People

@ellennickles
Copy link
Member

The Datasheets for Datasets paper (mentioned in #10) advises that dataset creators answer ~60 questions (!) regarding motivation, curation (composition, collection, and data cleaning), and integration (uses, distribution, and maintenance).

Would it makes sense to focus on a select number of these, especially those questions related specifically to data from other people? I copied these verbatim, but we can edit further…

(Motivation)

  • For what purpose was the dataset created?

(Composition)

  • Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the
    content of individuals’ non-public communications)? If so, please provide a description.
  • Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? If so, please describe why.
  • If the dataset relates to people, does it identify any subpopulations (e.g., by age, gender)? If so, please describe how these subpopulations are identified and provide a description of their respective distributions within the dataset.
  • Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? If so, please describe how.
  • Does the dataset contain data that might be considered sensitive in any way (e.g., data that reveals racial or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)? If so, please provide a description.

(Collection Process)

  • If the dataset relates to people, did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g., websites)?
  • Were the individuals in question notified about the data collection? If so, please describe (or show with screenshots or other information) how notice was provided, and provide a link or other access point to, or otherwise reproduce, the exact language of the notification itself.
  • Did the individuals in question consent to the collection and use of their data? If so, please describe (or show with screenshots or other information) how consent was requested and provided, and provide a link or other access point to, or otherwise reproduce, the exact language to which the individuals consented.
  • If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses? If so, please provide a description, as well as a link or other access point to the mechanism (if appropriate).

(Uses)

  • Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a future user might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other undesirable harms (e.g., financial harms, legal risks) If so, please provide a description. Is there anything a future user could do to mitigate these undesirable harms?
  • Are there tasks for which the dataset should not be used? If so, please provide a description.

(Maintenance)

  • If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were individuals in question told that their data would be retained for a fixed period of time and then deleted)? If so, please describe these limits and explain how they will be enforced.

The full list is here. Of note, this paper is frequently referenced in the Partnership on AI’s About ML project which ultimately aims to establish documentation standards across industries for the transparency of entire ML systems—both datasets and models.

@shiffman
Copy link
Member Author

shiffman commented Oct 9, 2019

Thank you so much @ellennickles, this is fantastic. (And thank you for summarizing, super helpful.) I plan on discussing this in class tomorrow!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants