We collect tweets posted by politicians in the U.S. and India and save the JSON provided by the Twitter API. Lists of politicians are generated by NivaDuck, software developed at Microsoft Research - India for automatically identifying accounts that belong to politicians.
As of April 21, 2021, the data includes:
- USA
- Number of accounts: 9608 (8994 with state metadata; all current MCs with complete metadata)
- New Accounts of Celebrities added from https://github.com/webis-de/ACL-19/tree/master/celebrity-profiling
- India
- Number of political Twitter handles: 33074 (27300 with party metadata, 16027 with state metadata)
- New Handles Added from the DISMISS database
Two scripts ran daily, one each for India and the U.S., to pull new tweets posted everyday by each politician in the respective lists. For India, the list of accounts includes journalists, media outlets, celebrities, and influencers.
You can view the scripts for collection in the scripts
folder.
The data is archived at the Social Media Archive (SOMAR) at ICPSR. Visit SOMAR to apply for access to the data.
We are manually checking all accounts NivaDuck identified and will provide periodic metadata updates.
See the codebook for a list of metadata fields, descriptions, variable types, valid values, etc.
Here's an example of the minimum metadata:
id | id_str | screen_name | confirmed_account_type | state | twitter_name | real_name | bioguide | office_holder | party | district | level | woman | birthday | last_updated | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 986781648 | 986781648 | jeffsessions | 1 | Alabama | Jeff Sessions | 4/20/21 | ||||||||
29 | 1155335864 | 1155335864 | repdonaldpayne | 1 | New Jersey | Rep. Donald Payne Jr | Donald Payne | P000604 | 1 | 1 | 10 | 3 | FALSE | 12/17/58 | 4/20/21 |
74 | 2970462034 | 2970462034 | repkathleenrice | 1 | New York | Kathleen Rice | Kathleen Rice | R000602 | 1 | 1 | 4 | 3 | TRUE | 2/15/65 | 4/20/21 |
Archived metadata files are available in the metadata
folder as well.
Anmol Panda and Armand Burks wrote the scripts to collect and archive Tweets using the Twitter Public API (via tweepy). Joyojeet Pal conceived the project at MSR India with Anmol Panda, and his team regularly contributes new accounts for the India dataset. Libby Hemphill generated this documentation and manages the team who collect and update data and metadata. Evan Parres handled metadata updates, and Najmin Ahmed manually verified many state labels for 2020 election candidates.
This project was a continuation of work initiated by Joyojeet Pal and Anmol Panda at Microsoft Research India.
Funding for the staff and infrastructure were provided by
- Michigan Institute for Data Science
- Advanced Research Computing - Technology Services
- Assoc. Professor Libby Hemphill
We are grateful to Ballot Ready for providing data on political candidates in the U.S.
@techreport {panda2023,
author = {Panda, Anmol and Hemphill, Libby and Pal, Joyojeet},
year = {2023},
title = {Politweets: Tweets of politicians, celebrities, news media, and influencers from India and the United States},
institution = {Inter - University Consortium for Political and Social Research},
number = {SOMAR44-v1},
address = {Ann Arbor, MI},
note = {DOI:10.3886/xm68-rw44},
}
@inproceedings{
10.1145/3400806.3400830,
author = {Panda, Anmol and Gonawela, A’ndre and Acharyya, Sreangsu and Mishra, Dibyendu and Mohapatra, Mugdha and Chandrasekaran, Ramgopal and Pal, Joyojeet},
title = {NivaDuck - A Scalable Pipeline to Build a Database of Political Twitter Handles for India and the United States},
year = {2020},
isbn = {9781450376884},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3400806.3400830},
doi = {10.1145/3400806.3400830},
abstract = {We present a scalable methodology to identify Twitter handles of politicians in a given region and test our framework in the context of Indian and US politics. The main contribution of our work is the list of the curated Twitter handles of 18500 Indian and 8000 US politicians. Our work leveraged machine learning-based classification and human verification to build a data set of Indian politicians on Twitter. We built NivaDuck, a highly precise, two-staged classification pipeline that leverages Twitter description text and tweet content to identify politicians. For India, we tested NivaDuck’s recall using Twitter handles of the members of the Indian parliament while for the US we used state and local level politicians in California state and San Diego county respectively. We found that while NivaDuck has lower recall scores, it produces large, diverse sets of politicians with precision exceeding 90 percent for the US dataset. We discuss the need for an ML-based, scalable method to compile such a dataset and its myriad use cases for the research community and its wide-ranging utilities for research in political communication on social media. },
booktitle = {International Conference on Social Media and Society},
pages = {200–209},
numpages = {10},
keywords = {united states, india, archive, twitter, politics},
location = {Toronto, ON, Canada},
series = {SMSociety'20}
}
Panda, A., Gonawela, A., Acharyya, S., Mishra, D., Mohapatra, M., Chandrasekaran, R., & Pal, J. (2020). NivaDuck - A Scalable Pipeline to Build a Database of Political Twitter Handles for India and the United States. International Conference on Social Media and Society, 200–209. https://doi.org/10.1145/3400806.3400830
@article{Arya_De_Mishra_Shekhawat_Sharma_Panda_Lalani_Singh_Mothilal_Grover_Nishal_Dash_Shora_Akbar_Pal_2022,
title={DISMISS: Database of Indian Social Media Influencers on Twitter},
volume={16},
url={https://ojs.aaai.org/index.php/ICWSM/article/view/19370},
DOI={10.1609/icwsm.v16i1.19370},
number={1},
journal={Proceedings of the International AAAI Conference on Web and Social Media},
author={Arya, Arshia and De, Soham and Mishra, Dibyendu and Shekhawat, Gazal and Sharma, Ankur and Panda, Anmol and Lalani, Faisal and Singh, Parantak and Mothilal, Ramaravind Kommiya and Grover, Rynaa and Nishal, Sachita and Dash, Saloni and Shora, Shehla and Akbar, Syeda Zainab and Pal, Joyojeet},
year={2022},
month={May},
pages={1201-1207} }
Arya, A., De, S., Mishra, D., Shekhawat, G., Sharma, A., Panda, A., Lalani, F., Singh, P., Mothilal, R. K., Grover, R., Nishal, S., Dash, S., Shora, S., Akbar, S. Z., & Pal, J. (2022). DISMISS: Database of Indian Social Media Influencers on Twitter. Proceedings of the International AAAI Conference on Web and Social Media, 16(1), 1201-1207. https://doi.org/10.1609/icwsm.v16i1.19370
Use issues to report bugs and request changes to the collection process or metadata. We will not be providing hands-on help with the data, but we will try to answer questions if they come up.