Skip to content

Service for creating Twitter datasets for research and archiving.

License

Notifications You must be signed in to change notification settings

gwu-libraries/TweetSets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TweetSets

DOI

Twitter datasets for research and archiving.

  • Create your own Twitter dataset from existing datasets.
  • Conforms with Twitter policies.

TweetSets allows users to (1) select from existing datasets; (2) limit the dataset by querying on keywords, hashtags, and other parameters; (3) generate and download dataset derivatives such as the list of tweet ids and mention nodes/edges.

Modes

TweetSets can be run in different modes. The modes determine which datasets are available and what type of dataset derivates can be generated.

  • public mode: Source datasets that are marked as local only are excluded. Dataset derivates that include the text of the tweet cannot be generated.
  • local mode: All source datasets are included, including those that are marked as local only. All dataset derivatives can be generated, including those that include the text of the tweet.
  • both mode: For configured network IP ranges, the user is placed in local mode. Otherwise, the user is placed in public mode.

These modes allow conforming with the Twitter policy that prohibits sharing complete tweets with 3rd parties.

Modes are configured in the .env file as described below.

Installing

Prerequisites

  • Docker
  • Docker-compose
  • Set vm_max_map_count as described in the ElasticSearch documentation. Each node of the cluster may require this setting.

Installation for non-cluster ElasticSearch

  1. Create data directories on a volume with adequate storage:

     mkdir -p /tweetset_data/redis
     mkdir -p /tweetset_data/datasets
     mkdir -p /tweetset_data/full_datasets
     mkdir -p /tweetset_data/elasticsearch/esdata1
     mkdir -p /tweetset_data/elasticsearch/esdata2
     chown -R 1000:1000 /tweetset_data/elasticsearch
    

Note:

  • Create an esdata<number> directory for each ElasticSearch container.
  • On OS X, the redis and esdata<number> directories must be ugo+rwx.
  1. Create a directory, to be named as you choose, where tweet data files will be stored for loading.

     mkdir /datasets_loading
    
  2. Clone or download this repository:

     git clone https://github.com/gwu-libraries/TweetSets.git
    
  3. Change to the docker directory:

     cd docker
    
  4. Copy the example docker files:

     cp example.docker-compose.yml docker-compose.yml
     cp example.env .env
    
  5. Edit .env. This file is annotated to help you select appropriate values.

  6. Create dataset_list_msg.txt in the docker directory. The contents of this file will be displayed on the dataset list page. It can be used to list other datasets that are available, but not yet loaded. If leaving the file empty then:

     touch dataset_list_msg.txt
    
  7. Bring up the containers:

     docker-compose up -d
    

For HTTPS support, uncomment and configure the nginx-proxy container in docker-compose.yml.

Cluster installation

Clusters must have at least a primary node and two additional nodes.

Primary node

  1. Create data directories on a volume with adequate storage. Note that in order to use the Spark loader, the full_datasets and datasets_loading directories (see below) will need to be shared between the primary and cluster nodes as an NFS mount. (The other directories do not need to be shared.)

     mkdir -p /tweetset_data/redis
     mkdir -p /tweetset_data/datasets
     mkdir -p /tweetset_data/full_datasets
     mkdir -p /tweetset_data/elasticsearch
     chown -R 1000:1000 /tweetset_data/elasticsearch
    
  2. Create a directory, to be named as you choose, where tweet data files will be stored for loading.

     mkdir /datasets_loading
    
  3. Set up the tweetset_data/full_datasets and datasets_loading NFS mounts as described here.

  4. Clone or download this repository:

     git clone https://github.com/gwu-libraries/TweetSets.git
    
  5. Change to the docker directory:

     cd docker
    
  6. Copy the example docker files:

     cp example.cluster-primary.docker-compose.yml docker-compose.yml
     cp example.env .env
    
  7. Update .env. This file is annotated to help you select appropriate values.

  8. Create dataset_list_msg.txt in the docker directory. The contents of this file will be displayed on the dataset list page. It can be used to list other datasets that are available, but not yet loaded. If leaving the file empty then:

     touch dataset_list_msg.txt
    

For HTTPS support, uncomment and configure the nginx-proxy container in docker-compose.yml.

Cluster node(s)

  1. Create data directories on a volume with adequate storage:

     mkdir -p /tweetset_data/elasticsearch
     mkdir -p /tweetset_data/full_datasets
     chown -R 1000:1000 /tweetset_data/elasticsearch
     mkdir /datasets_loading
    
  2. Clone or download this repository:

     git clone https://github.com/gwu-libraries/TweetSets.git
    
  3. Set up the tweetset_data/full_datasets and datasets_loading NFS mounts as described here.

  4. Change to the docker directory:

     cd docker
    
  5. Copy the example docker files:

     cp example.cluster-node.docker-compose.yml docker-compose.yml
     cp example.cluster-node.env .env
    
  6. Edit .env. This file is annotated to help you select appropriate values. Note that 2 cluster nodes must have MASTER set to true.

  7. Bring up the containers, starting with the cluster nodes and then moving to the primary node.

     docker-compose up -d
    

Loading a source dataset

Prepping the source dataset

  1. Create a dataset directory within the dataset filepath configured in your .env.
  2. Place tweet files in the directory. The tweet files can be line-oriented JSON (.json) or gzip compressed line-oriented JSON (.json.gz).
  3. Create a dataset description file in the directory named dataset.json. See example.dataset.json for the format of the file.

Loading

Use this method when Elasticsearch is on the same machine as TweetSets (non-cluster option), or for otherwise loading without using Spark.

  1. Start and connect to a loader container:

     docker-compose run --rm loader /bin/bash
    
  2. Invoke the loader:

     python tweetset_loader.py create /dataset/path/to
    

To see other loader commands:

    python tweetset_loader.py

Note that tweets are never added to an existing index. When using the reload command, a new index is created for a dataset that replaces the existing index. The new index replaces the old index only after the new index has been created, so users are not affected by reloading.

Loading with Apache Spark

When using the Spark loader, the dataset files must be located at the dataset filepath on all nodes. All nodes must also have access to shared directory (tweetset_data/full_datasets) for creating the full extracts. For creating full extracts, this process is more efficient than the method described below ("Creating a manual extract").

In general, using Spark within Docker is tricky because the Spark driver, Spark master, and Spark nodes all need to be able to communicate and the ports are dynamically selected. (Some of the ports can be fixed, but supporting multiple simultaneous loaders requires leaving some dynamic.) This doesn't play well with Docker's port mapping, since the hostnames and ports that Spark advertises internally must match what is available through Docker. Further complicating this is that host networking (which is used to support the dynamic ports) does not work correctly on Mac. Use the regular loader rather than the Spark loader Elasticsearch is on the same machine as TweetSets (e.g., in a small development environment, not a cluster).

Cluster mode

  1. Start and connect to a loader container:

     docker-compose -f loader.docker-compose.yml run --rm loader /bin/bash
    
  2. Invoke the loader:

     spark-submit \
     --jars elasticsearch-hadoop.jar \
     --master spark://$SPARK_MASTER_HOST:7101 \
     --py-files dist/TweetSets-2.2.0-py3.8.egg,dependencies.zip \
     --conf spark.driver.bindAddress=0.0.0.0 \
     --conf spark.driver.host=$SPARK_DRIVER_HOST \
     --conf spark.driver.port=7003 \
     --conf spark.blockManager.port=7020 \
     tweetset_loader.py spark-create /dataset/path/to
    
  3. Extracts will be stored in /tweetset_data/full_datasets and will be visible in the UI.

Reloading an existing set with Apache Spark

  1. Start and connect to a loader container:

     docker-compose -f loader.docker-compose.yml run --rm loader /bin/bash
    
  2. Invoke the loader:

     spark-submit \
     --jars elasticsearch-hadoop.jar \
     --master spark://$SPARK_MASTER_HOST:7101 \
     --py-files dist/TweetSets-2.2.0-py3.8.egg,dependencies.zip \
     --conf spark.driver.bindAddress=0.0.0.0 \
     --conf spark.driver.host=$SPARK_DRIVER_HOST \
     --conf spark.driver.port=7003 \
     --conf spark.blockManager.port=7020 \
     tweetset_loader.py spark-reload dataset-id /dataset/path/to
    

where dataset-id is the id of the dataset, which can be found by viewing the collection's ID metadata field via the Tweetsets UI.

Note that running spark-reload does not re-read dataset.json and update the dataset descriptive metadata. To update the dataset descriptive metadata to match dataset.json if it has been changed, invoke the loader with an update command:

    spark-submit \
    --jars elasticsearch-hadoop.jar \
    --master spark://$SPARK_MASTER_HOST:7101 \
    --py-files dist/TweetSets-2.2.0-py3.8.egg,dependencies.zip \
    --conf spark.driver.bindAddress=0.0.0.0 \
    --conf spark.driver.host=$SPARK_DRIVER_HOST \
    --conf spark.driver.port=7003 \
    --conf spark.blockManager.port=7020 \
    tweetset_loader.py update dataset-id /dataset/path/to

Creating a manual extract (dataset)

Full extracts of existing datasets can be created from the command line. Note that this command does not use the Spark loader, so will not generate mentions or user tweet count files. It generates .zip versions of JSON, CSV, and IDs.

  1. Launch a shell session in the server container:
docker exec -it ts_server_1 /bin/bash

or

docker exec -it ts_server-flaskrun_1 /bin/bash
  1. Issue the command to create the extract, where dataset-id is the id of the dataset, which can be found by viewing the collection's ID metadata field via the Tweetsets UI.
flask create-extract dataset_id
  1. Upon completion, an email will be sent to the address in the ADMIN_EMAIL field of the .env file.

Kibana

Elastic's Kibana is a general-purpose framework for exploring, analyzing, and visualizing data. Since the tweets are already indexed in ElasticSearch, they are ready to be used from Kibana.

To enable Kibana, uncomment the Kibana service in your docker-compose.yml. By default, Kibana will run on port 5601.

A few notes about Kibana:

  • When starting Kibana, the first step you will need to do is select an index pattern. Each index represents a dataset, where the format of the name of the index is tweets-. The dataset id is available under the dataset details when selecting source datasets in TweetSets.
  • The time period of the tweets is controlled by the date picker on the top, right of the Kibana screen. By default the time period is very short; you will probably want to adjust to cover a longer time period.

Citing

Please cite TweetSets as:

    Justin Littman, Laura Wrubel, Dan Kerchner, Dolsy Smith, Will Bonnett. (2020). TweetSets. Zenodo. https://doi.org/10.5281/zenodo.1289426

Development

Unit tests

Run outside the container.

python -m unittest

The Spark loader has its own set of unit tests. These will be copied to the TweetSets/tests directory when creating the loader container. Run them within the loader container with python -m unittest.

Kibana TODO

  • Consider multiple Kibana users.
  • Consider persistence.
  • Provide a default dashboard.
  • Consider approaches to index patterns.

TweetSets TODO

  • Loading:
    • Hydration of tweet ids lists.
  • Limiting:
    • Limit by mention user ids
    • Limit by user ids
    • Limit by verified users
  • Scroll additional sample tweets
  • Dataset derivatives:
    • Additional top derivatives:
      • URL
      • Quotes/retweets
    • Options to limit top derivatives by:
      • Top number (e.g., top 500)
      • Count greater than (e.g., more than 5 mentions)
    • Additional nodes/edges derivatives:
      • Replies
      • Quotes/retweets
    • Provide nodes/edges in additional formats such as Gephi.
  • Separate counts of tweets available for public / local on home page.