Skip to content

Set of instructions for using data in the frame of EPFL SHS class

License

Notifications You must be signed in to change notification settings

impresso/epfl-shs-class

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

epfl-shs-class

Set of instructions for using data in the frame of EPFL SHS class.

Part 1 - How to get the data

  1. Read and sign the NDA, and give it back to the teachers.

  2. To get your credentials on SwitchEngine:

  1. Retrieve EC2 credentials (for use with s3cmd)
  • go to https://engines.switch.ch/ and authenticate with your credentials
  • from the menu on the left select Project >> API Access
  • click on "View credentials"
  • copy the fields EC2 Access Key and EC2 Secret Key into your .s3cfg config file (see below)
  1. Install s3cmd
  • with brewon Mac OS
  • with sudo apt-get install s3cmd on e.g. Ubuntu
  • configure it
    • copy the file .s3cfg in this repo to your home (e.g. ~/)
    • add access_key and secret_key to .s3cfg
    • type s3cmd ls: you should get a list of all buckets in the project
  1. Download the data
  • s3cmd get s3://impresso-data/* ~/home/impresso-data/

Part 2 - How to transform it

NB: before reading further, install jq, in case it's not yet installed on your system.

Data is in the form of bz2 archives. These archives are on a journal-year basis, and contains newspaper articles, which have been 'rebuilt' from the OCR output. The format is json-lines: each line is a json object, i.e. an article.

Each article contains more information that what you need so it is a good idea to filter out things and get a version of what interests you only. In the folder where you have the archives, execute the following command:

for f in *[0-9].jsonl.bz2; do bzcat $f | jq -c '{id: .id, type: .tp, date: .d, title: .t, fulltext: .ft}' | bzip2 > "${f%.jsonl.bz2}-reduced.jsonl.bz2" ; done

what does the command do:

  • iterate over the files having the suffix .jsonl.bz2 preceded by a number (each file lies in the variable $f)
  • open the archive (bzcat) and produce a stream of json
  • send (pipe |) this stream into jq
  • apply some filtering on the json content
  • send the output to a file which name is composed of the input file, completed with -reduced

You will now on work with the archives -reduced.jsonl.bz2. You can delete the others.

Part 3 - Setting up your working environment

Python environment

  1. Download Anaconda in order to get the Conda environement manager.
  2. Familiarize yourself with Conda
  3. Open a terminal, go to your working repository and create an environement: conda create -n NAME python=3.6 where NAME is the name you want to give to the environement (e.g. digital-history)
  4. Activate it: source activate NAME
  5. install dependencies with pip install -r requirements.txt

Useful commands (and more info here):

conda info --envs => list your environments
source deactivate => deactivate an env
conda remove --name NAME --all => remove environment 'NAME'

Working with Jupyter notebook

What it is: see this tutorial

Conda already installs by default Jupyter when you create an environment.

To launch a notebook, just execute this in your activated env: jupyter notebook

Starting working with the data

We've put a jupyter notebook in this repo (Example.ipynb) where you can get an idea where to start.

If you want to use Iramuteq, you will have to isolate the textual parts and print them as specified here.