This project is a web interface that attempts to tame the overwhelming flood of papers on Arxiv. It allows researchers to keep track of recent papers, search for papers, sort papers by similarity to any paper, see recent popular papers, to add papers to a personal library, and to get personalized recommendations of (new or old) Arxiv papers. This code is currently running live at www.arxiv-sanity.com/, where it's serving 25,000+ Arxiv papers from Machine Learning (cs.[AI|CL|CV|LG|NE|SD]/eess.[AS|IV]/stat.ML) over all years. With this code base you could replicate the website to any of your favorite subsets of Arxiv by simply changing the categories in fetch_papers.py
.
fetch_papers.py
is for query arxiv API and create a filedb.p
that contains all information for each paper.download_pdfs.py
is for iterate over all papers in parsed pickle and downloads the papers into folderpdf
.thumb_pdf.py
is for export thumbnails of all downloaded pdfs tothumb
pictures.analyze.py
is for compute tfidf based on fetch info and save totfidf.p
,tfidf_meta.p
andsim_dict.p
.buildsvm.py
is for train SVMs for all users (if any), exports a pickleuser_sim.p
make_cache.py
is for save some fast searching data based on previous data and save todb2.p
file.twitter_daemon.py
is optional, which uses your Twitter API credentials (stored intwitter.txt
) to query Twitter periodically looking for mentions of papers in the database, and writes the results to the pickle filetwitter.p
.
serve.py
is for running a server
several software you need to install:
- Python 3: because all codes below depends on it
- ImageMagick :convert pdf to thumbnail
- Ghostscript :
imagemagick
need it for pdf converting - Mongodb :save infos from twitter
- sqlite-tools :save infos of registered users
$ pip install -r requirements.txt
all_in_one.py
contains all data preparing part mentioned above, so just running all_in_one.py
to do fetching,downloading,analyzing etc.:
python all_in_one.py
Run python serve.py
and visit your_ip:5000
. you can change port by using port
parameter.
If you'd like to run server to outer world (e.g. AWS) run it as python serve.py --prod
to use tornado instead of flask.
You also want to create a secret_key.txt
file and fill it with random text (see top of serve.py
).