Overview

This repository contains the work I have done for my master thesis at University of British Columbia. It's a machine learning project on testing whether SVD could be used to find related articles by using only the keywords inside the abstracts of each article and finding what parameters work the best with SVD.

Here are the basic steps for each different experiment

Use the downloaded papers from MedLine or PubMed API and extract key biomedical terms
Create a document-term matrix using an appropriate TF-IDF function
Perform SVD with GraphLab, with a chosen number of singular values
Measure row-row distances in the decomposed matrix that represent the document space
Find the closest rows for the rows corresponding to the target papers
Compare the predictions with the PubMed database or human annotations (future)

For reproducibility

Presentations_Writings

Contains any write-ups, presentations, poster, reports, etc.

Shared

Contains shared resources such as common code used by different experiment, instructions on obtaining third-party tools and resources (eg. GraphLab, geniatagger, the biomedical phrase list, Medline papers), as well as code written by other lab members.

Directories for different experiments

These directories contained different experiments. For more information, see the README in these folders.

TREC2005_Training

Basic experiment on the use of SVD on a small set of abstracts.

PubMed_100_Random_Papers

Experiment on selecting a set of 100 random PubMed papers and see how well SVD can find their closest papers, evaluated using PubMed's API to search for what PubMed labels as related articles. This experiment was used to choose some initial parameters, such as distance function and term-frequency.

Shallow_Deep_Neighbors

Experiment on how well SVD can find related articles for each of the 100 random PubMed papers if the dataset contains not just the 100 random papers and their related articles, but also the related articles of the related articles (second level neighbors). To see how well PubMed can be trained with the additional of second level neighbors, the result is compared against another dataset with random papers instead of the second level neighbors.

TFIDF_Variant_Comparison

Testing different variants of TFIDF on the effect of precision and recalls.

Distance_Threshold

Testing if cosine distances measured from the decomposed matrices are bimodal and thus allow a cutoff to find related articles.

Medline_13M

SVD on all the papers available on MedLine (until March 2015), excluding the ones that do not have abstracts.

Graph Sparsification

Experiment on whether sampling the entries in a matrix can help us estimate the number of nsv.

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
Distance_Threshold		Distance_Threshold
Graph_Sparsification		Graph_Sparsification
Presentations_Writings		Presentations_Writings
PubMed_100_Random_Papers		PubMed_100_Random_Papers
Shallow_Deep_Neighbors		Shallow_Deep_Neighbors
Shared		Shared
Surveys_Creation		Surveys_Creation
TFIDF_Variant_Comparison		TFIDF_Variant_Comparison
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

For reproducibility

Presentations_Writings

Shared

Directories for different experiments

TREC2005_Training

PubMed_100_Random_Papers

Shallow_Deep_Neighbors

TFIDF_Variant_Comparison

Distance_Threshold

Medline_13M

Graph Sparsification

About

Releases

Packages

Contributors 2

Languages

santina/Master_Thesis_UBC

Folders and files

Latest commit

History

Repository files navigation

Overview

For reproducibility

Presentations_Writings

Shared

Directories for different experiments

TREC2005_Training

PubMed_100_Random_Papers

Shallow_Deep_Neighbors

TFIDF_Variant_Comparison

Distance_Threshold

Medline_13M

Graph Sparsification

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages