Skip to content
This repository has been archived by the owner on Dec 14, 2023. It is now read-only.

Add sample ML-based topic modeling support #170

Open
wants to merge 107 commits into
base: master
Choose a base branch
from
Open

Commits on Jun 29, 2017

  1. Create token_pool.py

    tokenize articles
    DonggeLiu authored Jun 29, 2017
    Configuration menu
    Copy the full SHA
    e24f3b7 View commit details
    Browse the repository at this point in the history

Commits on Jul 3, 2017

  1. Configuration menu
    Copy the full SHA
    9535b81 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    934da4b View commit details
    Browse the repository at this point in the history
  3. 1. Two LDA model (with different package, not sure which one is bette…

    …r yet)
    
    2. A path helper to assit import
    3. modified token_pool to make it compatible with LDA model
    DonggeLiu committed Jul 3, 2017
    Configuration menu
    Copy the full SHA
    2a8a0f2 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    e888805 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    a23aa13 View commit details
    Browse the repository at this point in the history

Commits on Jul 10, 2017

  1. General

    1. Made every variable and method priavte if possible
    2. Reformatted code with Pycharm shortcut
    3. Added tests for TokenPool (works well) and ModelGensim (does work due to 'no module named XXX' problem when model_gensim is calling its abstract parent)
    4. Decoupled token_pool and model_*
    5. Used if __name__ == '__main__' to give a simple demonstration on how to use each mehtod
    
    Model_*
    1. Renamed mode_lda.py and model_lda2.py to model_gensim.py (which uses the Gensim package) and model_lda.py (which uses the LDA package)
    2. Added a abstract parent class TopicModel.py
    3. Moved some code from summarise() to add_stories() (a. better structure of code; b. improved performance)
    4. Changed some constants to function arguments (e.g. total_topic_num, iteration_num, etc.)
    
    TokenPool
    1. Added mc_root_path() when locating the stopwords file
    2. Modified query in token pool:
    	1. added "INNER JOIN stories WHERE language='en'" to guarantee all stories are in English
    	2. added "LIMIT" and corresponding "SELECT DISTINCT ... ORDER BY..." to guarantee only fetch the required number of stroies (thus improves performance)
    	3. added "OFFSET"
    3. Restructured token_pool.py, so that the stories are traversed only once (thus improves performance)
    4. Decoupled DB from token_pool.py
    5. Replace regex tokenization with nltk.tokenizer
    6. Added nltk.stem.WordNetLemmatizer to lemmatize (which gives a better result than stemming) tokens
    DonggeLiu committed Jul 10, 2017
    Configuration menu
    Copy the full SHA
    bc462ba View commit details
    Browse the repository at this point in the history

Commits on Jul 11, 2017

  1. Configuration menu
    Copy the full SHA
    83a31a7 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    ced8bb4 View commit details
    Browse the repository at this point in the history

Commits on Jul 17, 2017

  1. Configuration menu
    Copy the full SHA
    943c696 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    3db49ee View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    06d1d37 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    e027dad View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    336c0d8 View commit details
    Browse the repository at this point in the history

Commits on Jul 18, 2017

  1. Configuration menu
    Copy the full SHA
    178226b View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    ebc4715 View commit details
    Browse the repository at this point in the history

Commits on Jul 20, 2017

  1. Configuration menu
    Copy the full SHA
    39c5e8c View commit details
    Browse the repository at this point in the history

Commits on Jul 24, 2017

  1. Configuration menu
    Copy the full SHA
    716fe91 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    f66ead6 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    2d6c12d View commit details
    Browse the repository at this point in the history
  4. added model_nmf.py to model topics with the NMF algorithm

    The result of this algorithm is similar but slightly different from the LDA model
    +
    It allows multiple topics for each story
    DonggeLiu committed Jul 24, 2017
    Configuration menu
    Copy the full SHA
    6c50ed2 View commit details
    Browse the repository at this point in the history
  5. test cases for model_nmf.py

    DonggeLiu committed Jul 24, 2017
    Configuration menu
    Copy the full SHA
    679fef0 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    3ab2124 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    025dece View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    61517d1 View commit details
    Browse the repository at this point in the history
  9. cache WordNet

    DonggeLiu committed Jul 24, 2017
    Configuration menu
    Copy the full SHA
    36817b9 View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    b5562ad View commit details
    Browse the repository at this point in the history
  11. relocate test files

    DonggeLiu committed Jul 24, 2017
    Configuration menu
    Copy the full SHA
    e6b126c View commit details
    Browse the repository at this point in the history
  12. Configuration menu
    Copy the full SHA
    c93fe63 View commit details
    Browse the repository at this point in the history
  13. 1. removed josn serialization after fetching sentences from database

    2. renamed a few methods/variables due to the change of functionalities
    DonggeLiu committed Jul 24, 2017
    Configuration menu
    Copy the full SHA
    730a4e9 View commit details
    Browse the repository at this point in the history
  14. add .close to open file

    DonggeLiu committed Jul 24, 2017
    Configuration menu
    Copy the full SHA
    3b38dff View commit details
    Browse the repository at this point in the history
  15. add .close() to opened file

    DonggeLiu committed Jul 24, 2017
    Configuration menu
    Copy the full SHA
    154f96d View commit details
    Browse the repository at this point in the history
  16. Configuration menu
    Copy the full SHA
    5ea449a View commit details
    Browse the repository at this point in the history
  17. Configuration menu
    Copy the full SHA
    34fdcbc View commit details
    Browse the repository at this point in the history
  18. Configuration menu
    Copy the full SHA
    baca56c View commit details
    Browse the repository at this point in the history
  19. Configuration menu
    Copy the full SHA
    fe78de8 View commit details
    Browse the repository at this point in the history
  20. 1. Change the SQL query to be the same as suggested in previous PR re…

    …view, leave the alternative query and related code as comments
    
    2. Allowing TokenPool to take either a DBHandler or a TextIOWrapper
    DonggeLiu committed Jul 24, 2017
    Configuration menu
    Copy the full SHA
    91d725e View commit details
    Browse the repository at this point in the history
  21. Seperated test cases for three models from db_connection

    they are now taking the stories in the sample file as input
    DonggeLiu committed Jul 24, 2017
    Configuration menu
    Copy the full SHA
    0ca1eca View commit details
    Browse the repository at this point in the history
  22. Configuration menu
    Copy the full SHA
    dc0b73b View commit details
    Browse the repository at this point in the history
  23. Configuration menu
    Copy the full SHA
    96f566c View commit details
    Browse the repository at this point in the history
  24. Configuration menu
    Copy the full SHA
    c488c08 View commit details
    Browse the repository at this point in the history

Commits on Jul 26, 2017

  1. remove import path_helper

    DonggeLiu committed Jul 26, 2017
    Configuration menu
    Copy the full SHA
    6d8555e View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    6182c4f View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    9c68669 View commit details
    Browse the repository at this point in the history
  4. silent wget

    DonggeLiu committed Jul 26, 2017
    Configuration menu
    Copy the full SHA
    0e04ff1 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    d995cb8 View commit details
    Browse the repository at this point in the history

Commits on Jul 27, 2017

  1. Configuration menu
    Copy the full SHA
    a361b01 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    db1c584 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    2a88eab View commit details
    Browse the repository at this point in the history
  4. Don't --force-reinstall stuff needlessly

    pypt authored and DonggeLiu committed Jul 27, 2017
    Configuration menu
    Copy the full SHA
    b62e71d View commit details
    Browse the repository at this point in the history
  5. Install only WordNet data from NLTK data

    1) Faster (Travis doesn't have all day)
    2) We only use WordNet at the moment
    pypt authored and DonggeLiu committed Jul 27, 2017
    Configuration menu
    Copy the full SHA
    7922d3c View commit details
    Browse the repository at this point in the history
  6. Revert "added COMMAND_PREFIX to use sudo on linux"

    This reverts commit db1c584.
    pypt authored and DonggeLiu committed Jul 27, 2017
    Configuration menu
    Copy the full SHA
    7ce27cc View commit details
    Browse the repository at this point in the history
  7. Revert "turn on -n switch of unzip gh-pages.zip, preventing rewrite e…

    …xisting files"
    
    This reverts commit a361b01.
    pypt authored and DonggeLiu committed Jul 27, 2017
    Configuration menu
    Copy the full SHA
    29d460c View commit details
    Browse the repository at this point in the history
  8. Revert "adding more echos and comments"

    This reverts commit d995cb8.
    pypt authored and DonggeLiu committed Jul 27, 2017
    Configuration menu
    Copy the full SHA
    4008366 View commit details
    Browse the repository at this point in the history
  9. Revert "silent wget"

    This reverts commit 0e04ff1.
    pypt authored and DonggeLiu committed Jul 27, 2017
    Configuration menu
    Copy the full SHA
    c1da604 View commit details
    Browse the repository at this point in the history
  10. Revert "Use wget instead of nltk.download() to avoid 405 error"

    This reverts commit 9c68669.
    pypt authored and DonggeLiu committed Jul 27, 2017
    Configuration menu
    Copy the full SHA
    7b6beaf View commit details
    Browse the repository at this point in the history
  11. Install NLTK data from own mirror on S3

    pypt authored and DonggeLiu committed Jul 27, 2017
    Configuration menu
    Copy the full SHA
    bf2c962 View commit details
    Browse the repository at this point in the history
  12. Install only WordNet data from NLTK data

    1) Faster (Travis doesn't have all day)
    2) We only use WordNet at the moment
    pypt authored and DonggeLiu committed Jul 27, 2017
    Configuration menu
    Copy the full SHA
    482f01e View commit details
    Browse the repository at this point in the history
  13. Don't --force-reinstall stuff needlessly

    pypt authored and DonggeLiu committed Jul 27, 2017
    Configuration menu
    Copy the full SHA
    00633aa View commit details
    Browse the repository at this point in the history

Commits on Aug 1, 2017

  1. Configuration menu
    Copy the full SHA
    6f09e31 View commit details
    Browse the repository at this point in the history

Commits on Aug 7, 2017

  1. Configuration menu
    Copy the full SHA
    179da05 View commit details
    Browse the repository at this point in the history
  2. 1. make use of sample_handler.py to access sample file

    2. fix newly occurred pycharm warnings (expect iterator get list)
    DonggeLiu committed Aug 7, 2017
    Configuration menu
    Copy the full SHA
    1cf5601 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    1d3ad5e View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    81d6892 View commit details
    Browse the repository at this point in the history

Commits on Aug 8, 2017

  1. Temporarily disable unit tests for Travis to cache dependencies

    Before running unit tests, Travis installs all Perl and Python
    dependency modules which takes up a lot of time and doesn't always leave
    enough time (of the available 50 minutes) to complete all the unit
    tests.
    
    After a successful unit test run, Travis caches all the installed
    dependencies so that it doesn't have to install anymore and can get to
    running unit tests themselves faster.
    
    So, we temporarily disable the unit tests (replace them with a simple
    "echo" statement) for Travis to be able to install the dependencies and
    cache them. Subsequent Travis runs (with actual unit tests reenabled)
    will then be able to use the pre-cached dependencies.
    pypt committed Aug 8, 2017
    Configuration menu
    Copy the full SHA
    8861d9e View commit details
    Browse the repository at this point in the history
  2. Revert "cache WordNet"

    This reverts commit 36817b9.
    
    Caching fails because Travis is unable to find /usr/share/nltk_data for
    whatever reason:
    
    https://travis-ci.org/berkmancenter/mediacloud#L3361
    
    ...and so nothing gets cached (including Perl dependencies which take a
    long time to install), and so builds time out.
    pypt committed Aug 8, 2017
    Configuration menu
    Copy the full SHA
    c732a50 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    65c505b View commit details
    Browse the repository at this point in the history

Commits on Aug 9, 2017

  1. Configuration menu
    Copy the full SHA
    73f7e2e View commit details
    Browse the repository at this point in the history
  2. unify the name of models used in each class to self._model as in the …

    …abstract class
    
    added method named evaluate as in the abstract class
    DonggeLiu committed Aug 9, 2017
    Configuration menu
    Copy the full SHA
    ef35923 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    89882cd View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    73e518c View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    e2d6655 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    5289a85 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    00831af View commit details
    Browse the repository at this point in the history

Commits on Aug 12, 2017

  1. Configuration menu
    Copy the full SHA
    59bcb50 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    2c8e6eb View commit details
    Browse the repository at this point in the history

Commits on Aug 13, 2017

  1. a finder that can identify the max/min points of a polynomial compute…

    …d based on a few points
    DonggeLiu committed Aug 13, 2017
    Configuration menu
    Copy the full SHA
    d1129a6 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    4d5b9e4 View commit details
    Browse the repository at this point in the history

Commits on Aug 14, 2017

  1. Configuration menu
    Copy the full SHA
    8e77ed4 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    809aad7 View commit details
    Browse the repository at this point in the history

Commits on Aug 19, 2017

  1. Configuration menu
    Copy the full SHA
    f819366 View commit details
    Browse the repository at this point in the history
  2. no longer test tune_with_iteration as polynomial has a sigificant bet…

    …ter efficiency and performance
    
    I will combine these two later
    DonggeLiu committed Aug 19, 2017
    Configuration menu
    Copy the full SHA
    9869ca8 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    e185dd0 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    3545e0e View commit details
    Browse the repository at this point in the history

Commits on Aug 20, 2017

  1. Configuration menu
    Copy the full SHA
    7816ec8 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    94ebc24 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    c1c257e View commit details
    Browse the repository at this point in the history
  4. removed uncessary tune_with_iteration as its advantage/feature has be…

    …en combined with tune_with_polynomial
    DonggeLiu committed Aug 20, 2017
    Configuration menu
    Copy the full SHA
    6d09265 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    2479107 View commit details
    Browse the repository at this point in the history
  6. removed useless codes

    DonggeLiu committed Aug 20, 2017
    Configuration menu
    Copy the full SHA
    51dd0ec View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    620afb4 View commit details
    Browse the repository at this point in the history
  8. Disable unit tests temporarily for Travis to have a chance to compile…

    … and cache dependencies
    DonggeLiu committed Aug 20, 2017
    Configuration menu
    Copy the full SHA
    5ead4f2 View commit details
    Browse the repository at this point in the history
  9. Cache WordNet of NLTK

    DonggeLiu committed Aug 20, 2017
    Configuration menu
    Copy the full SHA
    0fb4e4a View commit details
    Browse the repository at this point in the history
  10. set test cases back

    DonggeLiu committed Aug 20, 2017
    Configuration menu
    Copy the full SHA
    87efd01 View commit details
    Browse the repository at this point in the history
  11. Configuration menu
    Copy the full SHA
    6ea203b View commit details
    Browse the repository at this point in the history

Commits on Aug 21, 2017

  1. added more story samples

    DonggeLiu committed Aug 21, 2017
    Configuration menu
    Copy the full SHA
    b675559 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    8753442 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    e39415b View commit details
    Browse the repository at this point in the history
  4. changed sample file name

    DonggeLiu committed Aug 21, 2017
    Configuration menu
    Copy the full SHA
    a674d26 View commit details
    Browse the repository at this point in the history
  5. this sample file has been replaced by 3 files with different size

    This allows more flexibility in Travis (i.e. use larger samples if we can run tests longer in Travis)
    DonggeLiu committed Aug 21, 2017
    Configuration menu
    Copy the full SHA
    6267f72 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    d4e9d48 View commit details
    Browse the repository at this point in the history
  7. 1. break large block of codes up to more funcitons

    2. improve performance based on empirical results
    DonggeLiu committed Aug 21, 2017
    Configuration menu
    Copy the full SHA
    0c3f7ee View commit details
    Browse the repository at this point in the history
  8. remove uncessary code

    DonggeLiu committed Aug 21, 2017
    Configuration menu
    Copy the full SHA
    4c12748 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    720dd7a View commit details
    Browse the repository at this point in the history

Commits on Aug 22, 2017

  1. further improvements on the code structure

    added more comments
    DonggeLiu committed Aug 22, 2017
    Configuration menu
    Copy the full SHA
    97afc48 View commit details
    Browse the repository at this point in the history
  2. remove redudent code

    DonggeLiu committed Aug 22, 2017
    Configuration menu
    Copy the full SHA
    016d01c View commit details
    Browse the repository at this point in the history

Commits on Sep 1, 2017

  1. Configuration menu
    Copy the full SHA
    9ff15ff View commit details
    Browse the repository at this point in the history