collected codes from
https://github.com/IanLewis/tensorflow-examples
https://github.com/leriomaggio/deep-learning-keras-tensorflow
https://github.com/rouseguy/scipyUS2016_dl-image
http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLSD15_2.html
https://github.com/mikekestemont/ghentDL
http://brohrer.github.io/how_convolutional_neural_networks_work.html
https://ronxin.github.io/wevi/
-
CHIP-seq data https://www.encodeproject.org/experiments/ENCSR000AQU/
-
Reference: http://www.nature.com/nbt/journal/v33/n8/full/nbt.3300.html
-
Supplementary Notes: http://www.nature.com/nbt/journal/v33/n8/extref/nbt.3300-S2.pdf
train.data: Train sample size 77531
name | sequence | label | |
---|---|---|---|
train.data | name | 101 length | 0 negative, 1 positive |
test.data: Test sample szie 19383 (positive 9709 , negative 9674)
test_ans.data: (update), test data with 答案.
encodingSeq.py - sequence encoding
# change the first line #!/home/fish/anaconda3/bin/python to your python directory
# encodingSeq.py train.data flanking_length
# for example
encodingSeq.py train.data 10
繳交作業
- Programs (.txt, .py)
- Training data and validation data accuracy. (Trend Chart)
- Prediction for 19383 test.data. (19383 row, 0 negative, 1 positive)
- Description(1~2 pages) for what paramters do you try in this homework.
繳交日期
- 2/22 中午前.
繳交方式
分組
- 2/15上課分組.
組別 | 成員 | accuracy |
---|---|---|
1 | winiel559 | 0.8493 |
2 | chou.yuta | 0.4959 |
3 | wtwang, jason | 0.5167 |
4 | rouanshen, ilunteng | 0.6773 |
5 | bomson, andrewkuo | 0.8845 |
6 | yichun1492 | 0.5299 |
7 | alicetuan | 0.7831 |
8 | jill | 0.6538 |
9 | fish | 0.8260 |
/notebooks/fish/udacity/4_convolutions-HW.ipynb <== base line code
Fish's 參數 | value |
---|---|
convolution | 2 * (12 filter, size 11 * 4) |
NN | 64 hidden |
GE | SGE, batch_size 1000, 30000 run |
Learning rate | 0.05 |
relu | unused |
pooling | unused |
k-fold | unused |
strides | 2 ([1, 2, 2, 1]) |
padding | VALID (no padding) |
可以調的參數
- CNN(width, depth, filter size, filter number, max_pool/avg_pool ,full-connected network (hidden number), learning rate, stochastic GD, dropout ... etc
參考範本
Descriptions
- Build a CNN which use featrues from DNA sequence data (HW1) and protein sequence (HW2) seperately.
- Before classifier, concatenate two feature vectors and then feed it into classifier (NN)
- filter size of amino acide should be length * 20 (number of amino acid).
Training data
- Take positive DNA sequence and CTCF as positive sample. (labels should be [0, 1])
- Take negative DNA sequence and CTCF as negative sample. (labels should be [1, 0])
- Take all DNA sequence and 4 non-DNA binding protein as negative sample.
Testing data
- Test DNA sequence and CTCF sample.
CTCF_HUMAN.fasta
>sp|P49711|CTCF_HUMAN Transcriptional repressor CTCF OS=Homo sapiens GN=CTCF PE=1 SV=1
MEGDAVEAIVEESETFIKGKERKTYQRRREGGQEEDACHLPQNQTDGGEVVQDVNSSVQM
VMMEQLDPTLLQMKTEVMEGTVAPEAEAAVDDTQIITLQVVNMEEQPINIGELQLVQVPV
PVTVPVATTSVEELQGAYENEVSKEGLAESEPMICHTLPLPEGFQVVKVGANGEVETLEQ
GELPPQEDPSWQKDPDYQPPAKKTKKTKKSKLRYTEEGKDVDVSVYDFEEEQQEGLLSEV
NAEKVVGNMKPPKPTKIKKKGVKKTFQCELCSYTCPRRSNLDRHMKSHTDERPHKCHLCG
RAFRTVTLLRNHLNTHTGTRPHKCPDCDMAFVTSGELVRHRRYKHTHEKPFKCSMCDYAS
VEVSKLKRHIRSHTGERPFQCSLCSYASRDTYKLKRHMRTHSGEKPYECYICHARFTQSG
TMKMHILQKHTENVAKFHCPHCDTVIARKSDLGVHLRKQHSYIEQGKKCRYCDAVFHERY
ALIQHQKSHKNEKRFKCDQCDYACRQERHMIMHKRTHTGEKPYACSHCDKTFRQKQLLDM
HFKRYHDPNFVPAAFVCSKCGKTFTRRNTMARHADNCAGPDGVEGENGGETKKSKRGRKR
KMRSKKEDSSDSENAEPDLDDNEDEEEPAVEIEPEPEPQPVTPAPPPAKKRRGRPPGRTN
QPKQNQPTAIIQVEDQNTGAIENIIVEVKKEPDAEPAEGEEEEAQPAATDAPNGDLTPEM
ILSMMDR
4 Non-DNA binding protein (negative protein)
>1GND:_ GUANINE NUCLEOTIDE DISSOCIATION INHIBITOR
MDEEYDVIVLGTGLTECILSGIMSVNGKKVLHMDRNPYYGGESSSITPLEELYKRFQLLE
GPPETMGRGRDWNVDLIPKFLMANGQLVKMLLYTEVTRYLDFKVVEGSFVYKGGKIYKVP
STETEALASNLMGMFEKRRFRKFLVFVANFDENDPKTFEGVDPQNTSMRDVYRKFDLGQD
VIDFTGHALALYRTDDYLDQPCLETINRIKLYSESLARYGKSPYLYPLYGLGELPQGFAR
LSAIYGGTYMLNKPVDDIIMENGKVVGVKSEGEVARCKQLICDPSYVPDRVRKAGQVIRI
ICILSHPIKNTNDANSCQIIIPQNQVNRKSDIYVCMISYAHNVAAQGKYIAIASTTVETT
DPEKEVEPALELLEPIDQKFVAISDLYEPIDDGSESQVFCSCSYDATTHFETTCNDIKDI
YKRMAGSAFDFENMKRKQNDVFGEADQ
>1PHP:_ 3-PHOSPHOGLYCERATE KINASE (PGK) (E.C.2.7.2.3) - CHAIN _
MNKKTIRDVDVRGKRVFCRVDFNVPMEQGAITDDTRIRAALPTIRYLIEHGAKVILASHL
GRPKGKVVEELRLDAVAKRLGELLERPVAKTNEAVGDEVKAAVDRLNEGDVLLLENVRFY
PGEEKNDPELAKAFAELADLYVNDAFGAAHRAHASTEGIAHYLPAVAGFLMEKELEVLGK
ALSNPDRPFTAIIGGAKVKDKIGVIDNLLEKVDNLIIGGGLAYTFVKALGHDVGKSLLEE
DKIELAKSFMEKAKEKGVRFYMPVDVVVADRFANDANTKVVPIDAIPADWSALDIGPKTR
ELYRDVIRESKLVVWNGPMGVFEMDAFAHGTKAIAEALAEALDTYSVIGGGDSAAAVEKF
GLADKMDHISTGGGASLEFMEGKQLPGVVALEDK
>1LKI:_ LEUKEMIA INHIBITORY FACTOR (LIF) - CHAIN _
SPLPITPVNATCAIRHPCHGNLMNQIKNQLAQLNGSANALFISYYTAQGEPFPNNVEKLC
APNMTDFPSFHGNGTEKTKLVELYRMVAYLSASLTNITRDQKVLNPTAVSLQVKLNATID
VMRGLLSNVLCRLCNKYRVGHVDVPPVPDHSDKEAFQRKKLGCQLLGTYKQVISVVVQAF
>1MRP:_ FERRIC IRON BINDING PROTEIN
DITVYNGQHKEAATAVAKAFEQETGIKVTLNSGKSEQLAGQLKEEGDKTPADVFYTEQTA
TFADLSEAGLLAPISEQTIQQTAQKGVPLAPKKDWIALSGRSRVVVYDHTKLSEKDMEKS
VLDYATPKWKGKIGYVSTSGAFLEQVVALSKMKGDKVALNWLKGLKENGKLYAKNSVALQ
AVENGEVPAALINNYYWYNLAKEKGVENLKSRLYFVRHQDPGALVSYSGAAVLKASKNQA
EAQKFVDFLASKKGQEALVAARAEYPLRADVVSPFNLEPYEKLEAPVVSATTAQDKEHAI
KLIEEAGLK
繳交作業
- same as HW1
- Compare the result of HW1 and HW2 which one is better? Is protein sequence helpful?
繳交日期
- 3/8 中午前.
繳交方式
分組
組別 | 成員 | accuracy |
---|---|---|
1 | winiel559 | ---------- |
2 | chou.yuta | ---------- |
3 | wtwang, jason | ---------- |
4 | rouanshen, ilunteng | ---------- |
5 | bomson, andrewkuo | ---------- |
6 | yichun1492 | ---------- |
7 | alicetuan | ---------- |
8 | jill | ---------- |
9 | fish | ---------- |