Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A question about the PSTNPss function in util/FileProcessing.py #16

Open
DellCode233 opened this issue Oct 20, 2023 · 0 comments
Open

Comments

@DellCode233
Copy link

Hello, I am interested in your project, but I have some questions when reading your source code. I hope you can help me answer them.

My question is about the PSTNPss function in util/FileProcessing.py.
Question 1: I see that in this function, you subtract one from the total number of samples for the corresponding label and subtract one from the trinucleotide count at the corresponding location. I don’t understand the purpose and principle of doing this.

p_num, n_num = positive_number, negative_number
po_number = matrix_po[j][order[sequence[j: j + 3]]]
if i[0] in positive_key and po_number > 0:
    po_number -= 1
    p_num -= 1
ne_number = matrix_ne[j][order[sequence[j: j + 3]]]
if i[0] in negative_key and ne_number > 0:
    ne_number -= 1
    n_num -= 1

Question 2: Secondly, this function uses different processing methods for the training dataset and the testing dataset. In the training dataset, you perform the above subtraction operation, but not in the testing dataset. I don’t understand why there is such a difference. I have attached your code snippet for your convenience. Thank you for your time and help!

    def PSTNPss(self):
        try:
            if not self.is_equal:
                self.error_msg = 'PSTNPss descriptor need fasta sequence with equal length.'
                return False

            fastas = []
            for item in self.fasta_list:
                if item[3] == 'training':
                    fastas.append(item)
                    fastas.append([item[0], item[1], item[2], 'testing'])
                else:
                    fastas.append(item)

            for i in fastas:
                if re.search('[^ACGT-]', i[1]):
                    self.error_msg = 'Illegal character included in the fasta sequences, only the "ACGT[U]" are allowed by this encoding scheme.'
                    return False

            encodings = []
            header = ['SampleName', 'label']
            for pos in range(len(fastas[0][1]) - 2):
                header.append('Pos.%d' % (pos + 1))
            encodings.append(header)

            positive = []
            negative = []
            positive_key = []
            negative_key = []
            for i in fastas:
                if i[3] == 'training':
                    if i[2] == '1':
                        positive.append(i[1])
                        positive_key.append(i[0])
                    else:
                        negative.append(i[1])
                        negative_key.append(i[0])

            nucleotides = ['A', 'C', 'G', 'T']
            trinucleotides = [n1 + n2 + n3 for n1 in nucleotides for n2 in nucleotides for n3 in nucleotides]
            order = {}
            for i in range(len(trinucleotides)):
                order[trinucleotides[i]] = i

            matrix_po = self.CalculateMatrix(positive, order)
            matrix_ne = self.CalculateMatrix(negative, order)

            positive_number = len(positive)
            negative_number = len(negative)

            for i in fastas:
                if i[3] == 'testing':
                    name, sequence, label = i[0], i[1], i[2]
                    code = [name, label]
                    for j in range(len(sequence) - 2):
                        if re.search('-', sequence[j: j + 3]):
                            code.append(0)
                        else:
                            p_num, n_num = positive_number, negative_number
                            po_number = matrix_po[j][order[sequence[j: j + 3]]]
                            if i[0] in positive_key and po_number > 0:
                                po_number -= 1
                                p_num -= 1
                            ne_number = matrix_ne[j][order[sequence[j: j + 3]]]
                            if i[0] in negative_key and ne_number > 0:
                                ne_number -= 1
                                n_num -= 1
                            code.append(po_number / p_num - ne_number / n_num)
                            # print(sequence[j: j+3], order[sequence[j: j+3]], po_number, p_num, ne_number, n_num)
                    encodings.append(code)
            self.encoding_array = np.array([])
            self.encoding_array = np.array(encodings, dtype=str)
            self.column = self.encoding_array.shape[1]
            self.row = self.encoding_array.shape[0] - 1
            del encodings
            if self.encoding_array.shape[0] > 1:
                return True
            else:
                return False
        except Exception as e:
            self.error_msg = str(e)
            return False
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant