Skip to content

Latest commit

 

History

History
98 lines (78 loc) · 3.58 KB

187. Repeated DNA Sequences.md

File metadata and controls

98 lines (78 loc) · 3.58 KB
title toc date tags top
187. Repeated DNA Sequences
false
2017-10-10
Leetcode
Hash Table
Bit Manipulation
187

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

Example:

Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"

Output: ["AAAAACCCCC", "CCCCCAAAAA"]

分析

最直接的方法是依次取出长度为10的子字符串,放入哈希表中,计算子字符串出现的次数。当子字符串出现的次数大于1时,加入到结果中。

public List<String> findRepeatedDnaSequences(String s) {
    List<String> list = new ArrayList<>();
    if (s == null || s.length() < 10) return list;
    HashMap<String, Integer> map = new HashMap<>();
    for (int i = 10; i <= s.length(); i++) {
        String sub = s.substring(i - 10, i);
        map.put(sub, map.getOrDefault(sub, 0) + 1);
    }
    
    for (Map.Entry<String, Integer> entry : map.entrySet())
        if (entry.getValue() > 1) list.add(entry.getKey());
    
    return list;
}

利用位操作也可以寻找到重复的子字符串。方法是将字符A、C、G、T映射成数字0,1,2,3。注意到数字0,1,2,3对应的二进制数字的有效位数为2,那么我们可以用两个二进制位来表示一个字符,也就是说用20个二进制位来表示一个长度为10的字符串。既然一个整数int长度为32位,那么可以用int来表示长度为10的字符串。所以只需要验证整数是否重复出现,来代替验证子字符串是否重复出现。一方面节省了空间复杂度(一个整数比10个字符占用的空间小),另一方面节省了时间复杂度(整数的hashCode()计算比字符串要简单,整数的hash值即为整数本身)。

public List<String> findRepeatedDnaSequences(String s) {
    Set<Integer> words = new HashSet<>();
    Set<Integer> doubleWords = new HashSet<>();
    List<String> rv = new ArrayList<>();
    // 将字符映射成数字
    char[] map = new char[26];
    map['A' - 'A'] = 0;
    map['C' - 'A'] = 1;
    map['G' - 'A'] = 2;
    map['T' - 'A'] = 3;

    for(int i = 0; i < s.length() - 9; i++) {
        int v = 0;
        // 将长度位10的字符串转化为一个整数
        for (int j = i; j < i + 10; j++) {
            v <<= 2;
            v |= map[s.charAt(j) - 'A'];
        }
        // 查看是否已经访问过该字符串
        if(!words.add(v) && doubleWords.add(v))
            rv.add(s.substring(i, i + 10));

    }
    return rv;
}

Can we do better? YES! 上面说到只需要20位二进制数字就可以表示长度为10的子字符串,但是在构建对应的数字的时候,每次都是从头开始的,有这个必要吗?其实每次左移2位,然后取最低20位就可以了!

public List<String> findRepeatedDnaSequences(String s) {
    Set<Integer> words = new HashSet<>();
    Set<Integer> doubleWords = new HashSet<>();
    List<String> rv = new ArrayList<>();
    char[] map = new char[26];
    map['A' - 'A'] = 0;
    map['C' - 'A'] = 1;
    map['G' - 'A'] = 2;
    map['T' - 'A'] = 3;
    
    int v = 0;
    for (int i = 0 i < s.length(); i++) {
        // 每次取出最低20位二进制数字
        v = ((v << 2) | (map[s.charAt(i) - 'A'])) & 0xfffff;
        if (i < 9) continue;
        if(!words.add(v) && doubleWords.add(v))
            rv.add(s.substring(i - 9, i + 1));
    }
    return rv;
}