10. 正则表达式匹配 [hard]

https://leetcode-cn.com/problems/regular-expression-matching

给你一个字符串 s 和一个字符规律 p，请你来实现一个支持 '.' 和 '*' 的正则表达式匹配。

'.' 匹配任意单个字符
'*' 匹配零个或多个前面的那一个元素

所谓匹配，是要涵盖整个字符串 s的，而不是部分字符串。

说明:

s 可能为空，且只包含从 a-z 的小写字母。
p 可能为空，且只包含从 a-z 的小写字母，以及字符 . 和 *。

示例 1:

输入:
s = "aa"
p = "a"
输出: false
解释: "a" 无法匹配 "aa" 整个字符串。

示例 2:

输入:
s = "aa"
p = "a*"
输出: true
解释: 因为 '*' 代表可以匹配零个或多个前面的那一个元素, 在这里前面的元素就是 'a'。因此，字符串 "aa" 可被视为 'a' 重复了一次。

示例 3:

输入:
s = "ab"
p = ".*"
输出: true
解释: ".*" 表示可匹配零个或多个（'*'）任意字符（'.'）。

示例 4:

输入:
s = "aab"
p = "c*a*b"
输出: true
解释: 因为 '*' 表示零个或多个，这里 'c' 为 0 个, 'a' 被重复一次。因此可以匹配字符串 "aab"。

示例 5:

输入:
s = "mississippi"
p = "mis*is*p*."
输出: false

通过次数81,627 | 提交次数289,182

Second Try

2020-06-13

First Try的加速版本，当时还没想到有重复的路径需要用用cache缓存，只是考虑到每次递归调用都需要复制字符串，担心耗费太多时间，于是传入同一个字符串和当前处理的指针位置。速度最终是有点提升，但没用cache缓存的时候还是挂掉了，问题还是在于状态的保存，动态规划。

既然Fist Try中的string消耗递归也没影响多少时间，这步优化可以不用。

class Solution(object):
    def isMatch(self, s, p):
        """
        :type s: str
        :type p: str
        :rtype: bool

        - 还是考虑使用回溯法，看是否能够同时消耗掉string和pattern
        - *号依然每次考虑两个点
        """
        # global k
        # k = 0

        cache = [["" for i in range(len(p)+1)] for j in range(len(s)+1)]
        def match(string, pattern, si, pi):
            # string或者pattern有一个用完了，需要判断是否两个都用完
            # 如果pattern用完了，那么可以肯定认为string必须得用完
            # 如果string用完了，但是pattern没用完，有可能pattern剩下的是无效的，需要允许在string为空的时候继续进入判断
            # 允许string为空的时候进入判断，这个就麻烦了，又写出了几个bug
            # print(string, pattern)
            # print(id(string))  # 每次调用的string都一样，没有被复制
            # global k
            # k += 1
            if cache[si][pi] != "":
                return cache[si][pi]
            sn, pn = len(string), len(pattern)
            # print(si, sn, pi, pn)          
            if pi == pn:  # 溢出才算数
                return si == sn
            first_match = si != sn and (pattern[pi] in (string[si], "."))
            rv = False
            if pn - pi >= 2 and pattern[pi + 1] == "*":
                # choice 1: just pass over the pattern
                # choice 2: consume the string, pattern reserved, 但是需要注意string仍然有内容可供跳转，不然就是无限递归
                choice_one = match(string, pattern, si, pi + 2)
                choice_two = first_match  and match(string, pattern, si + 1, pi)              
                rv = choice_one or choice_two
            elif first_match:
                rv = match(string, pattern, si + 1, pi + 1)
            else:
                rv = False
                
            cache[si][pi] = rv
            return cache[si][pi]
    
        rv =  match(s, p, 0, 0)
        # print(k)
        return rv

执行用时 :32 ms, 在所有 Python 提交中击败了96.98%的用户
内存消耗 :12.9 MB, 在所有 Python 提交中击败了33.33%的用户

First Try

2020-06-13

这道题的坑非常多，重点是考虑“*"号多重路径的处理，最后还要考察动态规划的状态缓存。

*号的坑：
- 首先**是不允许存在的，*也不会出现在第一个，不过该题目好像不需要考虑这个问题
- 在遇到*号的时候，需要考虑到跳两格处理问题，同时还需要能想出递归的编写方式。
- 每次遇到*号，有两种路径选择需要分开考虑:
  - choice 1: just pass over the pattern, 跳过当前pattern不去管，让递归自动进入string和pattern下一级的判断。
  - choice 2: consume the string, pattern reserved, 消耗当前string一个字符，让递归继续判断string下一级和当前pattern。
结束条件有坑：
- string或者pattern有一个用完了，需要判断是否两个都用完才算是结束条件
- 如果pattern用完了，那么string必须得用完，否则挂掉
- 如果string用完了但是pattern没用完，有可能pattern剩下的是无效的，需要允许在string为空的时候继续进入判断。由于这一步的影响，需要在当string为空的时候跳过其他判断，只运行判断"*"号的处理。
cache缓存动态规划：
- 本来使用string和pattern字符串递归传递，很难用字符来做缓存数组，但考虑到剩余长度为n的string是相同的，直接用剩下的string和pattern长度作为缓存即可。

时间复杂度方面，由于每个i,j的匹配都被缓存至多计算一次，因此复杂度为O(len(s)*len(p)),还行。

class Solution(object):
    def isMatch(self, s, p):
        """
        :type s: str
        :type p: str
        :rtype: bool

        - 还是考虑使用回溯法，看是否能够同时消耗掉string和pattern
        - *号依然每次考虑两个点

        """
        cache = [["" for i in range(len(p)+1)] for j in range(len(s) + 1)]
        def match(string, pattern):
            # string或者pattern有一个用完了，需要判断是否两个都用完
            # 如果pattern用完了，那么可以肯定认为string必须得用完
            # 如果string用完了，但是pattern没用完，有可能pattern剩下的是无效的，需要允许在string为空的时候继续进入判断
            # 允许string为空的时候进入判断，这个就麻烦了，又写出了几个bug
            # print(string, pattern)
            if cache[len(string)][len(pattern)] != "":
                return cache[len(string)][len(pattern)]
            if pattern == "": 
                return string == ""
            if len(pattern) >= 2 and pattern[1] == "*":
                # choice 1: just pass over the pattern
                # choice 2: consume the string, pattern reserved, 但是需要注意string仍然有内容可供跳转，不然就是无限递归
                choice_one = match(string, pattern[2:])
                choice_two = len(string) != 0 and (pattern[0] in (string[0], ".")) and match(string[1:], pattern)
                rv = choice_one or choice_two
            elif len(string) != 0 and pattern[0] in (string[0], "."):
                rv = match(string[1:], pattern[1:])
            else:
                rv = False
            cache[len(string)][len(pattern)] = rv
            return rv
        return match(s, p)

执行用时 :40 ms, 在所有 Python 提交中击败了90.06%的用户
内存消耗 :12.9 MB, 在所有 Python 提交中击败了33.33%的用户

failed try（time exceeded version)

没有用动态规划的递归版本，一路顺利，一直到某个string和patter的时候直接超时，挂掉。

class Solution(object):
    def isMatch(self, s, p):
        """
        :type s: str
        :type p: str
        :rtype: bool

        - 还是考虑使用回溯法，看是否能够同时消耗掉string和pattern
        - *号依然每次考虑两个点

        超出时间限制， oops
        "aaaaaaaaaaaaab"
        "a*a*a*a*a*a*a*a*a*a*a*a*b"
        """

        def match(string, pattern):
            # string或者pattern有一个用完了，需要判断是否两个都用完
            # 如果pattern用完了，那么可以肯定认为string必须得用完
            # 如果string用完了，但是pattern没用完，有可能pattern剩下的是无效的，需要允许在string为空的时候继续进入判断
            # 允许string为空的时候进入判断，这个就麻烦了，又写出了几个bug
            # print(string, pattern)
            if pattern == "": 
                return string == ""
            if len(pattern) >= 2 and pattern[1] == "*":
                # choice 1: just pass over the pattern
                # choice 2: consume the string, pattern reserved, 但是需要注意string仍然有内容可供跳转，不然就是无限递归
                choice_one = match(string, pattern[2:])
                choice_two = len(string) != 0 and (pattern[0] in (string[0], ".")) and match(string[1:], pattern)
                return choice_one or choice_two
            if len(string) != 0 and pattern[0] in (string[0], "."):
                return match(string[1:], pattern[1:])
            return False
        return match(s, p)

Bug Try

2020-06-15

因为有了解过正则的图形处理方法，所有有点思路，遇到分叉路口多种选择，比如*号的时候把多余的路径保存起来，其他路径失败的时候取出保存的路径即可。照理来说这个思路是没问题的，但没有构建图形状态切换，直接裸写实在太多bug了，浪费了无数次提交机会。

一直想要一个字符一个字符处理，但是遇到号的时候需要考虑回退、往前走等各种情况，非常头疼。解决了号问题，又出现string处理完，pattern没有处理完的情况，不停打补丁还是解决不了这个bug。

按照自己的写法，总共花了大半天时间一直到深夜，每次都觉得有希望，最后还没解决bug，太惨了。

# f**k 还是挂掉了， 在处理s结束，但是p还没结束上悲剧了，不知道如何处理
# *号连续处理两个
class Solution(object):
    def isMatch(self, s, p):
        """
        :type s: str
        :type p: str
        :rtype: bool
        
        特殊情况: .*, 还有a*， 
        遇到*号往回溯即可，但问题来了，如果.*A遇到BAA的时候，到底什么时候走到A的位置？
        直接蒙蔽。

        挂在这个例子上:
        "aaa"
        "ab*a*c*a"

        但s被消耗完，p没消耗完，需要考察p剩下的是否不影响，太tmd难了

        还有这个例子
        "a"
        "ab*"

        还是挂在这个上面，连续提交了8次了
        "ab"
        ".*c"


        "a"
        ".*..a*"
        """

        # 每次遇到 *号有两种可能，要么完全跳过，要么消耗一个string长度; 每次只选择一个可能，另一个可能先放在栈中。
        # 由于*号总会影响到前面的判断，因此最好每次找两个，遇到星号一并处理了
        # 每次遇到*号，直接跳过的往前走(s, p+2)，消耗的放在额外的栈中(s+2, p)
        # 当s被消耗完，p没消耗完，需要考察p剩下的是否不影响，太tmd难了
        # 可以缓存每种交叉路口的选择，证明已经被走过了

        # 不允许连续两个"*"，如果允许直接替换为"*"也可以方便处理；同时不允许第一个符号为*
        
        if "**" in p or p[0] == "*":
            return False
        reserved = [(0, 0)]
        choice_recorded = set()

        i = 0
        j = 0
        while len(reserved) != 0:
            i, j = reserved.pop()
            # 问题在于如果i已经走完，p还没走完，怎么继续测试
            while i <= len(s) - 1 and j <= len(p) - 1:
                print("inner", i, j, reserved)
                if j <= len(p) - 2 and p[j + 1] == "*":
                    # choice one, pass the pattern
                    reserved.append((i, j + 2))
                    # choice two, consume the string
                    if p[j] in (s[i], '.'):
                        i = i + 1
                    # if p[j] != s[i], need to bypass the pattern, just use the reserved
                    else:
                        break
                elif p[j] in (s[i], '.'): # match one
                    i += 1
                    j += 1
                else:  
                    break  # p[j] != s[j], and p[j+1] != "*"
            if i == len(s) and j == len(p):
                return True
            elif i == len(s) and j < len(p): # 这个不知道如何处理才好，当前的j有时候没被消耗，有时候又被消耗过
                print('wtf', i, j, reserved)
                if (len(p) - j) % 2 != 0:
                    continue
                # 还是没法判断成功
                # "a"， ".*..a*" 在（1, 0)的时候
                for x in range(1, len(p) - j, 2): 
                    if p[j + x] != '*':
                        continue
                return True
                # print("wtf",i, j, reserved)
                # for x, y in reserved[::-1]:
                #     if self.isMatch(s[x:], p[y:]):
                #         return True
                # return False
        return False

class Solution(object):
    def isMatch(self, s, p):
        """
        :type s: str
        :type p: str
        :rtype: bool
        
        特殊情况: .*, 还有a*， 
        遇到*号往回溯即可，但问题来了，如果.*A遇到BAA的时候，到底什么时候走到A的位置？
        直接蒙蔽。

        挂在这个例子上:
        "aaa"
        "ab*a*c*a"

        但s被消耗完，p没消耗完，需要考察p剩下的是否不影响，太tmd难了

        还有这个例子
        "a"
        "ab*"

        还是挂在这个上面，连续提交了8次了
        "ab"
        ".*c"
        """

        # 每次遇到 *号有两种可能，要么完全跳过，要么消耗一个string长度
        # 由于*号总会影响到前面的判断，因此最好每次找两个，遇到星号一并处理了
        reserved = [(0, 0)]
        i = 0
        j = 0
        k = 0
        failed = set()
        while len(reserved) != 0:
            i, j = reserved.pop()
            while i <= len(s) - 1 and j <= len(p) - 1:
                if (i, j) in failed:
                    k += 1
                    print(k)
                    break
                failed.add((i, j))
                
                # print(i, j, reserved)
                if p[j] == ".":
                    i += 1
                    j += 1
                # 重点是遇到*, 代表两种可能，要么j往前走一步，要么不走。先选择不走，而走的那一步保存在状态中，留待取出。
                # *不允许出现在第一位
                # 如果 x*不匹配，则j跳到下一步
                elif p[j] == "*":  
                    if j == 0:
                        break
                    reserved.append([i, j + 1]) # 遇到*的时候，可以直接跳过，消耗一个p
                    if p[j-1]  in (".", s[i]):
                        reserved.append([i + 1, j + 1])   # 可以消耗一个s，同时消耗一个p
                        i += 1  # 可以消耗一个s，但是不消耗当前的p
                    else:  # p[j-1] != s[i], move j to next point
                        j += 1
                        continue
                elif p[j] == s[i]: 
                    i += 1
                    j += 1
                # 如果某个字母不匹配，需要考虑下一个字母是否为"*"
                # condition like "ca" vs "b*ca"
                elif p[j] != s[i]:
                    if j + 1 < len(p) and p[j + 1] == "*":
                        j += 2
                else:
                    break
                        # print(i, j)
                    
            # print(i, j)
            if i == len(s)  and j == len(p):
                return True
            # 还需要考虑特殊情况，最后一位为"*"则不占用位置
            # 但是还要考虑到p的后面为a*b*c*之类的情况，f**k
            # 还有比如 s = "a"  p = "ab*"， 但i为1的时候，p为1
            elif i == len(s) and j != len(p):
                print("wtf", j)
                if self.isMatch("", p[j + 1:]):
                    return True
        return False