python：正则表达式匹配字符串的数字范围

[假设您需要这个，因为它是一些需要regexp的奇怪的第三方系统]

我对弗雷德里克的评论的思考越多，我就越同意。即使输入字符串很长，regexp引擎也应该能够将其编译为紧凑的DFA。在许多情况下，以下是一个明智的解决方案：

import re

def regexp(lo, hi):
    fmt = '%%0%dd' % len(str(hi))
    return re.compile('(%s)' % '|'.join(fmt % i for i in range(lo, hi+1)))

（它适用于以下测试中的所有数值范围，包括99519000-99519099。粗略的信封计算表明9位数字约为1GB内存的限制。匹配；如果只有几个匹配，则可以更大。）

[再次更新以提供更短的结果-除了合并偶尔出现\d\d的效果与手工生成的效果差不多之外]

假设所有数字的长度相同（即，如有必要，则在左侧使用零填充），这可以实现：

import re

def alt(*args):
    '''format regexp alternatives'''
    if len(args) == 1: return args[0]
    else: return '(%s)' % '|'.join(args)

def replace(s, c): 
     '''replace all characters in a string with a different character'''
    return ''.join(map(lambda x: c, s))

def repeat(s, n):
    '''format a regexp repeat'''
    if n == 0: return ''
    elif n == 1: return s
    else: return '%s{%d}' % (s, n)

def digits(lo, hi): 
    '''format a regexp digit range'''
    if lo == 0 and hi == 9: return r'\d'
    elif lo == hi: return str(lo)
    else: return '[%d-%d]' % (lo, hi)

def trace(f):
    '''for debugging'''
    def wrapped(lo, hi):
        result = f(lo, hi)
        print(lo, hi, result)
        return result
    return wrapped

#@trace  # uncomment to get calls traced to stdout (explains recursion when bug hunting)
def regexp(lo, hi):
    '''generate a regexp that matches integers from lo to hi only.
       assumes that inputs are zero-padded to the length of hi (like phone numbers).
       you probably want to surround with ^ and $ before using.'''

    assert lo <= hi
    assert lo >= 0

    slo, shi = str(lo), str(hi)
    # zero-pad to same length
    while len(slo) < len(shi): slo = '0' + slo
    # first digits and length
    l, h, n = int(slo[0]), int(shi[0]), len(slo)

    if l == h:
        # extract common prefix
        common = ''
        while slo and slo[0] == shi[0]:
            common += slo[0]
            slo, shi = slo[1:], shi[1:]
        if slo: return common + regexp(int(slo), int(shi))
        else: return common

    else:
        # the core of the routine.
        # split into 'complete blocks' like 200-599 and 'edge cases' like 123-199
        # and handle each separately.

        # are these complete blocks?
        xlo = slo[1:] == replace(slo[1:], '0')
        xhi = shi[1:] == replace(shi[1:], '9')

        # edges of possible complete blocks
        mlo = int(slo[0] + replace(slo[1:], '9'))
        mhi = int(shi[0] + replace(shi[1:], '0'))

        if xlo:
            if xhi:
                # complete block on both sides
                # this is where single digits are finally handled, too.
                return digits(l, h) + repeat('\d', n-1)
            else:
                # complete block to mhi, plus extra on hi side
                prefix = '' if l or h-1 else '0'
                return alt(prefix + regexp(lo, mhi-1), regexp(mhi, hi))
        else:
            prefix = '' if l else '0'
            if xhi:
                # complete block on hi side plus extra on lo
                return alt(prefix + regexp(lo, mlo), regexp(mlo+1, hi))
            else:
                # neither side complete, so add extra on both sides
                # (and maybe a complete block in the middle, if room)
                if mlo + 1 == mhi:
                    return alt(prefix + regexp(lo, mlo), regexp(mhi, hi))
                else:
                    return alt(prefix + regexp(lo, mlo), regexp(mlo+1, mhi-1), regexp(mhi, hi))


# test a bunch of different ranges
for (lo, hi) in [(0, 0), (0, 1), (0, 2), (0, 9), (0, 10), (0, 11), (0, 101),
                 (1, 1), (1, 2), (1, 9), (1, 10), (1, 11), (1, 101),
                 (0, 123), (111, 123), (123, 222), (123, 333), (123, 444),
                 (0, 321), (111, 321), (222, 321), (321, 333), (321, 444),
                 (123, 321), (111, 121), (121, 222), (1234, 4321), (0, 999),
                 (99519000, 99519099)]:
    fmt = '%%0%dd' % len(str(hi))
    rx = regexp(lo, hi)
    print('%4s - %-4s  %s' % (fmt % lo, fmt % hi, rx))
    m = re.compile('^%s$' % rx)
    for i in range(0, 1+int(replace(str(hi), '9'))):
        if m.match(fmt % i):
            assert lo <= i <= hi, i
        else:
            assert i < lo or i > hi, i

该函数regexp(lo, hi)会建立一个正则表达式，以匹配介于lo和之间的值hi（零填充到最大长度）。您可能需要在^前后放置一个$字符串（如测试代码中所示），以将匹配项强制为整个字符串。

该算法实际上非常简单- 它以递归方式将事物划分为通用前缀和“完整块”。完整的区块类似于200-599，并且可以可靠地匹配（在这种情况下为[2-5]\d{2}）。

因此123-599被分为123-199和200-599。后半部分是一个完整的块，上半部分有一个公共前缀1和23-99，将其递归处理为23-29（公共前缀）和30-99（完整块）（我们最终终止，因为参数每个通话的时间都比初始输入短）。

唯一令人讨厌的细节是prefix，这是必需的，因为to的参数regexp()是整数，因此在调用生成00-09的regexp时，实际上会生成0-9的regexp，而前导0除外。

输出是一堆测试用例，显示了范围和正则表达式：

   0 - 0     0
   0 - 1     [0-1]
   0 - 2     [0-2]
   0 - 9     \d
  00 - 10    (0\d|10)
  00 - 11    (0\d|1[0-1])
 000 - 101   (0\d\d|10[0-1])
   1 - 1     1
   1 - 2     [1-2]
   1 - 9     [1-9]
  01 - 10    (0[1-9]|10)
  01 - 11    (0[1-9]|1[0-1])
 001 - 101   (0(0[1-9]|[1-9]\d)|10[0-1])
 000 - 123   (0\d\d|1([0-1]\d|2[0-3]))
 111 - 123   1(1[1-9]|2[0-3])
 123 - 222   (1(2[3-9]|[3-9]\d)|2([0-1]\d|2[0-2]))
 123 - 333   (1(2[3-9]|[3-9]\d)|2\d\d|3([0-2]\d|3[0-3]))
 123 - 444   (1(2[3-9]|[3-9]\d)|[2-3]\d{2}|4([0-3]\d|4[0-4]))
 000 - 321   ([0-2]\d{2}|3([0-1]\d|2[0-1]))
 111 - 321   (1(1[1-9]|[2-9]\d)|2\d\d|3([0-1]\d|2[0-1]))
 222 - 321   (2(2[2-9]|[3-9]\d)|3([0-1]\d|2[0-1]))
 321 - 333   3(2[1-9]|3[0-3])
 321 - 444   (3(2[1-9]|[3-9]\d)|4([0-3]\d|4[0-4]))
 123 - 321   (1(2[3-9]|[3-9]\d)|2\d\d|3([0-1]\d|2[0-1]))
 111 - 121   1(1[1-9]|2[0-1])
 121 - 222   (1(2[1-9]|[3-9]\d)|2([0-1]\d|2[0-2]))
1234 - 4321  (1(2(3[4-9]|[4-9]\d)|[3-9]\d{2})|[2-3]\d{3}|4([0-2]\d{2}|3([0-1]\d|2[0-1])))
 000 - 999   \d\d{2}
99519000 - 99519099  995190\d\d

最后的测试循环超过99999999个数字需要一段时间。

表达式应该足够紧凑以避免任何缓冲区限制（我想最坏情况下的内存大小与最大数位的平方成正比）。

PS：我使用的是python 3，但我认为在这里没有太大的区别。

python 2022/1/1 18:29:47 有502人围观

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节

关注并接收问题和回答的更新提醒

参与内容的编辑和改进，让解决方法与时俱进

请先登录

python：正则表达式匹配字符串的数字范围

撰写回答

推荐问题

尝试使用selenium和python登录网页时出错

从Python访问errno？

gcloud compute copy-files：复制文件时拒绝权限

在服务器上运行selenium浏览器（Flask / Python / Heroku）

ImportError：没有使用Python2的名为mysql.connector的模块

Python：无法在网页中使用selenium下载

带有Selenium的Python“元素未附加到页面文档中”

在Jenkins中设置特定的Python

Python：从文件中选择随机行，然后删除该行

从Python字符串中删除不在允许列表中的HTML标签

从python读取json文件

通过Python3使用Selenium和WebDriver切换选项卡时，“ NoSuchWindowException：没有这样的窗口：窗口已经关闭”

pythonselenium多个测试用例

连接所有PostgreSQL表并创建一个Python字典

带有selenium的Python：无法找到真正存在的元素

列出用户和组的Python脚本

在capybara中选择具有多个类的元素

如何以正确的顺序导入Scrapy项目密钥？

如何确定是否为Selenium + Python加载了某些HTML元素？

使用Java与Python的Selenium Webdriver

分类汇总

您的鼓励是对我最大的支持