您好, 欢迎来到 !    登录 | 注册 | | 设为首页 | 收藏本站

在python中解析巨大的xml时lxml的内存使用情况

在python中解析巨大的xml时lxml的内存使用情况

看来您遵循了一些很好的建议lxml,尤其是etree.iterparse(..),但是我认为您的实现从错误的角度来解决问题。的想法iterparse(..)是摆脱收集和存储数据,而是在读取标签时进行处理。您的readAllChildren(..)功能是将所有内容保存到中rowList,该内容不断增长以覆盖整个文档树。我做了一些更改以显示正在发生的事情:

from lxml import etree
def parseXml(context,attribList):
    for event, element in context:
        print "%s element %s:" % (event, element)
        fieldMap = {}
        rowList = []
        readAttribs(element, fieldMap, attribList)
        readAllChildren(element, fieldMap, attribList, rowList)
        for row in rowList:
            yield row
        element.clear()

def readAttribs(element, fieldMap, attribList):
    for attrib in attribList:
        fieldMap[attrib] = element.get(attrib,'')
    print "fieldMap:", fieldMap

def readAllChildren(element, fieldMap, attribList, rowList):
    for childElem in element:
        print "Found child:", childElem
        readAttribs(childElem, fieldMap, attribList)
        if len(childElem) > 0:
           readAllChildren(childElem, fieldMap, attribList, rowList)
        rowList.append(fieldMap.copy())
        print "len(rowList) =", len(rowList)
        childElem.clear()

def process_xml_original(xml_file):
    attribList=['name','age','id']
    context=etree.iterparse(xml_file, events=("start",))
    for row in parseXml(context,attribList):
        print "Row:", row

使用一些伪数据运行:

>>> from cStringIO import StringIO
>>> test_xml = """\
... <family>
...     <person name="somebody" id="5" />
...     <person age="45" />
...     <person name="Grandma" age="62">
...         <child age="35" id="10" name="Mom">
...             <grandchild age="7 and 3/4" />
...             <grandchild id="12345" />
...         </child>
...     </person>
...     <something-completely-different />
... </family>
... """
>>> process_xml_original(StringIO(test_xml))
start element: <Element family at 0x105ca58>
fieldMap: {'age': '', 'name': '', 'id': ''}
Found child: <Element person at 0x105ca80>
fieldMap: {'age': '', 'name': 'somebody', 'id': '5'}
len(rowList) = 1
Found child: <Element person at 0x105c468>
fieldMap: {'age': '45', 'name': '', 'id': ''}
len(rowList) = 2
Found child: <Element person at 0x105c7b0>
fieldMap: {'age': '62', 'name': 'Grandma', 'id': ''}
Found child: <Element child at 0x106e468>
fieldMap: {'age': '35', 'name': 'Mom', 'id': '10'}
Found child: <Element grandchild at 0x106e148>
fieldMap: {'age': '7 and 3/4', 'name': '', 'id': ''}
len(rowList) = 3
Found child: <Element grandchild at 0x106e490>
fieldMap: {'age': '', 'name': '', 'id': '12345'}
len(rowList) = 4
len(rowList) = 5
len(rowList) = 6
Found child: <Element something-completely-different at 0x106e4b8>
fieldMap: {'age': '', 'name': '', 'id': ''}
len(rowList) = 7
Row: {'age': '', 'name': 'somebody', 'id': '5'}
Row: {'age': '45', 'name': '', 'id': ''}
Row: {'age': '7 and 3/4', 'name': '', 'id': ''}
Row: {'age': '', 'name': '', 'id': '12345'}
Row: {'age': '', 'name': '', 'id': '12345'}
Row: {'age': '', 'name': '', 'id': '12345'}
Row: {'age': '', 'name': '', 'id': ''}
start element: <Element person at 0x105ca80>
fieldMap: {'age': '', 'name': '', 'id': ''}
start element: <Element person at 0x105c468>
fieldMap: {'age': '', 'name': '', 'id': ''}
start element: <Element person at 0x105c7b0>
fieldMap: {'age': '', 'name': '', 'id': ''}
start element: <Element child at 0x106e468>
fieldMap: {'age': '', 'name': '', 'id': ''}
start element: <Element grandchild at 0x106e148>
fieldMap: {'age': '', 'name': '', 'id': ''}
start element: <Element grandchild at 0x106e490>
fieldMap: {'age': '', 'name': '', 'id': ''}
start element: <Element something-completely-different at 0x106e4b8>
fieldMap: {'age': '', 'name': '', 'id': ''}

读取起来有些困难,但是您可以看到它是在第一遍中从根标签开始向下爬整棵树,rowList为整个文档中的每个元素建立起来的。您还会注意到它甚至没有停在那儿,因为element.clear()调用在中yield语句 之后进行parseXml(..),直到第二次迭代(即树中的下一个元素)才会执行。

一个简单的解决方法是让它iterparse(..)完成工作:迭代解析!以下内容提取相同的信息并对其进行增量处理:

def do_something_with_data(data):
    """This just prints it out. Yours will probably be more interesting."""
    print "Got data: ", data

def process_xml_iterative(xml_file):
    # by using the default 'end' event, you start at the _bottom_ of the tree
    ATTRS = ('name', 'age', 'id')
    for event, element in etree.iterparse(xml_file):
        print "%s element: %s" % (event, element)
        data = {}
        for attr in ATTRS:
            data[attr] = element.get(attr, u"")
        do_something_with_data(data)
        element.clear()
        del element # for extra insurance

在相同的伪XML上运行:

>>> print test_xml
<family>
    <person name="somebody" id="5" />
    <person age="45" />
    <person name="Grandma" age="62">
        <child age="35" id="10" name="Mom">
            <grandchild age="7 and 3/4" />
            <grandchild id="12345" />
        </child>
    </person>
    <something-completely-different />
</family>
>>> process_xml_iterative(StringIO(test_xml))
end element: <Element person at 0x105cc10>
Got data:  {'age': u'', 'name': 'somebody', 'id': '5'}
end element: <Element person at 0x106e468>
Got data:  {'age': '45', 'name': u'', 'id': u''}
end element: <Element grandchild at 0x106e148>
Got data:  {'age': '7 and 3/4', 'name': u'', 'id': u''}
end element: <Element grandchild at 0x106e490>
Got data:  {'age': u'', 'name': u'', 'id': '12345'}
end element: <Element child at 0x106e508>
Got data:  {'age': '35', 'name': 'Mom', 'id': '10'}
end element: <Element person at 0x106e530>
Got data:  {'age': '62', 'name': 'Grandma', 'id': u''}
end element: <Element something-completely-different at 0x106e558>
Got data:  {'age': u'', 'name': u'', 'id': u''}
end element: <Element family at 0x105c6e8>
Got data:  {'age': u'', 'name': u'', 'id': u''}

这将大大提高脚本的速度和内存性能。另外,通过钩住'end'事件,您可以随时清除和删除元素,而不必等到所有子级都已处理完毕。

根据您的数据集,最好只处理某些类型的元素。根元素之一可能不是很有意义,其他嵌套元素也可能用填充很多数据{'age': u'', 'id': u'', 'name': u''}

顺便说一句,当我阅读“ XML”和“低内存”时,我的想法总是直接跳到SAX上,这是您可以解决此问题的另一种方法。使用内置xml.sax模块:

import xml.sax

class AttributeGrabber(xml.sax.handler.ContentHandler):
    """SAX Handler which will store selected attribute values."""
    def __init__(self, target_attrs=()):
        self.target_attrs = target_attrs

    def startElement(self, name, attrs):
        print "Found element: ", name
        data = {}
        for target_attr in self.target_attrs:
            data[target_attr] = attrs.get(target_attr, u"")

        # (no xml trees or elements created at all)
        do_something_with_data(data)

def process_xml_sax(xml_file):
    grabber = AttributeGrabber(target_attrs=('name', 'age', 'id'))
    xml.sax.parse(xml_file, grabber)

您必须根据哪种情况最适合您来评估这两个选项(如果您经常这样做,则可能要运行几个基准测试)。

确保跟进事情的进展!

python 2022/1/1 18:53:08 有398人围观

撰写回答


你尚未登录,登录后可以

和开发者交流问题的细节

关注并接收问题和回答的更新提醒

参与内容的编辑和改进,让解决方法与时俱进

请先登录

推荐问题


联系我
置顶