您好, 欢迎来到 !    登录 | 注册 | | 设为首页 | 收藏本站

Python:逐行读写提高程序性能

5b51 2022/1/14 8:25:23 python 字数 10954 阅读 823 来源 www.jb51.cc/python

笔记本4G内存,使用率40%的样子,昨晚走之前跑一个程序,处理300M数据,第二天过来一看居然还没跑完,意识到严重性。

概述

笔记本4G内存,使用率40%的样子,昨晚走之前跑一个程序,处理300M数据,第二天过来一看居然还没跑完,意识到严重性。

问题代码如下:

代码问题:

(1) 编码问题:fd.readlines()一次读入内存,占用空间

(2) 逻辑问题:一次性处理过多,看似简洁,实则嵌套了大量循环,耗费资源

内存占用如下(程序停留在循环中迟迟无法退出,耗时12h+):

解决

(1) 采用文件迭代器,逐行读取并处理。

(2) 调整逻辑:一个用户的信息顺序读取完之随即处理该用户(而之前是到处去查找该用户的信息,浪费了原本就具有的空间局部性)。

代码如下:

file_sim = "a.csv"
file_feature = "a.feature"
file_pickle = "a.pickle"

def getTopModes():
fd = open(file_sim,"r")
fp = open(file_pickle,"wb")
fw = open(file_feature,"w")
user_browse_dict = dict()
temp_browse_dict = dict()
temp_user_lis = []

逐行读取并按用户处理

print "Count begin"
for line in fd:
    [uid,action,subact] = line.strip().split(",")
    if temp_user_lis == [] or uid == temp_user_lis[-1][0]:
        temp_user_lis.append([uid,action + "," + subact])
    else:
        # 处理上<a href="https://www.jb51.cc/tag/yige/" target="_blank" class="keywords">一个</a><a href="https://www.jb51.cc/tag/yonghu/" target="_blank" class="keywords">用户</a>id的信息
        df = pd.DataFrame(temp_user_lis,"actions"])
        temp_browse_dict = dict(list(df.groupby("actions")))
        for k,v in temp_browse_dict.items():
            temp_browse_dict[k] = str(len(v))
        temp_vlist = sorted(temp_browse_dict.items(),reverse = True)[:3]     # ("117,2",8)
        user_browse_dict[uid] = temp_vlist
        fw.write(uid + "," + ",".join([",".join(tv) for tv in temp_vlist]) + "\n")
        # 清空准备<a href="https://www.jb51.cc/tag/tongji/" target="_blank" class="keywords">统计</a>下<a href="https://www.jb51.cc/tag/yige/" target="_blank" class="keywords">一个</a><a href="https://www.jb51.cc/tag/yonghu/" target="_blank" class="keywords">用户</a>id的信息
        temp_browse_dict = dict()
        temp_user_lis = []
        temp_user_lis.append([uid," + subact])
# 处理最后<a href="https://www.jb51.cc/tag/yige/" target="_blank" class="keywords">一个</a><a href="https://www.jb51.cc/tag/yonghu/" target="_blank" class="keywords">用户</a>信息
if temp_user_lis != []:
    df = pd.DataFrame(temp_user_lis,v in temp_browse_dict.items():
        temp_browse_dict[k] = str(len(v))
    temp_vlist = sorted(temp_browse_dict.items(),8)
    user_browse_dict[uid] = temp_vlist
    fw.write(uid + ",".join(tv) for tv in temp_vlist]) + "\n")
# dump数据
print "Dump begin!"
pickle.dump(user_browse_dict,fp)  
fd.close(); fp.close(); fw.close();

getTopModes()

print "Count begin"
for line in fd:
    [uid,action,subact] = line.strip().split(",")
    if temp_user_lis == [] or uid == temp_user_lis[-1][0]:
        temp_user_lis.append([uid,action + "," + subact])
    else:
        # 处理上<a href="https://www.jb51.cc/tag/yige/" target="_blank" class="keywords">一个</a><a href="https://www.jb51.cc/tag/yonghu/" target="_blank" class="keywords">用户</a>id的信息
        df = pd.DataFrame(temp_user_lis,"actions"])
        temp_browse_dict = dict(list(df.groupby("actions")))
        for k,v in temp_browse_dict.items():
            temp_browse_dict[k] = str(len(v))
        temp_vlist = sorted(temp_browse_dict.items(),reverse = True)[:3]     # ("117,2",8)
        user_browse_dict[uid] = temp_vlist
        fw.write(uid + "," + ",".join([",".join(tv) for tv in temp_vlist]) + "\n")
        # 清空准备<a href="https://www.jb51.cc/tag/tongji/" target="_blank" class="keywords">统计</a>下<a href="https://www.jb51.cc/tag/yige/" target="_blank" class="keywords">一个</a><a href="https://www.jb51.cc/tag/yonghu/" target="_blank" class="keywords">用户</a>id的信息
        temp_browse_dict = dict()
        temp_user_lis = []
        temp_user_lis.append([uid," + subact])
# 处理最后<a href="https://www.jb51.cc/tag/yige/" target="_blank" class="keywords">一个</a><a href="https://www.jb51.cc/tag/yonghu/" target="_blank" class="keywords">用户</a>信息
if temp_user_lis != []:
    df = pd.DataFrame(temp_user_lis,v in temp_browse_dict.items():
        temp_browse_dict[k] = str(len(v))
    temp_vlist = sorted(temp_browse_dict.items(),8)
    user_browse_dict[uid] = temp_vlist
    fw.write(uid + ",".join(tv) for tv in temp_vlist]) + "\n")
# dump数据
print "Dump begin!"
pickle.dump(user_browse_dict,fp)  
fd.close(); fp.close(); fw.close();

def getTopModes():
fd = open(file_sim,"r")
fp = open(file_pickle,"wb")
fw = open(file_feature,"w")
user_browse_dict = dict()
temp_browse_dict = dict()
temp_user_lis = []

getTopModes()

def getTopModes():
fd = open(file_sim,"r")
fp = open(file_pickle,"wb")
fw = open(file_feature,"w")
user_browse_dict = dict()
temp_browse_dict = dict()
temp_user_lis = []

print "Count begin"
for line in fd:
    [uid,action,subact] = line.strip().split(",")
    if temp_user_lis == [] or uid == temp_user_lis[-1][0]:
        temp_user_lis.append([uid,action + "," + subact])
    else:
        # 处理上<a href="https://www.jb51.cc/tag/yige/" target="_blank" class="keywords">一个</a><a href="https://www.jb51.cc/tag/yonghu/" target="_blank" class="keywords">用户</a>id的信息
        df = pd.DataFrame(temp_user_lis,"actions"])
        temp_browse_dict = dict(list(df.groupby("actions")))
        for k,v in temp_browse_dict.items():
            temp_browse_dict[k] = str(len(v))
        temp_vlist = sorted(temp_browse_dict.items(),reverse = True)[:3]     # ("117,2",8)
        user_browse_dict[uid] = temp_vlist
        fw.write(uid + "," + ",".join([",".join(tv) for tv in temp_vlist]) + "\n")
        # 清空准备<a href="https://www.jb51.cc/tag/tongji/" target="_blank" class="keywords">统计</a>下<a href="https://www.jb51.cc/tag/yige/" target="_blank" class="keywords">一个</a><a href="https://www.jb51.cc/tag/yonghu/" target="_blank" class="keywords">用户</a>id的信息
        temp_browse_dict = dict()
        temp_user_lis = []
        temp_user_lis.append([uid," + subact])
# 处理最后<a href="https://www.jb51.cc/tag/yige/" target="_blank" class="keywords">一个</a><a href="https://www.jb51.cc/tag/yonghu/" target="_blank" class="keywords">用户</a>信息
if temp_user_lis != []:
    df = pd.DataFrame(temp_user_lis,v in temp_browse_dict.items():
        temp_browse_dict[k] = str(len(v))
    temp_vlist = sorted(temp_browse_dict.items(),8)
    user_browse_dict[uid] = temp_vlist
    fw.write(uid + ",".join(tv) for tv in temp_vlist]) + "\n")
# dump数据
print "Dump begin!"
pickle.dump(user_browse_dict,fp)  
fd.close(); fp.close(); fw.close();

getTopModes()

内存占用如下(46.1M,程序已经由内存忙碌型转变为cpu忙碌型):

程序总计361s运行结束,可以看到原始内存占用率已经达到了46%。

总结

以上是编程之家为你收集整理的Python:逐行读写提高程序性能全部内容,希望文章能够帮你解决Python:逐行读写提高程序性能所遇到的程序开发问题。


如果您也喜欢它,动动您的小指点个赞吧

除非注明,文章均由 laddyq.com 整理发布,欢迎转载。

转载请注明:
链接:http://laddyq.com
来源:laddyq.com
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。


联系我
置顶