下载新浪博客文章，保存成文本文件(python)

今天用Python写了一个下载韩寒新浪博客文章的下载器，恩，基本功能如下：

1、从新浪博客上批量下载文章，并按文章标题创建文件

2、对下载的文章进行格式化。

已知Bug:长篇文章格式会错乱

 1 #!/usr/bin/python
 2 #-*- coding:utf-8 -*-
 3 
 4 import urllib
 5 import os
 6 import re
 7 
 8 def article_format(usock,basedir):    
 9     title_flag=True
10     context_start_flag=True
11     context_end_flag=True
12     for line in usock:
13         if title_flag:
14             title=re.findall(r'(<title>.+?<)',line)
15             if title:
16                 title=title[0][7:-1]
17                 filename=basedir+title
18                 print filename
19                 try:
20                     fobj=open(filename,'w+')
21                     fobj.write(title+'
')
22                     title_flag=False
23                 except IOError,e:
24                     print "Open %s error:%s"%(filename,e)
25             else:
26                 #print "Title has not found,drop it"
27                 pass
28         elif context_start_flag:
29             results1=re.findall(r'(<.+?正文开始.+?>)',line)
30             if results1:
31                 context_start_flag=False
32         elif context_end_flag:
33             results2=re.findall(r'(<.+?正文结束.+?)',line)
34             if results2:
35                 context_end_flag=False
36                 fobj.write('
END')
37                 fobj.close()
38                 break
39             else:    
40                 if 'div' in line or 'span' in line or  '<p>' in line:
41                     pass
42                 else:    
43                     line=re.sub('，',',',line)
44                     line=re.sub('：',':',line)
45                     line=re.sub('！','!',line)
46                     line=re.sub('（','(',line)
47                     line=re.sub('）',')',line)
48                     line=re.sub('⋯','...',line)
49                     line=re.sub('？','?',line)
50                     line=re.sub('；',';',line)
51                     line=re.sub(r'<wbr>','',line)
52                     line=re.sub(r'&nbsp;','',line)
53                     line=re.sub(r'<brs+?/>','',line)
54                     fobj.write(line)
55         else:
56             pass
57 
58 if __name__=='__main__':
59     basedir='/home/tmyyss/article/'
60     if not os.path.exists(basedir):
61         os.makedirs(basedir)
62 
63     usock=urllib.urlopen("http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html")
64     context=usock.read()
65     #print context
66     raw_url_list=re.findall(r'(<as+title.+?href="http.+?html)',context)
67     for url in raw_url_list:
68         url=re.findall('(http.+?html)',url)[0]
69         article_usock=urllib.urlopen(url)
70         article_format(article_usock,basedir)

View Code

下载新浪博客文章，保存成文本文件(python)

二进制文件

孩子网课不用怕，巧用路由器来控制孩子上网时间

最新文章

铃木首款纯电车型 eVitara 发布：提供四驱版本，定位小型 SUV

三星W25/W25 Flip新品发布会官宣：11月6日19:00见

明天会不会什么节日呢（明天会下雨吗？）

消费贷利率是多少(消费贷款国家规定年利率)

很困怎么办(很困如何强制清醒)

郁金香的简介-郁金香的作用

金生卷烟价格及品牌介绍

路由器安装图解（路由器怎么安装图文教程）

乒乓球的20条规则

最小的一位数是几（最小的一位数是0还是1？）

最新评论

标签

关注我们么么哒！

下载新浪博客文章，保存成文本文件(python)

二进制文件

孩子网课不用怕，巧用路由器来控制孩子上网时间

最新文章

铃木首款纯电车型 eVitara 发布：提供四驱版本，定位小型 SUV

最新评论

标签

关注我们 么么哒！

关注我们的公众号

关注我们么么哒！