Python 爬虫实战 | 爬取南方周末新闻文章

博主：机灵鹤
发布时间：2021 年 11 月 17 日
1535次浏览
暂无评论
9317字数
分类：爬虫实战

前几天受一个粉丝所托，爬取《南方周末》网站上的新闻文章。

要求也并不复杂，跟人民日报爬虫和解放日报爬虫类似。

话不多说，我们直接开始。

1. 分析网站

南方周末，网站地址为：http://www.infzm.com/contents?term_id=1

南方周末网站首页

观察网站主页，我们可以了解到，网站左侧为 频道列表 ，中间为 新闻列表 。

鼠标点击切换左侧的频道时，观察到浏览器地址栏中 term_id 的值同步发生变化，说明 term_id 参数表示频道的 id 。

将网页滚动条往下滑，观察到会不断有新的新闻文章加载进来，但是浏览器地址栏中的网址全程没有变化，说明新闻列表采用 瀑布流 的加载形式，数据通过 Ajax 动态加载。

简单分析之后，我们打开 开发者工具 ，切换到 Network 页签开始抓包分析。

1.1 新闻列表分析

新闻列表

在页面下滑的过程中，不断有新的请求出现。

请求的 URL 形如： http://www.infzm.com/contents?term_id=1&page=2&format=json

请求的内容如图所示：

文章列表数据接口

到这里我们知道了，这个便是我们要找的 新闻列表 的数据接口。

观察接口 URL：http://www.infzm.com/contents?term_id=1&page=2&format=json

有 3 个参数：term_id ，page 和 format 。

term_id 前面分析过了表示频道的 id，其他两个根据字面含义，page 表示页数，format 表示数据格式。

返回的数据格式是标准的 json ，文章列表数据位于 data -> contents ，包括文章标题，文章id，作者名字，发布时间等信息。

1.2 新闻详情页分析

随便打开一篇新闻文章的详情页，如：http://www.infzm.com/contents/217973 。

我们观察到详情页链接的构成方式为 http://www.infzm.com/contents/ + 文章id 。

通过开发者工具查看，了解到新闻正文内容渲染在 HTML 源码中。

新闻文章正文

如图所示，新闻内容在 <div class="nfzm-content__content"> 标签中。其中 引言 部分位于 <blockquote class="nfzm-bq"> 标签下；正文内容位于 <div class="nfzm-content__fulltext"> 标签下的 p 标签中。

网页结构示意如下：

<div class="nfzm-content__content">
    <blockquote class="nfzm-bq">引言</blockquote>
    <div class="nfzm-content__fulltext">
        <p>第一段</p>
        <p>第二段</p>
        <p>第三段</p>
    </div>
</div>

1.3 反爬机制分析

我们用 Python 简单编写一段代码，测试一下网站的反爬机制。

1.3.1 新闻列表

简单伪造一下 headers ，发起网络请求。

import requests

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
}
url = "http://www.infzm.com/contents?term_id=1&page=2&format=json"
r = requests.get(url, headers=headers)
r.encoding = r.apparent_encoding
print(r.text)

发现可以正常获取到数据。

运行结果：获取新闻列表

1.3.2 新闻正文

import requests

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
}
url = "http://www.infzm.com/contents/217973"
r = requests.get(url, headers=headers)
r.encoding = r.apparent_encoding
print(r.text)

新闻正文内容也可以成功获取到。

运行结果：获取新闻正文内容

不过并不是所有新闻正文都可以无障碍爬取到，有些新闻正文仅展示部分内容，全文需要登录账号之后才能查看。

查看全文需要登录账号

而当我注册好账号之后刷新界面，发现查看全文居然还要订阅会员。

查看全文需要订阅会员

这里我就先不开通会员了。

如果有需要的同学，可以自行开通会员后，将登录后的 cookies 填入代码中的 headers 中，进行爬取。

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
    'Cookie': "你自己的cookie"
}

Cookie 可以在开发者工具中查看。

2. 编码环节

接下来，开始正式编码。

首先导入这个爬虫程序需要用到的库

import requests
import json
from bs4 import BeautifulSoup
import os

然后是网络请求函数 fetchUrl

def fetchUrl(url):
    '''
    功能：访问 url 的网页，获取网页内容并返回
    参数：目标网页的 url
    返回：目标网页的 html 内容
    '''
    headers = {
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
    }
    try:
        r = requests.get(url, headers=headers)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except Exception as e:
        print(e)

解析新闻列表函数 parseNewsList

def parseNewsList(html):
    '''
    功能：解析新闻列表页，提取新闻列表数据并依次返回
    参数：列表数据（json 格式）
    返回：新闻的id，标题，发布时间
    '''
    try:
        jsObj = json.loads(html)
        contents = jsObj["data"]["contents"]
        for cnt in contents:
            pid = cnt["id"]
            subject = cnt["subject"]
            publish_time = cnt["publish_time"]
            yield pid, subject, publish_time

    except Exception as e:
        print("parseNewsList error!")
        print(e)

解析新闻正文内容函数 parseNewsContent

def parseNewsContent(html):
    '''
    功能：解析新闻详情页，提取新闻正文内容并返回
    参数：网页源码（html 格式）
    返回：新闻正文内容的字符串
    '''
    try:
        bsObj = BeautifulSoup(html, "html.parser")
        cntDiv = bsObj.find("div", attrs={"class": "nfzm-content__content"})
        blockQuote = cntDiv.find("blockquote", attrs={"class": "nfzm-bq"})
        fulltextDiv = cntDiv.find("div", attrs={"class": "nfzm-content__fulltext"})
        pList = fulltextDiv.find_all("p")
        
        ret = blockQuote.text + "\n" if blockQuote else ""
        ret += "\n".join([p.text for p in pList if len(p.text) > 1])
        return ret
        
    except Exception as e:
        print("parseNewsContent error!")
        print(e)

保存文件函数 saveFile

def saveFile(path, filename, content):
    '''
    功能：将文章内容 content 保存到本地文件中
    参数：要保存的内容，路径，文件名
    '''
    # 如果没有该文件夹，则自动生成
    if not os.path.exists(path):
        os.makedirs(path)
    # 保存文件
    with open(path + filename, 'w', encoding='utf-8') as f:
        f.write(content)

爬虫调度器 download_nfzm

def download_nfzm(termId, page, savePath):
    '''
    功能：爬取 termId 频道，第 page 页的所有新闻，并保存至 savePath 路径下
    参数：termId 频道 Id
          page 页码
          savePath 保存路径
    '''
    url = f"http://www.infzm.com/contents?term_id={termId}&page={page}&format=json"
    html = fetchUrl(url)
    try:
        for pid, title, publish_time in parseNewsList(html):
            print(pid, publish_time, title)
            pLink = f"http://www.infzm.com/contents/{pid}"
            content = parseNewsContent(fetchUrl(pLink))
            content = title + "\n\n" + publish_time + "\n\n" + content
            date = publish_time.split(" ")[0]
            filename = f"{date}-{pid}.txt"
            
            saveFile(savePath, filename, content)
            
    except Exception as e:
        print("download_nfzm Error")
        print(e)

最后是主函数，用来启动爬虫。

if __name__ == '__main__':
    '''
    主函数：程序入口
    '''
    beginPage = 1
    endPage = 10
    term_id = 1

    for page in range(beginPage, endPage + 1):
        download_nfzm(term_id, page, 'infzm_News/')
    
    print("爬取完成")

3. 关键词筛选

有的同学可能有这样的需求，就是根据关键词来筛选要爬取的新闻文章，而非全部爬取。

于是我试了一下网站的搜索功能。

网站搜索功能

在搜索结果页尝试抓包，方法跟前面一样。

关键词搜索结果

然后发现，南方周末网站的关键词搜索功能，其实是在之前的数据接口的基础上，新加了一个参数 k

http://www.infzm.com/search?term_id=&page=2&k=%E7%BB%8F%E6%B5%8E&format=json

其中 %E7%BB%8F%E6%B5%8E 就是 url 编码后的关键词 经济 。

于是，我们可以在前面代码的基础上，略微调整一下主函数和 download_nfzm 函数，即可将普通的新闻文章爬虫改造成带 关键词筛选 的新闻文章爬虫。

def download_nfzm(termId, page, kw, savePath):
    '''
    功能：爬取 termId 频道，第 page 页的所有新闻，并保存至 savePath 路径下
    参数：termId 频道 Id
          page 页码
          savePath 保存路径
    '''
    url = f"http://www.infzm.com/contents?term_id={termId}&page={page}&k={kw}&format=json"
    html = fetchUrl(url)
    try:
        for pid, title, publish_time in parseNewsList(html):
            print(pid, publish_time, title)
            pLink = f"http://www.infzm.com/contents/{pid}"
            content = parseNewsContent(fetchUrl(pLink))
            content = title + "\n\n" + publish_time + "\n\n" + content
            date = publish_time.split(" ")[0]
            filename = f"{date}-{pid}.txt"
            
            saveFile(savePath, filename, content)
            
    except Exception as e:
        print("download_nfzm Error")
        print(e)

if __name__ == '__main__':
    '''
    主函数：程序入口
    '''
    beginPage = 1
    endPage = 10
    term_id = 1
    kw = "经济"

    for page in range(beginPage, endPage + 1):
        download_nfzm(term_id, page, kw, 'infzm_News/')
    
    print("爬取完成")

4. 运行效果

运行代码，爬取前10页进行测试

运行结果

保存好的新闻文章文件

爬取好的新闻内容

如果文章中有哪里没有讲明白，或者讲解有误的地方，欢迎在评论区批评指正，或者扫描下面的二维码，加我微信，大家一起学习交流，共同进步。

加我微信

最后修改：2021 年 11 月 19 日 02 : 44 PM

如果觉得我的文章对你有用，请随意赞赏

发表评论取消回复

评论 *

私密评论

名称 *

邮箱 *

地址

Python 爬虫实战 | 爬取南方周末新闻文章

机灵鹤 • 2021 年 11 月 17 日

前几天受一个粉丝所托，爬取《南方周末》网站上的新闻文章。

要求也并不复杂，跟人民日报爬虫和解放日报爬虫类似。

话不多说，我们直接开始。

1. 分析网站

南方周末，网站地址为：http://www.infzm.com/contents?term_id=1

南方周末网站首页

观察网站主页，我们可以了解到，网站左侧为 频道列表 ，中间为 新闻列表 。

鼠标点击切换左侧的频道时，观察到浏览器地址栏中 term_id 的值同步发生变化，说明 term_id 参数表示频道的 id 。

简单分析之后，我们打开 开发者工具 ，切换到 Network 页签开始抓包分析。

1.1 新闻列表分析

新闻列表

在页面下滑的过程中，不断有新的请求出现。

请求的 URL 形如： http://www.infzm.com/contents?term_id=1&page=2&format=json

请求的内容如图所示：

文章列表数据接口

到这里我们知道了，这个便是我们要找的 新闻列表 的数据接口。

观察接口 URL：http://www.infzm.com/contents?term_id=1&page=2&format=json

有 3 个参数：term_id ，page 和 format 。

term_id 前面分析过了表示频道的 id，其他两个根据字面含义，page 表示页数，format 表示数据格式。

返回的数据格式是标准的 json ，文章列表数据位于 data -> contents ，包括文章标题，文章id，作者名字，发布时间等信息。

1.2 新闻详情页分析

随便打开一篇新闻文章的详情页，如：http://www.infzm.com/contents/217973 。

我们观察到详情页链接的构成方式为 http://www.infzm.com/contents/ + 文章id 。

通过开发者工具查看，了解到新闻正文内容渲染在 HTML 源码中。

新闻文章正文

网页结构示意如下：

<div class="nfzm-content__content">
    <blockquote class="nfzm-bq">引言</blockquote>
    <div class="nfzm-content__fulltext">
        <p>第一段</p>
        <p>第二段</p>
        <p>第三段</p>
    </div>
</div>

1.3 反爬机制分析

我们用 Python 简单编写一段代码，测试一下网站的反爬机制。

1.3.1 新闻列表

简单伪造一下 headers ，发起网络请求。

import requests

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
}
url = "http://www.infzm.com/contents?term_id=1&page=2&format=json"
r = requests.get(url, headers=headers)
r.encoding = r.apparent_encoding
print(r.text)

发现可以正常获取到数据。

运行结果：获取新闻列表

1.3.2 新闻正文

import requests

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
}
url = "http://www.infzm.com/contents/217973"
r = requests.get(url, headers=headers)
r.encoding = r.apparent_encoding
print(r.text)

新闻正文内容也可以成功获取到。

运行结果：获取新闻正文内容

不过并不是所有新闻正文都可以无障碍爬取到，有些新闻正文仅展示部分内容，全文需要登录账号之后才能查看。

查看全文需要登录账号

而当我注册好账号之后刷新界面，发现查看全文居然还要订阅会员。

查看全文需要订阅会员

这里我就先不开通会员了。

如果有需要的同学，可以自行开通会员后，将登录后的 cookies 填入代码中的 headers 中，进行爬取。

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
    'Cookie': "你自己的cookie"
}

Cookie 可以在开发者工具中查看。

2. 编码环节

接下来，开始正式编码。

首先导入这个爬虫程序需要用到的库

import requests
import json
from bs4 import BeautifulSoup
import os

然后是网络请求函数 fetchUrl

def fetchUrl(url):
    '''
    功能：访问 url 的网页，获取网页内容并返回
    参数：目标网页的 url
    返回：目标网页的 html 内容
    '''
    headers = {
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
    }
    try:
        r = requests.get(url, headers=headers)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except Exception as e:
        print(e)

解析新闻列表函数 parseNewsList

def parseNewsList(html):
    '''
    功能：解析新闻列表页，提取新闻列表数据并依次返回
    参数：列表数据（json 格式）
    返回：新闻的id，标题，发布时间
    '''
    try:
        jsObj = json.loads(html)
        contents = jsObj["data"]["contents"]
        for cnt in contents:
            pid = cnt["id"]
            subject = cnt["subject"]
            publish_time = cnt["publish_time"]
            yield pid, subject, publish_time

    except Exception as e:
        print("parseNewsList error!")
        print(e)

解析新闻正文内容函数 parseNewsContent

def parseNewsContent(html):
    '''
    功能：解析新闻详情页，提取新闻正文内容并返回
    参数：网页源码（html 格式）
    返回：新闻正文内容的字符串
    '''
    try:
        bsObj = BeautifulSoup(html, "html.parser")
        cntDiv = bsObj.find("div", attrs={"class": "nfzm-content__content"})
        blockQuote = cntDiv.find("blockquote", attrs={"class": "nfzm-bq"})
        fulltextDiv = cntDiv.find("div", attrs={"class": "nfzm-content__fulltext"})
        pList = fulltextDiv.find_all("p")
        
        ret = blockQuote.text + "\n" if blockQuote else ""
        ret += "\n".join([p.text for p in pList if len(p.text) > 1])
        return ret
        
    except Exception as e:
        print("parseNewsContent error!")
        print(e)

保存文件函数 saveFile

def saveFile(path, filename, content):
    '''
    功能：将文章内容 content 保存到本地文件中
    参数：要保存的内容，路径，文件名
    '''
    # 如果没有该文件夹，则自动生成
    if not os.path.exists(path):
        os.makedirs(path)
    # 保存文件
    with open(path + filename, 'w', encoding='utf-8') as f:
        f.write(content)

爬虫调度器 download_nfzm

def download_nfzm(termId, page, savePath):
    '''
    功能：爬取 termId 频道，第 page 页的所有新闻，并保存至 savePath 路径下
    参数：termId 频道 Id
          page 页码
          savePath 保存路径
    '''
    url = f"http://www.infzm.com/contents?term_id={termId}&page={page}&format=json"
    html = fetchUrl(url)
    try:
        for pid, title, publish_time in parseNewsList(html):
            print(pid, publish_time, title)
            pLink = f"http://www.infzm.com/contents/{pid}"
            content = parseNewsContent(fetchUrl(pLink))
            content = title + "\n\n" + publish_time + "\n\n" + content
            date = publish_time.split(" ")[0]
            filename = f"{date}-{pid}.txt"
            
            saveFile(savePath, filename, content)
            
    except Exception as e:
        print("download_nfzm Error")
        print(e)

最后是主函数，用来启动爬虫。

if __name__ == '__main__':
    '''
    主函数：程序入口
    '''
    beginPage = 1
    endPage = 10
    term_id = 1

    for page in range(beginPage, endPage + 1):
        download_nfzm(term_id, page, 'infzm_News/')
    
    print("爬取完成")

3. 关键词筛选

有的同学可能有这样的需求，就是根据关键词来筛选要爬取的新闻文章，而非全部爬取。

于是我试了一下网站的搜索功能。

网站搜索功能

在搜索结果页尝试抓包，方法跟前面一样。

关键词搜索结果

然后发现，南方周末网站的关键词搜索功能，其实是在之前的数据接口的基础上，新加了一个参数 k

http://www.infzm.com/search?term_id=&page=2&k=%E7%BB%8F%E6%B5%8E&format=json

其中 %E7%BB%8F%E6%B5%8E 就是 url 编码后的关键词 经济 。

于是，我们可以在前面代码的基础上，略微调整一下主函数和 download_nfzm 函数，即可将普通的新闻文章爬虫改造成带 关键词筛选 的新闻文章爬虫。

def download_nfzm(termId, page, kw, savePath):
    '''
    功能：爬取 termId 频道，第 page 页的所有新闻，并保存至 savePath 路径下
    参数：termId 频道 Id
          page 页码
          savePath 保存路径
    '''
    url = f"http://www.infzm.com/contents?term_id={termId}&page={page}&k={kw}&format=json"
    html = fetchUrl(url)
    try:
        for pid, title, publish_time in parseNewsList(html):
            print(pid, publish_time, title)
            pLink = f"http://www.infzm.com/contents/{pid}"
            content = parseNewsContent(fetchUrl(pLink))
            content = title + "\n\n" + publish_time + "\n\n" + content
            date = publish_time.split(" ")[0]
            filename = f"{date}-{pid}.txt"
            
            saveFile(savePath, filename, content)
            
    except Exception as e:
        print("download_nfzm Error")
        print(e)

if __name__ == '__main__':
    '''
    主函数：程序入口
    '''
    beginPage = 1
    endPage = 10
    term_id = 1
    kw = "经济"

    for page in range(beginPage, endPage + 1):
        download_nfzm(term_id, page, kw, 'infzm_News/')
    
    print("爬取完成")

4. 运行效果

运行代码，爬取前10页进行测试

运行结果

保存好的新闻文章文件

爬取好的新闻内容

如果文章中有哪里没有讲明白，或者讲解有误的地方，欢迎在评论区批评指正，或者扫描下面的二维码，加我微信，大家一起学习交流，共同进步。

加我微信

Python 爬虫实战 | 爬取南方周末新闻文章

1. 分析网站

1.1 新闻列表分析

1.2 新闻详情页分析

1.3 反爬机制分析

1.3.1 新闻列表

1.3.2 新闻正文

2. 编码环节

3. 关键词筛选

4. 运行效果

发表评论取消回复

微信聊天记录导出教程

C++基础 | 十六进制宏的使用技巧

Python爬虫实战 | 爬取小红书去水印图片

Cocos Creator | 微信小游戏分包加载机制突破 4M 代码包体积限制

欢迎使用

从零开始搭建自己的个人博客网站

Cocos Creator 后台挂起时如何处理事件循环

如何用 Python 优雅地给程序添加授权码机制

我们现在怎样做父亲 -- 鲁迅

翻译 | Node.js 入门理解

Python 爬虫实战 | 爬取南方周末新闻文章

1. 分析网站

1.1 新闻列表分析

1.2 新闻详情页分析

1.3 反爬机制分析

1.3.1 新闻列表

1.3.2 新闻正文

2. 编码环节

3. 关键词筛选

4. 运行效果

1. 分析网站

1.1 新闻列表分析

1.2 新闻详情页分析

1.3 反爬机制分析

1.3.1 新闻列表

1.3.2 新闻正文

2. 编码环节

3. 关键词筛选

4. 运行效果

发表评论 取消回复

Python 爬虫实战 | 爬取南方周末新闻文章

1. 分析网站

1.1 新闻列表分析

1.2 新闻详情页分析

1.3 反爬机制分析

1.3.1 新闻列表

1.3.2 新闻正文

2. 编码环节

3. 关键词筛选

4. 运行效果

发表评论取消回复