基于python的新闻爬虫_网站优化分享_上海毫米网络优化公司

您的位置：上海毫米网络优化公司 > 网站优化分享 >

相关推荐recommended

基于python的新闻爬虫

作者：mmseoamin日期：2024-02-22

咱们这个任务啊，就是要从一个指定的网站上，抓取新闻内容，然后把它们整整齐齐地保存到本地。具体来说，就是要去光明网的板块里，瞅瞅里面的新闻，把它们一条条地保存下来。

首先，咱得有个网址，这就是咱要去的地方。然后用requests这个神奇的小工具，向这个网址发送个GET请求，就像是对网站说“喂，把你的内容给我送过来”。

接下来，用lxml这个库来解析网页，就像是拿到一本书，咱得知道目录在哪儿，正文在哪儿，才能把需要的内容找出来。

咱们的目标是抓取页面上的新闻链接，这些链接被放在了一系列的ul和li标签里。所以，咱得一个个ul去看，每个ul里面又是一堆li，每个li里面才是咱们要的新闻链接。

找到链接后，咱再次用requests去访问这个链接，把新闻的详细内容给抓回来。标题、正文咱都要，然后把它们整理一下，每条新闻保存成一个txt文件，文件名就按照咱抓取的顺序来编号，这样方便管理。

过程中，咱得注意，网页上的链接可能有的是完整的，有的可能就给了个后缀，咱得处理好这个，确保能正确访问到新闻的详细页面。然后，就是把新闻的标题和内容提取出来，去掉多余的空白字符，整整齐齐地写入到文件里。

这样一来，只要运行这段代码，咱就能自动化地把网站上的新闻一条条地保存到本地了，省时省力，还能随时回头看看收集到的新闻呢。

后续如果需要额外的处理和补充可以私信联系我

import requests
from lxml import html
import os
# 目标网站的url
base_url = "https://politics.gmw.cn/"
url = base_url + "node_9844.htm"
# 使用requests库发送GET请求到目标网站
response = requests.get(url)
response.encoding = 'utf-8'  # 尝试使用utf-8解码
# 解析HTML内容
tree = html.fromstring(response.text)  # 使用text代替content
# 文件编号
file_num = 1
# 循环处理从ul[1]到ul[10]
for ul_index in range(1, 11):
    # 循环处理每个ul中的li标签，从li[1]开始，如果没有找到li标签，就跳出循环
    li_index = 1
    while True:
        try:
            # 构建XPath
            xpath = f'/html/body/div[6]/div[1]/div[2]/ul[{ul_index}]/li[{li_index}]/a'
            
            # 使用XPath查找特定的a标签
            a_tag = tree.xpath(xpath)
            
            # 如果找到了a标签
            if a_tag:
                # 获取a标签的href属性，也就是URL
                sub_url = a_tag[0].get('href')
                sub_url = base_url + sub_url if not sub_url.startswith('http') else sub_url
                print("子url为：",sub_url)
                # 获取子页面内容
                sub_response = requests.get(sub_url)
                sub_response.encoding = 'utf-8'  # 尝试使用utf-8解码
                sub_tree = html.fromstring(sub_response.text)  # 使用text代替content
                # 获取标题
                title = sub_tree.xpath('/html/body/div[6]/div[1]/h1/text()')
                title = title[0].strip() if title else ''  # 去除两端的空白字符
                # 获取正文
                contents = sub_tree.xpath('//*[@id="article_inbox"]/div[5]/p/text()')
                contents = '\n'.join([content.strip() for content in contents if content.strip()]) if contents else ''  # 去除两端的空白字符，并删除空行
                # 写入到文件
                with open(f'./txt/{str(file_num).zfill(2)}.txt', 'w', encoding='utf-8', errors='ignore') as f:
                    f.write(title + '\n\n' + contents)
                # 更新文件编号
                file_num += 1
            else:
                # 如果没有找到a标签，就跳出循环
                break
            # 处理下一个li标签
            li_index += 1
        except Exception as e:
            print(f"处理XPath {xpath} 时发生错误: {e}")
            break

输出结果如下：

子url为： https://politics.gmw.cn/2023-06/28/content_36660331.htm
子url为： https://politics.gmw.cn/2023-06/28/content_36660279.htm
子url为： https://politics.gmw.cn/2023-06/28/content_36660246.htm
子url为： https://politics.gmw.cn/2023-06/28/content_36660217.htm
子url为： https://politics.gmw.cn/2023-06/28/content_36660215.htm
子url为： https://politics.gmw.cn/2023-06/28/content_36660103.htm
子url为： https://politics.gmw.cn/2023-06/28/content_36659630.htm
子url为： https://politics.gmw.cn/2023-06/28/content_36659390.htm
子url为： https://politics.gmw.cn/2023-06/28/content_36659337.htm
子url为： https://politics.gmw.cn/2023-06/28/content_36659325.htm
子url为： https://politics.gmw.cn/2023-06/28/content_36659297.htm
子url为： https://politics.gmw.cn/2023-06/28/content_36659135.htm
子url为： https://politics.gmw.cn/2023-06/28/content_36658702.htm
子url为： https://politics.gmw.cn/2023-06/28/content_36658613.htm
子url为： https://politics.gmw.cn/2023-06/28/content_36658674.htm
子url为： https://politics.gmw.cn/2023-06/28/content_36658631.htm
子url为： https://politics.gmw.cn/2023-06/28/content_36658595.htm
子url为： https://politics.gmw.cn/2023-06/28/content_36658527.htm
子url为： https://politics.gmw.cn/2023-06/28/content_36658463.htm
子url为： https://politics.gmw.cn/2023-06/28/content_36658416.htm
子url为： https://politics.gmw.cn/2023-06/28/content_36658377.htm
子url为： https://politics.gmw.cn/2023-06/28/content_36658411.htm
子url为： https://politics.gmw.cn/2023-06/28/content_36658401.htm
子url为： https://politics.gmw.cn/2023-06/28/content_36658372.htm
子url为： https://politics.gmw.cn/2023-06/28/content_36658356.htm
子url为： https://politics.gmw.cn/2023-06/27/content_36657735.htm
子url为： https://politics.gmw.cn/2023-06/27/content_36657732.htm
子url为： https://politics.gmw.cn/2023-06/27/content_36657622.htm
子url为： https://politics.gmw.cn/2023-06/27/content_36657620.htm
子url为： https://politics.gmw.cn/2023-06/27/content_36657627.htm
子url为： https://politics.gmw.cn/2023-06/28/content_36658305.htm
子url为： https://politics.gmw.cn/2023-06/27/content_36657625.htm
子url为： https://politics.gmw.cn/2023-06/28/content_36658293.htm
子url为： https://politics.gmw.cn/2023-06/27/content_36657544.htm
子url为： https://politics.gmw.cn/2023-06/27/content_36657204.htm
子url为： https://politics.gmw.cn/2023-06/27/content_36657203.htm
子url为： https://politics.gmw.cn/2023-06/27/content_36657192.htm
子url为： https://politics.gmw.cn/2023-06/27/content_36655447.htm
子url为： https://politics.gmw.cn/2023-06/27/content_36655793.htm
子url为： https://politics.gmw.cn/2023-06/27/content_36655772.htm
子url为： https://politics.gmw.cn/2023-06/27/content_36655744.htm
子url为： https://politics.gmw.cn/2023-06/27/content_36655734.htm
子url为： https://politics.gmw.cn/2023-06/27/content_36655703.htm
子url为： https://politics.gmw.cn/2023-06/27/content_36655712.htm
子url为： https://politics.gmw.cn/2023-06/27/content_36655729.htm
子url为： https://politics.gmw.cn/2023-06/27/content_36655735.htm
子url为： https://politics.gmw.cn/2023-06/27/content_36655693.htm
子url为： https://politics.gmw.cn/2023-06/27/content_36655613.htm
子url为： https://politics.gmw.cn/2023-06/27/content_36655425.htm
子url为： https://politics.gmw.cn/2023-06/27/content_36655404.htm

全网营销西安网页设计招聘设计效果图网站海口网站建设方案报价雷神代刷网站推广海宁网站制作

上一篇：保护个人信息安全，避免成为“互联网中的裸泳者”

下一篇：踩坑记录：SpringBoot3.x版本与Mybatis-Plus不兼容问题