刚刚经过了豆瓣电影的爬取,你是不是有点懵逼呢?那么博主今天带来一篇较为简单得动态html数据采集的文章。
今天我们来爬取腾讯招聘的相关信息。
链接:https://careers.tencent.com/search.htm
一、网页分析
首先我们打开链接,如下图:
通过查看源码,我们发现其并不是静态网页,因此可以初步判定其为动态网页
这样我们的方向就明朗起来了。我们只需找到API接口就可以获取数据。打开开发者选项,通过查找找到我们的API接口
到这里分析已经完成了,那么接下来先尝试获取整个接口信息。因为我们只有先获取数据才有可能继续下一步操作。
# encoding: utf-8'''
@author 李华鑫
@create 2020-10-19 12:28
Mycsdn:https://buwenbuhuo.blog.csdn.net/
@contact: 459804692@qq.com
@software: Pycharm
@file: test.py
@Version:1.0
'''import requests
url = "https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1603079775850&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn"headers = {
"user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",}def parse_json(url, params={}):
"""解析url,得到字典"""
response = requests.get(url=url, headers=headers, params=params)
return response.json()content = parse_json(url)print(content)1234567891011121314151617181920212223242526
????,看来我们是成功获取到数据了。
下面我们来分析下,每一页URL之间的关系。
首先,我们看下其中几个URL
https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1603079775850&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn
https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1603079981956&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=2&pageSize=10&language=zh-cn&area=cn
https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1603079981956&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=3&pageSize=10&language=zh-cn&area=cn12345
通过对比,我们发现只有pageIndex有变化,那么我们现在可以验证下猜想
通过对比,验证了我们的猜想。下面我们就只需要看总共有多少页即可
分析了页数,那么如果想要循环爬取全部网页,只需进行拼接即可
for i in range(1,635):
params["pageIndex"] = i12
二、功能实现
2.1 拼接URL
通过上述的分析,我们知道只需修改每回的pageIndex即可。
首先我们先把需要拼接的部分复制出来
让其转换成字典的方式
xx="""timestamp: 1603081773061
countryId:
cityId:
bgIds:
productId:
categoryId:
parentCategoryId:
attrId:
keyword:
pageIndex: 2
pageSize: 10
language: zh-cn
area: cn"""xx = xx.splitlines()params = {}for x in xx:
print(x.split(":"))
params[x.split(":")[0]] = x.split(":")[1]from pprint import pprint
pprint(params)123456789101112131415161718192021
下面就开始代码实现此部分
params = {'area': ' cn',
'attrId': ' ',
'bgIds': ' ',
'categoryId': ' ',
'cityId': ' ',
'countryId': ' ',
'keyword': ' ',
'language': ' zh-cn',
'pageIndex': ' 1',
'pageSize': ' 10',
'parentCategoryId': ' ',
'productId': ' ',
'timestamp': ' 1602211262824'}
def parse_json(url, params={}):
"""解析url,得到字典"""
response = requests.get(url=url, headers=headers, params=params)
return response.json()def start():
for i in range(1,635):
params["pageIndex"] = iif __name__ == '__main__':
start()12345678910111213141516171819202122232425
2.2 获取数据
由于之前已经讲解过此部分,因此此处只给出代码
def get_position(data):
"""获取职位数据"""
item = {
"postion_name":"",#职位名称
"postion_department":"",#职位部门
"postion_location":"",#职位所在地
"postion_country":"",#职位所在国家
"postion_category":"",#职位类别
"postion_responsibility":"",#职位职责
"postion_url":"",#职位url
}
data_list = data["Data"]["Posts"]
for data in data_list:
item["postion_name"] = data["RecruitPostName"]
item["postion_department"] = data["BGName"]
item["postion_location"] = data["LocationName"]
item["postion_country"] = data["CountryName"]
item["postion_category"] = data["CategoryName"]
item["postion_responsibility"] = data["Responsibility"]
item["postion_url"] = data["PostURL"]1234567891011121314151617181920
三、完整代码
# encoding: utf-8'''
@author 李华鑫
@create 2020-10-09 9:38
Mycsdn:https://buwenbuhuo.blog.csdn.net/
@contact: 459804692@qq.com
@software: Pycharm
@file: 腾讯招聘.py
@Version:1.0
'''import requestsimport csv
url = "https://careers.tencent.com/tencentcareer/api/post/Query"headers = {
"user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",}params = {'area': ' cn',
'attrId': ' ',
'bgIds': ' ',
'categoryId': ' ',
'cityId': ' ',
'countryId': ' ',
'keyword': ' ',
'language': ' zh-cn',
'pageIndex': ' 1',
'pageSize': ' 10',
'parentCategoryId': ' ',
'productId': ' ',
'timestamp': ' 1602211262824'}def parse_json(url, params={}):
"""解析url,得到字典"""
response = requests.get(url=url, headers=headers, params=params)
return response.json()def get_position(data):
"""获取职位数据"""
item = {
"postion_name":"",#职位名称
"postion_department":"",#职位部门
"postion_location":"",#职位所在地
"postion_country":"",#职位所在国家
"postion_category":"",#职位类别
"postion_responsibility":"",#职位职责
"postion_url":"",#职位url
}
data_list = data["Data"]["Posts"]
for data in data_list:
item["postion_name"] = data["RecruitPostName"]
item["postion_department"] = data["BGName"]
item["postion_location"] = data["LocationName"]
item["postion_country"] = data["CountryName"]
item["postion_category"] = data["CategoryName"]
item["postion_responsibility"] = data["Responsibility"]
item["postion_url"] = data["PostURL"]
save(item)
print(item)
print("保存完成")def save(item):
"""将数据保存到csv中"""
with open("./腾讯招聘.csv", "a", encoding="utf-8") as file:
writer = csv.writer(file)
writer.writerow(item.values())def start():
for i in range(1,635):
params["pageIndex"] = i
data = parse_json(url,params)
get_position(data)if __name__ == '__main__':
start()1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980
四、保存结果
美好的日子总是短暂的,虽然还想继续与大家畅谈,但是本篇博文到此已经结束了,如果还嫌不够过瘾,不用担心,我们下篇见!
转载自:CSDN 作者:不温卜火
原文链接:https://buwenbuhuo.blog.csdn.net/article/details/109351951