您的位置 首页 php

干货!黑马程序员,轻松带你自建ip池,抓取数据拒封ip

UA代理池和IP代理池

1. UA代理池

​ UA代理池也称作user-agent代理池,目的是在http头部加入user-agent选项,模拟浏览器进行发包给服务器端,起到伪装作用。也是很重要的一种反爬策略之一。 无私分享全套 Python 爬虫干货,如果你也想学习Python,@ 私信小编获取

从预先定义的user-agent的列表中随机选择一个来采集不同的页面

在settings.py中添加以下代码:

 DOWNLOADER_MIDDLEWARES = {
 ' scrapy .contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
    'qiubai_proj.middlewares.RotateUserAgentMiddleware' :400,
}
  

settings.py中添加USER_AGENT_LIST的配置

 USER_AGENT_LIST = [
      "Mozilla/5.0 ( Macintosh ; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
      "Mozilla/5.0 ( X11 ; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
      "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
      "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
      "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
      "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
      "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
      "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
      "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
  

在middlewares文件里添加代理 中间件 类RotateUserAgentMiddleware

 import  random 
#from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware

from settings import USER_AGENT_LIST

class RotateUserAgentMiddleware(UserAgentMiddleware):
    '''
    用户代理中间件(处于下载中间件位置)
    '''

    def process_ request (self, request, spider):
        user_agent = random.choice(USER_AGENT_LIST)
        if user_agent:
            request.headers.setdefault('User-Agent', user_agent)
            print(f"User-Agent:{user_agent}")
  

用 Python 爬取网站内容的时候,容易受到反爬虫机制的限制,而突破反爬虫机制的一个重要措施就是使用IP代理。我们可以在网络上找到许多IP代理,但稳定的IP代理成本都较高。因此利用免费代理构建自己的代理池就非常有必要了。

浏览器伪装一下才能爬取,使用requests库进行改造

 # -*- coding: utf-8 -*-  
__author__ = 'zhougy'
__date__ = '2018/9/7 下午2:32' 

import time

import requests

import threading
from threading import Lock
import queue

g_lock = Lock()

n_thread = 10

headers = {
     "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko)"
   " Chrome/68.0.3440.106 Safari/537.36",

}

def fetch_web_data(url, proxy=None, timeout=10):
try:
r = requests.get(url, timeout=timeout, headers=headers, proxies=proxy)
data = r.text
return data
except Exception as e:
print(f"fetch_web-data has error url: {url}")
return None


def write_ip_pair(ip_pair):
'''
将可用的IP和端口动态持久化到proxy_ip_list_日期.txt文件中
:param ip_pair:
:return:
'''
proxy_file_name = "proxy_ip_list_%s.txt" % (time.strftime("%Y.%m.%d", time.localtime(time.time())))
with open(proxy_file_name, "a+", encoding="utf-8") as f:
f.write(f"{ip_pair}n")


#def write_ip(ip_port_pair):
class IpProxyCheckThread(threading.Thread):
def __init__(self, queue):
threading.Thread.__init__(self)
self.__queue = queue


def run(self):
global g_lock
while True:
data = self.__queue.get()
ip_port_pair = data.split(",")[0]
print(f"the check ip is {ip_port_pair} ")
proxy = {
"http":ip_port_pair,
}
url = "#34;
data = fetch_web_data(url, proxy=proxy, timeout=15)
if data == None:
print(f"当前ip {ip_port_pair} 校验不成功,丢弃!")
continue
print(f"当前ip {ip_port_pair} 校验成功,可用!")
g_lock.acquire()
write_ip_pair(ip_port_pair)
g_lock.release()



class FetchProxyListThread(threading.Thread):
def __init__(self, url, mq):
threading.Thread.__init__(self)
self.__url = url
self.__mq = mq


def run(self):
data = fetch_web_data(self.__url)
print(data)
ip_pool_list = data.split("n")
[self.__mq.put(ip_pool) for ip_pool in ip_pool_list]



def process():
mq = queue.Queue()

thread_list = []
for i in range(n_thread):
t = IpProxyCheckThread(mq)
t.setDaemon(True)
thread_list.append(t)

[t.start()  for t in thread_list]

url = "#34;
fth = FetchProxyListThread(url, mq)
fth.start()

fth.join()
[t.join() for t in thread_list]

mq.join()

if __name__ == "__main__":
process()
  

scrapy中加入IP代理池

由day05节,讲述了如何得到网络连接良好的一组ip

(1) 将之前过滤出来的可用的一组IP和端口放入一个列表中(可以读取文件,加载到list中)

创建一个my_proxies.py文件,内容大致如下

 PROXY =[
"187.65.49.137:3128",
"108.61.246.98:8088",
"167.99.197.73:8080",
]  

(2)在middlewares.py文件中加入IP代理池中间件

 import random
from . import my_proxies
class MyProxyMidleware(object):
    
    def process_request(self, request, spider):
        request.meta['proxy']  = random.choice(my_proxies.PROXY)  

(3)在配置文件中加入添加映射关系

 DOWNLOADER_MIDDLEWARES = {
     'qiubai_proj.middlewares.MyProxyMidleware':300,
}  

(4)启动scrapy,查看IP代理池效果

​ scrapy crawl qiubai

为了帮助大家更轻松的学好Python,我给大家分享一套Python学习资料,希望对正在学习的你有所帮助!

获取方式:关注并私信小编 “ 学习 ”,即可免费获取!

文章来源:智云一二三科技

文章标题:干货!黑马程序员,轻松带你自建ip池,抓取数据拒封ip

文章地址:https://www.zhihuclub.com/152288.shtml

关于作者: 智云科技

热门文章

评论已关闭

36条评论

  1. My spouse and I stumbled over here from a different website and
    thought I should check things out. I like what I see so i am just following you.
    Look forward to looking at your web page yet again.

  2. I think everything published was actually very logical.
    However, what about this? what if you wrote a catchier title?

    I mean, I don’t wish to tell you how to run your website, however what if
    you added something to maybe grab folk’s attention? I mean 干货!黑马程序员,轻松带你自建ip池,抓取数据拒封ip – 智云一二三科技 is a little plain.
    You should peek at Yahoo’s front page and note how they create post titles to grab people to
    click. You might try adding a video or a pic or two to get people excited about everything’ve got to say.

    Just my opinion, it would bring your posts a little bit more interesting.

  3. Our platform presents our merchants one of many most crucial services related to their enterprise which contributes to its success.

    Our Gateway provides access to a complete net system to
    manage your transactions and covers all of the technology required for payment companies.

    Our Gateway Cashier structure is configured as an entire infrastructure = a BAAS –
    “Backend As A Service” net-primarily based platform, which implements a quick, progressive strategy to deal with funds,
    complete accuracy and quick pace. Our companies
    portfolio covers end to finish, clients, merchants, PSPs, and
    associates management wants, delivered with first-class services.

  4. Hi there just wanted to give you a quick heads up.
    The text in your content seem to be running off the screen in Safari.

    I’m not sure if this is a formatting issue or something to do with
    browser compatibility but I thought I’d post to let you know.
    The design and style look great though! Hope you get the issue fixed soon. Cheers

    Here is my webpage –

  5. It’s remarkable in support of me to have a website, which is helpful for my experience.
    thanks admin

  6. I pay a quick visit everyday some web sites and blogs
    to read articles, but this weblog presents quality based articles.

  7. Yoս made ѕome decent points thеre. I checked оn the net for additional informatiion аbout the issue and found most individuals wilⅼ go аlong with your views оn thiѕ website.

  8. whoah this weblog is excellent i like studying your articles.
    Stay up the great work! You realize, lots of individuals are looking round for this info, you could
    aid them greatly.

  9. I ɑm glad tо Ьe a visitant of this ɡross weblog, thank you for thіs
    rare іnformation!

  10. What’s up, for all time i usdd to check weblog posts һere еarly in tһе break of ɗay, as i enjoy t᧐ learn more
    and more.

  11. What’s up to every single one, it’s in fact a good for me
    to go to see this web page, it consists of important
    Information.

  12. Undeniably consider thаt wjich yoᥙ stated.

    Ⲩour favourite justification appared tⲟ be
    on the net tһe simplest tһing to remember ߋf. I say to you,
    I certainly get annoyed ѡhile people c᧐nsider concerns that they just don’t recognize about.

    Youu controlled to hit thе nqil uⲣ᧐n thе higest аnd defined оut
    thе engire thing with no neeԁ ѕide-effects
    , ᧐ther people can tazke a signal. Will рrobably Ьe ɑgain toߋ get more.

    Ꭲhank yοu

  13. It’s remarkable designed for me to have a web page, which is helpful in favor of my
    know-how. thanks admin

  14. {run by a crum|run by a crum order |run by a website by a crum|website promotion |seo website promotion|说道:

    Xrumer and GSA, allsubmitter
    We can offer you such a service as:

    If you want your site to attract as many visitors as possible, then you need it to be not only useful and convenient, but also well optimized for search
    engines. This will require a lot of effort and costs, but if you want to speed up the process and significantly raise the site as a result of the issue, then you can use a site run by
    a cruncher. This way you can not only get the desired result faster,
    but even save time and money.
    [b]TO CONTACT US, WRITE TO SKYPE LOGIN
    OR IN [/b]

  15. thank you a good deal this excellent website is usually elegant and also casual

  16. Hi! I just want to give you a big thumbs up for the great info you have here
    on this post. I’ll be coming back to your site for more soon.

    Check out my blog –

  17. Its such as you read my mind! You appear to know
    a lot approximately this, like you wrote the ebook in it or something.
    I feel that you just can do with some percent to drive the message house a bit,
    but other than that, that is great blog. An excellent read.
    I’ll certainly be back.

  18. I woᥙld like to thank үou fⲟr the effors
    you’ve putt іn writіng this site. I’m hoping the sasme high-grade website
    polst fгom yyou in the upcoming also. In fact your crеative writing skills has
    inspired me to get mmy owwn blog now. Reallү the
    blogging is sprеading itѕ wings rapidly. Your write up
    is a good example of it.

    Loоk at my web site :: – ,

  19. Hi there! Do you know if they make any plugins to protect against hackers?
    I’m kinda paranoid about losing everything I’ve worked hard on. Any recommendations?

  20. Very rapidly this web site will be famous among all blogging viewers, due to it’s fastidious content

  21. Appreciating the time and energy you ρut into yourr blog and іn depth information you offer.

    Іt’s go᧐ⅾ tⲟ ⅽome aⅽross а blog eνery оnce
    in a whіle that isn’t the same oold rehashed
    material. Excellent гead! I’vе saved yߋur site aand I’m adding уour RSS feeds to mу
    Google account.

  22. Hey there just wanted to give you a quick heads up. The words
    in your post seem to be running off the screen in Chrome.
    I’m not sure if this is a formatting issue or something
    to do with web browser compatibility but I thought I’d post to let you know.
    The design look great though! Hope you get the problem fixed soon.
    Kudos

    Feel free to visit my blog;

  23. Ӏt’s amazіnjg for me to hаve a sіte, which iis helpful for my
    қnoѡ-hοw. thankѕ admin

    Alsѕo visit my website: apartment for rеnt bridge; ,

  24. Dг. John Rackham

    Ɗr. John Rackham earned his Doctor of Pharmacy degree frdom Washington Stae University іn 2009.
    Ꮋe iѕ a memƅer of tһe International Society of Cannhabis Pharmacists (ISCPh) аnd is
    сurrently working on hiѕ Endocannabinoid Medocine Certification, offered tһrough The American Journal of Endocannabinoid Medicine.
    Ꮋе iѕ ɑlso a memƄer of the American Pharmacists Association (APhA),
    tһe National Commuity Pharmacy Association (NCPA), ɑnd the Washington Stɑte Pharmacy Association. Ηe earned а Certification іn Endocannabinoid Medicine from Cannabis Patient Care іn 2022.

    For almost 12 үears, Dг. Rackhaam haѕ advised and educated patients оn hοw to bеѕt use tһeir medication, safely ɑnd effectively.
    Ιn 2017, he recognized the neеd and demand
    for CBD education іn hiѕ owwn pharmacy. Realizing the potential health benefits tһat CBD ϲould offer, he researched ɑnd sought
    out a higһ-quality CBD manufacturer, tһen established and grew tһe CBD portion ⲟf his business.
    Dr. Rackhbam quicklү beсame an іn-house CBD expert, providing guidance ɑnd counsel fοr proper use,
    including evidence-based indications, dosing, administration, ѕide
    effects, drug interactions, and expectations οf therapy.

    In hiѕ spare time, hhe stujdies taekwondo ѡith
    hiss son, plays іn an ’80ѕ rock cover band, аnd enjoys vacationing іn Hawaii wioth hiѕ family.

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    my site –

  25. Τhank you for being my instfuctor on this
    issue. My spouse and i enjoyed you current article quite
    definitely and most of all cherished һow you hаndled the iѕsues I widely knoᴡn as controvеrsial.
    You happen to be always quite kkind to reаders like mme and
    assist me to in my life. Thank you.

    Here is my webpage; Continue… []

  26. Way cool! Some extremely valid points! I appreciate you penning this article and
    also the rest of the website is extremely good.

    my site;

  27. I’d like to thank you for the efforts you have put in writing this site.
    I really hope to see the same high-grade content from you in the future as well.
    In truth, your creative writing abilities has motivated me to get my own, personal website now ;
    )

  28. Hey I know this is off topic but I was wondering if
    you knew of any widgets I could add to my blog that automatically tweet my newest twitter
    updates. I’ve been looking for a plug-in like this for quite some time and was hoping maybe you would have some experience with something like this.

    Please let me know if you run into anything.
    I truly enjoy reading your blog and I look forward to your new updates.

  29. Thanks for finally writing about > 干货!黑马程序员,轻松带你自建ip池,抓取数据拒封ip < Liked it!

  30. Hi there, I enjoy reading all of your article.
    I like to write a little comment to support you.

  31. Whats up this is somewhat of off topic but I
    was wondering if blogs use WYSIWYG editors or if you have to manually code with HTML.
    I’m starting a blog soon but have no coding skills so
    I wanted to get guidance from someone with experience.
    Any help would be greatly appreciated!

网站地图