I’m working for a project of fetching data from www.zillow.com which is an online estate agency website.The whole program is based on python scrapy package,and you can find it in my github repositort Tests/zillow_scrapy.
But i really meet a problem , i have tried every possible way to avoid its anti-crawl system such as using proxy ip address,change my headers especially the randomly generated Cookies in the headers and slowing the speed.
And they all failed.
So,here’s some interesting facts i have found and i really want to know the correct reason behind it,really hope someone genius can solve them.
First is that ,even i use the same headers just as my Chrome used,i get the captcha response but C hrome Explorer get the correct result of data.In my view ,if they have the same headers and internet environment ,they should get the same result,but why not ?!
Another is that after about 2minutes’ crawling ,it return captcha page to me ,which means the server found me doubtful.So i just stop for a minute and restart the crawling programe.There are two problems strange here.
Everytime it stopped after crawling the same numbers of pages , EVERYTIME! It confused me for the second problem.I gussess it banned me for rates limits per minute or hour ,but why every time i restart it ,it still can continue work ? Is there any problem in the scrapy?
Anyone helping me with these questions will be very nice ,and i really appreciate for your kindness.