some questions about crawl data from zillow.com

I’m working for a project of fetching data from www.zillow.com which is an online estate agency website.The whole program is based on python scrapy package,and you can find it in my github repositort Tests/zillow_scrapy.

But i really meet a problem , i have tried every possible way to avoid its anti-crawl system such as using proxy ip address,change my headers especially the randomly generated Cookies in the headers and slowing the speed.

And they all failed.

So,here’s some interesting facts i have found and i really want to know the correct reason behind it,really hope someone genius can solve them.

First is that ,even  i use the same headers just as my Chrome used,i get the captcha response but C hrome Explorer get the correct result of data.In my view ,if they have the same headers and internet environment ,they should get the same result,but why not ?!

Another is that after about 2minutes’ crawling ,it return captcha page to me ,which means the server found me doubtful.So  i just stop for a minute and restart the crawling programe.There are two problems strange here.

Everytime it stopped after crawling the same numbers of pages , EVERYTIME!  It confused me for the second problem.I gussess it banned me for rates  limits per minute or hour ,but why every time i restart it ,it still can continue work ? Is there any problem in the scrapy?

Anyone  helping  me with these questions will be very nice ,and i really appreciate for your kindness.

Well, that anti-crawl protection probably is there for a reason. And no one really knows how those CAPTCHA’s work but that is probably part of how they work.

However, if you say Chrome is getting the correct data, you might want to try Headless Chromium. This basically is the browser without the window and works in such a way that you can easily fetch the data from it. Good luck!

1 Like