I created a scraper using scrapy, it was designed to scrape keyword search results on the Ask.com search engine and return the scraped data to a json formatted file. . Here is the code of my scraper:
import scrapy
import datetime
class PagesearchSpider(scrapy.Spider):
name = 'pageSearch'
def start_requests(self):
queries = [ 'love']
for query in queries:
url = 'https://www.ask.com/web?q='+query
yield scrapy.Request(url, callback=self.parse, meta={'pos': 0})
def parse(self, response):
print('url:', response.url)
start_pos = response.meta['pos']
print('start pos:', start_pos)
dt = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
items = response.css('div.PartialSearchResults-item')
for pos, result in enumerate(items, start_pos+1):
yield {
'title': result.css('a.PartialSearchResults-item-title-link.result-link::text').get().strip(),
'snippet': result.css('p.PartialSearchResults-item-abstract::text').get().strip(),
'link': result.css('a.PartialSearchResults-item-title-link.result-link').attrib.get('href'),
'position': pos,
'date': dt,
}
# --- after loop ---
next_page = response.css('.PartialWebPagination-next a')
if next_page:
url = next_page.attrib.get('href')
print('next_page:', url) # relative URL
# use `follow()` to add `https://www.ask.com/` to URL and create absolute URL
yield response.follow(url, callback=self.parse, meta={'pos': pos+1})
# --- run without project, and save in file ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
#'USER_AGENT': 'Mozilla/5.0',
# save in file CSV, JSON or XML
'FEEDS': {'test.json': {'format': 'json'}},
#'ROBOTSTXT_OBEY': True, # this stop scraping
})
c.crawl(PagesearchSpider)
c.start()
here, the search word is âloveâ, but you can put as many words as you want. The reason why, Iâm looking to create a scrapy api is to allow users to send their keywords automatically to my scraper from the url, because my basic scraper only works if I manually write the word to find in the source code. The versions of my file.py file are those designed to allow the passing of keywords to be searched by my api. So I created a flask api which is:
from flask import Flask , render_template
from numpy import empty
from script import PagesearchSpider
app = Flask(__name__)
request = PagesearchSpider()
@app.route('/')
def index():
return request.start_requests('{ }')
which works but whenever I pass it an argument to send to the scraper through the GET method, I always get the response Internal server error . and I think the error is because I canât connect my scrapy file with the api because in the terminal where I started the Flask server I got no error. Thatâs why I tested various code for my api and scraper to successfully connect them but without success.
In my api file namely main.py
, I first tried the following code inspired by this blog :
@app.route('/{ }', methods=['GET'])
def submit():
if '{cat}' != empty :
return request.start_requests('{ }')
then
from flask import Flask , render_template
from numpy import empty
from script import PagesearchSpider
app = Flask(__name__)
request = PagesearchSpider()
@app.route('/')
def index():
return render_template("index.html") # Returns index.html file in templates folder.
@app.route('/', methods=['GET'])
def submit():
if request.method == 'GET':
return request.start_data('{ }')
here above, I took care to replace the start_requests
method of scrapy with start_data
in my associated scraping file, then this one, inspired by this video:
from flask import Flask
from script import PagesearchSpider
app = Flask(__name__)
request = PagesearchSpider()
@app.get("/{ }")
async def read_item( ):
return request.start_requests(' ')
.
Being a beginner in this field, I do not know by which of my files is fake from the beginning if not both. I tested a LOT of solution and did a lot of research but without successes. I donât know where to put my head anymore thatâs why Iâm asking for your help. So I hope to count on the help of experienced members of the community. Thank you!