How to connect a Flask API to a web scraper using scrapy?

I created a scraper using scrapy, it was designed to scrape keyword search results on the Ask.com search engine and return the scraped data to a json formatted file. . Here is the code of my scraper:

import scrapy
import datetime

class PagesearchSpider(scrapy.Spider):

    name = 'pageSearch'

    def start_requests(self):
        queries = [ 'love']
        for query in queries:
            url = 'https://www.ask.com/web?q='+query
            yield scrapy.Request(url, callback=self.parse, meta={'pos': 0})

    def parse(self, response):
        print('url:', response.url)
        
        start_pos = response.meta['pos']
        print('start pos:', start_pos)

        dt = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')    
        
        items = response.css('div.PartialSearchResults-item')
        
        for pos, result in enumerate(items, start_pos+1):
            yield {
                'title':    result.css('a.PartialSearchResults-item-title-link.result-link::text').get().strip(), 
                'snippet':  result.css('p.PartialSearchResults-item-abstract::text').get().strip(), 
                'link':     result.css('a.PartialSearchResults-item-title-link.result-link').attrib.get('href'), 
                'position': pos, 
                'date':     dt,
            }

        # --- after loop ---
        
        next_page = response.css('.PartialWebPagination-next a')
        
        if next_page:
            url = next_page.attrib.get('href')
            print('next_page:', url)  # relative URL
            # use `follow()` to add `https://www.ask.com/` to URL and create absolute URL
            yield response.follow(url, callback=self.parse, meta={'pos': pos+1})


# --- run without project, and save in file ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    #'USER_AGENT': 'Mozilla/5.0',
    # save in file CSV, JSON or XML
    'FEEDS': {'test.json': {'format': 'json'}},
    #'ROBOTSTXT_OBEY': True,  # this stop scraping
})
c.crawl(PagesearchSpider)
c.start() 


here, the search word is “love”, but you can put as many words as you want. The reason why, I’m looking to create a scrapy api is to allow users to send their keywords automatically to my scraper from the url, because my basic scraper only works if I manually write the word to find in the source code. The versions of my file.py file are those designed to allow the passing of keywords to be searched by my api. So I created a flask api which is:

from flask import Flask , render_template
from numpy import empty
from script import PagesearchSpider

app = Flask(__name__)
request = PagesearchSpider()

@app.route('/')
def index():
	return request.start_requests('{ }')


which works but whenever I pass it an argument to send to the scraper through the GET method, I always get the response Internal server error . and I think the error is because I can’t connect my scrapy file with the api because in the terminal where I started the Flask server I got no error. That’s why I tested various code for my api and scraper to successfully connect them but without success.
In my api file namely main.py, I first tried the following code inspired by this blog :

@app.route('/{ }', methods=['GET'])
def submit():
    if '{cat}' != empty :
        return request.start_requests('{ }')
 

then

from flask import Flask , render_template
from numpy import empty
from script import PagesearchSpider

app = Flask(__name__)
request = PagesearchSpider()

@app.route('/')
def index():
	return render_template("index.html") # Returns index.html file in templates folder.


@app.route('/', methods=['GET'])
def submit():
    if request.method == 'GET':
 
	    return request.start_data('{ }')


here above, I took care to replace the start_requests method of scrapy with start_data in my associated scraping file, then this one, inspired by this video:

from flask import Flask
from script import PagesearchSpider

app = Flask(__name__)
request = PagesearchSpider()

@app.get("/{ }")
async def read_item( ):
    return request.start_requests(' ')

.

Being a beginner in this field, I do not know by which of my files is fake from the beginning if not both. I tested a LOT of solution and did a lot of research but without successes. I don’t know where to put my head anymore that’s why I’m asking for your help. So I hope to count on the help of experienced members of the community. Thank you!

1 Like

What is the '{cat}' in that call (and similar ones in the other versions) meant to do? From the context it almost looks like you’re trying to transfer a parameter from the incoming request, but it’s just a fixed string.

The Gist you posted for the PagesearchSpider is hard to read because it’s a mix of different versions, unfortunately.

{cat} has no specific function, it can be replaced by spaces, it was just a way to show how the keyword should be used. To be more specific, it’s useless and just works as an example. Otherwise when running the {cat} was never considered by the computer and that’s what I expected

And what do you mean by it’s a mix of different versions. Admittedly, I took my sources from blogs, different videos, but I don’t understand what you mean by that.

There are multiple definitions of your PagesearchSpider class in the file, plus other things. You write that you tried this and that, but that only makes it harder for anyone else to understand how your code is supposed to work, what you actually want.

This kind of has the same problem, it makes it hard to understand what you want the code to do. Especially considering '{cat}' looks like f-string syntax where someone just forgot the leading f. :sweat_smile:

As far as I understand you have two parts here: The scraper, and a Flask-based API using it. So I recommend you approach the problem in two steps:

  • Work on the scraper and write tests for it, until those work like you want them to. I actually recommend writing the tests first, so they show how you want the scraper to work.
  • Then work on the API, and make it use the scraper.

This makes it much easier to see where any problems are, and if you still have trouble smaller parts will make it easier for us here to understand what you want and what’s going on. :slightly_smiling_face:

:sweat_smile: Let’s say for the multiple versions of my code, it’s a habit I picked up in other forums. Show the community what you’ve tried so they can see that you’re putting in the effort and not trying to get rid of the work so people will be more open to helping you. I thought it was the same for github community.

That’s not wrong, showing context is important. But while I often have to remind people who post here to show context, too much can be a problem too: When it’s so much that it gets hard to find the important parts. Different versions can be important if you have situation like “A works, but B doesn’t. If I make this change, B works, but A breaks.” Otherwise it’s usually best to post the version that’s closest to doing what you want, and describing what you want and the issues you face. :wink:

So, looking just at the code in your top post right now:

  • All places where PagesearchSpider.start_requests() is called pass a string to the method. However the code there doesn’t expect a parameter. That would lead to an exception, which Flask would usually handle by returning an Internal Server Error.

  • The body of the PagesearchSpider.start_requests() method yields rather than returning. Which is correct code, but means calling the function returns a generator. You’ll need to add code that reads from the generator to turn it into a response body that Flask can send to the HTTP client. If your handler function returns something Flask doesn’t understand it’ll also result in an error.

Another thing to try is to run the Flask server in debug mode, so you can see more of what’s happening.

When I run my api file in debug mode, I get the following errors:

/usr/bin/env C:\\Python310\\python.exe c:\\Users\\user\\.vscode\\extensions\\ms-python.python-2022.8.0\\pythonFiles\\lib\\python\\debugpy\\launcher 63667 -- c:\\Users\\user\\Documents\\AAprojects\\Whelpsgroups1\\API\\main.py
Traceback (most recent call last):
  File "C:\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "c:\Users\user\.vscode\extensions\ms-python.python-2022.8.0\pythonFiles\lib\python\debugpy\launcher\__main__.py", line 97, in <module>
    main()
  File "c:\Users\user\.vscode\extensions\ms-python.python-2022.8.0\pythonFiles\lib\python\debugpy\launcher\__main__.py", line 53, in main
    launcher.connect(host, port)
  File "c:\Users\user\.vscode\extensions\ms-python.python-2022.8.0\pythonFiles\lib\python\debugpy\launcher/../..\debugpy\launcher\__init__.py", line 34, in connect
    sock.connect((host, port))
ConnectionRefusedError: [WinError 10061] No connection could be established because the target computer expressly denied it

on the command line: yet with the command python main.py it works well.

That looks like an error in the debugger code of VSC, which I’m not familiar with. I mean you should use the debug mode of Flask, as described in the documentation I linked.

I took your advice on the multiple questions too and I would like to edit my code, but I see that the button to edit the code does not appear on my post, I tried to reload my page several times, stop then start my browser but without success. I would like to say where exactly I am with my scraper and my api and remove the other versions of my files so that my codes are more related.

I don’t see any link in your post :thinking:

You can post your current code in a new post (in this thread), or put it in a repository and link to the current commit.

The previous post, here (that’s why I wrote “linked”, in past tense):

After following the advice of the administrator, I edited my gist file.py This version of my code is the current version.

I also changed my api code a lot so I would beg users to stick to this version more, here it is:

import crochet
crochet.setup()

from flask import Flask , render_template, jsonify, request, redirect, url_for
from scrapy import signals
from scrapy.crawler import CrawlerRunner
from scrapy.signalmanager import dispatcher
import time
import os

# Importing our Scraping Function from the amazon_scraping file

from scrap.askScraping import AskScrapingSpider

# Creating Flask App Variable

app = Flask(__name__)

output_data = []
crawl_runner = CrawlerRunner()

# By Deafult Flask will come into this when we run the file
@app.route('/')
def index():
	return render_template("index.html") # Returns index.html file in templates folder.


# After clicking the Submit Button FLASK will come into this
@app.route('/', methods=['POST'])
def submit():
	if request.method == 'POST':
		s = request.form['url'] # Getting the Input Amazon Product URL
		global baseURL
		baseURL = s
		# This will remove any existing file with the same name so that the scrapy will not append the data to any previous file.
		if os.path.exists("<path_to_outputfile.json>"): 
			os.remove("<path_to_outputfile.json>")
		return redirect(url_for('scrape')) # Passing to the Scrape function
			
		
							
	

@app.route("/scrape")
def scrape():

	scrape_with_crochet(baseURL="https://www.ask.com/web?q={baseURL}") # Passing that URL to our Scraping Function

	time.sleep(20) # Pause the function while the scrapy spider is running
	
	return jsonify(output_data) # Returns the scraped data after being running for 20 seconds.


@crochet.run_in_reactor
def scrape_with_crochet(baseURL):
	# This will connect to the dispatcher that will kind of loop the code between these two functions.
	dispatcher.connect(_crawler_result, signal=signals.item_scraped)
	
	# This will connect to the ReviewspiderSpider function in our scrapy file and after each yield will pass to the crawler_result function.
	eventual = crawl_runner.crawl(AskScrapingSpider, category = baseURL)
	return eventual

#This will append the data to the output data list.
def _crawler_result(item, response, spider):
	output_data.append(dict(item))


if __name__== "__main__":
	app.run(debug=True)

Thanks

Now that code looks like something that’d require deep knowledge of crochet and scrapy, which I don’t have (only Python and Flask). :sweat_smile:

What would be helpful for anyone who does is if you’d describe how your code isn’t working:

  • What you want it to do
  • What it’s doing instead
  • Any errors, etc. you’re getting

Here is how my scraping api should normally work: I created a form with a search bar named index.html which I didn’t present in my code. When we run the Flask API, it returns this form. When the user enters a word in the search bar of the form, the word is sent to my API via the POST method, it must check with a condition that the method used is indeed the POST method and then must pass the keywords entered in the scraper using the baseUrl variable and check if there is a file in json format in the scraper path and delete it if so. The scraper will search for this keyword on the Ask.com site and scrape the search results obtained, namely the title, url link, description, position and date of scraping… of each web page and follow the tabs on the site in order to scrap other results relating to the searched words, then these results will be sent in a file in json format which will in turn be sent to the Flask API which will take care of sending it to a web page at the address localhost:5000/scrape between square brackets [ ]. But unfortunately I don’t know for what reason, when I send a keyword from my form by the post method, it returns me empty square brackets [ ] at the address localhost:5000/scrape and in the command line, I don’t get any errors and no json file is designed as if nothing was sent to my scraper.

I also replaced in my main.py file crawl_runner = CrawlerRunner() with

project_settings = get_project_settings()
crawl_runner = CrawlerProcess(settings = project_settings)

and performed the following imports

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

but I get when I reload my Flask server the following errors:

2022-06-21 11:44:55 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2022-06-21 11:44:57 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.10.4 (tags/v3.10.4:9d38120, Mar 23 2022, 23:13:41) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 22.0.0 (OpenSSL 3.0.3 3 May 2022), cryptography 37.0.2, Platform Windows-10-10.0.19044-SP0
2022-06-21 11:44:57 [werkzeug] WARNING:  * Debugger is active!
2022-06-21 11:44:57 [werkzeug] INFO:  * Debugger PIN: 107-226-838
2022-06-21 11:44:57 [scrapy.crawler] INFO: Overridden settings:
{'CLOSESPIDER_TIMEOUT': 15}
2022-06-21 11:44:57 [werkzeug] INFO: 127.0.0.1 - - [21/Jun/2022 11:44:57] "GET / HTTP/1.1" 200 -
2022-06-21 11:44:58 [twisted] CRITICAL: Unhandled error in EventualResult
Traceback (most recent call last):
  File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 1315, in run
    self.mainLoop()
  File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 1325, in mainLoop
    reactorBaseSelf.runUntilCurrent()
  File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 964, in runUntilCurrent
    f(*a, **kw)
  File "C:\Python310\lib\site-packages\crochet\_eventloop.py", line 420, in runs_in_reactor
    d = maybeDeferred(wrapped, *args, **kwargs)
--- <exception caught here> ---
  File "C:\Python310\lib\site-packages\twisted\internet\defer.py", line 190, in maybeDeferred
    result = f(*args, **kwargs)
  File "C:\Users\user\Documents\AAprojects\Whelpsgroups1\API\main.py", line 62, in scrape_with_crochet
    eventual = crawl_runner.crawl(AskScrapingSpider, category = baseURL)
  File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 205, in crawl
    crawler = self.create_crawler(crawler_or_spidercls)
  File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 238, in create_crawler
    return self._create_crawler(crawler_or_spidercls)
  File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 313, in _create_crawler
    return Crawler(spidercls, self.settings, init_reactor=True)
  File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 82, in __init__
    default.install()
  File "C:\Python310\lib\site-packages\twisted\internet\selectreactor.py", line 194, in install
    installReactor(reactor)
  File "C:\Python310\lib\site-packages\twisted\internet\main.py", line 32, in installReactor
    raise error.ReactorAlreadyInstalledError("reactor already installed")
twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed

2022-06-21 11:44:58 [twisted] CRITICAL: Unhandled error in EventualResult
Traceback (most recent call last):
  File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 1315, in run
    self.mainLoop()
  File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 1325, in mainLoop
    reactorBaseSelf.runUntilCurrent()
  File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 964, in runUntilCurrent
    f(*a, **kw)
  File "C:\Python310\lib\site-packages\crochet\_eventloop.py", line 420, in runs_in_reactor
    d = maybeDeferred(wrapped, *args, **kwargs)
--- <exception caught here> ---
  File "C:\Python310\lib\site-packages\twisted\internet\defer.py", line 190, in maybeDeferred
    result = f(*args, **kwargs)
  File "C:\Users\user\Documents\AAprojects\Whelpsgroups1\API\main.py", line 62, in scrape_with_crochet
    eventual = crawl_runner.crawl(AskScrapingSpider, category = baseURL)
  File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 205, in crawl
    crawler = self.create_crawler(crawler_or_spidercls)
  File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 238, in create_crawler
    return self._create_crawler(crawler_or_spidercls)
  File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 313, in _create_crawler
    return Crawler(spidercls, self.settings, init_reactor=True)
  File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 82, in __init__
    default.install()
  File "C:\Python310\lib\site-packages\twisted\internet\selectreactor.py", line 194, in install
    installReactor(reactor)
  File "C:\Python310\lib\site-packages\twisted\internet\main.py", line 32, in installReactor
    raise error.ReactorAlreadyInstalledError("reactor already installed")
twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed

2022-06-21 11:45:54 [werkzeug] INFO: 127.0.0.1 - - [21/Jun/2022 11:45:54] "←[32mPOST / HTTP/1.1←[0m" 302 -
2022-06-21 11:45:54 [scrapy.crawler] INFO: Overridden settings:
{'CLOSESPIDER_TIMEOUT': 15}
2022-06-21 11:45:54 [twisted] CRITICAL: Unhandled error in EventualResult
Traceback (most recent call last):
  File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 1315, in run
    self.mainLoop()
  File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 1325, in mainLoop
    reactorBaseSelf.runUntilCurrent()
  File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 964, in runUntilCurrent
    f(*a, **kw)
  File "C:\Python310\lib\site-packages\crochet\_eventloop.py", line 420, in runs_in_reactor
    d = maybeDeferred(wrapped, *args, **kwargs)
--- <exception caught here> ---
  File "C:\Python310\lib\site-packages\twisted\internet\defer.py", line 190, in maybeDeferred
    result = f(*args, **kwargs)
  File "C:\Users\user\Documents\AAprojects\Whelpsgroups1\API\main.py", line 62, in scrape_with_crochet
    eventual = crawl_runner.crawl(AskScrapingSpider, category = baseURL)
  File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 205, in crawl
    crawler = self.create_crawler(crawler_or_spidercls)
  File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 238, in create_crawler
    return self._create_crawler(crawler_or_spidercls)
  File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 313, in _create_crawler
    return Crawler(spidercls, self.settings, init_reactor=True)
  File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 82, in __init__
    default.install()
  File "C:\Python310\lib\site-packages\twisted\internet\selectreactor.py", line 194, in install
    installReactor(reactor)
  File "C:\Python310\lib\site-packages\twisted\internet\main.py", line 32, in installReactor
    raise error.ReactorAlreadyInstalledError("reactor already installed")
twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed

2022-06-21 11:45:54 [twisted] CRITICAL: Unhandled error in EventualResult
Traceback (most recent call last):
  File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 1315, in run
    self.mainLoop()
  File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 1325, in mainLoop
    reactorBaseSelf.runUntilCurrent()
  File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 964, in runUntilCurrent
    f(*a, **kw)
  File "C:\Python310\lib\site-packages\crochet\_eventloop.py", line 420, in runs_in_reactor
    d = maybeDeferred(wrapped, *args, **kwargs)
--- <exception caught here> ---
  File "C:\Python310\lib\site-packages\twisted\internet\defer.py", line 190, in maybeDeferred
    result = f(*args, **kwargs)
  File "C:\Users\user\Documents\AAprojects\Whelpsgroups1\API\main.py", line 62, in scrape_with_crochet
    eventual = crawl_runner.crawl(AskScrapingSpider, category = baseURL)
  File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 205, in crawl
    crawler = self.create_crawler(crawler_or_spidercls)
  File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 238, in create_crawler
    return self._create_crawler(crawler_or_spidercls)
  File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 313, in _create_crawler
    return Crawler(spidercls, self.settings, init_reactor=True)
  File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 82, in __init__
    default.install()
  File "C:\Python310\lib\site-packages\twisted\internet\selectreactor.py", line 194, in install
    installReactor(reactor)
  File "C:\Python310\lib\site-packages\twisted\internet\main.py", line 32, in installReactor
    raise error.ReactorAlreadyInstalledError("reactor already installed")
twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed

This post was flagged by the community and is temporarily hidden.

what means that? I don’t know how use the website https://mempool.space