How to browse a csv in Python by intervals of letters at the beginning of lines?

I have a csv that contains a lot of data. When I launch a webscrapping, I receive a:

TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

In order to limit the amount of data to be processed for webscrapping, I would like to divide the following script into several scripts, each browsing intervals of the csv file:

# Get the data from the csv containing pmid list by author :
with open("D:/Nancy/Pèse-Savants/Excercice Covid-19/Exercice 3/pmid_par_auteur.csv",'r', encoding='utf-8') as f:   
    # Sseperate author's list from pmid's list into 2 columns :
    with open ("pmid_par_auteur_uniformise.csv", "w", encoding='utf-8') as fu:
        csv_f = csv.reader(f, delimiter = ';')
        for ligne in csv_f: 
            fu.write(ligne[0] + '\n')

auteur_pmid_doi = []

# Clean up encoded data in 'utf-8'          
with open("pmid_par_auteur_uniformise.csv",encoding='utf-8') as fu:
    csv_fu = csv.reader(fu)

for ligne in csv_fu:
    ligne[1] = ligne[1].replace("'", " ")
    ligne[1] = ligne[1].replace("[", " ")
    ligne[1] = ligne[1].replace("]", " ")
    ligne[1] = ligne[1].split(" , ")

# Get DOI for each pmid for each author that wrote on Covid-19
    pmid_doi = []

    for pmid in ligne[1]:

        try : 
            handle = Entrez.esummary(db="pubmed", id=pmid) 
            record = 
            record = record[0]['DOI']
        except IndexError :
            print ('Missing DOI')
        except KeyError :
            print ('Missing DOI')

        else :
            pmid_doi.append([pmid, record])

#handles are a finite resource, I close it in order to avoid exhausting the handle supply with a large dataset.

# Delete temporary variables to free some space in the RAM:
    auteur_pmid_doi.append([ligne[0], pmid_doi])
    del (ligne[1])
    del (handle)
    del (record)
    del (pmid_doi)


Each script would run through a data interval ranging like this: - From the first line starting with letter A to the last line starting with the letter E. - From the first line starting with letter F, to the last line starting with the letter J. - From the first line starting with the letter k to the last line starting with the letter O. - And so on up to “z”.

How do you browse the lines of a csv through these types of intervals?

I add the link to my csv and thank you in advance for your help.


First, you might try adding a time delay between requests, so that you are not overwhelming the service. E.g.

import time
         time.sleep(5)    # wait 5 seconds before next request

Another thing to consider is that you are requesting for publications by author. If a given author has a lot of publications, the service could be timing out because it has a problem with returning 100s of publications at a time.

But to answer your specific question, I’d use a simple brute-force method. First, get the first character of the line, which I’m guessing would be ligne[1][0]. You’ll need to test this and tweak as necessary; also make it lower case.

Once you have the first char, you can test against your desired range:

if not first_char in ['a','b','c','d','e']:
    continue  # skip to the next line

Then repeat (presumably separate scripts) for each range you want.

Finally you may need another script to handle first letters which are NOT in the range ‘a’…‘z’, i.e. any accented characters.

There’s probably a smarter way but this is what came to mind first.