Group by similar company names - Python

Hello !

I have a list of thousands of companies that manufacture certain products. Some of those companies are actually part of a same group and partially share the same name (for eg : 1- Git Company, Git and Sons co, FrenchGit, Go-Git-US ; 2- ChinaCooling co, Cooling International, BabyCool, etc…). 

Any advice on the piece of code I could use in Python to group the companies which share part of their names ?

Many thanks,

Xavier

One way to consider doing this would be to tokenize each company name, then gather companies together based on matching tokens.

E.g. “Git Company” is tokenized to “Git”,“Company”; “FrenchGit” to “French”,“Git”; etc. For this, you need to determine what are tokens, besides the usual ones of spaces, punctuation, and so on.  This link might be a useful intro.

You’d then need to figure out some “matching” logic. I.e. “Company”, “Inc”, etc might be useless tokens to match on. Then there is the “conflicting” tokens: does the company “Cool Git” match with “Cool” or “Git”? (Or both?)

I’d try a brute-force matching process first, if it’ll take until the heat-death of the universe to finish, then you can think about optimization.

Hi @levesxv,

Welcome to the Github community forum!

''' Lets create some sort of test list '''
companies_list = ['First company', 'Git Company', 'FrenchGit', 'ItalianGit',
                  'GitEurope', 'SpacegitSpace', 'another companies',
                  'Stop with companies', 'O git k']

''' This will print all the companies'''
print(companies_list)
''' Output:
['First company', 'Git Company', 'FrenchGit', 'ItalianGit', 'GitEurope', \
'SpacegitSpace', 'another companies', 'Stop with companies', 'O git k']
'''

''' Split the list in git and no git companies '''
git_companies = list()
nogit_companies = list()
for company in companies_list:
    if 'git' in company.lower():
        git_companies.append(company)
    else:
        nogit_companies.append(company)

''' Print the results '''
print(git_companies)
''' Output:
['Git Company', 'FrenchGit', 'ItalianGit', 'GitEurope', 'SpacegitSpace', 'O git k']
'''
print(nogit_companies)
''' Output:
['First company', 'another companies', 'Stop with companies']
'''

Hope that is what you want :slight_smile:

-Gabriele-

Mark helpfull posts with Accept as Solution to help other users locate important info. Don’t forget to give Kudos for great contents!

1 Like