I have a list of thousands of companies that manufacture certain products. Some of those companies are actually part of a same group and partially share the same name (for eg : 1- Git Company, Git and Sons co, FrenchGit, Go-Git-US ; 2- ChinaCooling co, Cooling International, BabyCool, etc…). 

Any advice on the piece of code I could use in Python to group the companies which share part of their names ?

One way to consider doing this would be to tokenize each company name, then gather companies together based on matching tokens.

E.g. “Git Company” is tokenized to “Git”,“Company”; “FrenchGit” to “French”,“Git”; etc. For this, you need to determine what are tokens, besides the usual ones of spaces, punctuation, and so on.  This link might be a useful intro.

You’d then need to figure out some “matching” logic. I.e. “Company”, “Inc”, etc might be useless tokens to match on. Then there is the “conflicting” tokens: does the company “Cool Git” match with “Cool” or “Git”? (Or both?)

I’d try a brute-force matching process first, if it’ll take until the heat-death of the universe to finish, then you can think about optimization.

''' Lets create some sort of test list '''
companies_list = ['First company', 'Git Company', 'FrenchGit', 'ItalianGit',
                  'GitEurope', 'SpacegitSpace', 'another companies',
                  'Stop with companies', 'O git k']

''' This will print all the companies'''
''' Output:
['First company', 'Git Company', 'FrenchGit', 'ItalianGit', 'GitEurope', \
'SpacegitSpace', 'another companies', 'Stop with companies', 'O git k']

''' Split the list in git and no git companies '''
git_companies = list()
nogit_companies = list()
for company in companies_list:
    if 'git' in company.lower():

''' Print the results '''
''' Output:
['Git Company', 'FrenchGit', 'ItalianGit', 'GitEurope', 'SpacegitSpace', 'O git k']
''' Output:
['First company', 'another companies', 'Stop with companies']

Hope that is what you want :slight_smile:


