Extract Dictionary Words From droplist

Hook

Member
Joined
Jul 8, 2024
Posts
59
Reaction score
58
Trophy points
19
I shared this on the old forum, so I thought I may aswel post it on this one also.

Replace the "file-here.txt" with your .txt file with the droplist. It'll then create a file called domains with 3 columns, use the filters in excel or A-Z column 3, and you'll get all the .co.uk domains together. It has a bug with .uk names which I am yet to fix as it causes no problems for my use.

Need python with nltk, csv and re - It'll import the words by default each time you run the script.

Python:
import re
import csv
import nltk
from nltk.corpus import words

# Ensure you have the words corpus
nltk.download('words')

# List of valid English words
english_words = set(words.words())

# Function to split domain into words
def split_into_words(domain):
    # Split by lowercase-uppercase transition or hyphen
    words = re.findall(r'[a-zA-Z]+', domain)
    return words

# Function to check if a word is valid English
def is_valid_word(word):
    return word.lower() in english_words

# Function to process domains
def process_domains(domains):
    processed_domains = []
    for domain in domains:
        # Check for numbers and hyphens in domain
        if re.search(r'[\d-]', domain):
            continue
        # Extract words from the domain
        words = split_into_words(domain.split('.')[0])
        # Filter out non-English words
        valid_words = [word for word in words if is_valid_word(word)]
        if valid_words:
            suffix = '.'.join(domain.split('.')[-2:])  # Get last two parts as suffix
            processed_domains.append((domain, ' '.join(valid_words), suffix))
    return processed_domains

# Read domains from file
file_path = "file-here.txt"  # Adjust the file path as necessary
with open(file_path, mode='r') as file:
    domains = file.read().splitlines()

# Process domains
processed_domains = process_domains(domains)

# Write to CSV
with open('domains.csv', mode='w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['url', 'word(s)', 'suffix'])  # Header with three columns
    writer.writerows(processed_domains)

print("CSV file 'domains.csv' created successfully.")
 
ukdroplists gives you dictionary words via a filter. Is this doing something different?
 
It's not perfect but good enough. I wrote it like a decade ago. If I was to do it again I'd use AI tokens which would be more accurate. It will always struggle with things it interprets as words, which are valid, but may not be the words the domain is made from. This is where you'd benefit from AI validation. It's obviously great with one word (dictionary) but can fall down when it hits 3 or more words in a domain.
 
Back
Top