Number of valid words?

Clem73188 · 26 December 2024 07:09

I want to find a relationship between number of characters in a word and the number of such valid words in english and also other languages. I was considering writing a programme to send a string to an online engine and get a response if valid or not valid word ( spell check). Is there any such online programme? Alternatively is it possible to request how many words in a dictionary of length x characters? I know this topic is streaching relevance - but someone might come up with a good suggestion. Regards.

Pixmusix · 26 December 2024 11:12

Hey @Clem73188

Something simple

You might be looking for this popular python library.

# from the docs
import enchant
d = enchant.Dict("en_US")
d.check("Hello")

Use case matters.

I think here you asking for the ratio between all possible combinations of characters of some length and the number of those that are words.
the_kings_english : every_possible_combination_of_A-Z_with_repetition
If that’s all you want you can calculate total number of words using permutation with repetition formulas.

For example : how many four letter strings can you make with the 26 chars of the Latin Alphabet.
26p4 = 26^4 = 456,976
450,000 checks is nothing for python, even on a raspberry pi.

However… note that all combinations less than 10chars is

That’s… not beyond python.
It will be hard to do.

If I had to brute for this, lord have mercy, I would use Julia.
Julia was born for this.
I would download this massive dataset of all the words. Then I’d make sure I have a sneaky bit of free ram and start running contains() (which is Julia’s heavy weight champ for this kind of thing).

checkMe : String = "cupcake"
data = CSV.read("englishwordsA-G.csv")
any(contains(checkMe),data[:1])

I have not checked the julia forums, but I almost guarantee you some poor and lonely English Literature student has asked for this and the some math nerd has delivered a beautiful one-liner for you to copy and paste. (And they probably posted the results too).

Counting words

This is the way I recon. Just grab that massive dataset of words I mentioned above and pandas will munch it up no problemo.

# Untested... I'm just going off memory here.
import pandas as pd
df = pd.read_csv('all_the_words.csv') 
df['num_char'] = df.index.str.len()
df.groupby('num_char').count().describe() #I think this is right

Anyway… that was fun

Thanks for giving me something to think about during boxing day test match ads
Pix

Clem73188 · 26 December 2024 21:30

Hi Pixmusix, thank you for your post - I knew there had to be someone out there wanting something to think about during adds! Thanks for the link to the word library. I did not understand your code, but I could write some code on the Pi to read in these files and extract all legitimate words and then process for number in each character length bin. Would there be a similar library for other languages like russian, greek, german, etc? I am doing a bit of a study on intelegent paterns in random noise. My thought is that the longer a code sequence is the less useful codes compared to all the possible codes of same length is and this is universal. It would be interesting to compare diferent languages. Someone said given protein sequences of amino acids there is about 1 in 1E+70 useful proteins compared to possible combinations of same length. Proteins are usually about 200 to 300 amino acids long. Regards.

Pixmusix · 26 December 2024 21:51

Enchant can do other languages. Hit the docs for more info

Oh yeah i think i get what your going for.
If your search doesnt have to be exhaustive you can use whatever tools and languages you want.

For instance its easy to run 10,000,000 checks per bin size and then form a statistical argument. Anything better than a standard deviation of 3.5 would satisfy most.

The numbers your talking about are hefty bois, even for computers. If your search has to be exhaustive, quality of your algorithm and the speed of your language does matter.

Topic		Replies	Views
Generate Any Voice Using Artificial Intelligence - ElevenLabs Speech Synthesis Guides	10	1536	28 May 2025
24/7 RPi Zero Meme Player Projects	0	327	14 May 2020
Counting with Hexadecimal, Binary, and Decimal Guides	0	425	26 July 2016
Font size increase for OLED Display Module SSD1306	33	15878	18 April 2024
DE2120 Scanner Arduino	25	1338	6 October 2021

Number of valid words?

Something simple

Use case matters.

Counting words

Anyway… that was fun

Related topics