Tags: stats, language, weird_questions 2nd Feb 2023
Acronymic Groups (Part 2)
This second post will look at the general case where we are looking at words beside BERMUDA, showing the code for that (where interesting) and looking at the results when we plug in a dictionary.
For an English wordlist I ended up using this word list, of about 50,000 words and which contains far less obscure words than other word-lists.
Results
With a word-list and the calculated FLDs we can calculate real probabilities for every word and every scenario, those being with and without order, and with and without choices. We begin by looking at Scenario 1.
I calculate the probabilities, written as prob, by simply going through each letter in a word, and multiplying by its probability in the given FLD.
Here are the top 20 most likely words, using the unisex FLD. The unfortunate result is that the most probable are boring 2 letter words.
word prob normalised
am 0.009912 0.099558
as 0.007465 0.086401
me 0.006445 0.080281
em 0.006445 0.080281
ms 0.005599 0.074825
ah 0.005459 0.073884
ha 0.005459 0.073884
at 0.004511 0.067167
mr 0.004327 0.065782
re 0.003752 0.061251
he 0.003550 0.059578
eh 0.003550 0.059578
an 0.002900 0.053854
im 0.002481 0.049808
be 0.002474 0.049737
pa 0.002036 0.045124
ne 0.001886 0.043427
so 0.001872 0.043268
is 0.001869 0.043226
eg 0.001860 0.043129
This ‘Normalised’ metric is not actually a normalised probability but is equal to \(\sqrt[n]{prob}\) which will give a sense of which words are most likely whilst allowing longer words to rank equal to shorter words.
Poisson: calculates the probability based on the assumption that the average number of exs is 5 using a poisson distribution, so this may represent a more realistic scenario than prob which suggest the probability of having 1 ex is the same as 5.
And if we sort by normalised it gets a tad bit more interesting.
both: | women: | men:
word prob normalised poisson | word prob normalised poisson | word prob normalised poisson
ala 9.974e-4 0.0999 1.400e-4 | ala 0.001435 0.1127 2.014e-4 | jam 1.089e-3 0.1028 1.528e-4
am 9.911e-3 0.0995 8.348e-4 | mama 0.000151 0.1108 2.652e-5 | ajar 7.907e-5 0.0943 1.387e-5
mama 9.824e-5 0.0995 1.723e-5 | am 0.012294 0.1108 1.035e-3 | raj 7.934e-4 0.0925 1.113e-4
lama 8.600e-5 0.0963 1.509e-5 | lama 0.000135 0.1078 2.376e-5 | jar 7.934e-4 0.0925 1.113e-4
jam 8.913e-4 0.0962 1.251e-4 | ease 0.000130 0.1067 2.282e-5 | am 7.779e-3 0.0882 6.552e-4
salaam 6.420e-7 0.0928 9.387e-8 | lea 0.001213 0.1066 1.702e-4 | mama 6.051e-5 0.0882 1.061e-5
mammal 6.393e-7 0.0928 9.348e-8 | ale 0.001213 0.1066 1.702e-4 | ala 6.591e-4 0.0870 9.253e-5
llama 6.490e-6 0.0917 1.138e-6 | salaam 0.000001 0.1064 2.126e-7 | jarl 5.265e-5 0.0851 9.238e-6
mamas 6.379e-6 0.0914 1.119e-6 | sea 0.001182 0.1057 1.658e-4 | jams 5.168e-5 0.0847 9.069e-6
lam 7.480e-4 0.0907 1.050e-4 | mamas 0.000012 0.1044 2.185e-6 | lama 5.145e-5 0.0846 9.028e-6
alas 6.477e-5 0.0897 1.136e-5 | alas 0.000118 0.1042 2.075e-5 | rajah 4.155e-6 0.0838 7.290e-7
lamas 5.584e-6 0.0890 9.799e-7 | as 0.010734 0.1036 9.041e-4 | jab 5.814e-4 0.0834 8.162e-5
mammals 4.151e-8 0.0881 4.336e-9 | male 0.000114 0.1034 2.008e-5 | mammal 3.134e-7 0.0824 4.583e-8
ajar 5.964e-5 0.0878 1.046e-5 | meal 0.000114 0.1034 2.008e-5 | jamb 4.538e-5 0.0820 7.963e-6
jams 5.788e-5 0.0872 1.015e-5 | lame 0.000114 0.1034 2.008e-5 | mara 4.409e-5 0.0814 7.736e-6
alabama 3.761e-8 0.0869 3.928e-9 | mammal 0.000001 0.1031 1.764e-7 | llama 3.414e-6 0.0806 5.991e-7
mara 5.718e-5 0.0869 1.003e-5 | seam 0.000112 0.1027 1.957e-5 | aha 5.219e-4 0.0805 7.326e-5
all 6.548e-4 0.0868 9.192e-5 | same 0.000112 0.1027 1.957e-5 | lam 5.162e-4 0.0802 7.246e-5
mall 5.646e-5 0.0866 9.907e-6 | llama 0.000011 0.1027 2.010e-6 | alabama 2.129e-8 0.0801 2.224e-9
llamas 4.215e-7 0.0865 6.163e-8 | lamas 0.000011 0.1022 1.958e-6 | cam 4.995e-4 0.0793 7.012e-5
Most probable words
The following two tables show the top 20 most probable words grouped by word size. These values have been calculated by the combined unisex FLDs. Gendered results are in the next section. Each table also includes total_prob which is the total summed probabilities over all words of word length K. N is the total number of words we have of length K. in the vocabulary.
2-word prob 3-word prob 4-word prob 5-word prob
am 0.009912 ala 0.000997 mama 0.000098 llama 0.000006
as 0.007465 jam 0.000891 lama 0.000086 mamas 0.000006
me 0.006445 lam 0.000748 alas 0.000065 lamas 0.000006
em 0.006445 all 0.000655 ajar 0.000060 lemma 0.000005
ms 0.005599 lea 0.000649 jams 0.000058 amass 0.000005
ha 0.005459 ale 0.000649 mara 0.000057 james 0.000004
ah 0.005459 sam 0.000644 mall 0.000056 alarm 0.000004
at 0.004511 mas 0.000644 lame 0.000056 salsa 0.000004
mr 0.004327 aha 0.000628 male 0.000056 areal 0.000004
re 0.003752 cam 0.000623 meal 0.000056 small 0.000004
eh 0.003550 mac 0.000623 area 0.000050 malls 0.000004
he 0.003550 las 0.000563 elal 0.000049 meals 0.000004
an 0.002900 sea 0.000558 slam 0.000049 males 0.000004
im 0.002481 ace 0.000540 alms 0.000049 salem 0.000004
be 0.002474 raj 0.000519 seam 0.000048 camel 0.000004
pa 0.002036 jar 0.000519 same 0.000048 madam 0.000003
ne 0.001886 ram 0.000497 clam 0.000047 malta 0.000003
so 0.001872 arm 0.000497 calm 0.000047 mamba 0.000003
is 0.001869 mar 0.000497 mace 0.000047 areas 0.000003
eg 0.001860 elm 0.000486 came 0.000047 slams 0.000003
total_prob = 0.1036 total_prob = 0.05914 total_prob = 0.01355 total_prob = 0.001317
N = 47 N = 589 N = 2294 N = 4266
6-word prob 7-word prob 8-word prob
salaam 6.420e-7 mammals 4.151e-8 mammalia 2.114e-9
mammal 6.393e-7 alabama 3.761e-8 caracals 1.475e-9
llamas 4.215e-7 marsala 3.222e-8 amalgams 1.377e-9
lemmas 3.130e-7 mascara 2.682e-8 caramels 1.316e-9
alaska 2.810e-7 amasses 2.332e-8 seamless 1.144e-9
alarms 2.802e-7 caracal 2.271e-8 massacre 1.132e-9
allele 2.761e-7 jamaica 2.129e-8 malarial 1.077e-9
camera 2.685e-7 amalgam 2.121e-8 alacarte 1.060e-9
madame 2.547e-7 mahatma 2.104e-8 almanacs 1.017e-9
smalls 2.380e-7 caramel 2.027e-8 massless 9.943e-10
sahara 2.351e-7 cascara 1.954e-8 sarcasms 9.839e-10
sesame 2.335e-7 alleles 1.793e-8 saleable 9.040e-10
camels 2.281e-7 measles 1.762e-8 escalate 8.924e-10
armada 2.280e-7 cameras 1.744e-8 callable 8.831e-10
madams 2.212e-7 almanac 1.567e-8 marshals 8.642e-10
jackal 2.127e-7 sarcasm 1.515e-8 teammate 8.454e-10
mambas 2.111e-7 armadas 1.480e-8 carcases 8.253e-10
pajama 2.086e-7 malaria 1.427e-8 escalade 7.887e-10
leases 2.044e-7 lacteal 1.387e-8 releases 7.670e-10
easels 2.044e-7 academe 1.387e-8 scalable 7.598e-10
total_prob = 0.0001057 total_prob = 6.781e-6 total_prob = 3.3527e-7
N = 6936 N = 9203 N = 9396
Boy results and girl results
Gendered results can be found here: Gendered results
Scenario 2
To calculate the probabilities, we calculate the number of permutations of the letter in the word using the multinomial coefficient formula for each word and multiply that by the probabilities in Scenario 1. The result tables can be found here: Gendered results
Scenario 3
The following code solves scenario 3, which can be thought of scenario 1 - except at each stage we must calculate the probability of not getting letter \(L\) from \(K\) choices which we do with a simple binomial.
def scenario3(fld, word, k):
word = word.upper()
prob = 1
for letter in word:
p =fld[ord(letter)-65]
prob *= 1-pow(1-p, k)
return prob
values = list(range(1,11))+[15, 20, 30, 40, 50, 100, 500]
fld = [1/26 for _ in range(26)]
word = "bermuda"
for k in values:
answer = scenario3(unisex_fld, word, k)
print(f"{k: 5d} {answer:.4g}")
It seems for most values of K, the most probable values are roughly all the same up to some large value of K value, and so I will omit dumping too much data here. You can confirm that by looking at the results for word length 5 results, that shows the consistent ordering. And when K=1 (a single choice per letter) this reduces to scenario 1 - so you can refer to the data there to see what is most likely for any word, and any small K value you might be interested in.
Scenario 4
The code use here is similar to the version from the previous post but generalised to work for any word, and it has been optimised to use memoisation.
@lru_cache(maxsize=None)
def solution(state, k, fld):
current_state = list(state)
if all(x == 0 for x in current_state):
return 1.0
total = 0
cur = 1
for i in range(len(current_state)):
if current_state[i] > 0:
branch_prob = 1-pow(1-fld[i], k)
current_state[i] -= 1
total += cur * branch_prob * solution(tuple(current_state), k, fld)
current_state[i] += 1
if k > 1:
cur *= (1-branch_prob)
return total
# Acronym group case 4 generic case
def presolution(fld, word, k):
word = word.upper()
letter_freq = [0 for _ in range(26)]
for i in word:
letter_freq[ord(i)-65] += 1
sorted_freq = sorted(zip(letter_freq, fld), key=lambda x: x[1] )
sorted_state = [x[0] for x in sorted_freq]
sorted_fld = sorted(fld)
return solution(sorted_state, k, sorted_fld)
values = list(range(1,11))+[15, 20, 30, 40, 50, 100, 500]
unisex_fld = [0.11496, 0.03309, 0.06283, 0.03468, 0.07475, 0.02670, 0.02488, 0.04748, 0.02877, 0.08992, 0.03774, 0.07547, 0.08621, 0.02522, 0.02882, 0.01771, 0.00054, 0.05018, 0.06493, 0.03924, 0.00134, 0.00536, 0.01096, 0.00071, 0.00419, 0.01318]
word = "bermuda"
for k in values:
answer = presolution(unisex_fld, word, k)
print(f"{k: 5d} {answer:.4g}")
Using the code, with some changes I have done a search of the most probable words over K = [2,5,10] and word lengths from 3 to 9. The results are here. Interesting enough the most probable 5-letter word, using the same FLD is in fact my name: James.
One aspect not yet addressed (but briefly mentioned above) is the conditional probability distribution of group sizes ( the chance of Lee Mack having exactly seven exes’) which would be actually relevant if trying to figure out if Lee Mack’ is lying. However I will leave this as an exercise to the reader.
Addendum
- Original link of wordlist: http://www.mieliestronk.com/wordlist.html
- Copy of the wordlist on my github.
- Full code
Here are the changes of the FLDs for boy and girl names from the year 1996-2021, using the census data.

