Talk:Random sentence from book: Difference between revisions

Line 7:

:In this task words that are more likely to follow words, should be more likely to occur next, These weights need accumulating and applying in this task whereas that task does not require it.

: --[[User:Paddy3118|Paddy3118]] ([[User talk:Paddy3118|talk]]) 10:56, 15 February 2021 (UTC)

==Some stats==

I was thinking of extending the Python example to weight words following two, three, ... other words but after a while, I thought you would constrain things so that you only generated sentences that are actually in the book! <br>

I decided instead to find out how many of the generated random sentences exist in the book for the current Python code by appending this snippet:

<lang python>#%% Sentence counts

def gen_simple_sentence(word2next=word2next, word2next2=word2next2) -> str:

"No tidying up of generated word sequence of sentence"

s = ['.']

s += random.choices(*zip(*word2next[s[-1]].items()))

while True:

s += random.choices(*zip(*word2next2[(s[-2], s[-1])].items()))

if s[-1] in sentence_ending:

break

return ' '.join(s[1:])

if 1:

N = 1_000

words = ['.'] + txt_with_pauses_and_endings.strip().split()

sent_count = sum(words.count(punct) for punct in sentence_ending) - 1

pause_count = sum(words.count(punct) for punct in sentence_pausing)

avg_words_in_sent = (len(words) - 1 - pause_count

- words.count('re') - words.count('s')) / sent_count

print(f'\nSentences in the book have ~{avg_words_in_sent:.1f}, words')

book = ' '.join(words) # Now sanitised

copies = sum(gen_simple_sentence() in book for _ in range(N))

print(f"Generating {N:_} random sentences produced {copies:_}"

" that are actually in the book")</lang>

;The average sentence length is approx. 19 words and around 15% of the generated sentences actually occur in the book.