Word frequency: Difference between revisions
(Adds word count task) |
(→{{header|Python}}: Add less terse Python3.6 version (f-strings!)) |
||
Line 40:
=={{header|Python}}==
===Python2.7===
<lang python>import collections
import re
Line 58 ⟶ 59:
('in', 11204), ('he', 9645), ('was', 8619), ('that', 7922), ('it', 6659)]
</pre>
===Python3.6===
<lang python>from collections import Counter
from re import findall
les_mis_file = 'les_mis_135-0.txt'
def _count_words(fname):
with open(fname) as f:
text = f.read()
words = findall(r'\w+', text.lower())
return Counter(words)
def most_common_words_in_file(fname, n):
counts = _count_words(fname)
for word, count in [['WORD', 'COUNT']] + counts.most_common(n):
print(f'{word:>10} {count:>6}')
if __name__ == "__main__":
n = int(input('How many?: '))
most_common_words_in_file(les_mis_file, n)</lang>
{{Out}}
<pre>How many?: 10
WORD COUNT
the 41036
of 19946
and 14940
a 14586
to 13939
in 11204
he 9645
was 8619
that 7922
it 6659</pre>
=={{header|UNIX Shell}}==
|
Revision as of 06:58, 15 August 2017
![Task](http://static.miraheze.org/rosettacodewiki/thumb/b/ba/Rcode-button-task-crushed.png/64px-Rcode-button-task-crushed.png)
You are encouraged to solve this task according to the task description, using any language you may know.
- Task
Given a text file and an integer n, print the n most common words in the file (and the number of their occurrences) in decreasing frequency.
For the purposes of this task:
- A word is a sequence of one or more contiguous letters
- Uppercase letters are considered equivalent to their lowercase counterparts
- Words of equal frequency can be listed in any order
Show example output using Les Misérables from Project Gutenberg
as the text file input and display the top 10 most used words.
- History
This task was originally taken from programming pearls from Communications of the ACM June 1986 Volume 29 Number 6
where this problem is solved by Donald Knuth using literate programming and then critiqued by Doug McIlroy,
demonstrating solving the problem in a 6 line Unix shell script.
Clojure
<lang clojure>(defn count-words [file n]
(->> file slurp clojure.string/lower-case (re-seq #"\w+") frequencies (sort-by val >) (take n)))</lang>
- Output:
user=> (count-words "135-0.txt" 10) (["the" 41036] ["of" 19946] ["and" 14940] ["a" 14589] ["to" 13939] ["in" 11204] ["he" 9645] ["was" 8619] ["that" 7922] ["it" 6659])
Python
Python2.7
<lang python>import collections import re import string import sys
def main():
counter = collections.Counter(re.findall(r"\w+",open(sys.argv[1]).read().lower())) print counter.most_common(int(sys.argv[2]))
if __name__ == "__main__":
main()</lang>
- Output:
$ python wordcount.py 135-0.txt 10 [('the', 41036), ('of', 19946), ('and', 14940), ('a', 14589), ('to', 13939), ('in', 11204), ('he', 9645), ('was', 8619), ('that', 7922), ('it', 6659)]
Python3.6
<lang python>from collections import Counter from re import findall
les_mis_file = 'les_mis_135-0.txt'
def _count_words(fname):
with open(fname) as f: text = f.read() words = findall(r'\w+', text.lower()) return Counter(words)
def most_common_words_in_file(fname, n):
counts = _count_words(fname) for word, count in 'WORD', 'COUNT' + counts.most_common(n): print(f'{word:>10} {count:>6}')
if __name__ == "__main__":
n = int(input('How many?: ')) most_common_words_in_file(les_mis_file, n)</lang>
- Output:
How many?: 10 WORD COUNT the 41036 of 19946 and 14940 a 14586 to 13939 in 11204 he 9645 was 8619 that 7922 it 6659
UNIX Shell
<lang bash>#!/bin/sh cat ${1} | tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${2}q</lang>
- Output:
$ ./wordcount.sh 135-0.txt 10 41089 the 19949 of 14942 and 14608 a 13951 to 11214 in 9648 he 8621 was 7924 that 6661 it