Word frequency: Difference between revisions

Content added Content deleted

Inline

Revision as of 06:58, 15 August 2017

Task

Given a text file and an integer n, print the n most common words in the file (and the number of their occurrences) in decreasing frequency.

For the purposes of this task:

A word is a sequence of one or more contiguous letters
Uppercase letters are considered equivalent to their lowercase counterparts
Words of equal frequency can be listed in any order

Show example output using Les Misérables from Project Gutenberg as the text file input and display the top 10 most used words.

History

This task was originally taken from programming pearls from Communications of the ACM June 1986 Volume 29 Number 6 where this problem is solved by Donald Knuth using literate programming and then critiqued by Doug McIlroy, demonstrating solving the problem in a 6 line Unix shell script.

Clojure

<lang clojure>(defn count-words [file n]

 (->> file
   slurp
   clojure.string/lower-case
   (re-seq #"\w+")
   frequencies
   (sort-by val >)
   (take n)))</lang>

Output:

user=> (count-words "135-0.txt" 10)
(["the" 41036] ["of" 19946] ["and" 14940] ["a" 14589] ["to" 13939]
 ["in" 11204] ["he" 9645] ["was" 8619] ["that" 7922] ["it" 6659])

Python

Python2.7

<lang python>import collections import re import string import sys

def main():

 counter = collections.Counter(re.findall(r"\w+",open(sys.argv[1]).read().lower()))
 print counter.most_common(int(sys.argv[2]))

if __name__ == "__main__":

 main()</lang>

Output:

$ python wordcount.py 135-0.txt 10
[('the', 41036), ('of', 19946), ('and', 14940), ('a', 14589), ('to', 13939),
 ('in', 11204), ('he', 9645), ('was', 8619), ('that', 7922), ('it', 6659)]

Python3.6

<lang python>from collections import Counter from re import findall

les_mis_file = 'les_mis_135-0.txt'

def _count_words(fname):

   with open(fname) as f:
       text = f.read()
   words = findall(r'\w+', text.lower())
   return Counter(words)

def most_common_words_in_file(fname, n):

   counts = _count_words(fname)
   for word, count in 'WORD', 'COUNT' + counts.most_common(n):
       print(f'{word:>10} {count:>6}')

if __name__ == "__main__":

   n = int(input('How many?: '))
   most_common_words_in_file(les_mis_file, n)</lang>

Output:

How many?: 10
      WORD  COUNT
       the  41036
        of  19946
       and  14940
         a  14586
        to  13939
        in  11204
        he   9645
       was   8619
      that   7922
        it   6659

UNIX Shell

Works with: Bash

Works with: zsh

Output:

$ ./wordcount.sh 135-0.txt 10 
41089 the
19949 of
14942 and
14608 a
13951 to
11214 in
9648 he
8621 was
7924 that
6661 it