Text processing/2: Difference between revisions
→{{header|AWK}}: wrapping the code |
|||
Line 133: | Line 133: | ||
'''What number of records have good readings for all instruments.''' |
'''What number of records have good readings for all instruments.''' |
||
<div style="width:100%;overflow:scroll"> |
|||
<lang awk>bash$ awk '{rec++;ok=1; for(i=0;i<24;i++){if($(2*i+3)<1){ok=0}}; recordok += ok} END {print "Total records",rec,"OK records", recordok, "or", recordok/rec*100,"%"}' readings.txt |
<lang awk>bash$ awk '{rec++;ok=1; for(i=0;i<24;i++){if($(2*i+3)<1){ok=0}}; recordok += ok} END {print "Total records",rec,"OK records", recordok, "or", recordok/rec*100,"%"}' readings.txt |
||
Total records 5471 OK records 5017 or 91.7017 % |
Total records 5471 OK records 5017 or 91.7017 % |
||
bash$</lang> |
bash$</lang> |
||
</div> |
|||
=={{header|C++}}== |
=={{header|C++}}== |
Revision as of 04:01, 26 January 2010
You are encouraged to solve this task according to the task description, using any language you may know.
The following data shows a few lines from the file readings.txt (as used in the Data Munging task).
The data comes from a pollution monitoring station with twenty four instruments monitoring twenty four aspects of pollution in the air. Periodically a record is added to the file constituting a line of 49 white-space separated fields, where white-space can be one or more space or tab characters.
The fields (from the left) are:
DATESTAMP [ VALUEn FLAGn ] * 24
i.e. a datestamp followed by twenty four repetitions of a floating point instrument value and that instruments associated integer flag. Flag values are >= 1 if the instrument is working and < 1 if there is some problem with that instrument, in which case that instrument's value should be ignored.
A sample from the full data file readings.txt is:
1991-03-30 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 1991-03-31 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 10.000 1 20.000 1 20.000 1 20.000 1 35.000 1 50.000 1 60.000 1 40.000 1 30.000 1 30.000 1 30.000 1 25.000 1 20.000 1 20.000 1 20.000 1 20.000 1 20.000 1 35.000 1 1991-03-31 40.000 1 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 0.000 -2 1991-04-01 0.000 -2 13.000 1 16.000 1 21.000 1 24.000 1 22.000 1 20.000 1 18.000 1 29.000 1 44.000 1 50.000 1 43.000 1 38.000 1 27.000 1 27.000 1 24.000 1 23.000 1 18.000 1 12.000 1 13.000 1 14.000 1 15.000 1 13.000 1 10.000 1 1991-04-02 8.000 1 9.000 1 11.000 1 12.000 1 12.000 1 12.000 1 27.000 1 26.000 1 27.000 1 33.000 1 32.000 1 31.000 1 29.000 1 31.000 1 25.000 1 25.000 1 24.000 1 21.000 1 17.000 1 14.000 1 15.000 1 12.000 1 12.000 1 10.000 1 1991-04-03 10.000 1 9.000 1 10.000 1 10.000 1 9.000 1 10.000 1 15.000 1 24.000 1 28.000 1 24.000 1 18.000 1 14.000 1 12.000 1 13.000 1 14.000 1 15.000 1 14.000 1 15.000 1 13.000 1 13.000 1 13.000 1 12.000 1 10.000 1 10.000 1
The task:
- Confirm the general field format of the file
- Identify any DATESTAMPs that are duplicated.
- What number of records have good readings for all instruments.
Ada
<lang ada>with Ada.Calendar; use Ada.Calendar; with Ada.Text_IO; use Ada.Text_IO; with Strings_Edit; use Strings_Edit; with Strings_Edit.Floats; use Strings_Edit.Floats; with Strings_Edit.Integers; use Strings_Edit.Integers;
with Generic_Map;
procedure Data_Munging_2 is
package Time_To_Line is new Generic_Map (Time, Natural); use Time_To_Line; File : File_Type; Line_No : Natural := 0; Count : Natural := 0; Stamps : Map;
begin
Open (File, In_File, "readings.txt"); loop declare Line : constant String := Get_Line (File); Pointer : Integer := Line'First; Flag : Integer; Year, Month, Day : Integer; Data : Float; Stamp : Time; Valid : Boolean := True; begin Line_No := Line_No + 1; Get (Line, Pointer, SpaceAndTab); Get (Line, Pointer, Year); Get (Line, Pointer, Month); Get (Line, Pointer, Day); Stamp := Time_Of (Year_Number (Year), Month_Number (-Month), Day_Number (-Day)); begin Add (Stamps, Stamp, Line_No); exception when Constraint_Error => Put (Image (Year) & Image (Month) & Image (Day) & ": record at " & Image (Line_No)); Put_Line (" duplicates record at " & Image (Get (Stamps, Stamp))); end; Get (Line, Pointer, SpaceAndTab); for Reading in 1..24 loop Get (Line, Pointer, Data); Get (Line, Pointer, SpaceAndTab); Get (Line, Pointer, Flag); Get (Line, Pointer, SpaceAndTab); Valid := Valid and then Flag >= 1; end loop; if Pointer <= Line'Last then Put_Line ("Unrecognized tail at " & Image (Line_No) & ':' & Image (Pointer)); elsif Valid then Count := Count + 1; end if; exception when End_Error | Data_Error | Constraint_Error | Time_Error => Put_Line ("Syntax error at " & Image (Line_No) & ':' & Image (Pointer)); end; end loop;
exception
when End_Error => Close (File); Put_Line ("Valid records " & Image (Count) & " of " & Image (Line_No) & " total");
end Data_Munging_2;</lang> Sample output
1990-3-25: record at 85 duplicates record at 84 1991-3-31: record at 456 duplicates record at 455 1992-3-29: record at 820 duplicates record at 819 1993-3-28: record at 1184 duplicates record at 1183 1995-3-26: record at 1911 duplicates record at 1910 Valid records 5017 of 5471 total
AWK
A series of AWK one-liners are shown as this is often what is done. If this information were needed repeatedly, (and this is not known), a more permanent shell script might be created that combined multi-line versions of the scripts below.
Gradually tie down the format.
(In each case offending lines will be printed)
If their are any scientific notation fields then their will be an e in the file: <lang awk>bash$ awk '/[eE]/' readings.txt bash$</lang> Quick check on the number of fields: <lang awk>bash$ awk 'NF != 49' readings.txt bash$</lang> Full check on the file format using a regular expression: <lang awk>bash$ awk '!(/^[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]([ \t]+[-]?[0-9]+\.[0-9]+[\t ]+[-]?[0-9]+)+$/ && NF==49)' readings.txt bash$</lang> Full check on the file format as above but using regular expressions allowing intervals (gnu awk): <lang awk>bash$ awk --re-interval '!(/^[0-9]{4}-[0-9]{2}-[0-9]{2}([ \t]+[-]?[0-9]+\.[0-9]+[\t ]+[-]?[0-9]+){24}+$/ )' readings.txt bash$</lang>
Identify any DATESTAMPs that are duplicated.
Accomplished by counting how many times the first field occurs and noting any second occurrences. <lang awk>bash$ awk '++count[$1]==2{print $1}' readings.txt 1990-03-25 1991-03-31 1992-03-29 1993-03-28 1995-03-26 bash$</lang>
What number of records have good readings for all instruments.
<lang awk>bash$ awk '{rec++;ok=1; for(i=0;i<24;i++){if($(2*i+3)<1){ok=0}}; recordok += ok} END {print "Total records",rec,"OK records", recordok, "or", recordok/rec*100,"%"}' readings.txt Total records 5471 OK records 5017 or 91.7017 % bash$</lang>
C++
Library: Boost <lang cpp>#include <boost/regex.hpp>
- include <fstream>
- include <iostream>
- include <vector>
- include <string>
- include <set>
- include <cstdlib>
- include <algorithm>
using namespace std ;
boost::regex e ( "\\s+" ) ;
int main( int argc , char *argv[ ] ) {
ifstream infile( argv[ 1 ] ) ; vector<string> duplicates ; set<string> datestamps ; //for the datestamps if ( ! infile.is_open( ) ) { cerr << "Can't open file " << argv[ 1 ] << '\n' ; return 1 ; } int all_ok = 0 ;//all_ok for lines in the given pattern e int pattern_ok = 0 ; //overall field pattern of record is ok while ( infile ) { string eingabe ; getline( infile , eingabe ) ; boost::sregex_token_iterator i ( eingabe.begin( ), eingabe.end( ) , e , -1 ), j ;//we tokenize on empty fields vector<string> fields( i, j ) ; if ( fields.size( ) == 49 ) //we expect 49 fields in a record pattern_ok++ ; else cout << "Format not ok!\n" ; if ( datestamps.insert( fields[ 0 ] ).second ) { //not duplicated int howoften = ( fields.size( ) - 1 ) / 2 ;//number of measurement //devices and values for ( int n = 1 ; atoi( fields[ 2 * n ].c_str( ) ) >= 1 ; n++ ) { if ( n == howoften ) { all_ok++ ; break ; } } } else { duplicates.push_back( fields[ 0 ] ) ;//first field holds datestamp } } infile.close( ) ; cout << "The following " << duplicates.size() << " datestamps were duplicated:\n" ; copy( duplicates.begin( ) , duplicates.end( ) , ostream_iterator<string>( cout , "\n" ) ) ; cout << all_ok << " records were complete and ok!\n" ; return 0 ;
}</lang> The program produces the following output:
Format not ok! The following 6 datestamps were duplicated: 1990-03-25 1991-03-31 1992-03-29 1993-03-28 1995-03-26 2004-12-31
F#
<lang fsharp> let file = @"readings.txt"
let dates = HashSet(HashIdentity.Structural) let mutable ok = 0
do
for line in System.IO.File.ReadAllLines file do match String.split [' '; '\t'] line with | [] -> () | date::xys -> if dates.Contains date then printf "Date %s is duplicated\n" date else dates.Add date let f (b, t) h = not b, if b then int h::t else t let _, states = Seq.fold f (false, []) xys if Seq.forall (fun s -> s >= 1) states then ok <- ok + 1 printf "%d records were ok\n" ok
</lang> Prints: <lang fsharp> Date 1990-03-25 is duplicated Date 1991-03-31 is duplicated Date 1992-03-29 is duplicated Date 1993-03-28 is duplicated Date 1995-03-26 is duplicated 5017 records were ok </lang>
Haskell
<lang haskell> import Data.List (nub, (\\))
data Record = Record {date :: String, recs :: [(Double, Int)]}
duplicatedDates rs = rs \\ nub rs
goodRecords = filter ((== 24) . length . filter ((>= 1) . snd) . recs)
parseLine l = let ws = words l in Record (head ws) (mapRecords (tail ws))
mapRecords [] = [] mapRecords [_] = error "invalid data" mapRecords (value:flag:tail) = (read value, read flag) : mapRecords tail
main = do
inputs <- (map parseLine . lines) `fmap` readFile "readings.txt" putStr (unlines ("duplicated dates:": duplicatedDates (map date inputs))) putStrLn ("number of good records: " ++ show (length $ goodRecords inputs))
</lang>
this script outputs:
duplicated dates: 1990-03-25 1991-03-31 1992-03-29 1993-03-28 1995-03-26 number of good records: 5017
J
<lang j> require 'tables/dsv dates'
dat=: TAB readdsv jpath '~temp/readings.txt' Dates=: getdate"1 >{."1 dat Vals=: _99 ". >(1 + +: i.24){"1 dat Flags=: _99 ". >(2 + +: i.24){"1 dat
# Dates NB. Total # lines
5471
+/ *./"1 ] 0 = Dates NB. # lines with invalid date formats
0
+/ _99 e."1 Vals,.Flags NB. # lines with invalid value or flag formats
0
+/ *./"1 [0 < Flags NB. # lines with only valid flags
5017
~. (#~ (i.~ ~: i:~)) Dates NB. Duplicate dates
1990 3 25 1991 3 31 1992 3 29 1993 3 28 1995 3 26</lang>
Java
<lang java5>import java.util.*; import java.util.regex.*; import java.io.*;
public class DataMunging2 {
public static final Pattern e = Pattern.compile("\\s+");
public static void main(String[] args) { try { BufferedReader infile = new BufferedReader(new FileReader(args[0])); List<String> duplicates = new ArrayList<String>(); Set<String> datestamps = new HashSet<String>(); //for the datestamps
String eingabe; int all_ok = 0;//all_ok for lines in the given pattern e while ((eingabe = infile.readLine()) != null) { String[] fields = e.split(eingabe); //we tokenize on empty fields if (fields.length != 49) //we expect 49 fields in a record System.out.println("Format not ok!"); if (datestamps.add(fields[0])) { //not duplicated int howoften = (fields.length - 1) / 2 ; //number of measurement //devices and values for (int n = 1; Integer.parseInt(fields[2*n]) >= 1; n++) { if (n == howoften) { all_ok++ ; break ; } } } else { duplicates.add(fields[0]); //first field holds datestamp } } infile.close(); System.out.println("The following " + duplicates.size() + " datestamps were duplicated:"); for (String x : duplicates) System.out.println(x); System.out.println(all_ok + " records were complete and ok!"); } catch (IOException e) { System.err.println("Can't open file " + args[0]); System.exit(1); } }
}</lang> The program produces the following output:
The following 5 datestamps were duplicated: 1990-03-25 1991-03-31 1992-03-29 1993-03-28 1995-03-26 5013 records were complete and ok!
JavaScript
<lang javascript>// wrap up the counter variables in a closure. function analyze_func(filename) {
var dates_seen = {}; var format_bad = 0; var records_all = 0; var records_good = 0; return function() { var fh = new ActiveXObject("Scripting.FileSystemObject").openTextFile(filename, 1); // 1 = for reading while ( ! fh.atEndOfStream) { records_all ++; var allOK = true; var line = fh.ReadLine(); var fields = line.split('\t'); if (fields.length != 49) { format_bad ++; continue; }
var date = fields.shift(); if (has_property(dates_seen, date)) WScript.echo("duplicate date: " + date); else dates_seen[date] = 1;
while (fields.length > 0) { var value = parseFloat(fields.shift()); var flag = parseInt(fields.shift(), 10); if (isNaN(value) || isNaN(flag)) { format_bad ++; } else if (flag <= 0) { allOK = false; } } if (allOK) records_good ++; } fh.close(); WScript.echo("total records: " + records_all); WScript.echo("Wrong format: " + format_bad); WScript.echo("records with no bad readings: " + records_good); }
}
function has_property(obj, propname) {
return typeof(obj[propname]) == "undefined" ? false : true;
}
var analyze = analyze_func('readings.txt'); analyze();</lang>
OCaml
<lang ocaml>#load "str.cma" open Str
let strip_cr str =
let last = pred(String.length str) in if str.[last] <> '\r' then (str) else (String.sub str 0 last)
let map_records =
let rec aux acc = function | value::flag::tail -> let e = (float_of_string value, int_of_string flag) in aux (e::acc) tail | _::[] -> invalid_arg "invalid data" | [] -> (List.rev acc) in aux [] ;;
let duplicated_dates =
let same_date (d1,_) (d2,_) = (d1 = d2) in let date (d,_) = d in let rec aux acc = function | a::b::tl when same_date a b -> aux (date a::acc) tl | _::tl -> aux acc tl | [] -> (List.rev acc) in aux [] ;;
let record_ok (_,record) =
let is_ok (_,v) = (v >= 1) in let sum_ok = List.fold_left (fun sum this -> if is_ok this then succ sum else sum) 0 record in (sum_ok = 24)
let num_good_records =
List.fold_left (fun sum record -> if record_ok record then succ sum else sum) 0 ;;
let parse_line line =
let li = split (regexp "[ \t]+") line in let records = map_records (List.tl li) and date = (List.hd li) in (date, records)
let () =
let ic = open_in "readings.txt" in let rec read_loop acc = try let line = strip_cr(input_line ic) in read_loop ((parse_line line) :: acc) with End_of_file -> close_in ic; (List.rev acc) in let inputs = read_loop [] in
Printf.printf "%d total lines\n" (List.length inputs);
Printf.printf "duplicated dates:\n"; let dups = duplicated_dates inputs in List.iter print_endline dups;
Printf.printf "number of good records: %d\n" (num_good_records inputs);
- </lang>
this script outputs:
5471 total lines duplicated dates: 1990-03-25 1991-03-31 1992-03-29 1993-03-28 1995-03-26 number of good records: 5017
Perl
<lang perl>use List::MoreUtils 'natatime'; use constant FIELDS => 49;
binmode STDIN, ':crlf';
# Read the newlines properly even if we're not running on # Windows.
my ($line, $good_records, %dates) = (0, 0); while (<>)
{++$line; my @fs = split /\s+/; @fs == FIELDS or die "$line: Bad number of fields.\n"; for (shift @fs) {/\d{4}-\d{2}-\d{2}/ or die "$line: Bad date format.\n"; ++$dates{$_};} my $iterator = natatime 2, @fs; my $all_flags_okay = 1; while ( my ($val, $flag) = $iterator->() ) {$val =~ /\d+\.\d+/ or die "$line: Bad value format.\n"; $flag =~ /\A-?\d+/ or die "$line: Bad flag format.\n"; $flag < 1 and $all_flags_okay = 0;} $all_flags_okay and ++$good_records;}
print "Good records: $good_records\n",
"Repeated timestamps:\n", map {" $_\n"} grep {$dates{$_} > 1} sort keys %dates;</lang>
Output:
Good records: 5017 Repeated timestamps: 1990-03-25 1991-03-31 1992-03-29 1993-03-28 1995-03-26
PowerShell
<lang powershell>$dateHash = @{} $goodLineCount = 0 get-content c:\temp\readings.txt |
ForEach-Object { $line = $_.split(" |`t",2) if ($dateHash.containskey($line[0])) { $line[0] + " is duplicated" } else { $dateHash.add($line[0], $line[1]) } $readings = $line[1].split() $goodLine = $true if ($readings.count -ne 48) { $goodLine = $false; "incorrect line length : $line[0]" } for ($i=0; $i -lt $readings.count; $i++) { if ($i % 2 -ne 0) { if ([int]$readings[$i] -lt 1) { $goodLine = $false } } } if ($goodLine) { $goodLineCount++ } }
[string]$goodLineCount + " good lines" </lang>
Output:
1990-03-25 is duplicated 1991-03-31 is duplicated 1992-03-29 is duplicated 1993-03-28 is duplicated 1995-03-26 is duplicated 5017
An alternative using regular expression syntax: <lang powershell> $dateHash = @{} $goodLineCount = 0 ForEach ($rawLine in ( get-content c:\temp\readings.txt) ){
$line = $rawLine.split(" |`t",2) if ($dateHash.containskey($line[0])) { $line[0] + " is duplicated" } else { $dateHash.add($line[0], $line[1]) } $readings = [regex]::matches($line[1],"\d+\.\d+\s-?\d") if ($readings.count -ne 24) { "incorrect number of readings for date " + $line[0] } $goodLine = $true foreach ($flagMatch in [regex]::matches($line[1],"\d\.\d*\s(?<flag>-?\d)")) { if ([int][string]$flagMatch.groups["flag"].value -lt 1) { $goodLine = $false } } if ($goodLine) { $goodLineCount++}
} [string]$goodLineCount + " good lines" </lang>
Output:
1990-03-25 is duplicated 1991-03-31 is duplicated 1992-03-29 is duplicated 1993-03-28 is duplicated 1995-03-26 is duplicated 5017 good lines
Python
<lang python>import re import zipfile import StringIO
def munge2(readings):
datePat = re.compile(r'\d{4}-\d{2}-\d{2}') valuPat = re.compile(r'[-+]?\d+\.\d+') statPat = re.compile(r'-?\d+') allOk, totalLines = 0, 0 datestamps = set([]) for line in readings: totalLines += 1 fields = line.split('\t') date = fields[0] pairs = [(fields[i],fields[i+1]) for i in range(1,len(fields),2)]
lineFormatOk = datePat.match(date) and \ all( valuPat.match(p[0]) for p in pairs ) and \ all( statPat.match(p[1]) for p in pairs ) if not lineFormatOk: print 'Bad formatting', line continue
if len(pairs)!=24 or any( int(p[1]) < 1 for p in pairs ): print 'Missing values', line continue
if date in datestamps: print 'Duplicate datestamp', line continue datestamps.add(date) allOk += 1
print 'Lines with all readings: ', allOk print 'Total records: ', totalLines
- zfs = zipfile.ZipFile('readings.zip','r')
- readings = StringIO.StringIO(zfs.read('readings.txt'))
readings = open('readings.txt','r') munge2(readings)</lang> The results indicate 5013 good records, which differs from the Awk implementation. The final few lines of the output are as follows
Missing values 2004-12-29 2.900 1 2.700 1 2.800 1 3.300 1 2.900 1 2.300 1 0.000 0 1.700 1 1.900 1 2.300 1 2.600 1 2.900 1 2.600 1 2.600 1 2.600 1 2.700 1 2.300 1 2.200 1 2.100 1 2.000 1 2.100 1 2.100 1 2.300 1 2.400 1 Missing values 2004-12-30 2.400 1 2.600 1 2.600 1 2.600 1 3.000 1 0.000 0 3.300 1 2.600 1 2.900 1 2.400 1 2.300 1 2.900 1 3.500 1 3.700 1 3.600 1 4.000 1 3.400 1 2.400 1 2.500 1 2.600 1 2.600 1 2.800 1 2.400 1 2.200 1 Missing values 2004-12-31 2.400 1 2.500 1 2.500 1 2.400 1 0.000 0 2.400 1 2.400 1 2.400 1 2.200 1 2.400 1 2.500 1 2.000 1 1.700 1 1.400 1 1.500 1 1.900 1 1.700 1 2.000 1 2.000 1 2.200 1 1.700 1 1.500 1 1.800 1 1.800 1 Lines with all readings: 5013 Total records: 5471
Second Version
Modification of the version above to:
- Remove continue statements so it counts as the AWK example does.
- Generate mostly summary information that is easier to compare to other solutions.
<lang python>import re import zipfile import StringIO
def munge2(readings, debug=False):
datePat = re.compile(r'\d{4}-\d{2}-\d{2}') valuPat = re.compile(r'[-+]?\d+\.\d+') statPat = re.compile(r'-?\d+') totalLines = 0 dupdate, badform, badlen, badreading = set(), set(), set(), 0 datestamps = set([]) for line in readings: totalLines += 1 fields = line.split('\t') date = fields[0] pairs = [(fields[i],fields[i+1]) for i in range(1,len(fields),2)] lineFormatOk = datePat.match(date) and \ all( valuPat.match(p[0]) for p in pairs ) and \ all( statPat.match(p[1]) for p in pairs ) if not lineFormatOk: if debug: print 'Bad formatting', line badform.add(date) if len(pairs)!=24 or any( int(p[1]) < 1 for p in pairs ): if debug: print 'Missing values', line if len(pairs)!=24: badlen.add(date) if any( int(p[1]) < 1 for p in pairs ): badreading += 1 if date in datestamps: if debug: print 'Duplicate datestamp', line dupdate.add(date)
datestamps.add(date)
print 'Duplicate dates:\n ', '\n '.join(sorted(dupdate)) print 'Bad format:\n ', '\n '.join(sorted(badform)) print 'Bad number of fields:\n ', '\n '.join(sorted(badlen)) print 'Records with good readings: %i = %5.2f%%\n' % ( totalLines-badreading, (totalLines-badreading)/float(totalLines)*100 ) print 'Total records: ', totalLines
readings = open('readings.txt','r') munge2(readings)</lang>
bash$ /cygdrive/c/Python26/python munge2.py Duplicate dates: 1990-03-25 1991-03-31 1992-03-29 1993-03-28 1995-03-26 Bad format: Bad number of fields: Records with good readings: 5017 = 91.70% Total records: 5471 bash$
R
<lang R># Read in data from file dfr <- read.delim("d:/readings.txt", colClasses=c("character", rep(c("numeric", "integer"), 24))) dates <- strptime(dfr[,1], "%Y-%m-%d")
- Any bad values?
dfr[which(is.na(dfr))]
- Any duplicated dates
dates[duplicated(dates)]
- Number of rows with no bad values
flags <- as.matrix(dfr[,seq(3,49,2)])>0 sum(apply(flags, 1, all))</lang>
Ruby
<lang ruby>require 'set'
def munge2(readings, debug=false)
datePat = /^\d{4}-\d{2}-\d{2}/ valuPat = /^[-+]?\d+\.\d+/ statPat = /^-?\d+/ totalLines = 0 dupdate, badform, badlen, badreading = Set[], Set[], Set[], 0 datestamps = Set[[]] for line in readings totalLines += 1 fields = line.split(/\t/) date = fields.shift pairs = fields.enum_slice(2).to_a lineFormatOk = date =~ datePat && pairs.all? { |x,y| x =~ valuPat && y =~ statPat } if !lineFormatOk puts 'Bad formatting ' + line if debug badform << date end if pairs.length != 24 || pairs.any? { |x,y| y.to_i < 1 } puts 'Missing values ' + line if debug end if pairs.length != 24 badlen << date end if pairs.any? { |x,y| y.to_i < 1 } badreading += 1 end if datestamps.include?(date) puts 'Duplicate datestamp ' + line if debug dupdate << date end
datestamps << date end
puts 'Duplicate dates:', dupdate.sort.map { |x| ' ' + x } puts 'Bad format:', badform.sort.map { |x| ' ' + x } puts 'Bad number of fields:', badlen.sort.map { |x| ' ' + x } puts 'Records with good readings: %i = %5.2f%%' % [ totalLines-badreading, (totalLines-badreading)/totalLines.to_f*100 ] puts puts 'Total records: %d' % totalLines
end
open('readings.txt','r') do |readings|
munge2(readings)
end</lang>
Scala
<lang scala>object DataMunging2 {
import scala.io.Source import scala.collection.immutable.{TreeMap => Map}
val pattern = """^(\d+-\d+-\d+)""" + """\s+(\d+\.\d+)\s+(-?\d+)""" * 24 + "$" r;
def main(args: Array[String]) { val files = args map (new java.io.File(_)) filter (file => file.isFile && file.canRead) val (numFormatErrors, numValidRecords, dateMap) = files.iterator.flatMap(file => Source fromFile file getLines ()). foldLeft((0, 0, new Map[String, Int] withDefaultValue 0)) { case ((nFE, nVR, dM), line) => pattern findFirstMatchIn line map (_.subgroups) match { case Some(List(date, rawData @ _*)) => val allValid = (rawData map (_ toDouble) iterator) grouped 2 forall (_.last > 0) (nFE, nVR + (if (allValid) 1 else 0), dM(date) += 1) case None => (nFE + 1, nVR, dM) } }
dateMap foreach { case (date, repetitions) if repetitions > 1 => println(date+": "+repetitions+" repetitions") case _ => }
println("""| |Valid records: %d |Duplicated dates: %d |Duplicated records: %d |Data format errors: %d |Invalid data records: %d |Total records: %d""".stripMargin format ( numValidRecords, dateMap filter { case (_, repetitions) => repetitions > 1 } size, dateMap.valuesIterable filter (_ > 1) map (_ - 1) sum, numFormatErrors, dateMap.valuesIterable.sum - numValidRecords, dateMap.valuesIterable.sum)) }
}</lang>
Sample output:
1990-03-25: 2 repetitions 1991-03-31: 2 repetitions 1992-03-29: 2 repetitions 1993-03-28: 2 repetitions 1995-03-26: 2 repetitions Valid records: 5017 Duplicated dates: 5 Duplicated records: 5 Data format errors: 0 Invalid data records: 454 Total records: 5471
Tcl
<lang tcl>set data [lrange [split [read [open "readings.txt" "r"]] "\n"] 0 end-1] set total [llength $data] set correct $total set datestamps {}
foreach line $data {
set formatOk true set hasAllMeasurements true
set date [lindex $line 0] if {[llength $line] != 49} { set formatOk false } if {![regexp {\d{4}-\d{2}-\d{2}} $date]} { set formatOk false } if {[lsearch $datestamps $date] != -1} { puts "Duplicate datestamp: $date" } {lappend datestamps $date}
foreach {value flag} [lrange $line 1 end] { if {$flag < 1} { set hasAllMeasurements false }
if {![regexp -- {[-+]?\d+\.\d+} $value] || ![regexp -- {-?\d+} $flag]} {set formatOk false} } if {!$hasAllMeasurements} { incr correct -1 } if {!$formatOk} { puts "line \"$line\" has wrong format" }
}
puts "$correct records with good readings = [expr $correct * 100.0 / $total]%" puts "Total records: $total"</lang>
$ tclsh munge2.tcl Duplicate datestamp: 1990-03-25 Duplicate datestamp: 1991-03-31 Duplicate datestamp: 1992-03-29 Duplicate datestamp: 1993-03-28 Duplicate datestamp: 1995-03-26 5017 records with good readings = 91.7016998721% Total records: 5471
Second version
To demonstate a different method to iterate over the file, and different ways to verify data types:
<lang tcl>set total [set good 0] array set seen {} set fh [open readings.txt] while {[gets $fh line] != -1} {
incr total set fields [regexp -inline -all {[^ \t\r\n]+} $line] if {[llength $fields] != 49} { puts "bad format: not 49 fields on line $total" continue } if { ! [regexp {^(\d{4}-\d\d-\d\d)$} [lindex $fields 0] -> date]} { puts "bad format: invalid date on line $total: '$date'" continue }
if {[info exists seen($date)]} { puts "duplicate date on line $total: $date" } incr seen($date) set line_format_ok true set readings_ignored 0 foreach {value flag} [lrange $fields 1 end] { if { ! [string is double -strict $value]} { puts "bad format: value not a float on line $total: '$value'" set line_format_ok false } if { ! [string is int -strict $flag]} { puts "bad format: flag not an integer on line $total: '$flag'" set line_format_ok false } if {$flag < 1} {incr readings_ignored} } if {$line_format_ok && $readings_ignored == 0} {incr good}
} close $fh
puts "total: $total" puts [format "good: %d = %5.2f%%" $good [expr {100.0 * $good / $total}]]</lang> Results:
duplicate date on line 85: 1990-03-25 duplicate date on line 456: 1991-03-31 duplicate date on line 820: 1992-03-29 duplicate date on line 1184: 1993-03-28 duplicate date on line 1911: 1995-03-26 total: 5471 good: 5017 = 91.70%
Ursala
compiled and run in a single step, with the input file accessed as a list of strings pre-declared in readings_dot_txt <lang Ursala>#import std
- import nat
readings = (*F ~&c/;digits+ rlc ==+ ~~ -={` ,9%cOi&,13%cOi&}) readings_dot_txt
valid_format = all -&length==49,@tK27 all ~&w/`.&& ~&jZ\digits--'-.',@tK28 all ~&jZ\digits--'-'&-
duplicate_dates = :/'duplicated dates:'+ ~&hK2tFhhPS|| -[(none)]-!
good_readings = --' good readings'@h+ %nP+ length+ *~ @tK28 all ~='0'&& ~&wZ/`-
- show+
main = valid_format?(^C/good_readings duplicate_dates,-[invalid format]-!) readings</lang> output:
5017 good readings duplicated dates: 1995-03-26 1993-03-28 1992-03-29 1991-03-31 1990-03-25
Vedit macro language
This implementation does the following checks:
- Checks for duplicate date fields. Note: duplicates can still be counted as valid records, as in other implementations.
- Checks date format.
- Checks that value fields have 1 or more digits followed by decimal point followed by 3 digits
- Reads flag value and checks if it is positive
- Requires 24 value/flag pairs on each line
<lang vedit>#50 = Buf_Num // Current edit buffer (source data) File_Open("|(PATH_ONLY)\output.txt")
- 51 = Buf_Num // Edit buffer for output file
Buf_Switch(#50)
- 11 = #12 = #13 = #14 = #15 = 0
Reg_Set(15, "xxx")
While(!At_EOF) {
#10 = 0 #12++
// Check for repeated date field if (Match(@15) == 0) { #20 = Cur_Line Buf_Switch(#51) // Output file Reg_ins(15) IT(": duplicate record at ") Num_Ins(#20) Buf_Switch(#50) // Input file #13++ }
// Check format of date field if (Match("|d|d|d|d-|d|d-|d|d|w", ADVANCE) != 0) { #10 = 1 #14++ } Reg_Copy_Block(15, BOL_pos, Cur_Pos-1)
// Check data fields and flags: Repeat(24) { if ( Match("|d|*.|d|d|d|w", ADVANCE) != 0 || Num_Eval(ADVANCE) < 1) { #10 = 1 #15++ Break } Match("|W", ADVANCE) } if (#10 == 0) { #11++ } // record was OK Line(1, ERRBREAK)
}
Buf_Switch(#51) // buffer for output data IN IT("Valid records: ") Num_Ins(#11) IT("Duplicates: ") Num_Ins(#13) IT("Date format errors: ") Num_Ins(#14) IT("Invalid data records:") Num_Ins(#15) IT("Total records: ") Num_Ins(#12)</lang> Sample output: <lang vedit>1990-03-25: duplicate record at 85 1991-03-31: duplicate record at 456 1992-03-29: duplicate record at 820 1993-03-28: duplicate record at 1184 1995-03-26: duplicate record at 1911
Valid records: 5017 Duplicates: 5 Date format errors: 0 Invalid data records: 454 Total records: 5471</lang>