count.pl

mbox counting at its... well, it works



Description | Features | Bugs | Requirements | Download | Output | Usage/Docs

Description

Wirten by:
MadHat at Unspecific.com [[|%^)
count.pl is to keep score on some of the mailing lists I have been on for a while. What it does is count the emails, domains or suffixes to tell how many emails, lines and new lines have been posted to the list. It reads a standard mbox format files.
I have tested it with mutt, pine, evolution, and Eudora.


Top | Description | Features | Bugs | Requirements | Download | Output | Usage/Docs

Features

  • count on email, domain or suffix
  • sort on number of messages (-E), number of lines (-L), number of new lines (-N) or the amount of grabage (-G) sent (quoted lines)
  • counts quoted lines if the line starts with > | : <TAB> blah blah blah
  • automatically strips duplicate messages (based on message ID)
  • can be limited to show only the top X entries (defalt is 50, see -m option)
  • can be limited to specific date range (see -f and -t)
  • can count on gzipped mbox files
  • can count on web based archives
  • can rip a list of URLs from mail (see -u)
  • can show % of email by each person that contain HTML, RTF, or attachments (see -l)
  • can show Total number of replies referencing each address (see -T)

Top | Description | Features | Bugs | Requirements | Download | Output | Usage/Docs

BUGS

  • some MS clients don't seem to properly attribute quoted lines, so it has trouble there, but there is nothing I can do about that.

Top | Description | Features | Bugs | Requirements | Download | Output | Usage/Docs

Requirements

perl (>5.6 prefered)
Date::Manip perl module
Mail::MboxParser perl moodule
Bundle::LWP if you are going to use the http get of mailboxes
gzip or gunzip if you are going to use compressed mailboxes. (if you don't know where to find them.... no comment)

Top | Description | Features | Bugs | Requirements | Download | Output | Usage/Docs

Download

count.pl - v2.4.0

Top | Description | Features | Bugs | Requirements | Download | Output | Usage/Docs

Output

An example of the output...
$ ./count.pl -e -f01/01/2002 -t01/31/2002 -m 5 subfolders/Folder/mbox
Start Date: Jan  1, 2002
End Date:   Jan 31, 2002
Total emails matched: 108
Total Unique Entries: 21
                                       Total    Total      Total   Sig/
    Address                           EMails    Lines       New    Noise
                                      Posted    Posted     Lines   Ratio
  1 someone@adomain.com                   20       842       267    32
  2 user@domain.org                       12       786       180    23
  3 another@domainfoo.com                  9       412        59    14
  4 user@some.collg.edu                    8       293        42    14
  5 someone@another.com                    8       149        62    42

$ ./count.pl -e -l -f01/01/2002 -t01/31/2002 -m 5 subfolders/Folder/mbox
Start Date: Jan  1, 2002
End Date:   Jan 31, 2002
Total emails matched: 108
Total Unique Entries: 21
                                     Total  Total   Total  Sig/
    Address                         EMails  Lines    New   Noise  Loser
                                    Posted  Posted  Lines  Ratio  Rating
  1 talks@not-msn.net                   10     288    142    49     25
  2 madhat_unspecificly@yahoo.com        9     354    215    61      0
  3 bob@someother.net                    5     124     41    33     20
  4 someone@bombed.net                   4     181     39    22      0
  5 luser@some.collg.edu                 3      54     39    72    100


$ ./count.pl -e -m 5 subfolders/Folder/mbox
Start Date: Dec 31, 1969
End Date:   Nov 10, 2002
Total Unique Entries: 12
                                     Total  Total   Total  Sig/
    Address                         EMails  Lines    New   Noise  Troll
                                    Posted  Posted  Lines  Ratio  Rating
  1 bobuser@domain.net                  10     288    142    49     15
  2 nonuser5@yahoo.com                   9     354    215    61      4
  3 latida@unsc.net                      5     124     41    33      1
  4 ugabooga@msn.com                     4     181     39    22      4
  5 loser@attbi.com                      3      54     39    72      0


Top | Description | Features | Bugs | Requirements | Download | Output | Usage/Docs

Usage (output from ./count.pl -h)

count - 2.2.1 - The email counter by: MadHat<madhat@unspecific.com>

./count <-e|-d|-s> [-ENLGTl] [-m#] [-f <from_date> -t <to_date>]\
     <mailbox | http://domain.com/archive/mailbox>
        -h Full Help
        -e email address
        -d count domains
        -s count suffix (.com, .org, etc...)
        -l Add loser rating
        -T Add troll rating
        -T Add list of URLs found in the date range
        -v Verbose output (DEBUG Output)
        -E sort on emails (DEFAULT)
        -L sort on total lines
        -N sort on numer of NEW lines (not part of reply)
        -G sort on Garbage (quoted lines)
        -m# max number of entries to show
        -fmm/dd/yyyy From date.  Start checking on this date [01/01/1970]
        -tmm/dd/yyyy To date. Stop checking after this date [today]
        <mailbox> is the mailbox to count in...


'count' will open the disgnated mailbox and sort through the emails counting
on the specified results.

-e, -d or -s are required as well as a mailbox.  All other flags are optional.

 -e will count on the whole email address
 -d will count only on the domain portion of the email (everything after the @) 
 -s will count on the suffix (evertthing past the last . - .com, .org...)

Present reporting fields include the designated count field (see above),
Total EMails Posted, Total Lines Posted, Total New Lines and Sig/Noise Ratio.

- Total EMails Pasted is just that, the total number of emails posted by
  that counted field.

- Total Lines Posted is the total number of messages lines, not including
  any header fields, posted by that counted field.

- Total New Lines is the total number of lines. not including any header
  fields, that are not part of a reply ot forward.  The way a line is
  determined to be new line, is that it is not started by one of the common
  characters for replied lines, > | # : or <TAB>.

  WARNING:  This is not accurate on some email client (some MS Clients) because
            they do not properly attribute lines in replies.
 
- Sig/Noise Ratio is the % of new info as compaired to total lines posted.
  This is calculated by taking the total new lines, deviding it by the total
  number of lines and multiplying by 100 (for percentage).
 
Other Options:
 
The default sort order is by Total Number of Emails (-E), but you can also
sort by other fields:
 
 -L to sort on total number of Lines posted.
 -N to sort on total number of New Lines posted.
 -G to sort on Garbage. Garbage is the number of non-new lines.
 
By default the maximum number of counted fields shown is 50.  This can be
changed with the -m flag.
 
By default the date range is from January 1, 1970 through 'today'.  You can
specify a date range using the -f and -t options
 
 -f From date.  Format is somewhat forgiving, but recomended is mm/dd/yyyy
 -t To date.  This is the date to stop on.  Same for format as above.

 -u Add list of URLs found in the date range
    create a list of URLs found, with Subject of the email listed for each URL

 -l Add loser rating.  I added this because I use this on mailing lists.
      Most mailing lists I am on, consider it bad to post HTML or attachments
      to the list, so this counts the number of HTML posting and attachments
      (other than things like PGP Sigs) and generates a number from 0 to 100
      which is the % of the mails that fall into this catagory.

 -T Add Troll rating.  I added this because some lists didn't have any
      obvious losers ;^) and didn't want to leave those lists out.
      This is simply the number of emails referencing a previous email
      The information is gathered from the 'In-Reply-To' and
      'Reference' headers.


Top | Description | Features | Bugs | Requirements | Download | Output | Usage/Docs

ChangeLog

  • 2.4
    • Fixed a problem with Dates (stripping last day in some instances)
    • Removed GPG sigs from being counted
    • Added less than 10% sig/noise to increase loser rating
    • Added "no subject" to increase loser rating
    • Added top posting to increase loser rating
    • Updated URL listing, grouping by subject (-u)
    • added contributers with URL listing (-u)
    • Fixed more debug messages
  • 2.3
    • fixed some email matching issues
    • cleaned up the debugging more... I think this happens every time
    • added gzip'ed mbox support
    • added troll rating based on Reference and In-Reply-To headers.
      Troll rating is the total number of emails that reference an email directly using one of those 2 header fields.
    • added http support for counting mail archives
    • added some new debug info when using -v
    • added -D for debug messages
      -D 1 low (shows basic routines and info)
      -D 2 detailed (shows more specific details about what is going on, plus everything above
      -D 3 annoying (shows every line of every email, pluse everything above)
    • made -v only give basic details abut what is going on because -D added for real DEBUGing info
    • cleaned u the quoted and new lines, like not including blank lines as new or quoted lines anymore.
    • added option for Troll rating (-T)
    • reworked how the format was generated
      so it is easier to add new features ;-)
    • added URL ripper to show list of URLs for date range (see -u)
    • after adding above, I went back and had it weed out all the footer style URLs, like for http://www.yahoo.com, http://messenger.msn.com, etc...
  • 2.2
    • fixed several errors
    • added loser rating OPTION based on attachments sent. Also includes non text (ie. HTML, RTF) mime types.
      Specifically, this is a % of mails that include one of these, since it is standard that you don't send attachments to lists and on the lists I am on, non-text mails are considered bad.
  • 2.1
    • INIT CHANGE LOG
 
 
 
 
http://www.unspecific.com/.go/count/

o S/N Ratio - % new lines to total lines posted.

o L - Loser Rating - % of emails that contain items deemed as bad list
etiquette, that includes top posting, posting in HTML, over-quoting,
etc...

o T - Troll Rating - # of emails that reply directly referencing your
email via the 'In-Reply-To' or 'Reference' headers