Sunday, February 20, 2011

DNA seen through the eyes of a coder

This has been around a while, but I just found it. Explanation by a gentleman named Bert Hubert about the similarities between genetics and computer systems, with the analogy of DNA to source code as a focal point. A very cool read: DNA seen through the eyes of a coder. It seems to have been written 9 years ago (in 2002), and I'm no molecular biologist, but I think there have been some interesting advanced in epigenetics since then. I wonder how that fits into the analogy?

Monday, February 14, 2011

ID numbers are not integers

Here at $work we use a numeric identifier called the UFID. It's an 8-digit string that uniquely identifies an individual related to the University. It's protected information, according to $policy, so we have to be careful how we treat it.

Note I called it an 8-digit string. Unfortunately, I continue to see databases where this identifier is stored as an integer merely because it looks numeric. That is to say, int rather than char. This makes me sad.

While it's possible to get into an involved academic discussion of why this is wrong, I'll just enumerate two simple rules for when to use a numeric type, such as integer:

  • If the data is going to be used for arithmetic or statistical functions such as mean.

  • If the data serves as a counter, including auto-increment primary keys.


Note the second case is really an exception, and in the strictest of sense should not be allowed either. But, in the spirit of pragmatism, it is easy enough to permit this very special, well-defined case without problems. What does cause problems is using an integer type for a string field. The most obvious problem is conversion from integer to string dropping the leading zeroes.

Yes, it's possible to instruct most databases to return the data with leading zeroes prepended even though it's an integer. That's an abomination. Not only that, but if your ORM "knows" this is an integer, its internal representation will probably ditch that padding. Now you have to make your code provide padding as well via sprintf or similar. Not very DRY.

The data is not integer to begin with, you should not have to shoe-horn it into a type to which it does not belong. What happens when one day they run out of IDs and start allowing letters in the ID? ...

Save yourself the worry. Store identifiers as strings.

Wednesday, January 27, 2010

SVN LoC and churn metrics

Wrote a small Perl script to grab svn lines of code metrics (added, modified, deleted) and churn (added + modified) metrics, as well as number of files added, updated, or deletes by revision. So you get a flat file output (fixed width) of timestamp, username, revision, lines added, modified, churned, deleted, files added, updated, deleted.

Using that raw data it's a quick thing to parse it any way you like, such as applying math to predict your defect rate or simply graphing it over time.

Here's the script, creatively named svnloc:


#!/usr/bin/perl
use strict;
use warnings;

my $BARSIZE = 40; # Size of the progress bar
my @statuses = qw(A U D);

my $repo = shift();
my $outfile = shift() || "./svnloc.txt";
my $revision = shift();
my $latest_rev;
my %rev_users;
my %rev_dates;
my %rev_changes;
my %rev_diff;

if (not defined $repo or not -e $repo) {
print <<END_USAGE;
Usage: svnloc repo [outfile [revision]]
repo the path to the svn repository
outfile the path for the output file, defaults to "./svnloc.txt"
revision if specified, will append data for that revision to the output
if not specified, all data for all revisions is obtained and the file
is generated from scratch, overwriting the old file if it exists.
END_USAGE
exit(1);
}

my $bl_filename = "svnloc.blacklist";
my @blacklist; # Don't count these revisions
if (-e "svnloc.blacklist") {
open BL, $bl_filename;
do { chomp; push @blacklist, $_; } for (<BL>);
close BL;
}

if (defined $revision) { #get info for our revision and append to output file
get_info($revision);
open OUTPUT, ">>$outfile";
output_line($revision);
close OUTPUT;
} else { # generate output file from scratch
my $history = `svnlook history $repo`;
($latest_rev) = $history =~ /(\d+)/s;
print "Latest revision: $latest_rev\n";

rev_loop("Obtaining revision information...",\&get_info);

open OUTPUT, ">$outfile";
printf OUTPUT ("%-20s%-18s%6s%7s%7s%5s%5s\n",
"Date","Username","Rev","Add","Mod","Chrn","Del",@statuses);

rev_loop("Generating outputfile ($outfile)...",\&output_line);

close OUTPUT;
}
print "Finished.\n";

sub get_info {
my $rev = shift;

my $info = `svnlook info -r $rev $repo`;
my ($user, $date) = split(/\n/,$info);
$rev_users{$rev} = $user;
$rev_dates{$rev} = $date;

my $changed = `svnlook changed -r $rev $repo`;
for my $s (split(/\n/,$changed)) {
my ($status) = substr($s,0,1);
$rev_changes{$rev}->{$status}++;
}

my $diff = `svnlook diff -r $rev $repo`;

my ($added,$modified,$deleted,$temp_deleted) = (0)x4;
for my $line (split(/\n/,$diff)) {
my $c2 = substr($line,0,2);
my ($c) = substr($line,0,1);
next if ($c2 eq '--' || $c2 eq '++'); # ignore header lines
if ($c eq '-') {
$temp_deleted++;
} elsif ($c eq '+') {
if ($temp_deleted) {
$temp_deleted--;
$modified++;
} else {
$added++;
}
} else {
$deleted += $temp_deleted;
$temp_deleted = 0;
}
}
$rev_diff{$rev}->{added} = $added;
$rev_diff{$rev}->{modified} = $modified;
$rev_diff{$rev}->{churn} = $added + $modified;
$rev_diff{$rev}->{removed} = $deleted;

}

sub output_line {
my $rev = shift;
no warnings 'uninitialized';
printf OUTPUT ("%20s%-18s%6d%7d%7d%7d%7d%5d%5d%5d\n",
substr($rev_dates{$rev},0,20),
$rev_users{$rev},
$rev,
$rev_diff{$rev}->{added},
$rev_diff{$rev}->{modified},
$rev_diff{$rev}->{churn},
$rev_diff{$rev}->{removed},
map { $rev_changes{$rev}->{$_} } @statuses);
}

sub rev_loop {
my ($msg, $code) = @_;
my $progress;
print "$msg\n";
start_progress(\$progress);
for (1..$latest_rev) {
tick_progress(\$progress,$latest_rev);
next if (is_in($_,@blacklist));
$code->($_);
}
end_progress();
}

sub start_progress {
my $progress = shift();
$$progress=0;
print "[" . (" " x $BARSIZE) . "]\r";
}

sub tick_progress {
my $progress = shift();
my $max = shift();
my $ticks = int(($$progress++/$max) * $BARSIZE);
my $spaces = $BARSIZE - $ticks;

printf "[" . ("=" x $ticks)
. (" " x $spaces)
. "] %-10s\r",$_;
}

sub end_progress {
print "[" . ("=" x $BARSIZE) . "]\n\n";
}

sub is_in {
my $item = shift;
my @list = @_;
my %seen;
@seen{@list} = (1) x scalar @list;
return $seen{$item};
}

Wednesday, February 11, 2009

Vim Color Schemes

Since deciding to get into Vim, I seem to have wasted a considerable about of time on color schemes! After discovering how to get 256 colors in my terminal window (Hint: It works with PuTTY out of the box, just set your TERM environment variable to xterm-256color),  I decided to put together my own 256 color themes.  Well, sort of.  I actually just copied the code from the Wombat theme by Lars H. Nielsen and modified the colors.

This is the one I'm using by default, it's high-contrast and it seems to work very well for the Perl/Template/JS/HTML editing I am doing most of the time. On account of the many bright colors (which I'm sure other people will think look ridiculous), I call this scheme Harlequin.




Inspired by the green/brown/white colors deployed by the marketing droids of the on-site coffee vendors where I work, I have created this scheme named Starbucks.




And of course to round things out and bring some balance, here is this truly evil dark-side scheme which I call Magma.




Of course, there are those who say that Starbucks® is the true evil, but I digress...

Friday, February 6, 2009

SVN deletion goodness

The process of getting a project which was not under version control into SVN can be a chore.  Usually the lack of source control has forced the creation of loads of temporary and backup files with silly names.  The easiest thing to do is to simply import the whole mess into the repository, and then go back and clean it up later.

That's what I was doing earlier this morning.  I checked out a copy and started trimming, and by the time I was ready to commit, I realized I had been accidentally deleting files directly in the shell instead of using svn delete.  Oops!  Now I have to go back re-delete them.  But they're gone, and there were probably a hundred files and directories removed.  Won't that be a huge pain?

Not really.  A little shell one-liner will take care of it for you:

svn status | grep '^!' | awk '{print $2}' | xargs svn del

Run that from the root of your working copy, and it will do the following:

  • Give you the status of all the files and directories in your working copy compared with the repository.
  • Extract only those lines which start with !, which is svn status's way of saying "Oh noes, I can't find that one!"
  • Feed those lines into awk so that it can get the second item on the line, the path.
  • Use xargs to run svn del on each of those paths.

Now all of your deletes will be properly reflected in the repo at the next commit.  Phew!

Friday, January 16, 2009

Chat Hacking, Part II

So, got it to work.  Turns out we were both right:  Danny was correct in that we weren't using the JSJaC library properly, and I was right in that the server detected our switcheroo and didn't want to talk to us.

Firstly, we discovered there's an internal (but public) method on JSJaC's connection object called inherit, which allows you to utilize an existing http-bind session when you fire up the chat engine.  That turns out to be the right way to do things.  It expects a number of arguments passed in an argument object, three of which are vital to convincing the server that you are who you say you are: 

  • sid: (Session ID) This is generated by the server and sent in the connection phase, we already had this working fine in Perl, so no problems here.
  • key: The key is a hex-encoded sha1 hash sequence, and it's used to verify that each subsequent request comes from the same client.  How?  Well, each new key is the sha1 hash of the previous key.  If you transmit the wrong key, the server barfs on you.
  • rid: (Request ID) This is simply a sequential number, but it's important with respect to the key.  If your request ID is not in lock-step with your key sequence, the server again will barf on you.

Getting the key right was the trickiest part, but not too bad.  Essentially, JSJaC by default generates a list of 16 keys at a time to use.  Since we weren't initializing the session using JSJaC, we instead had Perl initialize those keys and use the first few to establish a connection.  The key list then gets injected in to the web page where JSJaC can pick it up.  As long as there are no fencepost errors, the whole sequence proceeds along without a hitch, and the server happily talks to the JavaScript.

It's actually not as fragile as we were afraid it might be, since the session/key/rid setup makes sure that each session is unique and can only be utilized in JS by the CGI that initiated it.  And it makes sure that all the credentials required for login are safely tucked away on the server side, where clients can't see them at all.  Now all we have to do is tweak (read: restrict or rewrite) the JWChat interface a little to give it fewer features, and add a bit of conversation logging, and we've basically accomplished what we set out to do.  

Which means we can have anonymous clients talking to our people on our internal chat server, in a controlled environment and without compromising any credentials.

Thursday, January 15, 2009

Chat Hacking

In a blog post by the same name on our (internal) bug tracking system, my colleague Danny describes our (thus far futile) efforts to create an unholy union between JWChat and a cgi script.  That's right, we're trying to do XMPP Chat via the web, which means we're trying to utilize something which implements XEP-0124 (BOSH).

The point is that we want outsiders to be able to talk to our people who are using the internal XMPP server.  Trouble is, the XMPP server requires local credentials to log in (LDAP), so we need them to log in via dummy accounts.  But, we don't want to just hand out the user/pass to those dummy accounts by putting them into the JavaScript source of a web-based chat client.  Hence, the madness begins.

One of our system administrators had the lovely idea of using a CGI script to initiate the connection, keeping our credentials server-side, over an http-bind proxy.  Then we'd hand the session data to our JavaScript client-side, again through the proxy, hopefull unbeknownst to the chat server.  

Yeah, it's not going so well.  As in, not at all.  For one thing, there are no existing libraries to implement XEP-0124 in Perl.  We can do regular socket connections, sure, but not BOSH.  So, we were faced with either using a library from another language (like JavaScript or C++) inside a Perl wrapper, or just "faking" the BOSH process by sending some pre-formatted XML over an LWP connection to the http-bind port.  

Well, that part actually worked (suprisingly enough), we can connect server side and get session info.  But as of right now, we can't inject that info into the JavaScript client side and have it pick up the ball, so to speak.  I think maybe the server somehow can tell that we've pulled a fast one on it, and it's not willing to talk to the client masquerading as the server.   Danny thinks we probably just haven't covered all of our bases in initializing the JavaScript chat engine.  We're proceeding under the assumption that he's right, and it's still possible to get this to work.

Only time will tell if this crazy scheme of ours can work.