Sunday, April 28, 2013

A tale of one or more kludges

Let me tell you about a special incident I had while maintaining one of our projects. I wanted to write this while it was fresh in my mind, so that I could appropriately convey the various subtle levels of fail, not all of which I bothered to correct.

Please keep in mind this is for posterity, don't take this as a recommendation of how to go about things.

One of our customers has a business process that involves a MySQL database table which contains a list of users who are opearating under a given department's license for a particular piece of software. The department wanted to provide a mass update of all their user records in that table, and the way this is accomplished is with a CSV file. They e-mail a CSV to my customer, who then massages it into a particular format, and uses a web script which we maintain to "validate" and update the database table.

So far, WTF meters should be active, but in context, it's mostly shrugworthy. I mean, they need to do a bulk update, why not provide a CSV update feature? As long as we validate the data right?

Now the reason I even had to touch this code, which was written by a Master's student under the supervision of my group, was that it wasn't working. The uploaded CSV would not validate. Well, it seems that at some point someone decided that the table should have a Timestamp column added to it. No problem, right?

Not exactly. You see, whoever wrote the code using vanilla Perl DBI was unfamiliar with Perl hashes, or at least how to use them with DBI. Every column was indexed by its numerical position. And yes, of course, the Timestamp column was added at the beginning of the table definition.

It seems the programmer who did this tried to modify the supporting code. They added a check to make sure column zero had a valid date and time in it. They didn't change the numerical indices for the range of columns on which to operate (so our records were truncated), nor did they change the indices for the validation. So of course when we check whether column 17 is non-null, I had to actually count the columns to see that this was a check for the primary key being not null. Yes, you read that correctly. For some reason the primary key was at index 17. ... Which means now it's at position 18.

While fixing this, I noticed that the code which exports the code is duplicated... approximately 15 lines apart. I don't know why. At this point, my ability to care was minimal, I just fixed both of them.

So, doing these fixes, I tried again. The "validation" was still failing, even though my columns were properly numerically indexed at this point (did I just say "properly numerically indexed?" ...) I scrolled through the output the web-based tool was providing, and it seems that some of the rows had extra columns inserted in the middle for some reason. Of coure, I had an inkling as to why this was. You see, some people hear "CSV" and they believe "comma-separated" is the complete specification for that data format. I checked the export code, and sure enough it did nothing but dump the data from each field verbatim with commas separating them. And of course some fields had internal commas.

Now if I had reported this to my customer at this point, I have no doubt he would have some of his student employees set to the task of eliminating commas from his database, rather than "bother me" with proper field handling. So I went ahead and fixed that, too.

So, all set, attempt an import again, and the validation says... every blank field is now set to the literal string "nullstring." And the fields are still being split on comma. Whaaaa....? So I look back at the import code, and of course it is simply splitting on comma. But it looks like it was written to take quotes into account? Well of course I should have known better, the import would never had had quotes, so why would it handle them properly? The behavior was to split on commas first, then strip leading and trailing quotation marks from each field, and if any quotes were actually stripped and the field is now empty, set the field value to the literal "nullstring".


Okay, so I fix that. Now it seems to be working. Out of curiosity, I look at the rest of the code base. Horrors. This is only one of at at least four separate but virtually identical scripts that handle import/export in the same way. Copy/Paste FTL! ...

At this point, I remind myself that one of the guys on my team is currently engaged in writing the replacement for this system. All I have to do is put duct tape over the band-aid holding the rubber bands in place, and move on. There is more work to do on other projects.

Reluctantly, I inform the customer that the script is fixed. And he can now proceed, and it works great... You might be curious how exactly it works. Well, I've saved the best for last. You see, first the import script generates all of the SQL necessary to run the import. But, rather than run the SQL through Perl DBI, it uses something I like to call SQL Injection as a Service. Yes, the SQL is then embedded in JavaScript and output back to the user's browser.. What good does it do there? Well, you see, when the "validation" script reports that the data is good, the user clicks on "Import", and the system runs a JS loop which sends a single GET request for every row, where the parameter is the literal SQL to run on the server. And the server dutifully runs each literal SQL query as the JavaScript commands.

Like I said, seems to be working. I'm going to go sit in a corner and cry now.

Sunday, February 20, 2011

DNA seen through the eyes of a coder

This has been around a while, but I just found it. Explanation by a gentleman named Bert Hubert about the similarities between genetics and computer systems, with the analogy of DNA to source code as a focal point. A very cool read: DNA seen through the eyes of a coder. It seems to have been written 9 years ago (in 2002), and I'm no molecular biologist, but I think there have been some interesting advanced in epigenetics since then. I wonder how that fits into the analogy?

Monday, February 14, 2011

ID numbers are not integers

Here at $work we use a numeric identifier called the UFID. It's an 8-digit string that uniquely identifies an individual related to the University. It's protected information, according to $policy, so we have to be careful how we treat it.

Note I called it an 8-digit string. Unfortunately, I continue to see databases where this identifier is stored as an integer merely because it looks numeric. That is to say, int rather than char. This makes me sad.

While it's possible to get into an involved academic discussion of why this is wrong, I'll just enumerate two simple rules for when to use a numeric type, such as integer:

  • If the data is going to be used for arithmetic or statistical functions such as mean.

  • If the data serves as a counter, including auto-increment primary keys.

Note the second case is really an exception, and in the strictest of sense should not be allowed either. But, in the spirit of pragmatism, it is easy enough to permit this very special, well-defined case without problems. What does cause problems is using an integer type for a string field. The most obvious problem is conversion from integer to string dropping the leading zeroes.

Yes, it's possible to instruct most databases to return the data with leading zeroes prepended even though it's an integer. That's an abomination. Not only that, but if your ORM "knows" this is an integer, its internal representation will probably ditch that padding. Now you have to make your code provide padding as well via sprintf or similar. Not very DRY.

The data is not integer to begin with, you should not have to shoe-horn it into a type to which it does not belong. What happens when one day they run out of IDs and start allowing letters in the ID? ...

Save yourself the worry. Store identifiers as strings.

Wednesday, January 27, 2010

SVN LoC and churn metrics

Wrote a small Perl script to grab svn lines of code metrics (added, modified, deleted) and churn (added + modified) metrics, as well as number of files added, updated, or deletes by revision. So you get a flat file output (fixed width) of timestamp, username, revision, lines added, modified, churned, deleted, files added, updated, deleted.

Using that raw data it's a quick thing to parse it any way you like, such as applying math to predict your defect rate or simply graphing it over time.

Here's the script, creatively named svnloc:

use strict;
use warnings;

my $BARSIZE = 40; # Size of the progress bar
my @statuses = qw(A U D);

my $repo = shift();
my $outfile = shift() || "./svnloc.txt";
my $revision = shift();
my $latest_rev;
my %rev_users;
my %rev_dates;
my %rev_changes;
my %rev_diff;

if (not defined $repo or not -e $repo) {
print <<END_USAGE;
Usage: svnloc repo [outfile [revision]]
repo the path to the svn repository
outfile the path for the output file, defaults to "./svnloc.txt"
revision if specified, will append data for that revision to the output
if not specified, all data for all revisions is obtained and the file
is generated from scratch, overwriting the old file if it exists.

my $bl_filename = "svnloc.blacklist";
my @blacklist; # Don't count these revisions
if (-e "svnloc.blacklist") {
open BL, $bl_filename;
do { chomp; push @blacklist, $_; } for (<BL>);
close BL;

if (defined $revision) { #get info for our revision and append to output file
open OUTPUT, ">>$outfile";
close OUTPUT;
} else { # generate output file from scratch
my $history = `svnlook history $repo`;
($latest_rev) = $history =~ /(\d+)/s;
print "Latest revision: $latest_rev\n";

rev_loop("Obtaining revision information...",\&get_info);

open OUTPUT, ">$outfile";
printf OUTPUT ("%-20s%-18s%6s%7s%7s%5s%5s\n",

rev_loop("Generating outputfile ($outfile)...",\&output_line);

close OUTPUT;
print "Finished.\n";

sub get_info {
my $rev = shift;

my $info = `svnlook info -r $rev $repo`;
my ($user, $date) = split(/\n/,$info);
$rev_users{$rev} = $user;
$rev_dates{$rev} = $date;

my $changed = `svnlook changed -r $rev $repo`;
for my $s (split(/\n/,$changed)) {
my ($status) = substr($s,0,1);

my $diff = `svnlook diff -r $rev $repo`;

my ($added,$modified,$deleted,$temp_deleted) = (0)x4;
for my $line (split(/\n/,$diff)) {
my $c2 = substr($line,0,2);
my ($c) = substr($line,0,1);
next if ($c2 eq '--' || $c2 eq '++'); # ignore header lines
if ($c eq '-') {
} elsif ($c eq '+') {
if ($temp_deleted) {
} else {
} else {
$deleted += $temp_deleted;
$temp_deleted = 0;
$rev_diff{$rev}->{added} = $added;
$rev_diff{$rev}->{modified} = $modified;
$rev_diff{$rev}->{churn} = $added + $modified;
$rev_diff{$rev}->{removed} = $deleted;


sub output_line {
my $rev = shift;
no warnings 'uninitialized';
printf OUTPUT ("%20s%-18s%6d%7d%7d%7d%7d%5d%5d%5d\n",
map { $rev_changes{$rev}->{$_} } @statuses);

sub rev_loop {
my ($msg, $code) = @_;
my $progress;
print "$msg\n";
for (1..$latest_rev) {
next if (is_in($_,@blacklist));

sub start_progress {
my $progress = shift();
print "[" . (" " x $BARSIZE) . "]\r";

sub tick_progress {
my $progress = shift();
my $max = shift();
my $ticks = int(($$progress++/$max) * $BARSIZE);
my $spaces = $BARSIZE - $ticks;

printf "[" . ("=" x $ticks)
. (" " x $spaces)
. "] %-10s\r",$_;

sub end_progress {
print "[" . ("=" x $BARSIZE) . "]\n\n";

sub is_in {
my $item = shift;
my @list = @_;
my %seen;
@seen{@list} = (1) x scalar @list;
return $seen{$item};

Wednesday, February 11, 2009

Vim Color Schemes

Since deciding to get into Vim, I seem to have wasted a considerable about of time on color schemes! After discovering how to get 256 colors in my terminal window (Hint: It works with PuTTY out of the box, just set your TERM environment variable to xterm-256color),  I decided to put together my own 256 color themes.  Well, sort of.  I actually just copied the code from the Wombat theme by Lars H. Nielsen and modified the colors.

This is the one I'm using by default, it's high-contrast and it seems to work very well for the Perl/Template/JS/HTML editing I am doing most of the time. On account of the many bright colors (which I'm sure other people will think look ridiculous), I call this scheme Harlequin.

Inspired by the green/brown/white colors deployed by the marketing droids of the on-site coffee vendors where I work, I have created this scheme named Starbucks.

And of course to round things out and bring some balance, here is this truly evil dark-side scheme which I call Magma.

Of course, there are those who say that Starbucks® is the true evil, but I digress...

Friday, February 6, 2009

SVN deletion goodness

The process of getting a project which was not under version control into SVN can be a chore.  Usually the lack of source control has forced the creation of loads of temporary and backup files with silly names.  The easiest thing to do is to simply import the whole mess into the repository, and then go back and clean it up later.

That's what I was doing earlier this morning.  I checked out a copy and started trimming, and by the time I was ready to commit, I realized I had been accidentally deleting files directly in the shell instead of using svn delete.  Oops!  Now I have to go back re-delete them.  But they're gone, and there were probably a hundred files and directories removed.  Won't that be a huge pain?

Not really.  A little shell one-liner will take care of it for you:

svn status | grep '^!' | awk '{print $2}' | xargs svn del

Run that from the root of your working copy, and it will do the following:

  • Give you the status of all the files and directories in your working copy compared with the repository.
  • Extract only those lines which start with !, which is svn status's way of saying "Oh noes, I can't find that one!"
  • Feed those lines into awk so that it can get the second item on the line, the path.
  • Use xargs to run svn del on each of those paths.

Now all of your deletes will be properly reflected in the repo at the next commit.  Phew!

Friday, January 16, 2009

Chat Hacking, Part II

So, got it to work.  Turns out we were both right:  Danny was correct in that we weren't using the JSJaC library properly, and I was right in that the server detected our switcheroo and didn't want to talk to us.

Firstly, we discovered there's an internal (but public) method on JSJaC's connection object called inherit, which allows you to utilize an existing http-bind session when you fire up the chat engine.  That turns out to be the right way to do things.  It expects a number of arguments passed in an argument object, three of which are vital to convincing the server that you are who you say you are: 

  • sid: (Session ID) This is generated by the server and sent in the connection phase, we already had this working fine in Perl, so no problems here.
  • key: The key is a hex-encoded sha1 hash sequence, and it's used to verify that each subsequent request comes from the same client.  How?  Well, each new key is the sha1 hash of the previous key.  If you transmit the wrong key, the server barfs on you.
  • rid: (Request ID) This is simply a sequential number, but it's important with respect to the key.  If your request ID is not in lock-step with your key sequence, the server again will barf on you.

Getting the key right was the trickiest part, but not too bad.  Essentially, JSJaC by default generates a list of 16 keys at a time to use.  Since we weren't initializing the session using JSJaC, we instead had Perl initialize those keys and use the first few to establish a connection.  The key list then gets injected in to the web page where JSJaC can pick it up.  As long as there are no fencepost errors, the whole sequence proceeds along without a hitch, and the server happily talks to the JavaScript.

It's actually not as fragile as we were afraid it might be, since the session/key/rid setup makes sure that each session is unique and can only be utilized in JS by the CGI that initiated it.  And it makes sure that all the credentials required for login are safely tucked away on the server side, where clients can't see them at all.  Now all we have to do is tweak (read: restrict or rewrite) the JWChat interface a little to give it fewer features, and add a bit of conversation logging, and we've basically accomplished what we set out to do.  

Which means we can have anonymous clients talking to our people on our internal chat server, in a controlled environment and without compromising any credentials.