(Reading time: 3 – 4 minutes)
[Updated March 9, 2009: Added the code listing to show how the regular expression is used.]
Here is a little regular expression that will convert m/d/y dates from Excel spreadsheets into y/m/d dates used by SQL. Why I can’t export in this format from the Excel spreadsheet is a mystery. Another mystery is why SQL clients won’t accept dates in Excel spreadsheet format. As a user, I really don’t care whose format is “correct” because either are fine for me. It’s stupid. I could code up the relevant function in C in minutes.
In any case, here is a little regular expression for handling it.
1 | s:\([0-9]\+\)/\([0-9]\+\)/\(2006\):\3/\1/\2:g |
Example: applying the regex to 12/31/2006 returns 2006/12/31.
How I used this regular expression
My bank statements look about like the following:
1 2 3 | 10/22/2006,$13.13,Trader Joe's 10/31/2006,$17.19,Long's Drugs ...etc... |
What I need for the database is dates in the form “2006/10/22.” Since I have 12 of these statements for each bank, it makes a lot of sense to automate the whole procedure. First, I created a file named “fixyear.sed” which contained the regular expression. Then I wrote a shell script (which can run at the cygwin bash prompt in MS Windows) to invoke fixyear on each csv statement.
I put some comments in the listing below to help explain what and why; sh scripting is pretty easy and it could be translated in DOS batch language without too much trouble.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | #!/usr/bin/sh # Running sh script saves a lot of typing... echo Enter root password for MySQL database: read pass; # Convenient definitions for sh scripting: once you # have the command line syntax, no need to remember # or have to look it up. export db=" --database=finance" export user=" --user=imthebanker" export pw=" --password=$pass" export command="source mb_schema.sql" mysql $db $user $pw --execute="$command" # redefine command for convenience export command="load data local infile 'tmp' into table transactions_9005 fields terminated by ',' ignore 2 lines;" # Load each fixed CSV file into the database... for i in $( find . -type f -iname '*.csv' ); do # fixyear is where the regular expression lives, # tmp is overwritten each iteration ./fixyear.sed $i > tmp; mysql $db $user $pw --execute="$command" done # We don't need the tmp file anymore rm tmp |
Make regular expressions easier to find for everyone
Finding the exact regular expression you need can be time consuming. In the long run, if you do a lot of data processing, you will need to learn the fundamentals of regular expressions. And learning is faster—and more fun—with examples.
It would be really cool to have the metadata for little sed fragments like this, so search engines could find them more reliably. I looked into specifying a microformat to handle just that. After looking over the wiki for microformat “standards” it’s just too much bother. Too many requirements per line of metadata.
However, it would still be very useful to define a metadata format for regular expressions, perhaps using XML, perhaps using JSON or YAML. I’m still interested. Please leave a comment or send me an email if this strikes your fancy as well.








Comments on this entry are closed.