The joy of merge

A recent customer requirement was to have a process to import a list of data from CSV and incrementally update an existing table of the same data, adding/updating/deleting rows as required in order to get the two into sync. The production table has extra columns, with the CSV import being simply a list of (for argument’s sake) customer IDs and names. The technology here is .NET and SQL-Server.

In the past we might have used a process like this:

  • Import the data into a temporary table
  • Use a query to delete/archive any production data rows not present in the new imported list
  • Use a query to update any existing production data rows that are different in the imported list
  • Use a query to insert any new rows from the imported list into the production data
  • Present the results (“[n1] rows added, [n2] rows updated and [n3] rows deleted”) to the UI and/or system logs.

All well and good, and it’s a safe and established way of going about it. However… enter the SQL MERGE command, introduced in SQL-Server 2008. This allows us to perform all of the above data operations in a single SQL statement. Credit to this article (from 2014 - never tell me I’m not bang up to date!) for giving the bare bones on how to go about it, and that is definitely recommended reading for a deeper dive, as is the Microsoft documentation on MERGE. However, for a brief introduction, here is the syntax we’ve used:

CREATE PROCEDURE [dbo].[mergeCustomers]
    @tblNewCustomerList CustomerType READONLY

AS
  MERGE INTO Customers 
  USING @tblNewCustomerList NewCustomerList

  ON Customers.CustomerID = NewCustomerList.CustomerID

  /* For an existing row we just update the name details */
  WHEN MATCHED THEN
  UPDATE SET Customers.FirstName = NewCustomerList.FirstName,
                        Customers.LastName = NewCustomerList.LastName

  /* For a new row, insert the relevant data */
  WHEN NOT MATCHED BY TARGET THEN
  INSERT (CustomerID, FirstName, LastName)
  VALUES (NewCustomerList.CustomerID, NewCustomerList.FirstName, NewCustomerList.LastName)

  /* Row is in the data but not in the upload - delete it */
  WHEN NOT MATCHED BY SOURCE
  DELETE;

  /* Now return the number of rows affected */
  SELECT @@ROWCOUNT
 
RETURN

The first question this raises is: what’s that @tblNewCustomerList parameter? The answer is: it’s a user-defined table TYPE, which allows us to pass an entire “table” of data into the stored procedure. In .NET/C# terms this means we create an in-memory DataTable object, populate it with the contents of the CSV file, and then pass it to the stored procedure. In our case, data lengths and other validation is done in code, in order to ensure (as far as possible) that the data is compliant with any NULL rules or SQL field lengths before being passed, so that SQL won’t reject the entire update.

The C# code is of this form:

    DataTable dt = new DataTable();
    DataRow row; 
    int rows affected = 0;
    dt.Columns.Add(“CustomerID", typeof(int));
    dt.Columns.Add(“FirstName", typeof(string));
    dt.Columns.Add(“LastName", typeof(string));
    
    // Then repeated chunks of this sort of thing to add all our data to the datatable
    row = dt.NewRow()
    row[“CustomerID”] = 1;
    row[“FirstName”] = “Julian”;
    row[“LastName”] = “Woodward”;
    dt.Rows.Add(row);

    // Then when we’ve finished constructing the data
    string connString = [insert your SQL connection string here];
                
    using (SqlConnection conn = new SqlConnection(connString))
    {
        SqlParameter param = new SqlParameter("@tblNewCustomerList", SqlDbType.Structured)
        {
            TypeName = "dbo.CustomerType",
            Value = dt
        };

        using (SqlCommand cmd = new SqlCommand(“dbo.mergeCustomers", conn))
        {
            cmd.CommandType = CommandType.StoredProcedure;
            cmd.Parameters.Add(param);

            conn.Open();
            rowsaffected = (int)cmd.ExecuteScalar();
        }
    }

The last piece of the jigsaw, then is this TYPE thing: how do we create that? The answer: go into (I’m using Visual Studio 2015) the “SQL Server Object Explorer” tool (via the [View] menu if it’s not already visible); and open the twisties in the order Databases -> [your database] -> Programmability -> Types -> User-Defined Table Types

Screenshot: SQL-Server User-Defined Table Types menu

Right-click to add a new one, and then define it pretty much as you would a new table. For instance:

CREATE TYPE [dbo].[CustomerType] AS TABLE
(
    [CustomerID]  INT           NOT NULL,
    [FirstName]   NVARCHAR (50) NULL
    [LastName]    NVARCHAR (50) NULL
)

That’s all folks! Or… is it?

Here’s the thing. That MERGE command is enormously powerful, but the information you get back from it is extremely limited: @@ROWCOUNT will tell you the number of rows it touched. But remember our original pattern, which allowed us to feed back to the to user the individual counts of rows added/updated/deleted? Let’s say we need to find a way to do that.

The first thing to fix is that at the moment we are touching ALL existing rows with the FirstName and LastName, even if they are in fact the same, so let’s change that so that we’re only touching the rows we need to touch. That’s actually a simple change to the piece of syntax that deals with the case when we have a match, replacing

  WHEN MATCHED THEN with

  WHEN MATCHED
  AND Customers.FirstName<>NewCustomerList.FirstName
  OR Customers.LastName<>NewCustomersList.LastName
  THEN

This now tells us the real number of necessary updates, in @@ROWCOUNT, but still doesn’t give us the separate counts of updates, insertions and deletions that we desire.

One approach might be to have a separate stored procedure which compared the new and existing data and built some reports on what the MERGE command will do when it’s run. But that’s potentially error-prone, particularly if there’s a chance the data will change in between the two SQL statements being executed.

A better approach involves changing the design of the table slightly to add an extra two columns:

MergeBatchID - char(36) - NULL
MergeAction - char(1) - NULL

Within the SQL procedure we can generate a batch ID as a GUID, thus:

  DECLARE @batchID AS char(36)
  SET @batchID = NEWID()

and then the updates and inserts can set it appropriately:

  UPDATE SET MergeBatchID = @batchID,
                        MergeAction = ‘U'
                        Customers.FirstName = NewCustomerList.FirstName,
                        Customers.LastName = NewCustomerList.LastName and

  INSERT (MergeBatchID, MergeAction, CustomerID, FirstName, LastName)
  VALUES (@batchID, 'C', NewCustomerList.CustomerID, NewCustomerList.FirstName, NewCustomerList.LastName)

Using this we can then have some further SQL within the same stored procedure to build the detail we need. Possibly this would be more efficient done as a GROUP BY query, but for our purposes we’re happy with a couple of quick-running COUNT(*) type queries. Of course, we can’t get the deleted number from the data, because the relevant data has been, um, deleted. But we can derive it from the total returned by @@ROWCOUNT and the Updates/Creates numbers we’ve already extracted:

  DECLARE @Total as int
  DECLARE @Creates as int, @Updates as int
  
  SELECT @Total = @@ROWCOUNT
  SELECT @Creates = Count(*) FROM ClientCompany cc WHERE [email protected] AND MergeAction='C'
  SELECT @Updates = Count(*) FROM ClientCompany cc WHERE [email protected] AND MergeAction='U'

  SELECT @Creates as CreatedCount, @Updates as UpdatedCount, 
         @Total-(@[email protected]) as DeletedCount, @Total as TotalCount

Naturally, the C# code will also need to modified to read the results. But, other than that, there we have it: the MERGE syntax, with full user-friendly stats.

In our production example, which isn’t so different from this, an incremental update of about 10,000 rows of imported CSV takes 1-2 seconds.

node.js marches on

Nothing in the world of programming ever stays still, and, as time moves on things only change faster and faster it seems.

In our world we do a lot of node.js development and we’ve recently been taking stock of where we are.

As a rule of thumb, you want to be running on even number “LTS” (Long Term Support) versions of node.js. As of today (July 2017), version 6 is where most people should be and then in October the default will switch to version 8. The transparency is great, but the change is relentless.

Some legacy code was, until recently running on the node.js 4.x codestream. This is still supported, but it won’t be after April 2018, so it was time to bring it up to date, and a recent security vulnerability was the perfect opportunity to upgrade the codebase.

It’s when you dig back into older code like that, that you realise how much the style of coding has changed in even just the last 3 or 4 years.

Back then, the default way of working with network calls was callbacks. When your code is nice and simple, callbacks make a lot of sense and they’re pretty easy to manage. But imagine working in our world where everything is based around network calls to our API, things quickly become hard to manage.

And this is where promises entered. Our internal standard is to use a package called bluebird which is a really pleasant way of working. Increasingly you’re able to make use of native promises in node.js as more and more support for ES6 is added, but we stick with Bluebird for the moment.

But of course, nothing stays the same. With the advent of node.js 8.x we have support for async/await. As ever, the grass is always greener, but even from a syntax point of view it looks cleaner, and when you’re dealing with large code bases, that’s an important consideration.

So, for the moment, we’re sticking with node.js 6.11.1, but paying close attention to releases as they come from the node.js foundation.

Three productivity tips

Over the years books have been written, blog posts too, all about that mythical “work-life balance” thing, and how one maintains an even keel whilst delivering on ever-growing deadlines and requirements.

How do the most productive people get things done? Well, I think it comes down to some simple, no-nonsense tips which build upon our post about “getting started” from a few weeks back:

Restrict yourself

There are 35 - 40 working hours in a week (unless you’re Mark). Consider reducing that number still further — say to 32, working four eight hour days and then stopping. This really focusses the mind. Funnily enough, when you hurl days, nights and weekends at something, you don’t necessarily get things done any quicker. An hour of laser-sharp focus trumps a day of faffing at the keyboard every time.

Keep it simple

I’ve run a company for a decade now, and since the start have used this baroque piece of accounting software that did everything: payroll, stock… all sorts. Didn’t need 80% of it. Recently I switched to a multi-tab spreadsheet for my accounts instead. Financial stuff gets done in a fraction of the time it took me to plug in my numbers before, and my VAT return details just fall out too: wonderful!

A spreadsheet — no apps, no monthly cloudy subscriptions — just one file for one job.

Plan away from the screen

It’s so easy to sit down at your editor of choice and hack away. Given that many of us are stationery geeks, why not simply grab a favourite pen and notebook, really think about what it is you want to do, and plan. When you reckon you’ve done enough thinking and planning, you probably haven’t. Ponder, gaze out of the window, go for a walk, and after all that write down some more. Then sit at your screen and get your work done.

Using Textract with node.js on Heroku

One of the projects we’re working on at the moment relies heavily on the use of file attachments. Not a problem, LDC Via does that! But we also needed to be able to render some preview text from the contents of the files, so if a PDF, Word document, Excel sheet, or even an image has some text in it, we wanted to be able to show a brief summary of the contents to the user.

Enter the Textract plugin. This is a Python utility which can dig into the contents of files and extract them. It’s also really simple to use in your local environment: you just need to make sure Python’s installed and npm install does the rest for you.

But then we get to the real world. A dev machine is one thing, but when your app needs to be deployed to Heroku things get a little more complex. Typically, your app will be of one type: node.js, Python, Ruby or whatever. But we now need to have an app that contains two different code-bases. Luckily it’s a relatively simple procedure to set it up.

First, you’ll want to configure your project. In our case today, we’re dealing with a node.js app, so we already have our package.json file which defines all our dependencies. However, because we also want to add some Python dependencies, we need to add a new file to the project, which is called requirements.txt:

# This file contains all Python dependencies that are required
# by the Textract package in order for it to properly work.

argcomplete==1.8.2
chardet==2.3.0
python-pptx==0.6.5
#pdfminer.six <-- go back to this after the shebang fix is released (see https://github.com/goulu/pdfminer/issues/27)
https://github.com/goulu/pdfminer/zipball/e6ad15af79a26c31f4e384d8427b375c93b03533#egg=pdfminer.six
docx2txt==0.6
beautifulsoup4==4.5.3
xlrd==1.0.0
EbookLib==0.15
SpeechRecognition==3.6.3
https://github.com/mattgwwalker/msg-extractor/zipball/master
six==1.10.0

Next, we need to add another file called Procfile. This is basically an instruction file for Heroku that tells it what to do when it’s starting your application. Normally, when you only have a single code-base you don’t need the file as Heroku can make some assumptions, but with two different strands of code we need to give some guidance. So, in our case the file will simply look like this:

web: npm start

Once you’ve committed these files to your code repository… nothing will happen. We’ve made all the code changes we need to, but now we need to go to Heroku. You can obviously do all this with the CLI, but for pretty picture’s sake, you’ll want to go to your application settings page and scroll down to the Buildpacks section:

Buildpacks

Click the Add Buildpack button and for this case, choose “Python” from the list of options.

Once that’s added, we can simply re-deploy the application in the normal way and Textract will magically work!

My editor is WebStorm

Header image: WebStorm, logo

Continuing our series where we talk about our preferred code editor

A lot has been said, written, and ranted, about programming editors and IDEs over the years. Allow us to add to the noise.

This week Mark has his rant about WebStorm


Traditionally I have used the big clunky IDEs based on Eclipse such as IBM RAD and My Eclipse However on a recommendation from Ben I switched to IntelliJ IDEA for my Java and Scala work. After that it was but a short step to IntelliJ’s companion IDE, WebStorm

My main criteria in an IDE is not to cost me time or make me scream in anger: WebStorm manages beautifully here. It is light and easy-going, opening and restoring with no fuss, and whilst I’m aware that I don’t use it to its full capacity, a few of the things I love about it are:

  • It does not fight with your source control. Given the long-term nature of a lot of my clients, their source control systems vary hugely over time. In addition, with so many cooks in the mix, you often have to clean up big mistakes — I find a file explorer as the easiest way to do this, so I use Tortoise svn and Tortoise git to deal with my source control. Unlike many other IDEs, WebStorm has no issue with this and doesn’t try to take control, it just keeps me informed of the changes without any set-up.
  • Search results update as you change items. This sounds like a simple thing, but when you are searching for an item across multiple files and are changing each one after a brief investigation, having live search results which keep track of those changes is great.
  • “Context joining” is very, very clever: jumping around files following object and function links works very smoothly, particularly for a non-static language like JavaScript. I find that this speeds up my work a great deal, although it makes you a little lazy on your function size as large functions don’t get confusing anything like as quickly.

Screenshot: Webstorm

As my work tends to involve a wide range of issues, not just code, I use a stack of “secondary” programs too:

  • SOAP UI - Still the best program for messing around with XML-based web services.
  • EditPad Pro - even with all the new contenders out there, on the Windows platform this is still the best text editor.
  • SQuirreL SQL Client - Not the best SQL client per se, but far and away the best one for debugging Java data source issues.
  • Keystore Explorer - This has saved my sanity more times that I can remember when it comes to dealing with complex SSL key issues.

Chrome extensions

  • Restlet - How anyone does any REST services work without this is beyond me.
  • Salesforce Advanced Code - picking up code from existing organisations is a right pain, this code searcher helps hugely.
  • Grammarly - Because both my spelling and grammar suck.

So there you have it: more an overview of my entire toolbox than a simple IDE post!