Building an AWS Serverless system: Integration testing and accessing Private VPC resources

29 Jun 2021|Technology

In this third blog post in my series about an API-based serverless service in AWS, I want to explain how I may test everything from everywhere. I want to test websites, check APIs and peek into VPC-private databases. I want to do the same thing from my local PC and from CI/CD in GitHub. I want to receive a message in my Slack if something is the matter. And all of it should be so easy that a developer must only think of writing the essence of a test.

How do we test?

First things first: manual testing is not an option. Complex modern systems quickly become unmanageable and untestable manually. It is crucial to have a framework for Integration tests up and running early and place a firm requirement in your Definition of Done to have test coverage added at the development stage.

The list below comes from one of my favorite texts, "Things I believe":

- I don’t care if you write the tests first, last, or in the middle, but all code must have good tests.
Tests should be performed at different levels of the system. Don’t get hung up on what these different levels of tests are called.
Absolutely all tests should be automated.
Test code should be written and maintained as carefully as production code.
Developers should write the tests.
Run tests on the production system too, to check it’s doing the right thing.

Testing must be a breeze; testing must be able to cover everything that exists in reality. We have usually a choice between so-called unit tests and integration tests — roughly, the former are run against codebase by e.g. importing and testing a single function, the latter — run against several components working together, for example, several deployed services on an environment.

I would use unit tests in case I have some isolated business logic that I can call with limited mocking. On other hand, testing a service by mocking its dependencies does not make much sense. Consider a Lambda that orchestrates several other endpoints - mocking all of these would take a lot of time and there is no guarantee that the mocks will perform like real services. So for these unit tests that are not written, instead, I create integration ones, as they

test not only business logic but also the entire "texture", including networking communications and components.
there is much less to mock and maintain
you test real code and can go all the way to end-to-end system testing, peeking into real databases and third-party services.

I confess I do not find the test writing a particularly exciting work. And boring work is often pushed back and procrastinated, which does not help the stability of the system. Therefore, we should give a developer a way to make test writing easy, so one can execute all testing operations with a simple setup and be able to test exactly what needs to be tested, not something that just resembles it.

Being a good CI/CD "citizen", a properly written test should set up and teardown the vast majority of its data. It is also a good idea to have some cleanup job to purge databases which test might write to by schedule, to clean data left after tests that failed before a proper teardown.

It is a best practice in the cloud archiecture, that databases are hidden in a private VPC subnet, so they have no public endpoint and therefore, no way to directly connect to these from the outer Internet. Yet, we still want to run all tests both from CI/CD worker in GitHub actions as well as from a local machine which is in outer internet. Below, I explain how we can do the setup to achieve that using authorized AWS connections.

As a side note, AWS SAM offers a way to test a Lambda locally based on a dummy event, but this somehow felt unpractical for us. We do not do development and testing with this approach.

What do we test?

The most common set includes:

my APIs
my Database entries
my website pages

Here is the blueprint of a system that I describe in more detail below:

Testing Architecture in AWS

In the image above, an "Identity provider" is essentially an example of a typical third party dependency - tests use it to gain some tokens that system requires.

For the testing stack I chose:

NodeJS, as it offers a great variety of npm packages and therefore it is easy to borrow functionality for my tests, should I need to call an URL, visit a database or decrypt a JWT token;
jest test runner;
jest-cucumber to be able to write Gherkin "features" in addition to native jest tests;
playwright to be able to test website pages.

Also, the "test report" is generated in a junit format so it is viewable later in GitHub. In my opinion, this toolkit fulfills most of the usual needs both for website side testing as well as for API layer testing.

I like to write my tests as Cucumber definitions, so the first file I create is a feature that looks like

Feature: Really Useful Data Retrieval

  To check if user can retrieve data

  Scenario Outline: A <UserType> user can retrieve data
    Given The <UserType> user is authorized
    When User requests data
    Then User <UserType> data is received correctly
    Examples:
      | UserType   |
      | Usertype_1 |
      | Usertype_2 |

Using jest-cucumber, I convert the lines found in the feature into jest test code that creates test data, calls APIs or interacts with web pages using jest-playwright.

I stress that it is still possible to mix in native jest tests in case Gherkin features are not appropriate.

I also check the contents of a database in these tests. Interestingly, Gherkin authors claim that While it might be tempting to implement Then steps to look in the database - resist that temptation! You should only verify an outcome that is observable for the user (or external system), and changes to a database are usually not, — which, to me, is a very controversial claim that I can not agree with. Using the YAGNI principle, we create only entities that matter — databases included. I would say that I use Gherkin mostly as a nicer way to express classical "Arrange — Act — Assert" test structure.

JUnit test results

It is very handy to have a pretty GitHub page that shows your tests results. It can be achieved in the following way:

We configure our tests to report in a so-called junit format: jest.config.js:

module.exports = {
...
   reporters: [
      "default", ["jest-junit", {
          "outputDirectory": ".",
          "outputName": "jest-junit.xml",
      }]
  ],
  ...
};

In our GitHub CI/CD workflow we attach a step to create the desired report using dorny/test-reporter Action:

- name: Test Report
  uses: dorny/test-reporter@v1
  if: always()    # run this step even if previous step failed
  with:
    name: Integration tests 
    path: jest*.xml    # Path to test results
    reporter: jest-junit        # Format of test results

If you run Integration tests as a part of Pull Request checks, the respective report will be added to the Run page; in all cases, you can recover the report by peeking into the action's log and looking for Check run HTML line:

Creating test report Integration tests
Processing test results from jest-junit.xml
...
Check run HTML: https://github.com/<repo>/runs/12345678

Here's how it looks like (from dorny/test-reporter): Example from dorny's GitHub

Accessing Private Resources from tests in CI/CD

Despite some resources staying inside private subnets, we want our tests to peek in everything so we can have robust testing. With GitHub Actions, we have an option to create a custom Runner, a virtual machine that we can place in our VPC. It has to have the correct Security Group settings so it can communicate with GitHub over HTTPS and with desired databases over their respective ports. This approach is implemented by machulav/ec2-github-runner Action: it creates a Runner which is started before the execution and terminated right after, only using cloud resources when a job is underway. The extra time to do so is just a couple of minutes, so it does not produce a great delay.

As a parameter to the Action, we need an AMI Image that will be spun up as a Runner. To make it, I used as a base a default Amazon Linux and additionally installed node (using this AWS recipe), docker (using ec2-github-runner README info, and dependencies required for Chrome from this document (CentOS part). The latter are needed for playwright to work correctly. The AMI ID is stored in a Github Action Secret and later passed to the Workflow. To avoid confusion, a "jump host" discussed in the "Accessing Private Resources from local PC" section and this GitHub runner host are separate entities.

Other parameters for the action include a VPC's Public Subnet ID to place the instance and Security group with correct permissions.

My GitHub job to run the tests follows the example from the ec2-github-runner README info:

integration-tests:
    name: Run Integration Tests
    needs: start-runner # required to start the main job when the runner is ready
    runs-on: ${{ needs.start-runner.outputs.label }} # run the job on the newly created runner
    continue-on-error: true # Continue to destroy the runner if the job has failed
    env:
      MY_TEST_PARAMETER_ENVIRONMENT_VARIABLE: ${{ secrets.MY_SECRET_WITH_TEST_PARAMETER }}     
    steps:
      - uses: actions/checkout@v2
      - name: Use Node.js 14.x
        uses: actions/setup-node@v1
        with:
          node-version: 14.x
      - run: npm install
      - run: npm test
      - name: Test Report 
        uses: dorny/test-reporter@v1
        if: always()    # run this step even if previous step failed
        with:
          name: Integration tests results 
          path: jest*.xml    
          reporter: jest-junit

By default, a set of Jobs in the Workflow will stop being executed if a Job fails. Tests are destined to fail sometimes, therefore we use the option continue-on-error: true for the job that runs on the Runner - so if tests fail, the runner instance is still destroyed and we are not incurring charges for a resource that we do not use later.

A final remark, should you want to replace the AMI, remember to deregister old AMIs when creating new ones.

Failing the Workflow and reporting to Slack

Since we used continue-on-error: true for the integration-tests job, if it fails, the workflow would remain "green", and we do not want that; we want to get notified that tests have failed. Hence, after the Runner is destroyed, we need to re-check the status of the job and "fail" the Workflow. This is done using yet another Action technote-space/workflow-conclusion-action that is green only if all jobs are green. In the snippet below, it is taken care of by the Fail if Integration tests are red step.

Failed Workflow might not be visible instantly to developers, while it is very handy to let them know early that there is some problem with tests. So step called Notify Slack shows how I can report the result of the job to Slack, as well:

name: Test Status -> Workflow Status
needs:
  # First we run tests and delete the Runner
  - start-runner
  - integration-tests
  - stop-runner
runs-on: ubuntu-latest
if: always()
steps:
  - name: Get current date
    id: date
    run: echo "::set-output name=date::$(date +'%Y-%m-%d %H:%M')"
  - uses: technote-space/workflow-conclusion-action@v2
  - name: Notify Slack
    # if: env.WORKFLOW_CONCLUSION == 'failure' can be added if we only want to 
    # report failures in Slack, to reduce noise.
    uses: 8398a7/action-slack@v3
    with:
      status: custom
      fields: all
      custom_payload: |
        {
          attachments: [{
            color:  process.env.WORKFLOW_CONCLUSION == 'failure'? 'danger': 'good',
            text: `${process.env.AS_REPO}: ${process.env.AS_WORKFLOW} ${process.env.WORKFLOW_CONCLUSION == 'failure'? '*failed*': 'succeeded'} at ${{ steps.date.outputs.date }}.`,
          }]
        }
    env:
      SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}

  - name: Fail if Integration tests are red
    if: env.WORKFLOW_CONCLUSION == 'failure'
    run: exit 1

Accessing Private Resources from local PC

When resources are located in a private VPC subnet, it might sound counterintuitive to access them from Internet, but it is possible if you can use authenticated channels. Namely, SSM offers a way to establish a session to any instance where an "SSM Agent" is running, even it is in a private subnet. Therefore, we can do the following:

We launch a small Amazon Linux instance in a private VPC subnet, so no one except us can reach it from outside; we hold on to the private key for the instance.
The vanilla instance usually does not have SSM Agent installed. While the setup of the Agent is described in Session manager Steps for local machine and the instance and this AWS blog post, I have found these manuals hard to follow and struggled to produce a successful setup. However, AWS offers an easy way to configure the instance: in AWS Console, go to Systems Manager > Quick Setup, click Create, choose Host management, then Choose how you want to target instances -> manual -> , specify Instance ID, then let configuration run and complete its steps. To verify if you can use the SSM connection for the instance, go to your list of EC2 Instances available, select the desired instance, click "Connect" and choose "Session manager" (the tab should NOT display any errors that the connection is impossible), then again click "Connect" and see command prompt appearing.
After this, using aws ssm command line on a local machine we may establish a connection to the host itself. However, this connection does not support port forwarding to other VPC instances. To work around this, we need to issue an ssh tunneling command as discussed in this Reddit post and this manual. We connect to the instance using the SSM, then over that connection we establish ssh tunnels to databases. So, if we were doing things in the command line, the call we want to make would look like this (the parameters are quoted in square brackets):
```
 #> aws configure
 #> ssh -i [my-secret-key.pem] ec2-user@[i-instanceid] -L 1024:my-rds-postgres-url:5432  -o ProxyCommand="aws ssm start-session --document-name AWS-StartSSHSession --target  %h --parameters portNumber=%p"
```

-L option is establishing the tunnel and you can supply several such options to a single ssh command. The command above forwards localhost:1024 to my-rds-postgres-url:5432.

Enable global setup and teardown

Everything should be automated as much as possible, so with the same jest command, I would like to establish the tunnel, rewrite URLs for private resources to point these to the tunnel, and run the tests as usual. In my case, the relevant code checks for tunnel-related environment variables: VPC_JUMPHOST_INSTANCEID (an instance ID for the tunneling operation) and VPC_JUMPHOST_PEMKEY (a PEM key assigned to the instance while creating).

jest test framework offers us a way to specify global setup and teardown files that fit perfectly to establish and dismantle tunnels. For the following piece of code to work, remember to run aws confugure to authorize in AWS - the credentials will be stored in a local file and used by the AWS CLI command that the snippet will invoke. In the command line, ssm and ssh commands should be available.

To run global setup and teardown, use the following lines in jest.config.js:

module.exports = {
    ...
    globalSetup: "./tests/globalSetup.js",
    globalTeardown: "./tests/globalTeardown.js"

};

Create tunnel before running tests and dismantle it after tests are finished

./tests/globalSetup.js contains the following code. It uses tunnelForwardingData structure to define which environment variables script needs to alter, and which regexp to use to locate the hostname - currently, it must be found in the first matching group of the regexp (inside first brackets within the regexp).

const {exit} = require("shelljs");
const {exec} = require("child_process");

const tunnelForwardingData = [
    {  
        envVariable: "POSTGRES_CONNSTRING", // an URI-formatted connection string, such as "postgresql://postgres:password@mycluster-somenumber.eu-west-1.rds.amazonaws.com:5432" 
        regexp: /.+@(.+)/ // regexp to extract host and replace, the host and port must be the fist matching group.
    }
]
module.exports = async () => {
    console.log("This is Global Setup. Checking environment and creating tunnel to VPC resources for local testing.");
    const VPC_PEMKEY = "VPC_JUMPHOST_PEMKEY";
    const pemKey = process.env[VPC_PEMKEY] || "";
    const VPC_INSTANCE_ID = "VPC_JUMPHOST_INSTANCEID";
    const instanceId = process.env[VPC_INSTANCE_ID] || "";
    const SSH_EXECUTABLE = "SSH_EXECUTABLE";
    const sshExecutable = process.env[SSH_EXECUTABLE] || "ssh"

    // if the variables are not set, continue without 
    // tunnel (CI/CD case, where we run tests inside VPC GitHub Runner)
    if (!pemKey || !instanceId) {
        console.log("Global Setup: " + VPC_PEMKEY + " and/or " + VPC_INSTANCE_ID + " are not set. Not starting tunnel to VPC resources.");
        return;
    }
    console.log(`Global Setup: Tunnel variables set. Will create SSM/SSH tunnel to VPC.\n${VPC_PEMKEY} = ${pemKey}\n${VPC_INSTANCE_ID} = ${instanceId}\n${SSH_EXECUTABLE} = ${sshExecutable}`);

    let sshTunnelCommand = "";  
    let localPort = 1024;
    tunnelForwardingData.forEach(tunnelItem => {     
        let connStringInUriFormat = process.env[tunnelItem.envVariable] || ""
        const localHost = "localhost:" + localPort;
        const dbUrl = new URL(connStringInUriFormat);
        const remoteHost = dbUrl.host; // host and port
        const remotePort = dbUrl.port;
        connStringInUriFormat = connStringInUriFormat.replace(remoteHost, localHost);
        sshTunnelCommand += ` -L ${localPort}:${dbUrl.hostname}:${remotePort}`;
        process.env[tunnelItem.envVariable] = connStringInUriFormat;
        localPort++; // for next cycle
    })

    // Creating child process with tunnel
    const command = `${sshExecutable} -i ${pemKey} ec2-user@${instanceId} ${sshTunnelCommand} -o ProxyCommand="aws ssm start-session --document-name AWS-StartSSHSession --target  %h --parameters portNumber=%p"`
    global.tunnelProcess = exec(command);

    // One can handle various events to get more information on screen.
    global.tunnelProcess.on('close', (code) => {
        console.log(`Tunnel: child process close all stdio with code ${code}`);
    });

    global.tunnelProcess.on('exit', (code) => {
        console.log(`Tunnel: child process exited with code ${code}.`);
        if (code > 0) {
            console.log("Tunnel command resulted in an error. Please check configuration variables.")
            exit(1)
        }
    });
};

The child process is created and stored in the global.tunnelProcess object, that is available on teardown stage. So, when tests are finished to run, ./tests/globalTeardown.js is called:

module.exports = async () => {
    console.log("This is Global Teardown. Checking environment and deregisterting tunnel to VPC resources for local testing.");
    if (global.tunnelProcess) {
        console.log("Global Teardown: Found an active tunnel process, dismantling.");
        global.tunnelProcess.kill("SIGKILL");
    }
};