Connect

Elsevier Tech: Writing a native Java Lambda using Quarkus

July 27, 2023

By Neil Stevens

Photo of a woman at a computer with multiple screens displaying code (© istock.com/MTStock Studio)

An Elsevier software engineer writes about how to use Quarkus to improve Java functions with AWS Lambda

At Elsevier, we use Java a lot for our backend software. However, one weak point of Java is that it is not very well suited to implementing functions in AWS Lambdas. The reason for this is that Java needs to first start a JVM (Java Virtual Machine) before it can run the Lambda code. It’s fairly common for teams to implement functionality to periodically call their Java Lambda function in attempt to keep instances alive. However, this is not ideal since it results in additional needless invocations of the Lambda.

Quarkusopens in new tab/window is an open-source Java framework to improve Java performance in both Kubernetes and Serverless applications. One way in which it achieves this is by taking advantage of GraalVM's native compilation support, allowing developers to write Lambdas in Java that are compiled into standalone binary applications.

On the web I was able to find a fair amount of examples of how to write a very simple hello-world style native Lambda but struggled to find anything much more complex. Most of the examples also use Apache Maven as their Java build tool, while the three squads I’ve worked on since joining Elsevier have all used Apache Gradle.

I thought this would be an interesting topic to explore using development time and something to consider using for production code, so I have written this summary. A good way to investigate this would be to write a native Java Lambda for a typical task similar to what we have used Lambdas for in the past while working at Elsevier and then compare the performance of the native Lambda to an equivalent Lambda running on a JVM.

Explore tech jobs at Elsevier

The problem

At Elsevier, we sometimes need to process CSV files that get updated periodically. Not long ago, I had to write a Lambda that would process a large CSV file every 24 hours and add some of the data from it to a DynamoDB table. A Lambda was well suited to this task due to the fact it would only be run once per day and took a few minutes to complete.

To represent the above with a slightly more generic problem, I looked on Kaggle for some interesting CSV data to do some similar processing on. Eventually I found Rock & Metal Singles Chart Top 40 2003-2018opens in new tab/window (from the United States). I wrote a Lambda to look over this CSV, pull out data about a few of my favorite bands and save it to a DynamoDB table. (Note: I filtered on a very small number of bands so only a small amount of data would be written to DynamoDB since I wanted to avoid DynamoDB throttling my write requests.)

The input CSV file contains records in the following format:

[entry number],[chart position],[artist name],[song title],[chart date],[year]

An example row is:

"3,4,STEREOPHONICS,MADAME HELGA,20030525,2003"

I thought it would be interesting to write result entries to DynamoDB with the following attributes:

[artist name]-[song title] (used as the hash Key) date (used as the sort Key) chart position

Having the data in this format means I can efficiently search for my favorite songs in DynamoDB by retrieving entries all with the hash key"[artist name]-[song title]".

The implementation

The complete implementation is attached in this Zip file:

Download the Zip fileopens in new tab/window

I'll explain a few interesting parts of the implementation that relate to problems I later encountered.

I have written a simple POJO (with the help of Lombok) to hold the necessary information:

@Value public class ChartEntry { String artist; String song; LocalDate date; int position; }

A ChartEntryReader interface:

public interface ChartEntryReader extends AutoCloseable{ Stream<ChartEntry> getStream(); void close(); }

And a ChartEntryWriter interface:

public interface ChartEntryWriter extends AutoCloseable { void write(ChartEntry chartEntry); void close(); }

(Note: I extended the AutoClosable interface on these so I was able to use them within a try-with block without worrying about closing them myself) My main implementation logic is:

chartEntryReader.getStream() .filter(chartEntry -> FILTERED_ARTISTS.contains(chartEntry.getArtist())) .forEach(chartEntryWriter::write);

I could have read the entire file up-front, but if I had been dealing with a huge file, this could have caused an OutOfMemoryError. While I was not using a huge file, I decided to pretend I was. For implementation of the reader side I chose to use OpenCSV library as it has support for parsing a CSV file as a Stream while reading it. OpenCSV maps columns using Java annotations:

@Data public static class CsvChartEntry { @CsvBindByName(column = "Position") int position; @CsvBindByName(column = "Artist") String artist; @CsvBindByName(column = "Song") String song; @CsvBindByName(column = "ChartDate") String date; }

The build files

My gradle build file is fairly simple:

plugins { id 'java' id 'io.quarkus' version "3.0.1.Final" } group 'org.example' version '1.0-SNAPSHOT' repositories { mavenCentral() } dependencies { implementation enforcedPlatform('io.quarkus.platform:quarkus-bom:3.0.1.Final') implementation 'io.quarkus:quarkus-amazon-lambda' implementation enforcedPlatform('io.quarkus.platform:quarkus-amazon-services-bom:3.0.1.Final') implementation 'io.quarkiverse.amazonservices:quarkus-amazon-dynamodb' implementation 'io.quarkiverse.amazonservices:quarkus-amazon-s3' implementation 'software.amazon.awssdk:url-connection-client:2.20.55' implementation 'org.jboss.logging:commons-logging-jboss-logging:1.0.0.Final' compileOnly 'org.projectlombok:lombok:1.18.26' annotationProcessor 'org.projectlombok:lombok:1.18.26' implementation 'com.opencsv:opencsv:5.7.1' }

When creating a binary it should contain the smallest amount of code possible, therefore only classes which are used will be included, and only the methods and fields of those classes which are accessed will be included. Modern Java libraries often use reflection which doesn't mix too well with this approach. Because the Quartus team are aware of this they provide Quarkus extensions (https://quarkus.io/extensions/) which are essentially copies of common Java libraries which may have been modified and are thoroughly tested to work with Quarkus. The following dependencies are all Quarkus extensions and are used exactly the same way as the standard AWS SDK libraries:

io.quarkus:quarkus-amazon-lambda io.quarkiverse.amazonservices:quarkus-amazon-dynamodb io.quarkiverse.amazonservices:quarkus-amazon-s3

The next 2 dependencies are required to avoid ClassNotFoundExceptions. Both are related to known issues with Quarkus:

software.amazon.awssdk:url-connection-client org.jboss.logging:commons-logging-jboss-logging

The following is also needed in the gradle.properties file:

quarkus.package.type=native quarkus.native.container-build=true quarkus.native.additional-build-args=-H\:IncludeResourceBundles\=opencsv,-H\:ReflectionConfigurationFiles\=reflectconfig.json

The first property is needed to produce a native build (since Quarkus can also be used to generate normal Java bytecode). The second property causes a GraaelVM container to be used to build the native binary. This is the recommended way to generate a native binary according to the Quarkus website. Finally the build arguments passed to GraalVM: The first of these is "-H\:IncludeResourceBundles\=opencsv". As mentioned above, when compiling to binary, only the required parts of each dependency are included. It's impossible for the compiler to figure out if anything in resources is required so it assumes nothing is, however we can force the resources folder of a dependency to be included by using this option. (The functionality I use from OpenCSV is dependent data in its resources directory.) The second build argument "-H\:ReflectionConfigurationFiles\=reflectconfig.json" provides a way to reference a JSON file, where classes, methods and fields can be specified which must be included in the build. Fields and Methods invoked via reflection can be listed in this file instructing GraalVM to include them in the native build even though it hasn't found any code directly invoking them. The first element in the reflectconfig.json file is my CsvChartEntry which is only invoked by OpenCSV using reflection:

{ "name" : "com.elsevier.example.music.quarkus.codec.CsvChartEntryDecoder$CsvChartEntry", "allDeclaredConstructors" : true, "allDeclaredMethods" : true, "allDeclaredFields" : true }

Then there is also a number of array types specified like:

{ "name" : "[Ljava.lang.Double;", "allDeclaredConstructors" : true }

The name "[Ljava.lang.Double;" is confusingly the class name used for "Double[]" internally within Java. The reason these elements are required is due to the implementation of class org.apache.commons.beanutils.ConvertUtilsBean (part of apache-commons-beanutils which is a dependency of OpenCSV). This class contains static initialization code that creates a number of arrays using reflection. This is a good example of the unexpected complexities caused by using third party libraries with native compilation.

Building and deploying the project

Since gradle wrapper is used the only dependencies to build are: Java SDK Docker You can build and deploy the software by downloading and unzipping the attached code and then executing the following:

cd quarkus-experiment ./gradlew build

Deploying should be as simple as creating the S3 bucket, DynamoDb table and IAM role with the appropriate permissions (via AWS web console) and then executing the following commands after login into the AWS CLI:

export LAMBDA_ROLE_ARN=[IAM-ROLE-ARN] build/manage.sh native create

However, I found I needed to make a small edit to the script (probably caused by building in GitBash on Windows): I changed the line:

ZIP_FILE=fileb:/// [ABSOLUTE-PATH]build/function.zip

to:

ZIP_FILE=fileb://build/function.zip

Performance comparisons

The following times are an average of 3 runs, and the values are the "Billed duration"(which is the initialization and duration) and 256MB of RAM:

JVM (cold start)	Native (cold start)	JVM (warm)	Native (warm)
54.3 seconds	11.2 seconds	3.5 seconds	3.4 seconds

Conclusions

Quarkus binary Lambdas are significantly faster than normal Java Lambdas on a cold start; however they provide very similar performance for my example when both Lambdas are warm. Native Lambdas come with the following drawbacks:

Native compilation using a container is slow (over 2 minutes on my machine when the docker build image has already been downloaded)
Behavior of native applications may differ to the Java Bytecode version (which is a problem if unit tests are executed on the bytecode version)
Using third party libraries which are not directly supported as Quarkus extensions can be problematic (classes/fields/methods internally referenced via reflection may change when dependencies are updated requiring modifications to the reflection-config file)

Complete implementation code

Download the Zip file opens in new tab/window

Contributor