Node.js is terrible for data processing pipelines

When it comes to data processing, as we do at blockpulsar.com, it is usually referred to as a function that loads data with a specific format, transforms data into a different structure, and returns it. The logic here is pretty simple. Sometimes it is so simple that you think it will perform fast because all it does, is maybe changes the name of a field or remove some of the fields. All those operations are synchronous in practice, which is pretty CPU intensive if you data transformation operations simultaneously.

Node.js is quite good at handling Async tasks because JavaScript’s Event Loop works so that it executes some of the sync operations while it waits for an async function to give a result back. For example, when you have a Database connection with your Node.js server, and you make an async query to get some data from it, your Node.js application has the potential to execute some of the other parts of your code. At the same time, it waits for a response from the database. This makes handling too many synchronous operations quite challenging because each sync operation blocks the entire process. Even async functions are blocked until sync operations are done.

This issue comes up a lot for even relatively simple Web services with Node.js, which serve simple JSON responses. Still eventually, when requests are piling up, Node.js starts to be slower and slower because JSON encoding and decoding are quite heavy sync operations for JavaScript.

JSON parse/stringify performance (slow!)

This is the most common performance issue for JavaScript engines because it is a sync operation, and we are using it a lot during Node.js execution. So if you haven’t yet dealt with JSON parse/stringify performance issues in Node.js, then you haven’t had a performant application in general 😃

It’s not like it is super slow, it just blocks the entire process, and if you have many objects to parse or stringify, that ends up delaying your async methods like getting data from DB or responding to an HTTP Request.

There are some hacky ways of making JSON parse and stringify using Worker Threads, but code feels very Node.js is not optimized for running multithreaded applications. I admit it is still doable, I did that once, but it complicates production and local development.

Abusing Object copy/spread in JS.

Suppose you have done some coding with other low-level programming languages like C++, Rust, or even Go (not a low level, but…). In that case, you will remember that copying memory from function to function leads to poor performance. That’s why most developers use pointers in those languages, and they change a variable using reference directly in place.

I’ll admit that having a strict rule not changing the Function argument inside a function is suitable for code maintenance. Still, each time you return an object as a spread of a given object with minor changes, the JS engine copies memory then removes the old one, which of course, leads to more sync operations.

// Not that performant, but good for code readability
function processData(data: Object) {
  return { ...data, changedField: 100};
}

// Terrible for code maintanence, but works faster
function processData(data: Object) {
  data.changedField = 100;
  return data;
}

We’ve been following the rules of Eslint, and eventually, Object spread/copy became a real performance bottleneck over time.

Docker Image Size (1GB+)

You might think, why are we even talking about Docker Image size, it is not related to data processing, but it is actually! The critical thing for having ETL Data processing systems is to scale as fast as you can depending on data volume, especially during first-time data ingestion, where there is a time-sensitive first sync. But there is a real our application o heavy that it 0t a container up.

Some node-exec libraries are trying to convert Node.js into a single binary, used a few times. It is tough to manage and keep up with other modules because not all support is being held separate from the node_moduels folder itself. It is OK if you are building some simple Web API. Data processing pipelines where you usually try to use DB Drivers with some binary dependencies, dose things could break pretty easily.

What is the solution, then?

At blockpulsar.com, we switched to Golang and C++ because it was too hard to write proper JS code sync operation optimizations. We still kept Node.js as our leading Web App API service because there are so many great libraries built to handle Web integrations that it would be time-consuming to do that with other backend technology.

Parallel processing was the main reason we switched to another language for the data processing pipeline system. Node.js with Worker Threads consumed around 10x more memory than a single regular process, making me think that we use Node.js not the way it is designed.

We got shockingly good performance results and container images less than 20MB where we keep just a statically compiled binary. Of course, it takes more time to code, especially initially, but if you intend to drop data processing time from 10 seconds to 500ms, that’s the way to optimize.

Conclusion

Node.js is great for running as a Web API, but when you are putting under the heavy data processing load, it will struggle a lot. There are so many libraries and community projects for Node.js and JavaScript in general that writing projects in Node.js are more enjoyable, but it might not be a good fit depending on the project type

If you enjoyed this article, let me know how you struggled with Node in production 😃

Share This:

Get the best technology updates and coding tips

Subscribe to our newsletter and stay updated.