Building a s​erverless newsletter parser

NewsHub, a serverless newsletter parser that extracts all the links and publishes them to other services like Slack, Telegram or Pocket.

This past June we went on our yearly retreat, it was a fun week where I paired with other Codegrammers, went on some trekking trips and had a blast while canyoning! I had a pet project on my mind for a while, so it was the perfect opportunity to start working on it and release a first working version and try some new things.

NewsHub

I like to stay up-to-date with technology and development news, so whenever I find a weekly newsletter about something related to web development I subscribe to it. Weeks pass and most of the issues keep piling on my inbox, and I struggled to read and parse all the links and information from the (probably too many) newsletters I’m subscribed. I wondered if it wouldn’t be more comfortable if I could consume all that information differently.

Enter NewsHub, an email parser that extracts all the relevant links from a newsletter issue and publishes them back to my preferred service 🎉 (a Telegram bot, a Slack channel, Pocket, or other services)

Drinking the AWS kool-aid

I’ve been playing with AWS Lambda for a while, but I still had pending to deploy a full service using it on production so, for this project, I went full AWS and combined different services to have a complete serverless experience. Yes, I know there are actual servers, but I don’t need to worry about them, so they don’t exist 🙉. I started the project using the Serverless framework and Node.

Here’s a (zoomable) overview of the infrastructure:

First, Amazon SES (Simple Email Service) archives all the emails received to the address used to subscribe to a newsletter to an S3 Bucket.

Then, a Lambda function is triggered each time there’s a new object in that S3 bucket. Keep in mind that when a Lambda function is triggered from an S3 event, you don’t get that actual object, you’ll get a collection of object documents you need to download.

Once the email is parsed, we need to find out if it’s an email to confirm the subscription or a newsletter issue. If it’s the former, the confirmation link is published to a Slack channel, so I can manually confirm it. If it’s the latter, all the links are extracted, and a new Lambda function is called with each link. This function unfurls each link to extract the title from the website and to prevent publishing links that redirect to the desired article or website. Finally, the link is published to a Slack channel, a Telegram Bot and fed to the recommender service built by Txus.

All this was done using just a few Node packages, like async, cheerio, mailparser, aws-lambda-invoke and node-telegram-bot-api.

What’s next?

Overall I’m quite happy with the experience. Using the Serverless framework allows you to focus on creating your function and to not worry about all the configuration related to AWS and the deploys are extremely easy (NewsHub is continuously deployed on every push to the master branch).

NewsHub as a service is far from perfect, but I no longer have to worry about all those emails in my inbox. The next steps will probably be migrating the code to TypeScript, add some tests and add more feature like detecting unique links in a single issue, storing the links in a database to avoid publishing the same link from different sources, and try to add some buffering mechanism, since receiving 50 links together can be a bit overwhelming.

Made it to the end? You can also read what as Georgina, Marc, and Txus did during the retreat!

View all posts tagged as