Crawling websites with elixir and crawly

Quick introduction of how we can use the library crawly to extract data from websites.

Written in Development by Edgar Latorre — August 13, 2020

In this post, I'd like to introduce crawly which is an elixir library for crawling websites.

My idea for this post is to be a quick introduction to how we can use crawly. For this little example we're going to extract the latest posts titles on our website and write it to a file, so let's do it.

First of all, we need elixir installed, in case you don't have it installed you can check this guide.

Once we have elixir installed let's create our elixir application with built-in supervisor

mix new crawler --sup

In order to add the crawly dependencies to our project, we are going to change the deps function in the file mix.exs and it should look like this

defp deps do
	[
		{:crawly, "~> 0.10.0"},
		{:floki, "~> 0.26.0"} # used to parse html
	]
end

We need to install the dependencies that we just added running the command below

mix deps.get

Let's create a spider file lib/crawler/blog_spider.ex that is going to make a request to our blog, query the HTML response to get the post titles, and then returns a ParsedItem which contains items and requests. We are not going to leave requests as an empty list to keep it simple.

defmodule BlogSpider do
  use Crawly.Spider

  @impl Crawly.Spider
  def base_url(), do: "https://www.codegram.com"

  @impl Crawly.Spider
  def init(), do: [start_urls: ["https://www.codegram.com/blog"]] # urls that are going to be parsed

  @impl Crawly.Spider
  def parse_item(response) do
    {:ok, document} = Floki.parse_document(response.body)

    items =
      document
      |> Floki.find("h5.card-content__title") # query h5 elements with class card-content__title
      |> Enum.map(&Floki.text/1)
      |> Enum.map(fn title -> %{title: title} end)

    %Crawly.ParsedItem{items: items, requests: []}
  end
end

Now that we have our spider created it would be nice to save what we're extracting into some file. To do this we can use a pipeline provided by crawly called Crawly.Pipelines.WriteToFile. For that, we need a config folder and a config.exs file:

mkdir config # creates the config directory
touch config/config.exs # creates an empty file called config.exs inside the config folder
mkdir -p priv/output # creates output folder inside priv where we are going to store our files

Now let's create the configuration to save the response from our spider into a file.

use Mix.Config

config :crawly,
  pipelines: [
    Crawly.Pipelines.JSONEncoder, # encode each item into json
    {Crawly.Pipelines.WriteToFile, folder: "priv/output/", extension: "jl"} # stores the items into a file inside the folder specified
  ]

Now that we are good to go, we can open the elixir repl

iex -S mix

And then we can execute our spider

Crawly.Engine.start_spider(BlogSpider)

The spider is going to be executed by a supervisor and then we should see a new file inside priv/output folder. In my case the lasts posts showing in the first page are

{"title":"\"High tech, high touch\": A communication toolkit for virtual team"}
{"title":"My learning experience in a fully remote company as a Junior Developer"}
{"title":"Finding similar documents with transformers"}
{"title":"UX… What?"}
{"title":"Slice Machine from Prismic"}
{"title":"Stop (ab)using z-index"}
{"title":"Angular for Junior Backend Devs"}
{"title":"Jumping into the world of UX 🦄"}
{"title":"Gettin' jiggy wit' Git - Part 1"}

This is just a simple example of what is possible to do using crawly. I hope you enjoyed this introduction and remember to be responsible when extracting data from websites.

View all posts tagged as