Crawling websites with elixir and crawly
Quick introduction of how we can use the library crawly to extract data from websites.
In this post, I'd like to introduce crawly which is an elixir library for crawling websites.
My idea for this post is to be a quick introduction to how we can use crawly. For this little example we're going to extract the latest posts titles on our website and write it to a file, so let's do it.
First of all, we need elixir installed, in case you don't have it installed you can check this guide.
Once we have elixir installed let's create our elixir application with built-in supervisor
mix new crawler --sup
In order to add the crawly dependencies to our project, we are going to change the deps function in the file mix.exs
and it should look like this
defp deps do
[
{:crawly, "~> 0.10.0"},
{:floki, "~> 0.26.0"} # used to parse html
]
end
We need to install the dependencies that we just added running the command below
mix deps.get
Let's create a spider file lib/crawler/blog_spider.ex
that is going to make a request to our blog, query the HTML response to get the post titles, and then returns a ParsedItem
which contains items and requests. We are not going to leave requests as an empty list to keep it simple.
defmodule BlogSpider do
use Crawly.Spider
@impl Crawly.Spider
def base_url(), do: "https://www.codegram.com"
@impl Crawly.Spider
def init(), do: [start_urls: ["https://www.codegram.com/blog"]] # urls that are going to be parsed
@impl Crawly.Spider
def parse_item(response) do
{:ok, document} = Floki.parse_document(response.body)
items =
document
|> Floki.find("h5.card-content__title") # query h5 elements with class card-content__title
|> Enum.map(&Floki.text/1)
|> Enum.map(fn title -> %{title: title} end)
%Crawly.ParsedItem{items: items, requests: []}
end
end
Now that we have our spider created it would be nice to save what we're extracting into some file. To do this we can use a pipeline provided by crawly called Crawly.Pipelines.WriteToFile
. For that, we need a config
folder and a config.exs
file:
mkdir config # creates the config directory
touch config/config.exs # creates an empty file called config.exs inside the config folder
mkdir -p priv/output # creates output folder inside priv where we are going to store our files
Now let's create the configuration to save the response from our spider into a file.
use Mix.Config
config :crawly,
pipelines: [
Crawly.Pipelines.JSONEncoder, # encode each item into json
{Crawly.Pipelines.WriteToFile, folder: "priv/output/", extension: "jl"} # stores the items into a file inside the folder specified
]
Now that we are good to go, we can open the elixir repl
iex -S mix
And then we can execute our spider
Crawly.Engine.start_spider(BlogSpider)
The spider is going to be executed by a supervisor and then we should see a new file inside priv/output folder. In my case the lasts posts showing in the first page are
{"title":"\"High tech, high touch\": A communication toolkit for virtual team"}
{"title":"My learning experience in a fully remote company as a Junior Developer"}
{"title":"Finding similar documents with transformers"}
{"title":"UX… What?"}
{"title":"Slice Machine from Prismic"}
{"title":"Stop (ab)using z-index"}
{"title":"Angular for Junior Backend Devs"}
{"title":"Jumping into the world of UX 🦄"}
{"title":"Gettin' jiggy wit' Git - Part 1"}
This is just a simple example of what is possible to do using crawly
. I hope you enjoyed this introduction and remember to be responsible when extracting data from websites.