document.querySelectorAll
is one of the most handy tools in JavaScript
today. It’s a unified selector interface and brings so much power to the
end user. It’s applications are near endless, like replacing JQuery or
web scraping. It allows you to query the
current document using a css selector, and returns a NodeList
.
It unifies the selection interface in the web today.
I personally don’t like the NodeList
container because it doesn’t
support many of the functional things I expect in a js array like
container. It supports plain old iteration with .forEach
, but who just
wants that when you can .map
and .filter
with an array. So you might
as well use Array.from
if you’re working with it for anything to do
with web scraping. I’ll go over some of the common tasks I might do with
this, and how you can do this scraping without a browser.
Find all elements with some specified content pattern
Often times you want to find some element whose content meets a certain pattern. For this example let’s say you want to get the content of all elements whose text meets a certain predicate. In this case we would use filter to find those that match and map to get the content.
This might seem very useful to just select a single tag like that,
especially when there are more efficient methods to get by tag name, but
the convenience factor of using querySelectorAll
to get this is worth
it.
querySelectorAll can also be used on Nodes, not just whole documents
Let’s say you already found a bunch of div containers, and now you want
to extract content out of them. You can also apply querySelectorAll
to
a node, and it will operate on that nodes children returning a
collection of elements within the operated node. Something like the
below iframe:
So if you inspect that iframe you’ll find several divs each describing some stuff about the movie. All nicely tagged and such, you might not get this lucky in the real world, but this is a good example.
Let’s say that you wanted to get the details about every movie listed there. The easiest solution would be to inspect the html and see how it’s structured. In this case there is a div with an id of movieList which contains individual divs each describing a movie. Within each of the movie divs there’s a heading with a title, a description, and some actor listing. So let’s say that we want to get an object with all the details for all movies in the list.
Running that code on the above iframe should give you results like the below:
Now you have some nicely formatted JSON objects that you can do with what you please. You could stick them in a database or further process them.
Using Wget and JSDom
Often times you won’t have access to the DOM when you’re scraping or spidering
a website. This can be remedied by downloading the site with wget
, then using
JSDom to build a DOM from your downloaded html. JSDom lets you build a
document object model supporting many of the modern selectors from a string.
It’s powered by Node.JS. I’m not going to cover this in depth, but it can be a very useful strategy when planning a web so you don’t hit their server repeatedly, and get banned.
I hope you found these techniques described to be useful. I regularly use
these strategies when I’m pulling data in browser or writing a scraper.
document.querySelectorAll
is a very powerful tool for the modern developer,
and could easily be used for many complex tasks with modern web development.