document.querySelectorAll is one of the most handy tools in JavaScript today. It’s a unified selector interface and brings so much power to the end user. It’s applications are near endless, like replacing JQuery or web scraping. It allows you to query the current document using a css selector, and returns a NodeList. It unifies the selection interface in the web today.

I personally don’t like the NodeList container because it doesn’t support many of the functional things I expect in a js array like container. It supports plain old iteration with .forEach, but who just wants that when you can .map and .filter with an array. So you might as well use Array.from if you’re working with it for anything to do with web scraping. I’ll go over some of the common tasks I might do with this, and how you can do this scraping without a browser.

Find all elements with some specified content pattern

Often times you want to find some element whose content meets a certain pattern. For this example let’s say you want to get the content of all elements whose text meets a certain predicate. In this case we would use filter to find those that match and map to get the content.

Array.from(document.querySelectorAll('p'))
    .filter(x => x.textContent.length > 75)
    .map(x => x.textContent)

This might seem very useful to just select a single tag like that, especially when there are more efficient methods to get by tag name, but the convenience factor of using querySelectorAll to get this is worth it.

querySelectorAll can also be used on Nodes, not just whole documents

Let’s say you already found a bunch of div containers, and now you want to extract content out of them. You can also apply querySelectorAll to a node, and it will operate on that nodes children returning a collection of elements within the operated node. Something like the below iframe:

So if you inspect that iframe you’ll find several divs each describing some stuff about the movie. All nicely tagged and such, you might not get this lucky in the real world, but this is a good example.

Let’s say that you wanted to get the details about every movie listed there. The easiest solution would be to inspect the html and see how it’s structured. In this case there is a div with an id of movieList which contains individual divs each describing a movie. Within each of the movie divs there’s a heading with a title, a description, and some actor listing. So let’s say that we want to get an object with all the details for all movies in the list.

Array.from(document.querySelectorAll('div.movie'))
    .map(x => {
        return {
            "title": x.querySelector('.title').textContent,
            "description": x.querySelector('p').textContent,
            "actors": x.querySelector('p.actors').textContent.split('&')
        }
    });

Running that code on the above iframe should give you results like the below:

[
  {
    "title": "Foo: The Movie",
    "description": "lipsum 123465",
    "actors": [
      "Someguy ",
      " Blah"
    ]
  },
  {
    "title": "Bar: The Motion Picture",
    "description": "lipsum 123456",
    "actors": [
      "Another Famous Guy"
    ]
  },
  {
    "title": "FooBar: Things United",
    "description": "lipsum 12345. They just united.",
    "actors": [
      "Someguy ",
      " Another Famus Guy"
    ]
  }
]

Now you have some nicely formatted JSON objects that you can do with what you please. You could stick them in a database or further process them.

Using Wget and JSDom

Often times you won’t have access to the DOM when you’re scraping or spidering a website. This can be remedied by downloading the site with wget, then using JSDom to build a DOM from your downloaded html. JSDom lets you build a document object model supporting many of the modern selectors from a string. It’s powered by Node.JS. I’m not going to cover this in depth, but it can be a very useful strategy when planning a web so you don’t hit their server repeatedly, and get banned.

I hope you found these techniques described to be useful. I regularly use these strategies when I’m pulling data in browser or writing a scraper. document.querySelectorAll is a very powerful tool for the modern developer, and could easily be used for many complex tasks with modern web development.

document.querySelectorAll

2018 March 09

Find all elements with some specified content pattern

querySelectorAll can also be used on Nodes, not just whole documents

Using Wget and JSDom

A selected list of related posts that you might enjoy:

← Previous Post

Fountain Pens

Next Post →

Unicode Character Code Decoder

document.querySelectorAll 2018 March 09

Find all elements with some specified content pattern

querySelectorAll can also be used on Nodes, not just whole documents

Using Wget and JSDom

A selected list of related posts that you might enjoy:

← Previous Post

Fountain Pens

Next Post →

Unicode Character Code Decoder

document.querySelectorAll

2018 March 09