DOSAYGO Studio

Selector Generalization: Bioinformatics for Web Scraping

By Cris, January 1, 2013

Selecting related elements across diverse web pages is a puzzle, often requiring manual CSS selector crafting. I solved this with Selector Generalization, adapting the Needleman-Wunsch algorithm from bioinformatics to align DOM trees and generate generalized selectors from user clicks. In the web’s tangled forest, where each page is a unique organism, this project is a genetic map, tracing patterns to reveal hidden connections.

By clicking example elements, users provide the algorithm with “DNA” to align, producing a selector that captures all similar elements. The process starts by mapping each element’s DOM path into a canonical sequence, then aligning these sequences to find common patterns.

path_lcs.get_canonical_path(node) {
  const path = [];
  const canonical_path = [];
  let code = next_code();
  while(!!node && node.tagName != 'HTML') {
    const canonical_level = {
      tags: new Set(),
      geometry: new Set(),
      classes: new Set(),
      ids: new Set()
    };
    let index_name = '';
    if(!!node.parentNode) {
      let count = 0;
      const siblings = Array.from(node.parentNode.childNodes);
      for(const sibling of siblings) {
        if(sibling.tagName == node.tagName) {
          count += 1;
        }
        if(sibling === node) {
          index_name = `:nth-of-type(${count})`;
          break;
        }
      }
    }
    if(!!node.id && node.id.length > 0) {
      canonical_level.ids.add(node.id);
    }
    let classes = node.getAttribute('class') || node.className;
    if(typeof classes !== 'string') {
      classes = Array.from(node.classList);
    } else {
      classes = classes.split(/\s+/);
    }
    classes.forEach(classword => {
      if(classword.length > 0) {
        canonical_level.classes.add(`.${safe(classword)}`);
      }
    });
    canonical_level.tags.add(node.tagName);
    canonical_level.geometry.add(index_name);
    canonical_level.code = code;
    code += 1;
    canonical_path.unshift(canonical_level);
    node = node.parentNode;
  }
  return {canonical: canonical_path};
}

This code maps an element’s DOM path into a canonical sequence, a key step in aligning elements to generate selectors. Selector Generalization’s interdisciplinary leap makes web scraping intuitive, as if decoding a genome to find common traits.

Selector Generalization interface: a DNA double helix morphing into a DOM tree, with highlighted web elements on a page

Creating Selector Generalization showed me the power of cross-disciplinary thinking. Like a scientist peering through a microscope, I found clarity in unexpected parallels. At DOSAYGO, we embrace such curiosity, blending fields to craft tools that surprise and delight. This project is a bridge between biology and code, inviting developers to explore the web’s structure with a biologist’s eye.