Selector Generalization: Bioinformatics for Web Scraping
Selecting related elements across diverse web pages is a puzzle, often requiring manual CSS selector crafting. I solved this with Selector Generalization, adapting the Needleman-Wunsch algorithm from bioinformatics to align DOM trees and generate generalized selectors from user clicks. In the web’s tangled forest, where each page is a unique organism, this project is a genetic map, tracing patterns to reveal hidden connections.
By clicking example elements, users provide the algorithm with “DNA” to align, producing a selector that captures all similar elements. The process starts by mapping each element’s DOM path into a canonical sequence, then aligning these sequences to find common patterns.
path_lcs.get_canonical_path(node) {
const path = [];
const canonical_path = [];
let code = next_code();
while(!!node && node.tagName != 'HTML') {
const canonical_level = {
tags: new Set(),
geometry: new Set(),
classes: new Set(),
ids: new Set()
};
let index_name = '';
if(!!node.parentNode) {
let count = 0;
const siblings = Array.from(node.parentNode.childNodes);
for(const sibling of siblings) {
if(sibling.tagName == node.tagName) {
count += 1;
}
if(sibling === node) {
index_name = `:nth-of-type(${count})`;
break;
}
}
}
if(!!node.id && node.id.length > 0) {
canonical_level.ids.add(node.id);
}
let classes = node.getAttribute('class') || node.className;
if(typeof classes !== 'string') {
classes = Array.from(node.classList);
} else {
classes = classes.split(/\s+/);
}
classes.forEach(classword => {
if(classword.length > 0) {
canonical_level.classes.add(`.${safe(classword)}`);
}
});
canonical_level.tags.add(node.tagName);
canonical_level.geometry.add(index_name);
canonical_level.code = code;
code += 1;
canonical_path.unshift(canonical_level);
node = node.parentNode;
}
return {canonical: canonical_path};
}
This code maps an element’s DOM path into a canonical sequence, a key step in aligning elements to generate selectors. Selector Generalization’s interdisciplinary leap makes web scraping intuitive, as if decoding a genome to find common traits.

Creating Selector Generalization showed me the power of cross-disciplinary thinking. Like a scientist peering through a microscope, I found clarity in unexpected parallels. At DOSAYGO, we embrace such curiosity, blending fields to craft tools that surprise and delight. This project is a bridge between biology and code, inviting developers to explore the web’s structure with a biologist’s eye.