Coveo Crawler Using Selenium Web Driver and Chrome Driver

May 31, 2024

Indexing a Salesforce LWR site for Coveo can be tricky especially if it requires authentication. Coveo Cloud has web sources that can attempt to login to a website but they didn't work with the SSO setup ISACA uses. Also LWR loads a lot of content client-side and crawlers need to wait for the page to completely finish loading. Finally, the default web sources lack the ability to skip sections of pages (like the main nav) that can generate false positives for search results.

To solve these problems, I setup a custom crawler usingChromeDriver, Coveo Platform SDK, and the Selenium WebDriver. These packages work together in a .Net application to retrieve and render pages, login when necessary and maintain session across requests, wait for elements on the page to load before capturing the page content, remove sections that are not relevant for search, and then push the metadata and page source to the Coveo index.

Selenium WebDriver is a browser automation tool that works with ChromeDriver and other browsers. It allows you to retrieve, render, capture, and interact with webpages. Through this tool, the crawler application enters credentials into the login form, requests pages, and waits for page elements to load before capturing pages. The example below is for waiting for the login page to load and entering the username:

wait.Until(c => c.FindElement(By.CssSelector("input.username")));
var usernameInput = driver.FindElement(By.CssSelector("input.username"));
usernameInput.SendKeys(_settings.Username);

ChromeDriver is used as the browser engine for the actual retrieval and rendering of webpages. You can run it headless so it doesn't open a window or produce any visual output. This allows the application to use less memory/resources while running. An example of setting this in ChromeOptions is:

ChromeOptions chromeOptions = new ChromeOptions();
chromeOptions.AddArguments("--headless=new");
ChromeDriver driver = new ChromeDriver(filePath, chromeOptions);

The Coveo Platform SDK makes it easier to communicate with a Coveo Cloud source and you can add/update/delete items individually or in bulk. This example shows how you retrieve the HTML source of a page and push a single document to Coveo:

driver.Navigate().GoToUrl(url);
wait.Until(c => c.FindElement(By.TagName("webruntime-app")));
PushDocument document = new PushDocument(url) {
    ClickableUri = url,
    ModifiedDate = DateTime.UtcNow
}
PushDocumentHelper.SetContentAndCompress(document, driver.PageSource);
client.DocumentManager.AddOrUpdateDocument(sourceId, document, null, cancellationToken);

Once you have a custom crawler, there is a lot more functionality you can add. I've often used the CleanHtml processor from Coveo For Sitecore that allowed you to markup sections of your HTML with comments to be excluded. As an alternative for this project, I added data attributes to any LWC components that shouldn't be indexed. Then I usedHtmlAgilityPack to identify and remove any HTML nodes with it before sending the source to Coveo.

var document = new HtmlDocument();
document.LoadHtml(driver.PageSource);
var ignoredNodes = document.DocumentNode.SelectNodes("//*[@data-index='false']");

for (int i = (ignoredNodes?.Count ?? 0); i > 0; i--)
    ignoredNodes[i-1].RemoveAll();

var cleanHtml = document.DocumentNode.OuterHtml;