Overview

Add and manage web scrapes, campaigns, banners, and knowledge bases. Learn about AI scraping frequency and customization options.

The Indexes view lists all types of indexes (web scrapes, campaigns and banners, knowledge bases, etc.) available in an account. Read more about integrations and indexes on our website.

Scraping Process

The scraper traverses through all specified pages in an index (plus the sitemap for web scrapes) in order to locate and scrape all available text content. Note that pages cannot be scraped if any of the following conditions are present:

  • page is not published
  • page is not searchable (page has noindex, nofollow tags)
  • page is not linked from any page AND not listed in the sitemap
  • the site’s robots.txt file contains a disallow rule for crawlers

If a page is published and searchable but not linked from anywhere nor included in the sitemap, it can be still be scraped by manually adding the address as an entrypoint URL. Refer to Inclusions, or provide a list of articles (or parent URL for articles under the same topic) to Raffle Support.

Sync Frequency

After the initial scrape, the scraping process is repeated automatically on a regular basis (weekly or daily). All recent changes made on the host index (e.g. updated content) are then reflected in the Raffle scrape after a successful resync.

For websites with sitemap (e.g. website.com/sitemap or website.com/sitemap.xml), Raffle uses the regularly updated <lastMod> tag in order to apply differential scraping, which ensures that only articles with recent changes are re-scraped, resulting in a faster and more optimized syncing process.

Scrape Cleaning

By default, articles go through a standard “cleaning” process when first extracted from a connected index. This process involves deduplication checks and css blacklisting in order to exclude specific parts of an article from the scrape (e.g. breadcrumbs, footer, etc.)

For further adjustments, provide a list of pruning requests to Raffle Support.

Sectioning

During the scraping process, the contents of a page are divided into smaller subsections which are used to match with relevant search queries. When a user makes a search, the model tries to find the most relevant sections in the scraped index, and sorts them by relevance. The top (relevant) entries are returned as snippets in the search results list (one section is shown per answer) while the least relevant ones are entered in a different list called Content Gaps.

Whitelisting

Configurations set on a server may prevent Raffle from accessing and scraping the contents of a protected knowledge base, and whitelisting of Raffle IP addresses may be necessary.

Whitelisting is based on the principle of giving access to only trusted entities while blocking all others by default. A whitelist (allowlist) is an exclusive list containing IP addresses of trusted and authorized entities (e.g. users or devices) that can access sensitive data or systems. The process mainly involves assigning these addresses to a user or group of users as unique identifiers, and permitting them access to the target server. This list is set up and maintained by the customer’s IT administrators.

Contact Raffle Support for a list of Raffle IP addresses to be whitelisted.

Direct Access

Alternatively, if a straightforward web scraping approach does not work, the customer can provide Raffle Support the needed credentials to a supported integration in order to directly access and scrape the contents of a knowledge base.

Incomplete Scrape

If an article is not included in an index, try the following steps:

  • Verify if the affected article has the noindex, nofollow meta tag
    • By default, Raffle obeys the noindex, nofollow rule and automatically ignores articles with this tag
  • Check if the page is included in the sitemap (e.g. ‘website.com/sitemap’ or ‘website.com/sitemap.xml’)
    • An updated sitemap that lists all pages in the website is also used by the Raffle scraper to find pages that may not be linked from other pages
  • Refer to Inclusions to manually add pages in the scrape
  • Provide a list of articles to be included in the scrape to Raffle Support

Sync Status

The sync status displays an overview of the latest sync dates and statuses of available indexes, including those that may require attention and further adjustments.

Sync Status Overview

Add Indexes

Contact Raffle Support to add new indexes. For a list of supported integrations, refer to Integrations.

Add New Index

View Indexes

View scraped indexes and underlying scraped articles.

View Indexes (Start Page)

View Indexes (Filled List)

Filter

Filter indexes by title, type or date.

Filter Indexes

Sort

Sort items in ascending or descending order.

Sort Indexes

List

View scraped articles within an index.

View Scraped Articles

Search for articles within an index based on the whole or partial title or URL. Pagination and the total number of search hits are displayed at the bottom of the table.

Search in Scraped Articles

Preview

Read and inspect the contents of a scraped article.

View Scraped Content

Manage Indexes

Manage various types of indexes. For a list of supported integrations, refer to Integrations.

Field Guide:

  • Index Title: set the label of the index
  • Entrypoint URLs: starting path to be visited by the Raffle scraper e.g ‘https://www.website.com/' or ‘https://website.com/'
  • Allowed Domain: domain (or subdomain) of a website e.g. ‘www.website.com’ or ‘subdomain.website.com’ (no ‘https://’ or any (’/’))
  • Included URLs: set of pages to be visited, followed and indexed as part of the scrape e.g. ‘^https://www.website.com/.*$’ (includes all pages in the website)
  • Excluded URLs: set of pages to be ignored and excluded from the scrape e.g. ‘^https://www.website.com/en.*$’ (scrapes all pages except those under ‘/en’)

Ensure that correct regular expressions rules are followed, and all required fields are duly filled out.

Customize Index Settings

Renaming

By default, indexes are named according to the path of the index e.g. full URL for a web scrape. This and other details can be configured as follows:

Rename an Index

Inclusions

Raffle locates and scrapes published and searchable articles within a provided index. To manually include URLs:

Manually Include Articles

Alternatively, provide a list of articles to include to Raffle Support.

Exclusions

Raffle locates and scrapes published and searchable articles within a provided index. To manually exclude URLs:

Manually Exclude Articles

Alternatively, provide a list of articles to exclude to Raffle Support.

Rules Engine

Boost articles with specific keywords (using words, phrases, questions or suggestions).

Alternatively, provide a list of articles (include search words and target URL) to be boosted to Raffle Support.

Boost Articles with Search Phrases

Archive Indexes

Active indexes can be archived for a period of time before permanently deleted.

Archive an Index

Restore Indexes

Archived indexes can be restored or permanently deleted under the ARCHIVED tab.

Restore an Index