Overview
The Indexes view lists all types of indexes (web scrapes, campaigns and banners, knowledge bases, etc.) available in an account. Read more about integrations and indexes on our website.
Scraping Process
The scraper traverses through all specified pages in an index (plus the sitemap for web scrapes) in order to locate and scrape all available text content. Note that pages cannot be scraped if any of the following conditions are present:
- page is not published
- page is not searchable (page has noindex, nofollow tags)
- page is not linked from any page AND not listed in the sitemap
- the site’s
robots.txt
file contains adisallow
rule for crawlers
If a page is published and searchable but not linked from anywhere nor included in the sitemap, it can be still be scraped by manually adding the address as an entrypoint URL. Refer to Inclusions, or provide a list of articles (or parent URL for articles under the same topic) to Raffle Support.
Sync Frequency
After the initial scrape, the scraping process is repeated automatically on a regular basis (weekly or daily). All recent changes made on the host index (e.g. updated content) are then reflected in the Raffle scrape after a successful resync.
For websites with sitemap (e.g. website.com/sitemap
or website.com/sitemap.xml
), Raffle uses the regularly updated <lastMod>
tag in order to apply differential scraping, which ensures that only articles with recent changes are re-scraped, resulting in a faster and more optimized syncing process.
Note
For newly launched websites, contact Raffle Support for a quick resync. Otherwise, the scraped data will only be updated after the next successful resync (the next day for daily syncs, or the following weekend for weekly syncs).Scrape Cleaning
By default, articles go through a standard “cleaning” process when first extracted from a connected index. This process involves deduplication checks and css blacklisting in order to exclude specific parts of an article from the scrape (e.g. breadcrumbs, footer, etc.)
For further adjustments, provide a list of pruning requests to Raffle Support.
Sectioning
During the scraping process, the contents of a page are divided into smaller subsections which are used to match with relevant search queries. When a user makes a search, the model tries to find the most relevant sections in the scraped index, and sorts them by relevance. The top (relevant) entries are returned as snippets in the search results list (one section is shown per answer) while the least relevant ones are entered in a different list called Content Gaps.
Whitelisting
Configurations set on a server may prevent Raffle from accessing and scraping the contents of a protected knowledge base, and whitelisting of Raffle IP addresses may be necessary.
Whitelisting is based on the principle of giving access to only trusted entities while blocking all others by default. A whitelist (allowlist) is an exclusive list containing IP addresses of trusted and authorized entities (e.g. users or devices) that can access sensitive data or systems. The process mainly involves assigning these addresses to a user or group of users as unique identifiers, and permitting them access to the target server. This list is set up and maintained by the customer’s IT administrators.
Contact Raffle Support for a list of Raffle IP addresses to be whitelisted.
Direct Access
Alternatively, if a straightforward web scraping approach does not work, the customer can provide Raffle Support the needed credentials to a supported integration in order to directly access and scrape the contents of a knowledge base.
Incomplete Scrape
If an article is not included in an index, try the following steps:
- Verify if the affected article has the noindex, nofollow meta tag
- By default, Raffle obeys the noindex, nofollow rule and automatically ignores articles with this tag
- Check if the page is included in the sitemap (e.g. ‘website.com/sitemap’ or ‘website.com/sitemap.xml’)
- An updated sitemap that lists all pages in the website is also used by the Raffle scraper to find pages that may not be linked from other pages
- Refer to Inclusions to manually add pages in the scrape
- Provide a list of articles to be included in the scrape to Raffle Support
Sync Status
The sync status displays an overview of the latest sync dates and statuses of available indexes, including those that may require attention and further adjustments.
Add Indexes
Contact Raffle Support to add new indexes. For a list of supported integrations, refer to Integrations.
View Indexes
View scraped indexes and underlying scraped articles.
Filter
Filter indexes by title, type or date.
Steps
- Go to Indexes
- Type a search phrase on the search bar
- Press ENTER
Sort
Sort items in ascending or descending order.
Steps
- Go to Indexes
- Click on a column title to sort a column
- Click the column title again to toggle between ascending, descending or no particular order
List
View scraped articles within an index.
Steps
Search
Search for articles within an index based on the whole or partial title or URL. Pagination and the total number of search hits are displayed at the bottom of the table.
Steps
- Go to Indexes
- Click VIEW CONTENT or click the index name
- Type full or partial article title or URL in the search bar
- Press ENTER
Preview
Read and inspect the contents of a scraped article.
Steps
- Go to Indexes
- Click VIEW CONTENT or click the index name
- Click the paper icon to view the contents of a scraped article
- Click OPEN ORIGINAL to open the original article in a new tab
- Click X to close the preview
Manage Indexes
Manage various types of indexes. For a list of supported integrations, refer to Integrations.
Field Guide:
- Index Title: set the label of the index
- Entrypoint URLs: starting path to be visited by the Raffle scraper e.g ‘https://www.website.com/' or ‘https://website.com/'
- Allowed Domain: domain (or subdomain) of a website e.g. ‘www.website.com’ or ‘subdomain.website.com’ (no ‘https://’ or any (’/’))
- Included URLs: set of pages to be visited, followed and indexed as part of the scrape e.g. ‘^https://www.website.com/.*$’ (includes all pages in the website)
- Excluded URLs: set of pages to be ignored and excluded from the scrape e.g. ‘^https://www.website.com/en.*$’ (scrapes all pages except those under ‘/en’)
Ensure that correct regular expressions rules are followed, and all required fields are duly filled out.
Renaming
By default, indexes are named according to the path of the index e.g. full URL for a web scrape. This and other details can be configured as follows:
Steps
- Go to Indexes
- Click the gear icon to view the index settings
- Set the index title
- Click SAVE
Inclusions
Raffle locates and scrapes published and searchable articles within a provided index. To manually include URLs:
Steps
- Go to Indexes
- Click the gear icon to view the index settings
- Add domains or subdomains to include under ALLOWED DOMAINS
- The scrape only indexes pages within the allowed domains
- Add URLs to include under INCLUDED URL FILTERS - (REGEX)
- Ensure that regex rules are followed, refer to the textfield placeholder for reference
- Click the plus icon to add an entry
- OPTIONAL: Click the delete icon to remove an entry
- Click SAVE
Alternatively, provide a list of articles to include to Raffle Support.
Exclusions
Raffle locates and scrapes published and searchable articles within a provided index. To manually exclude URLs:
Steps
- Go to Indexes
- Click the gear icon to view the index settings
- Add URLs to exclude under the field EXCLUDED URL FILTERS - (REGEX)
- Ensure that regex rules are followed, refer to the textfield placeholder for reference
- Click the plus icon to add an entry
- OPTIONAL: Click the delete icon to remove an entry
- Click SAVE
Alternatively, provide a list of articles to exclude to Raffle Support.
Rules Engine
Boost articles with specific keywords (using words, phrases, questions or suggestions).
Alternatively, provide a list of articles (include search words and target URL) to be boosted to Raffle Support.
Steps
- Go to Indexes
- Click VIEW CONTENT
- Click the question icon to launch the questions dialog
- Click ADD QUESTION to add a search term (one phrase per line)
- Add as many search words or phrases as possible, relevant to the scraped content
- Add the title of the article as a question, for further boosting
- OPTIONAL: Click the delete icon to remove a search term from the list
- Click SAVE
Archive Indexes
Active indexes can be archived for a period of time before permanently deleted.
Steps
- Go to Indexes
- Click the delete icon to archive an index
- Click CONFIRM
Restore Indexes
Archived indexes can be restored or permanently deleted under the ARCHIVED tab.
Steps
- Click the restore icon to restore an index
- OPTIONAL: Click the delete icon to permanently delete an index
- Click CONFIRM