Quick Start Guide - Everything you need to know
Last updated
Last updated
Thunderbit AI Web Scraper is a handy tool that uses AI to help you pull data from virtually any website or source and converts it into well-organized, structured information. Whether it's web pages, images, or files, this tool can handle it all. It's the same as hiring someone, read the website, and fill in your pre-defined table. (In this case, the website is the Data Source, and the table is the Scraper Template.)
Here's what you can do with it:
2-Clicks Data Extraction: With just two clicks, you can grab data from almost any webpage. The AI gives you suggestions to help you get perfectly formatted data tables.
Natural Language Queries: Just describe what you're looking for, like "Name" or "Email", and the AI will find it for you. It's like having a scraper for different webpages.
File and Image Uploads: If a website is tricky to capture, you can upload files or images for extraction.
AI-Powered Data Processing: The tool can rearrange and organize your data on the fly, delivering it exactly the way you need.
Let's get into the details of using the AI Web Scraper. If you haven't installed Thunderbit yet, check out the How to Set Up Thunderbit section. You're welcome to try this out in the free Playground, where you can practice without any cost. For more details, visit the Playground Guide.
By default, the AI Web Scraper uses the Current Page as the Data Source. If you're on a paid plan (or free trial), you can choose URLs or File & Image as your data input. Select the format that suits your data collection needs:
Current Page: Scrapes from the currently open browser tab. You can choose pagination or scrolling options. Learn more in the Pagination & Scrolling.
URLs: Input multiple URLs, and the AI Web Scraper will scrape data from each one sequentially. You can choose between browser or cloud server scraping. Each has its pros and cons, but we default to browser scraping. More details in the Browser vs. Background.
File & Image: Upload files or images. Supported formats include PDF, Scanned PDF, Images, MS Office, txt, and Markdown.
The scraper template tells the AI what information (data structure) you want, how you want it formatted, and what the final data table should look like. Think of it as the column names that you need to scrape from this page. You can build your scraper template in two ways:
AI Suggest Columns (Recommended): Click "+ New Scraper Template" (or use a blank template), then click the "AI Suggest Columns" button. The AI will come up with the column names based on the page on the left. You can add or remove columns as you wish.
Write Your Own Column Names: Manually write the column names if you have specific needs or naming conventions. For better extraction, specify the data format for each column by clicking the icon at the start of each column. Learn more about data formats in the Data Type.
Be specific on the data you want to extract. For example: "Article Content" is better than "Article"; "Product Spec" (if present on page) is better than "Product Description"
Next, choose how you want your data to be output. You can display it as a table within the extension or export it directly to Google Sheets, Airtable, or Notion. See the Export Guide for step-by-step instructions.
With AI capabilities, one template can scrape multiple different pages (e.g., use a single SKU template for multiple e-commerce sites).
Click "scrape" to begin the extraction process. Depending on the amount of data, this may take some time. You can leave the task running in the background and find it in Notifications.
Results are displayed in a table format within the extension.
Preview the data and perform basic operations like sorting, filtering, and hiding columns.
Choose your export format. If you specified an export location earlier, click the button to view it. If you chose "Output as table," you can copy to clipboard or download as CSV. Most platforms support direct pasting or file import.
Bonus: If you export to Notion or Airtable, images in the field will be displayed just like in their respective data types.
Now that you've got the hang of the AI Web Scraper, let's explore more AI features to tackle complex tasks.
Thunderbit AI Web Scraper works differently from traditional scrapers. We send the entire webpage to the AI, so each column is like a conversation with the AI. That's why we can take your data scraping to the next level.
Use simple column naming to perform basic text operations like "Summarize" and "Translate"—the AI will understand what you want it to do. Here are some examples:
Summarize: Add a column named "One-sentence Summary" to get a concise version of the content using ChatGPT.
Translate: If you're unfamiliar with the language, write "xxx in Japanese (language)" to get accurate translations.
If you've used ChatGPT, you're familiar with prompts. Here, you can add Custom Instructions to columns for data processing tasks.
Here are some ways to use Custom Instructions:
Determine Ideal Customer Profile (ICP): Add a column named "Customer type" and write classification rules to identify ICP based on your criteria.
Format: To extract publication dates in a specific format, use a prompt like:
Calculate: To calculate a product's total price including shipping, add a column named "Total Price" with the instruction:
Categorize: To scrape popular Amazon products and categorize them, add a column named "Category" with the instruction:
The only limit to AI capabilities is your imagination. Explore various Custom Instructions to enhance data cleaning and processing!
If your Scraper Template includes URL columns, we offer a special subpage scraping feature. It helps you extract data from subpages and merge it back into the table. For example, gather contact information from personal pages and add it to the main table. This works for all websites with secondary pages. Learn more in the Subpage Scraping.
To experience all the features on your work webpages, start with a free trial. You'll get 10 free pages with all AI Web Scraper features (Pre-built Scrapers, Subpage Scraping, Data Enrichment, Bulk Scraping, Pagination) or subscribe to enjoy a 31% off limited-time discount. For just $16.5/month, you'll get 30,000 Credits per year. (A Credit is the basic usage unit in Thunderbit. Basically, 1 Output Row = 1 Credit. Learn more at Pricing.)