Extract Text from HTML

Definition

The "Extract Text from HTML" action allows you to extract the plain text content from an HTML string. This action strips away any HTML tags, leaving only the textual information from the HTML document. It can be particularly useful when you need to process or analyze the content of an HTML page without the distraction of HTML formatting or structure. Additionally, it provides the option to include the content from the <head> tag, which can be useful for extracting metadata or title information.

Key Capabilities:

  • Extracts only the text content, removing HTML tags.
  • Optionally includes the content from the <head> tag of the HTML.
  • Simplifies working with HTML content by focusing on the relevant text.

Example Use Cases

  1. Web Scraping for Text Content
    When scraping web pages, you may need to extract only the text (e.g., article content, blog posts, or product descriptions) while ignoring HTML formatting, images, and other non-text elements.

  2. Email Content Extraction
    If you are processing HTML emails, you can extract the main text body of the email to analyze its content, for example, extracting the message or details from a notification email.

  3. Data Extraction from HTML Reports
    In scenarios where HTML reports are generated, you can extract the textual information (such as summaries, results, or data points) from the report without the need to parse the HTML structure.

  4. Content Filtering and Text Processing
    For text analysis, sentiment analysis, or keyword extraction, you can use this action to first extract the raw text content from an HTML page and then process it further.

  5. Metadata Extraction from Web Pages
    When extracting metadata (like the title or description) from HTML pages, you can enable the option to include the <head> tag content to retrieve the meta tags, title, or other header information.

Inputs

  1. HTML Payload
    This is the HTML string or document from which you want to extract text. It can be the raw HTML content of a webpage, email, or any HTML-formatted text. The action will parse this HTML and extract the text content while ignoring the HTML tags.

    • Example: If you have an HTML string like <html><body><p>Hello, world!</p></body></html>, this field would contain that string, and the action would extract "Hello, world!" as the output.
  2. Include Head Tag Content

    This is a checkbox that allows you to include the content found within the <head> tag of the HTML document. The <head> tag typically contains metadata such as the title, meta description, or other header information that is not usually visible on the page but may be relevant for certain use cases.

    • Example: If the HTML contains <head><title>Page Title</title><meta name="description" content="This is a page description"></head>, enabling this option would include "Page Title" and "This is a page description" in the output text. If left unchecked, only the body content (e.g., paragraphs, images) would be extracted.

    Outputs

    1. Text

      The extracted text content from the provided HTML payload. This output will contain the plain text, with all HTML tags removed. If the "Include Head Tag Content" option is enabled, it will also include the text content from the <head> section, such as the page title or meta description.

      • Example: Given an HTML input like:

        <html>
          <head><title>Page Title</title></head>
          <body><p>Hello, world!</p><p>Welcome to the page.</p></body>
        </html>
        

        The output will be:

        Page Title
        Hello, world!
        Welcome to the page.
        

        If "Include Head Tag Content" is disabled, the output will only include the body content:

        Hello, world!
        Welcome to the page.
        

Example of Using the Action

Imagine you have a collection of HTML emails stored in your system, and you want to extract the plain text content from each email for analysis or processing.

For instance, the HTML emails might contain promotional messages with various HTML tags like <div>, <img>, and <p>. By using the "Extract Text from HTML" action, you can easily strip away the HTML tags and retrieve just the readable content, like the promotional text or offer details.

In this case, the action will help you convert the HTML email content into a clean, readable format for further processing, such as sentiment analysis or keyword extraction. This is particularly useful when dealing with large datasets of HTML-based content, where you need to extract and process the raw text efficiently.