## Purpose
Data stored on the Internet accessible through the HTTP protocol can be interpreted as a data source using the internettable keyword. A depth-first scan is done in which solely unique URLs are returned. Starting of at one starting URL, URLs are downloaded from the Internet consisting of webpages and content. Contents is made available with no further deeper inspection. Webpages in HTML format are scanned for more URLs by default for the following paths
- `//a[@href]`: all hrefs in anchors;
- `//script[@src]`: all sources of scripts;
- `//link[@href]`: all hrefs of links.
- `//img[@src]`: all hrefs of images.
The startAtExpression specifies the initial webpage to retrieve data for.
A pre-defined list of columns is available per retrieved URL:
- URL: URL of page;
- Contents_char: the character contents, converted from the original character set into UTF-8;
- Contents_blob: the binary contents;
- Mime_type: MIME-type returned by the web server;
- Http_status_code: numeric HTTP response status code;
- Date_retrieval_utc: date/time when the response was received (UTC);
- Retrieval_duration_ms: time between the request and complete response in milliseconds;
- Bytes_retrieved: number of bytes retrieved;
- Depth: recursion depth, starting at 1 for the initial URL;
- Retrieval_successful: indicator whether the response was completely successful retrieved;
- Last_modified: date/time when the response's content was last modified;
- Etag: ETAG on the content as returned by the web server;
- Content_disposition: preferred file name and encoding to be used;
- Cache_Control: contents of cache-control HTTP response header;
- Expires: contents of the Expires HTTP response header;
- Error_message_code: Invantive UniversalSQL engine error message code if any occurred;
- Error_message_text: Invantive UniversalSQL engine error message code if any occurred.
```sql
select t.*
from internettable
( 'https://www.invantive.com'
stay on site
max depth 2
) t
```
## Syntax
```mermaid
%%{init: {
'theme': 'base',
'flowchart': { 'padding': '7', 'nodeSpacing': '20', 'rankSpacing': '20' },
'themeVariables': {
'fontSize': '11px',
'fontFamily': 'Arial'
}
}}%%
flowchart TD
internetTableSpec_start((START))
internetTableSpec_start --> internetTableSpec_0_1["INTERNETTABLE("]:::quoted
internetTableSpec_0_1 --> internetTableSpec_0_2[startAtExpression]
internetTableSpec_0_2 --> internetTableSpec_0_3[sitemapExpression]
internetTableSpec_0_3 --> internetTableSpec_0_4[excludeExpression]
internetTableSpec_0_4 --> internetTableSpec_0_5[internetTableOptions]
internetTableSpec_0_5 --> internetTableSpec_0_6[")"]:::quoted
internetTableSpec_0_6 --> internetTableSpec_end((END))
```
## startAtExpression
```mermaid
%%{init: {
'theme': 'base',
'flowchart': { 'padding': '7', 'nodeSpacing': '20', 'rankSpacing': '20' },
'themeVariables': {
'fontSize': '11px',
'fontFamily': 'Arial'
}
}}%%
flowchart LR
startAtExpression_start((START))
startAtExpression_start --> startAtExpression_0_0[START]:::quoted
startAtExpression_0_0 --> startAtExpression_0_1[AT_C]:::quoted
startAtExpression_0_1 --> startAtExpression_0_2[<a href="Invantive UniversalSQL/Grammar/Expression" class="internal-link">expression</a>]
startAtExpression_0_2 --> startAtExpression_end((END))
```
## sitemapExpression
```mermaid
%%{init: {
'theme': 'base',
'flowchart': { 'padding': '7', 'nodeSpacing': '20', 'rankSpacing': '20' },
'themeVariables': {
'fontSize': '11px',
'fontFamily': 'Arial'
}
}}%%
flowchart LR
sitemapExpression_start((START))
sitemapExpression_start --> sitemapExpression_0_0[SITEMAP]:::quoted
sitemapExpression_0_0 --> sitemapExpression_0_1[<a href="Invantive UniversalSQL/Grammar/Expression" class="internal-link">expression</a>]
sitemapExpression_0_1 --> sitemapExpression_end((END))
```
## excludeExpression
```mermaid
%%{init: {
'theme': 'base',
'flowchart': { 'padding': '7', 'nodeSpacing': '20', 'rankSpacing': '20' },
'themeVariables': {
'fontSize': '11px',
'fontFamily': 'Arial'
}
}}%%
flowchart LR
excludeExpression_start((START))
excludeExpression_start --> excludeExpression_0_0[EXCLUDE]:::quoted
excludeExpression_0_0 --> excludeExpression_0_1[EXCLUDING]:::quoted
excludeExpression_0_1 --> excludeExpression_0_2[<a href="Invantive UniversalSQL/Grammar/Expression" class="internal-link">expression</a>]
excludeExpression_0_2 --> excludeExpression_end((END))
```
## internetTableOptions
```mermaid
%%{init: {
'theme': 'base',
'flowchart': { 'padding': '7', 'nodeSpacing': '20', 'rankSpacing': '20' },
'themeVariables': {
'fontSize': '11px',
'fontFamily': 'Arial'
}
}}%%
flowchart TD
internetTableOptions_start((START))
internetTableOptions_start --> internetTableOptions_0_0[STAY]:::quoted
internetTableOptions_0_0 --> internetTableOptions_0_1[ON]:::quoted
internetTableOptions_0_1 --> internetTableOptions_0_2[SITE]:::quoted
internetTableOptions_0_2 --> internetTableOptions_0_3[MAX]:::quoted
internetTableOptions_0_3 --> internetTableOptions_0_4[MAXIMUM]:::quoted
internetTableOptions_0_4 --> internetTableOptions_0_5[DEPTH]:::quoted
internetTableOptions_0_5 --> internetTableOptions_0_6[numericConstant]
internetTableOptions_0_6 --> internetTableOptions_0_7[IGNORE]:::quoted
internetTableOptions_0_7 --> internetTableOptions_0_8[ERRORS]:::quoted
internetTableOptions_0_8 --> internetTableOptions_0_9[MAX]:::quoted
internetTableOptions_0_9 --> internetTableOptions_0_10[MAXIMUM]:::quoted
internetTableOptions_0_10 --> internetTableOptions_0_11[PARALLEL]:::quoted
internetTableOptions_0_11 --> internetTableOptions_0_12[numericConstant]
internetTableOptions_0_12 --> internetTableOptions_end((END))
```
## Purpose
The process can be controlled using options:
- Stay on site: when present, recursion restricts to URLs on the same host name as the starting URL. Default behaviour is to branch out to other sites too.
- Maximum depth: limit the depth of the recursion to a specific number.
- Ignore errors: do not stop on the first error but continue. Default behaviour is to stop on the first error or result completion, whatever comes first.