## Purpose Data stored on the Internet accessible through the HTTP protocol can be interpreted as a data source using the internettable keyword. A depth-first scan is done in which solely unique URLs are returned. Starting of at one starting URL, URLs are downloaded from the Internet consisting of webpages and content. Contents is made available with no further deeper inspection. Webpages in HTML format are scanned for more URLs by default for the following paths - `//a[@href]`: all hrefs in anchors; - `//script[@src]`: all sources of scripts; - `//link[@href]`: all hrefs of links. - `//img[@src]`: all hrefs of images. The startAtExpression specifies the initial webpage to retrieve data for. A pre-defined list of columns is available per retrieved URL: - URL: URL of page; - Contents_char: the character contents, converted from the original character set into UTF-8; - Contents_blob: the binary contents; - Mime_type: MIME-type returned by the web server; - Http_status_code: numeric HTTP response status code; - Date_retrieval_utc: date/time when the response was received (UTC); - Retrieval_duration_ms: time between the request and complete response in milliseconds; - Bytes_retrieved: number of bytes retrieved; - Depth: recursion depth, starting at 1 for the initial URL; - Retrieval_successful: indicator whether the response was completely successful retrieved; - Last_modified: date/time when the response's content was last modified; - Etag: ETAG on the content as returned by the web server; - Content_disposition: preferred file name and encoding to be used; - Cache_Control: contents of cache-control HTTP response header; - Expires: contents of the Expires HTTP response header; - Error_message_code: Invantive UniversalSQL engine error message code if any occurred; - Error_message_text: Invantive UniversalSQL engine error message code if any occurred. ```sql select t.* from internettable ( 'https://www.invantive.com' stay on site max depth 2 ) t ``` ## Syntax ```mermaid %%{init: { 'theme': 'base', 'flowchart': { 'padding': '7', 'nodeSpacing': '20', 'rankSpacing': '20' }, 'themeVariables': { 'fontSize': '11px', 'fontFamily': 'Arial' } }}%% flowchart TD internetTableSpec_start((START)) internetTableSpec_start --> internetTableSpec_0_1["INTERNETTABLE("]:::quoted internetTableSpec_0_1 --> internetTableSpec_0_2[startAtExpression] internetTableSpec_0_2 --> internetTableSpec_0_3[sitemapExpression] internetTableSpec_0_3 --> internetTableSpec_0_4[excludeExpression] internetTableSpec_0_4 --> internetTableSpec_0_5[internetTableOptions] internetTableSpec_0_5 --> internetTableSpec_0_6[")"]:::quoted internetTableSpec_0_6 --> internetTableSpec_end((END)) ``` ## startAtExpression ```mermaid %%{init: { 'theme': 'base', 'flowchart': { 'padding': '7', 'nodeSpacing': '20', 'rankSpacing': '20' }, 'themeVariables': { 'fontSize': '11px', 'fontFamily': 'Arial' } }}%% flowchart LR startAtExpression_start((START)) startAtExpression_start --> startAtExpression_0_0[START]:::quoted startAtExpression_0_0 --> startAtExpression_0_1[AT_C]:::quoted startAtExpression_0_1 --> startAtExpression_0_2[<a href="Invantive UniversalSQL/Grammar/Expression" class="internal-link">expression</a>] startAtExpression_0_2 --> startAtExpression_end((END)) ``` ## sitemapExpression ```mermaid %%{init: { 'theme': 'base', 'flowchart': { 'padding': '7', 'nodeSpacing': '20', 'rankSpacing': '20' }, 'themeVariables': { 'fontSize': '11px', 'fontFamily': 'Arial' } }}%% flowchart LR sitemapExpression_start((START)) sitemapExpression_start --> sitemapExpression_0_0[SITEMAP]:::quoted sitemapExpression_0_0 --> sitemapExpression_0_1[<a href="Invantive UniversalSQL/Grammar/Expression" class="internal-link">expression</a>] sitemapExpression_0_1 --> sitemapExpression_end((END)) ``` ## excludeExpression ```mermaid %%{init: { 'theme': 'base', 'flowchart': { 'padding': '7', 'nodeSpacing': '20', 'rankSpacing': '20' }, 'themeVariables': { 'fontSize': '11px', 'fontFamily': 'Arial' } }}%% flowchart LR excludeExpression_start((START)) excludeExpression_start --> excludeExpression_0_0[EXCLUDE]:::quoted excludeExpression_0_0 --> excludeExpression_0_1[EXCLUDING]:::quoted excludeExpression_0_1 --> excludeExpression_0_2[<a href="Invantive UniversalSQL/Grammar/Expression" class="internal-link">expression</a>] excludeExpression_0_2 --> excludeExpression_end((END)) ``` ## internetTableOptions ```mermaid %%{init: { 'theme': 'base', 'flowchart': { 'padding': '7', 'nodeSpacing': '20', 'rankSpacing': '20' }, 'themeVariables': { 'fontSize': '11px', 'fontFamily': 'Arial' } }}%% flowchart TD internetTableOptions_start((START)) internetTableOptions_start --> internetTableOptions_0_0[STAY]:::quoted internetTableOptions_0_0 --> internetTableOptions_0_1[ON]:::quoted internetTableOptions_0_1 --> internetTableOptions_0_2[SITE]:::quoted internetTableOptions_0_2 --> internetTableOptions_0_3[MAX]:::quoted internetTableOptions_0_3 --> internetTableOptions_0_4[MAXIMUM]:::quoted internetTableOptions_0_4 --> internetTableOptions_0_5[DEPTH]:::quoted internetTableOptions_0_5 --> internetTableOptions_0_6[numericConstant] internetTableOptions_0_6 --> internetTableOptions_0_7[IGNORE]:::quoted internetTableOptions_0_7 --> internetTableOptions_0_8[ERRORS]:::quoted internetTableOptions_0_8 --> internetTableOptions_0_9[MAX]:::quoted internetTableOptions_0_9 --> internetTableOptions_0_10[MAXIMUM]:::quoted internetTableOptions_0_10 --> internetTableOptions_0_11[PARALLEL]:::quoted internetTableOptions_0_11 --> internetTableOptions_0_12[numericConstant] internetTableOptions_0_12 --> internetTableOptions_end((END)) ``` ## Purpose The process can be controlled using options: - Stay on site: when present, recursion restricts to URLs on the same host name as the starting URL. Default behaviour is to branch out to other sites too. - Maximum depth: limit the depth of the recursion to a specific number. - Ignore errors: do not stop on the first error but continue. Default behaviour is to stop on the first error or result completion, whatever comes first.