util

`thml.util` ¶

Modules:

check_installation –
convert_doc –
cookie_tool –
show_text_diffs_in_jupyter –

Show the differences between 2 texts in Jupytere Notebook side-by-side.
util_rag –

This module contains support functions for the RAG system.

`check_installation` ¶

Functions:

check_install_duckduckgo –
check_install_googlesearch –
check_install_g4f –
check_install_spacy –

`check_install_duckduckgo()` ¶

`check_install_googlesearch()` ¶

`check_install_g4f()` ¶

`check_install_spacy()` ¶

`convert_doc` ¶

Functions:

pdf2md –

Convert a PDF file to a markdown file.

`pdf2md(input_pdf: str, output_md: str, num_threads: int = 8, device: str = 'auto', do_ocr: bool = False, ocr_engine: str = 'easyocr', ocr_lang: list[str] = ['vi', 'en']) -> None` ¶

Convert a PDF file to a markdown file.

Parameters:

input_pdf (str) –

The local-path or URL to the PDF/documents.
output_md (str) –

The path to the output markdown file.
num_threads (int, default: 8 ) –

The number of threads to use. Defaults to 8.
device (str, default: 'auto' ) –

The accelerate device to use. Defaults to "auto".
do_ocr (bool, default: False ) –

Whether to perform OCR on the PDF. Defaults to False.
ocr_engine (str, default: 'easyocr' ) –

The OCR engine to use. Defaults to "easyocr". Choices: "easyocr" or "rapidocr".
ocr_lang (list[str], default: ['vi', 'en'] ) –

The list of languages to use for OCR. Defaults to ["vi", "en"].

Note

See docling examples, and supported formats.
To use OCR feature, you need to install ocr engine, see installation guide here.

`cookie_tool` ¶

Functions:

search_cookie_files –

Search all cookie files based on the search string.
first_cookie_file –

select the first cookie file that are matched patterns
read_cookies –

Read all cookie files, and select some cookies based on names.

`search_cookie_files(patterns: list[str] = ['*.json']) -> list[str]` ¶

Search all cookie files based on the search string.

Parameters:

patterns (list[str], default: ['*.json'] ) –

list of patterns (strings) to search

Returns:

list[str] –

list[str]: list of paths of matched files.

`first_cookie_file(patterns: list[str]) -> str` ¶

select the first cookie file that are matched patterns

Parameters:

patterns (list[str]) –

list of patterns (strings) to search

Returns:

str ( str ) –

firt path of matched files.

`read_cookies(cookie_files: Union[str, list[str]], selected_names: list[str] = None) -> list[dict]` ¶

Read all cookie files, and select some cookies based on names.

Parameters:

cookie_files (list[str]) –

list of cookie files
selected_names (list[str], default: None ) –

select cookies by names

Returns:

list[dict] –

list[dict]: list of cookies

`show_text_diffs_in_jupyter` ¶

Show the differences between 2 texts in Jupytere Notebook side-by-side. Following this article: https://skeptric.com/python-diffs/

Functions to create the diffs: - Escape any HTML characters so that they will display properly in HTML - Align the texts at a sentence level - Markup the differences between the tokens in each pair of aligned sentences - Output the markedup and aligned sentences as side-by-side HTML

REF: - Showing Side-by-Side Diffs in Jupyter

Functions:

html_diffs –

Return the side-by-side HTML of the differences between text_a and text_b.
align_sentences –

Align the sentences between two lists of sentences of text.
display_diffs_jupyter –

Display the differences between text_a and text_b in Jupyter Notebook.

Attributes:

Token –
TokenList –
whitespace –
end_sentence –

`Token = str` `module-attribute` ¶

`TokenList = list[Token]` `module-attribute` ¶

`whitespace = re.compile('\\s+')` `module-attribute` ¶

`end_sentence = re.compile('(?<!\\w\\.\\w.)(?<![A-Z][a-z]\\.)(?<=\\.|\\?|\\!\\:)\\s+')` `module-attribute` ¶

`html_diffs(a: str, b: str) -> str` ¶

Return the side-by-side HTML of the differences between text_a and text_b.

Parameters:

a (str) –

The first string.
b (str) –

The second string.

Returns:

html ( str ) –

The side-by-side HTML of the differences between text_a and text_b.

`align_sentences(list1: list[str], list2: list[str]) -> Tuple[list[str], list[str]]` ¶

Align the sentences between two lists of sentences of text.

- Use the similarity score to check sentence between two lists
- Align them base on similarity score > a certain threshold,
- Insert empty sentences if it is neccessary, but do not change the order of sentence in original lists

Parameters:

list1 (list[str]) –

The first list of sentences of text.
list2 (list[str]) –

The second list of sentences of text.

Returns:

aligned_list1 ( list[str] ) –

The aligned list1.
aligned_list2 ( list[str] ) –

The aligned list2.

`display_diffs_jupyter(a: str, b: str)` ¶

Display the differences between text_a and text_b in Jupyter Notebook. Args: a: The first string. b: The second string.

`util_rag` ¶

This module contains support functions for the RAG system.

Classes:

Cookie –

Convenience class for Bing Cookie files, data, and configuration. This Class

Attributes:

log –

`log = Log.BingChat.debug` `module-attribute` ¶

`Cookie` ¶

Convenience class for Bing Cookie files, data, and configuration. This Class is updated dynamically by the Query class to allow cycling through >1 cookie/credentials file e.g. when daily request limits (current 200 per account per day) are exceeded.

Methods:

files –

Return a sorted list of all cookie files matching .search_pattern in
import_data –

Read the active cookie file and populate the following attributes:
import_next –

Cycle through to the next cookies file then import it.

Attributes:

current_file_index –
dir_path –
current_file_path –
search_pattern –
ignore_files –
request_count –
supplied_files –
rotate_cookies –

`current_file_index = 0` `class-attribute` `instance-attribute` ¶

`dir_path = Path.home().resolve() / 'bing_cookies'` `class-attribute` `instance-attribute` ¶

`current_file_path = dir_path` `class-attribute` `instance-attribute` ¶

`search_pattern = 'bing_cookies_*.json'` `class-attribute` `instance-attribute` ¶

`ignore_files = set()` `class-attribute` `instance-attribute` ¶

`request_count = {}` `class-attribute` `instance-attribute` ¶

`supplied_files = set()` `class-attribute` `instance-attribute` ¶

`rotate_cookies = True` `class-attribute` `instance-attribute` ¶

`files() -> list[Path]` `classmethod` ¶

Return a sorted list of all cookie files matching .search_pattern in cls.dir_path, plus any supplied files, minus any ignored files.

`import_data() -> None` `classmethod` ¶

Read the active cookie file and populate the following attributes:

.current_file_path .current_data .image_token

`import_next(discard: bool = False) -> None` `classmethod` ¶

Cycle through to the next cookies file then import it.

discard (bool): True -Mark the previous file to be ignored for the remainder of the current session. Otherwise cycle through all available cookie files (sharing the workload and 'resting' when not in use).

util

thml.util ¶

check_installation ¶

check_install_duckduckgo() ¶

check_install_googlesearch() ¶

check_install_g4f() ¶

check_install_spacy() ¶

convert_doc ¶

pdf2md(input_pdf: str, output_md: str, num_threads: int = 8, device: str = 'auto', do_ocr: bool = False, ocr_engine: str = 'easyocr', ocr_lang: list[str] = ['vi', 'en']) -> None ¶

cookie_tool ¶

search_cookie_files(patterns: list[str] = ['*.json']) -> list[str] ¶

first_cookie_file(patterns: list[str]) -> str ¶

read_cookies(cookie_files: Union[str, list[str]], selected_names: list[str] = None) -> list[dict] ¶

show_text_diffs_in_jupyter ¶

Token = str module-attribute ¶

TokenList = list[Token] module-attribute ¶

whitespace = re.compile('\\s+') module-attribute ¶

end_sentence = re.compile('(?<!\\w\\.\\w.)(?<![A-Z][a-z]\\.)(?<=\\.|\\?|\\!\\:)\\s+') module-attribute ¶

html_diffs(a: str, b: str) -> str ¶

align_sentences(list1: list[str], list2: list[str]) -> Tuple[list[str], list[str]] ¶

display_diffs_jupyter(a: str, b: str) ¶

util_rag ¶

log = Log.BingChat.debug module-attribute ¶

Cookie ¶

current_file_index = 0 class-attribute instance-attribute ¶

dir_path = Path.home().resolve() / 'bing_cookies' class-attribute instance-attribute ¶

current_file_path = dir_path class-attribute instance-attribute ¶

search_pattern = 'bing_cookies_*.json' class-attribute instance-attribute ¶

ignore_files = set() class-attribute instance-attribute ¶

request_count = {} class-attribute instance-attribute ¶

supplied_files = set() class-attribute instance-attribute ¶

rotate_cookies = True class-attribute instance-attribute ¶

files() -> list[Path] classmethod ¶

import_data() -> None classmethod ¶

import_next(discard: bool = False) -> None classmethod ¶

`thml.util` ¶

`check_installation` ¶

`check_install_duckduckgo()` ¶

`check_install_googlesearch()` ¶

`check_install_g4f()` ¶

`check_install_spacy()` ¶

`convert_doc` ¶

`pdf2md(input_pdf: str, output_md: str, num_threads: int = 8, device: str = 'auto', do_ocr: bool = False, ocr_engine: str = 'easyocr', ocr_lang: list[str] = ['vi', 'en']) -> None` ¶

`cookie_tool` ¶

`search_cookie_files(patterns: list[str] = ['*.json']) -> list[str]` ¶

`first_cookie_file(patterns: list[str]) -> str` ¶

`read_cookies(cookie_files: Union[str, list[str]], selected_names: list[str] = None) -> list[dict]` ¶

`show_text_diffs_in_jupyter` ¶

`Token = str` `module-attribute` ¶

`TokenList = list[Token]` `module-attribute` ¶

`whitespace = re.compile('\\s+')` `module-attribute` ¶

`end_sentence = re.compile('(?<!\\w\\.\\w.)(?<![A-Z][a-z]\\.)(?<=\\.|\\?|\\!\\:)\\s+')` `module-attribute` ¶

`html_diffs(a: str, b: str) -> str` ¶

`align_sentences(list1: list[str], list2: list[str]) -> Tuple[list[str], list[str]]` ¶

`display_diffs_jupyter(a: str, b: str)` ¶

`util_rag` ¶

`log = Log.BingChat.debug` `module-attribute` ¶

`Cookie` ¶

`current_file_index = 0` `class-attribute` `instance-attribute` ¶

`dir_path = Path.home().resolve() / 'bing_cookies'` `class-attribute` `instance-attribute` ¶

`current_file_path = dir_path` `class-attribute` `instance-attribute` ¶

`search_pattern = 'bing_cookies_*.json'` `class-attribute` `instance-attribute` ¶

`ignore_files = set()` `class-attribute` `instance-attribute` ¶

`request_count = {}` `class-attribute` `instance-attribute` ¶

`supplied_files = set()` `class-attribute` `instance-attribute` ¶

`rotate_cookies = True` `class-attribute` `instance-attribute` ¶

`files() -> list[Path]` `classmethod` ¶

`import_data() -> None` `classmethod` ¶

`import_next(discard: bool = False) -> None` `classmethod` ¶