Skip to content

util

thml.util

Modules:

check_installation

Functions:

check_install_duckduckgo()

check_install_googlesearch()

check_install_g4f()

check_install_spacy()

convert_doc

Functions:

  • pdf2md

    Convert a PDF file to a markdown file.

pdf2md(input_pdf: str, output_md: str, num_threads: int = 8, device: str = 'auto', do_ocr: bool = False, ocr_engine: str = 'easyocr', ocr_lang: list[str] = ['vi', 'en']) -> None

Convert a PDF file to a markdown file.

Parameters:

  • input_pdf (str) –

    The local-path or URL to the PDF/documents.

  • output_md (str) –

    The path to the output markdown file.

  • num_threads (int, default: 8 ) –

    The number of threads to use. Defaults to 8.

  • device (str, default: 'auto' ) –

    The accelerate device to use. Defaults to "auto".

  • do_ocr (bool, default: False ) –

    Whether to perform OCR on the PDF. Defaults to False.

  • ocr_engine (str, default: 'easyocr' ) –

    The OCR engine to use. Defaults to "easyocr". Choices: "easyocr" or "rapidocr".

  • ocr_lang (list[str], default: ['vi', 'en'] ) –

    The list of languages to use for OCR. Defaults to ["vi", "en"].

Note

cookie_tool

Functions:

Search all cookie files based on the search string.

Parameters:

  • patterns (list[str], default: ['*.json'] ) –

    list of patterns (strings) to search

Returns:

  • list[str]

    list[str]: list of paths of matched files.

select the first cookie file that are matched patterns

Parameters:

  • patterns (list[str]) –

    list of patterns (strings) to search

Returns:

  • str ( str ) –

    firt path of matched files.

read_cookies(cookie_files: Union[str, list[str]], selected_names: list[str] = None) -> list[dict]

Read all cookie files, and select some cookies based on names.

Parameters:

  • cookie_files (list[str]) –

    list of cookie files

  • selected_names (list[str], default: None ) –

    select cookies by names

Returns:

  • list[dict]

    list[dict]: list of cookies

show_text_diffs_in_jupyter

Show the differences between 2 texts in Jupytere Notebook side-by-side. Following this article: https://skeptric.com/python-diffs/

Functions to create the diffs: - Escape any HTML characters so that they will display properly in HTML - Align the texts at a sentence level - Markup the differences between the tokens in each pair of aligned sentences - Output the markedup and aligned sentences as side-by-side HTML

REF: - Showing Side-by-Side Diffs in Jupyter

Functions:

  • html_diffs

    Return the side-by-side HTML of the differences between text_a and text_b.

  • align_sentences

    Align the sentences between two lists of sentences of text.

  • display_diffs_jupyter

    Display the differences between text_a and text_b in Jupyter Notebook.

Attributes:

Token = str module-attribute

TokenList = list[Token] module-attribute

whitespace = re.compile('\\s+') module-attribute

end_sentence = re.compile('(?<!\\w\\.\\w.)(?<![A-Z][a-z]\\.)(?<=\\.|\\?|\\!\\:)\\s+') module-attribute

html_diffs(a: str, b: str) -> str

Return the side-by-side HTML of the differences between text_a and text_b.

Parameters:

  • a (str) –

    The first string.

  • b (str) –

    The second string.

Returns:

  • html ( str ) –

    The side-by-side HTML of the differences between text_a and text_b.

align_sentences(list1: list[str], list2: list[str]) -> Tuple[list[str], list[str]]

Align the sentences between two lists of sentences of text.

- Use the similarity score to check sentence between two lists
- Align them base on similarity score > a certain threshold,
- Insert empty sentences if it is neccessary, but do not change the order of sentence in original lists

Parameters:

  • list1 (list[str]) –

    The first list of sentences of text.

  • list2 (list[str]) –

    The second list of sentences of text.

Returns:

  • aligned_list1 ( list[str] ) –

    The aligned list1.

  • aligned_list2 ( list[str] ) –

    The aligned list2.

display_diffs_jupyter(a: str, b: str)

Display the differences between text_a and text_b in Jupyter Notebook. Args: a: The first string. b: The second string.

util_rag

This module contains support functions for the RAG system.

Classes:

  • Cookie

    Convenience class for Bing Cookie files, data, and configuration. This Class

Attributes:

log = Log.BingChat.debug module-attribute

Cookie

Convenience class for Bing Cookie files, data, and configuration. This Class is updated dynamically by the Query class to allow cycling through >1 cookie/credentials file e.g. when daily request limits (current 200 per account per day) are exceeded.

Methods:

  • files

    Return a sorted list of all cookie files matching .search_pattern in

  • import_data

    Read the active cookie file and populate the following attributes:

  • import_next

    Cycle through to the next cookies file then import it.

Attributes:

current_file_index = 0 class-attribute instance-attribute
dir_path = Path.home().resolve() / 'bing_cookies' class-attribute instance-attribute
current_file_path = dir_path class-attribute instance-attribute
search_pattern = 'bing_cookies_*.json' class-attribute instance-attribute
ignore_files = set() class-attribute instance-attribute
request_count = {} class-attribute instance-attribute
supplied_files = set() class-attribute instance-attribute
rotate_cookies = True class-attribute instance-attribute
files() -> list[Path] classmethod

Return a sorted list of all cookie files matching .search_pattern in cls.dir_path, plus any supplied files, minus any ignored files.

import_data() -> None classmethod

Read the active cookie file and populate the following attributes:

.current_file_path .current_data .image_token

import_next(discard: bool = False) -> None classmethod

Cycle through to the next cookies file then import it.

discard (bool): True -Mark the previous file to be ignored for the remainder of the current session. Otherwise cycle through all available cookie files (sharing the workload and 'resting' when not in use).