util
thml.util
¶
Modules:
-
check_installation
– -
convert_doc
– -
cookie_tool
– -
show_text_diffs_in_jupyter
–Show the differences between 2 texts in Jupytere Notebook side-by-side.
-
util_rag
–This module contains support functions for the RAG system.
check_installation
¶
Functions:
convert_doc
¶
Functions:
-
pdf2md
–Convert a PDF file to a markdown file.
pdf2md(input_pdf: str, output_md: str, num_threads: int = 8, device: str = 'auto', do_ocr: bool = False, ocr_engine: str = 'easyocr', ocr_lang: list[str] = ['vi', 'en']) -> None
¶
Convert a PDF file to a markdown file.
Parameters:
-
input_pdf
(str
) –The local-path or URL to the PDF/documents.
-
output_md
(str
) –The path to the output markdown file.
-
num_threads
(int
, default:8
) –The number of threads to use. Defaults to 8.
-
device
(str
, default:'auto'
) –The accelerate device to use. Defaults to "auto".
-
do_ocr
(bool
, default:False
) –Whether to perform OCR on the PDF. Defaults to False.
-
ocr_engine
(str
, default:'easyocr'
) –The OCR engine to use. Defaults to "easyocr". Choices: "easyocr" or "rapidocr".
-
ocr_lang
(list[str]
, default:['vi', 'en']
) –The list of languages to use for OCR. Defaults to ["vi", "en"].
Note
- See docling examples, and supported formats.
- To use OCR feature, you need to install
ocr engine
, see installation guide here.
cookie_tool
¶
Functions:
-
search_cookie_files
–Search all cookie files based on the search string.
-
first_cookie_file
–select the first cookie file that are matched patterns
-
read_cookies
–Read all cookie files, and select some cookies based on names.
search_cookie_files(patterns: list[str] = ['*.json']) -> list[str]
¶
Search all cookie files based on the search string.
Parameters:
-
patterns
(list[str]
, default:['*.json']
) –list of patterns (strings) to search
Returns:
-
list[str]
–list[str]: list of paths of matched files.
first_cookie_file(patterns: list[str]) -> str
¶
select the first cookie file that are matched patterns
Parameters:
-
patterns
(list[str]
) –list of patterns (strings) to search
Returns:
-
str
(str
) –firt path of matched files.
read_cookies(cookie_files: Union[str, list[str]], selected_names: list[str] = None) -> list[dict]
¶
Read all cookie files, and select some cookies based on names.
Parameters:
-
cookie_files
(list[str]
) –list of cookie files
-
selected_names
(list[str]
, default:None
) –select cookies by names
Returns:
-
list[dict]
–list[dict]: list of cookies
show_text_diffs_in_jupyter
¶
Show the differences between 2 texts in Jupytere Notebook side-by-side. Following this article: https://skeptric.com/python-diffs/
Functions to create the diffs: - Escape any HTML characters so that they will display properly in HTML - Align the texts at a sentence level - Markup the differences between the tokens in each pair of aligned sentences - Output the markedup and aligned sentences as side-by-side HTML
REF: - Showing Side-by-Side Diffs in Jupyter
Functions:
-
html_diffs
–Return the side-by-side HTML of the differences between text_a and text_b.
-
align_sentences
–Align the sentences between two lists of sentences of text.
-
display_diffs_jupyter
–Display the differences between text_a and text_b in Jupyter Notebook.
Attributes:
-
Token
– -
TokenList
– -
whitespace
– -
end_sentence
–
Token = str
module-attribute
¶
TokenList = list[Token]
module-attribute
¶
whitespace = re.compile('\\s+')
module-attribute
¶
end_sentence = re.compile('(?<!\\w\\.\\w.)(?<![A-Z][a-z]\\.)(?<=\\.|\\?|\\!\\:)\\s+')
module-attribute
¶
html_diffs(a: str, b: str) -> str
¶
Return the side-by-side HTML of the differences between text_a and text_b.
Parameters:
-
a
(str
) –The first string.
-
b
(str
) –The second string.
Returns:
-
html
(str
) –The side-by-side HTML of the differences between text_a and text_b.
align_sentences(list1: list[str], list2: list[str]) -> Tuple[list[str], list[str]]
¶
Align the sentences between two lists of sentences of text.
- Use the similarity score to check sentence between two lists
- Align them base on similarity score > a certain threshold,
- Insert empty sentences if it is neccessary, but do not change the order of sentence in original lists
Parameters:
-
list1
(list[str]
) –The first list of sentences of text.
-
list2
(list[str]
) –The second list of sentences of text.
Returns:
-
aligned_list1
(list[str]
) –The aligned list1.
-
aligned_list2
(list[str]
) –The aligned list2.
display_diffs_jupyter(a: str, b: str)
¶
Display the differences between text_a and text_b in Jupyter Notebook. Args: a: The first string. b: The second string.
util_rag
¶
This module contains support functions for the RAG system.
Classes:
-
Cookie
–Convenience class for Bing Cookie files, data, and configuration. This Class
Attributes:
-
log
–
log = Log.BingChat.debug
module-attribute
¶
Cookie
¶
Convenience class for Bing Cookie files, data, and configuration. This Class is updated dynamically by the Query class to allow cycling through >1 cookie/credentials file e.g. when daily request limits (current 200 per account per day) are exceeded.
Methods:
-
files
–Return a sorted list of all cookie files matching .search_pattern in
-
import_data
–Read the active cookie file and populate the following attributes:
-
import_next
–Cycle through to the next cookies file then import it.
Attributes:
-
current_file_index
– -
dir_path
– -
current_file_path
– -
search_pattern
– -
ignore_files
– -
request_count
– -
supplied_files
– -
rotate_cookies
–
current_file_index = 0
class-attribute
instance-attribute
¶
dir_path = Path.home().resolve() / 'bing_cookies'
class-attribute
instance-attribute
¶
current_file_path = dir_path
class-attribute
instance-attribute
¶
search_pattern = 'bing_cookies_*.json'
class-attribute
instance-attribute
¶
ignore_files = set()
class-attribute
instance-attribute
¶
request_count = {}
class-attribute
instance-attribute
¶
supplied_files = set()
class-attribute
instance-attribute
¶
rotate_cookies = True
class-attribute
instance-attribute
¶
files() -> list[Path]
classmethod
¶
Return a sorted list of all cookie files matching .search_pattern in cls.dir_path, plus any supplied files, minus any ignored files.
import_data() -> None
classmethod
¶
Read the active cookie file and populate the following attributes:
.current_file_path .current_data .image_token
import_next(discard: bool = False) -> None
classmethod
¶
Cycle through to the next cookies file then import it.
discard (bool): True -Mark the previous file to be ignored for the remainder of the current session. Otherwise cycle through all available cookie files (sharing the workload and 'resting' when not in use).