Coverage for src / local_deep_research / document_loaders / __init__.py: 100%

5 statements  

« prev     ^ index     » next       coverage.py v7.13.4, created at 2026-02-25 01:07 +0000

1""" 

2Document loaders module. 

3 

4Provides centralized document loading functionality for both: 

5- Collection uploads (bytes from HTTP requests) 

6- Local search engine (file paths on disk) 

7 

8Supported formats (35+ formats): 

9 

10Documents: 

11- PDF (.pdf) 

12- Text (.txt) 

13- Markdown (.md, .markdown) 

14- Word (.doc, .docx) 

15- RTF (.rtf) - Rich Text Format 

16- RST (.rst) - reStructuredText documentation 

17 

18Presentations: 

19- PowerPoint (.ppt, .pptx) 

20 

21Spreadsheets: 

22- Excel (.xls, .xlsx) 

23- CSV (.csv), TSV (.tsv) 

24- ODT (.odt) - OpenDocument text 

25 

26Data formats: 

27- JSON (.json) 

28- YAML (.yaml, .yml) 

29- XML (.xml) - important for USPTO patent data 

30- TOML (.toml) - config files 

31 

32Web content: 

33- HTML (.html, .htm) 

34- MHTML (.mhtml, .mht) - saved web pages 

35 

36Images (OCR): 

37- PNG, JPG, JPEG, TIFF, BMP, HEIC 

38 

39Research/Notes: 

40- Jupyter Notebooks (.ipynb) 

41- Evernote exports (.enex) 

42- EPUB (.epub) - ebooks (requires pandoc) 

43- Org (.org) - Emacs org-mode files 

44- Email (.eml) - email messages 

45""" 

46 

47from .bytes_loader import extract_text_from_bytes, load_from_bytes 

48from .json_loader import SimpleJSONLoader 

49from .loader_registry import ( 

50 get_loader_class_for_extension, 

51 get_loader_for_path, 

52 get_supported_extensions, 

53 is_extension_supported, 

54) 

55from .yaml_loader import YAMLLoader 

56 

57__all__ = [ 

58 # Bytes loading (for uploads) 

59 "load_from_bytes", 

60 "extract_text_from_bytes", 

61 # Path loading (for local files) 

62 "get_loader_for_path", 

63 # Registry functions 

64 "get_supported_extensions", 

65 "is_extension_supported", 

66 "get_loader_class_for_extension", 

67 # Custom loaders 

68 "YAMLLoader", 

69 "SimpleJSONLoader", 

70]