Megu Package¶
megu.cli
¶
style
¶
Contains some style definitions for the CLI interface.
ui
¶
utils
¶
megu.constants
¶
Contains project-wide constants.
- megu.constants.STAGING_DIR¶
The directory path where the application downloads content fragments to.
- Type:
megu.env
¶
Contains available environment configs and defaults.
- class megu.env.MeguEnv(cache_dir=PosixPath('/home/docs/.cache/megu'), log_dir=PosixPath('/home/docs/.cache/megu/log'), plugin_dir=PosixPath('/home/docs/.config/megu/plugins'), download_dir=PosixPath('/home/docs/Downloads'))[source]¶
Defines available environment configuration values.
- cache_dir¶
The directory where persistent caches should be stored. Read from
MEGU_CACHE_DIR
.- Type:
megu.config
¶
Contains project wide configuration values.
megu.download
¶
Contains the namespace for content downloaders.
base
¶
Contains the abstractions necessary to build content downloaders.
- megu.download.base.DEFAULT_MAX_CONNECTIONS¶
The maximum number of connections permittable for a standard download.s
- Type:
- class megu.download.base.BaseDownloader[source]¶
The base downloader that all content downloaders should inherit from.
- abstract classmethod can_handle(content)[source]¶
Check if some given content can be handled by the downloader.
- abstract download_content(content, max_connections=8, update_hook=None)[source]¶
Download the resources of some content to temporary storage.
- Parameters:
content (Content) – The content to download.
max_connections (int, optional) – The limit of connections to make to handle downloading the content. Defaults to
DEFAULT_MAX_CONNECTIONS
.update_hook (Optional[Callable[[int], Any]], optional) – Callable for reporting downloaded chunk sizes. Defaults to
None
.
- Returns:
The manifest of downloaded content and local file artifacts.
- Return type:
discover
¶
Contains the functionality to discover the currently available downloaders.
- megu.download.discover.discover_downloaders()[source]¶
Discover the available downloaders in the project.
- Yields:
Type[
BaseDownloader
] – The currently available downloaders.- Return type:
http
¶
Contains logic for handling HTTP downloads.
- megu.download.http.DEFAULT_CHUNK_SIZE¶
The default bytesize that the HTTP downloader should use for streaming content.
- Type:
- megu.download.http.DEFAULT_MAX_CONNECTIONS¶
The default maximum number of HTTP connections the downloader should use.
- Type:
- megu.download.http.CONTENT_RANGE_PATTERN¶
A compiled regex pattern to help matching content range header values.
- Type:
- class megu.download.http.HttpDownloader[source]¶
Downloader for traditional HTTP resources.
- classmethod can_handle(content)[source]¶
Check if some given content can be handled by the HTTP downloader.
- download_content(content, max_connections=8, update_hook=None)[source]¶
Download the resource of some content to temporary storage.
- Parameters:
content (Content) – The content to download.
max_connections (int, optional) – The limit of connections to make to handle downloading the content. Defaults to
DEFAULT_MAX_CONNECTIONS
.update_hook (Optional[Callable[[int], Any]], optional) – Callable for reporting downloaded chunk sizes. Defaults to
None
.
- Returns:
The manifest of downloaded content and local file artifacts.
- Return type:
- download_resource(resource, resource_index, to_path, chunk_size=4096, update_hook=None)[source]¶
Download some resource to a specific filepath.
- Parameters:
resource (HttpResource) – The resource to download.
resource_index (int) – The content’s index of the resource in its list of resources.
to_path (Path) – The filepath to download the resource to.
chunk_size (int, optional) – The byte size of chunks to stream the resource data in. Defaults to
DEFAULT_CHUNK_SIZE
.update_hook (Optional[Callable[[int], Any]], optional) – Callable for reporting downloaded chunk sizes. Defaults to None.
- Raises:
ValueError – When attempting to download the resource fails for any reason.
- Returns:
A tuple containing the index, the resource, and the path the resource was downloaded to.
- Return type:
Tuple[int, HttpResource, Path]
megu.exceptions
¶
Contains definitions for custom project exceptions.
megu.filters
¶
Contains some really basic content filters.
- megu.filters.best_content(content)[source]¶
Get the best quality content from the extracted content iterator.
- megu.filters.specific_content(content, **conditions)[source]¶
Apply many filters to an iterable of content instances.
With no conditions provided, no content will be filtered out and all content instances will be returned. When conditions are provided, matching filter handlers will be dynamically applied to filter out content instances.
- Parameters:
- Returns:
An iterator for filtered content.
- Return type:
Iterable[Content]
- Yields:
Content – Content instances which have passed all defined filters.
megu.hasher
¶
This module provides simple safe hashing functions.
We only support several of the available hashing algorithms from hashlib
as
they have several that are never really used (such as sha224
).
Tip
The provided basic functions allow you to calculate multiple hashes at the same time which means that your bottleneck will be whatever slowest hashing algorithm you request.
>>> from megu.hasher import hash_io, HashType
>>> with open("/home/user/A/PATH/TO/A/FILE", "rb") as file_io:
... hashes = hash_io(file_io, {HashType.MD5, HashType.SHA256})
{
<HashType.SHA256: 'sha256'>: 'f0e4c2f76c58916ec258f246851bea091d14d4247a2f...',
<HashType.MD5: 'md5'>: 'a46062d24103b87560b2dc0887a1d5de'
}
- megu.hasher.DEFAULT_CHUNK_SIZE¶
The default size in bytes to chunk file streams for hashing.
- Type:
- class megu.hasher.HashType(value)[source]¶
Enumeration of supported hash types.
- property hasher: Callable[[bytes | bytearray | memoryview], hashlib._Hash]¶
Get the hasher callable for the current hash type.
- megu.hasher.hash_file(filepath, types, chunk_size=65536)[source]¶
Calculate the requested hash types for some given file path instance.
Basic usage of this function typically looks like the following:
>>> from pathlib import Path >>> from megu.hasher import hash_file, HashType >>> big_file_path = Path("/home/USER/A/PATH/TO/A/BIG/FILE") >>> hash_file(big_file_path, {HashType("md5"), HashType.SHA256}) { <HashType.SHA256: 'sha256'>: 'f0e4c2f76c58916ec258f246851bea091d14d4247a2f...', <HashType.MD5: 'md5'>: 'a46062d24103b87560b2dc0887a1d5de' }
- Parameters:
- Raises:
FileNotFoundError – If the given filepath does not point to an existing file.
ValueError – If one of the given types is not supported.
- Returns:
A dictionary of hash type strings and the calculated hexdigest of the hash.
- Return type:
Dict[~HashType, str]
- megu.hasher.hash_io(io, types, chunk_size=65536)[source]¶
Calculate the requested hash types for some given binary IO instance.
>>> from io import BytesIO >>> from megu.hasher import hash_io, HashType >>> hash_io(BytesIO(b"Hey, I'm a string"), {HashType("sha256"), HashType.MD5}) { <HashType.SHA256: 'sha256'>: 'f0e4c2f76c58916ec258f246851bea091d14d4247a2f...', <HashType.MD5: 'md5'>: '25cb7b2c4e2064c1deebac4b66195c9c' }
Of course if you need to instead hash
StringIO
, it’s up to you to do whatever conversions you need to do to create aBytesIO
instance. This typically involves having to read the entire string and encode it.>>> from io import BytesIO, StringIO >>> from megu.hasher import hash_io, HashType >>> string_io = StringIO("Hey, I'm a string") >>> byte_io = BytesIO(string_io.read().encode("utf-8")) >>> hash_io(byte_io, {HashType.SHA256, HashType("md5")}) { <HashType.SHA256: 'sha256'>: 'f0e4c2f76c58916ec258f246851bea091d14d4247a2f...', <HashType.MD5: 'md5'>: '25cb7b2c4e2064c1deebac4b66195c9c' }
- Parameters:
io (BinaryIO) – The IO to calculate hashes for.
types (Set[~HashType]) – The set of names for hash types to calculate.
chunk_size (int) – The size of bytes to have loaded from the buffer into memory at a time. Defaults to
DEFAULT_CHUNK_SIZE
.
- Raises:
ValueError – If one of the given types is not supported.
- Returns:
A dictionary of hash type strings and the calculated hexdigest of the hash.
- Return type:
Dict[~HashType, str]
megu.helpers
¶
Contains helper methods that plugins can use to simplify usage.
- megu.helpers.disk_cache(cache_name)[source]¶
Context manager for creating or accessing a local disk cache.
We recommend that you avoid using a diskcache if at all possible. The feature to define and use a disk-persisted cache was introduced for the purpose of caching fetched API tokens between runs (such as OAuth Bearer tokens). You should not be caching content, you should be downloading content.
Important
For some relatively naive precautions, we don’t allow for path separators or spaces in the cache name. For this purpose, we are enforcing that the name of the cache must match the following pattern:
^[a-z]+[a-z0-9_-]{3,31}[a-z0-9]$
.For this reason, we recommend that you use your plugin’s package name as the name for your plugin’s disk-persisted cache.
Warning
Please be reasonable about what you are caching. No one wants people taking advantage of their disk-space.
- megu.helpers.get_soup(markup)[source]¶
Get a BeautifulSoup instance for some HTML markup.
- Parameters:
markup (str) – The HTML markup to use when building a BeautifulSoup instance.
- Returns:
The parsed soup for the given HTML markup.
- Return type:
BeautifulSoup
- megu.helpers.http_session()[source]¶
Context manager for creating a requests HTTP session to make basic requests.
- megu.helpers.noop(*args, **kwargs)[source]¶
Noop function that does absolutely nothing.
- Return type:
- class megu.helpers.noop_class(**kwargs)[source]¶
Noop class that allows for everything but does nothing.
- megu.helpers.python_path(*paths)[source]¶
Context manager for temporarily added directories to the Python search path.
- megu.helpers.temporary_directory(prefix, dirpath=None)[source]¶
Context manager for creating a temporary directory at the appropriate location.
- Parameters:
- Raises:
NotADirectoryError – When the provided
dirpath
does not exist.- Yields:
Path
– The temporary directory’s path.- Return type:
- megu.helpers.temporary_file(prefix, mode, dirpath=None)[source]¶
Context manager for opening a temporary file at the appropriate location.
- Parameters:
- Raises:
NotADirectoryError – When the provided
dirpath
does not exist.s- Yields:
Tuple[
Path
,IO
] – A tuple containing the temporary file’s path and the file handle.- Return type:
megu.log
¶
Contains logger configuration and creation.
We use Loguru to handle all the complexities of logging. They work with the concept of a single global logger which is used throughout the entire application. Since this project is just a single tool that doesn’t need to handle too complex threading or distributed processing, this style of a single global logger works fine.
Examples
Most all usage of this logger should look like the following:
from .log import instance as log
log.debug("My logged message here")
If you need to re-configure the logger for debug logging or for other intricate
logging handler settings, you should do so through the
configure_logger()
function:
from .log import configure_logger, instance
configure_logger(instance, debug=True)
- megu.log.instance¶
The configured global logger instance that should likely always be used.
- Type:
loguru.Logger
- megu.log.configure_logger(logger, level='CRITICAL', debug=False, record=False)[source]¶
Configure the global logger.
- Parameters:
logger (
loguru.Logger
) – The global logger instance to configure.level (str, optional) – The string level to filter logging messages through. Defaults to “CRITICAL”
debug (bool, optional) – If True, configures the logger with the debug configuration. Defaults to False.
record (bool, optional) – If True, logs will be recorded and written out to the log directory. Defaults to False.
- Returns:
The newly configured global logger
- Return type:
loguru.Logger
megu.models
¶
Contains data models to use throughout the project.
content
¶
Contains definitions of content types used throughout the project.
- class megu.models.content.Url[source]¶
A basic wrapper around a furl URL to keep things consistent between plugins and the internals of the package without declaring a direct dependency on a third-party.
- class megu.models.content.Checksum(**data)[source]¶
Describes a checksum that should be used for content validation.
- Parameters:
type (HashType) – The type of checksum hash is being defined.
hash (str) – The value of the checksum hash being defined.
data (Dict, optional) – Model parameter dictionary provided by pydantic. You should likely never use this property unless you need a keyword argument for a dictionary payload to construct the model.
- class megu.models.content.Content(**data)[source]¶
Describes some extracted content that can be downloaded.
- Parameters:
id (str) – The plugin-defined content-unique identifier for the content.
name (str) – The human-readable name to describe the content.
url (str) – The absolute URL from where the plugin extracted the content. This URL string gets translated into a
Url
instance.quality (float) – The plugin-defined arbitrary quality of the content.
size (int) – The size in bytes the content will take up on the local filesystem.
type (str) – The appropriate mimetype of the content.
resources (List[Resource]) – The resources required to fetch and download the extracted content.
meta (Meta) – The structured metadata of the extracted content.
checksums (List[Checksum]) – A list of checksums that can be used to verify the downloaded content.
extra (Dict[str, Any) – The unstructured metadata of the extracted content.
data (Dict, optional) – Model parameter dictionary provided by pydantic. You should likely never use this property unless you need a keyword argument for a dictionary payload to construct the model.
- class megu.models.content.Manifest(**data)[source]¶
Describes the downloaded artifacts ready to be merged.
- Parameters:
content (Content) – The content instance that was download.
artifacts (List[Tuple[Resource, Path]]) – A tuple containing (resource, path) of content resources that were downloaded to the local filesystem.
data (Dict, optional) – Model parameter dictionary provided by pydantic. You should likely never use this property unless you need a keyword argument for a dictionary payload to construct the model.
- class megu.models.content.Meta(**data)[source]¶
Describes some additional metadata about the extracted content.
- Parameters:
id (Optional[str], optional) – The site internal identifier for the extracted content.
title (Optional[str], optional) – The site defined title for the extracted content.
description (Optional[str], optional) – The site defined description for the extracted content.
publisher (Optional[str], optional) – The site defined publisher name for the extracted content.
published_at (Optional[datetime], optional) – The site defined datetime timestamp for when the extracted content was published.
filename (Optional[str], optional) – The site defined filename for the extracted content.
thumbnail (Optional[str], optional) – The URL for the thumbnail of the extracted content.
data (Dict, optional) – Model parameter dictionary provided by pydantic. You should likely never use this property unless you need a keyword argument for a dictionary payload to construct the model.
- class megu.models.content.Resource(**data)[source]¶
The base resource class that resource types must inherit from.
Important
This class is abstract and used as an typing interface for the
Content
model. Concrete implementations of this abstract class such asHttpResource
must be provided to content in order for the application to understand how to fetch the content.- Parameters:
data (Dict, optional) – You should never use this parameter. Since this is an abstract class, you should never be instantiating it.
- abstract property fingerprint: str¶
Get the unique identifier of an resource.
- Raises:
NotImplementedError – Subclasses must implement this property.
- Returns:
A string fingerprint of the resource.
- Return type:
http
¶
Contains definitions of HTTP resource types used throughout the project.
- class megu.models.http.HttpMethod(value)[source]¶
Enumeration of the available HTTP methods that resources can use.
- class megu.models.http.HttpResource(**data)[source]¶
Describes a downloadable HTTP resource that is part of some local content.
- Parameters:
method (HttpMethod) – The HTTP method that should be used to fetch this resource.
url (str) – The URL that should be used to fetch this resource. This URL string gets translated into a
Url
instance.headers (dict) – The dictionary of headers to use to fetch this resource (if any).
data (Optional[bytes], optional) – The data body to send in the resource request (if any).
auth – (Optional[Callable[[~requests.Request], ~requests.Request]], optional): A callable that mutates a request to ensure it is authenticated for fetching the resource.
- fingerprint¶
Get a computed unique identifier for the resource.
- Returns:
The unique identifier for the resource.
- Return type:
- classmethod from_request(request)[source]¶
Produce an resource from an existing prepared request.
- Parameters:
request (PreparedRequest) – The request to construct an resource from.
- Returns:
The newly produced resource.
- Return type:
types
¶
Contains custom model types to be used in model implementations.
- class megu.models.types.Url(url='', args=<object object>, path=<object object>, fragment=<object object>, scheme=<object object>, netloc=<object object>, origin=<object object>, fragment_path=<object object>, fragment_args=<object object>, fragment_separator=<object object>, host=<object object>, port=<object object>, query=<object object>, query_params=<object object>, username=<object object>, password=<object object>, strict=False)[source]¶
A URL validated by
AnyHttpUrl
and casted as afurl
.- classmethod validate(value, field, config)[source]¶
Validate and parse the given URL string value.
- Parameters:
value (Any) – The URL provided by a user.
field (ModelField) – The field instance the URL is using.
config (BaseConfig) – The config instance the URL is in.
- Returns:
The furl instance of the given URL string.
- Return type:
furl.furl.furl
megu.plugin
¶
Contains logic for producing and loading plugins for the project.
base
¶
Contains the abstractions necessary for the plugin discovery to work.
- class megu.plugin.base.BasePlugin[source]¶
The base plugin that all plugins should inherit from.
This class should mostly be excluded from testing as it should only ever define an interface and not provide much if any implementation.
- __str__()[source]¶
Build a human-friendly string representation of a plugin.
- Returns:
The human-friendly string representation of a plugin.
- Return type:
discover
¶
Contains logic to discover and load compatible plugins from a directory.
- megu.plugin.discover.discover_plugins(package_dirpath, plugin_type=<class 'megu.plugin.base.BasePlugin'>)[source]¶
Discover and load plugins from a given directory of plugin modules.
- Parameters:
- Raises:
PluginFailure – When a discovered plugin fails to load
- Yields:
Tuple[str, List[
BasePlugin
]] – A tuple of the plugin name and the instances of exported plugins from that plugin module- Return type:
- megu.plugin.discover.iter_available_plugins(plugin_dirpath=None, plugin_type=<class 'megu.plugin.base.BasePlugin'>)[source]¶
Get all available plugins from the given plugin directory.
- Parameters:
plugin_dirpath (Path, optional) – The path to the directory where plugins are installed. Defaults to
PLUGIN_DIR
.plugin_type (Type, optional) – The type of plugins to load. Defaults to
BasePlugin
.
- Yields:
Tuple[str, List[
BasePlugin
]] – A tuple of the plugin name and the instances of exported plugins from available plugin modules.- Return type:
- megu.plugin.discover.load_plugin(plugin_name, plugin_class)[source]¶
Load a plugin instance from a given plugin class.
- Parameters:
plugin_name (str) – The name of the plugin package
plugin_class (Type[BasePlugin]) – The plugin class from the plugin package
- Raises:
PluginFailure – When the plugin fails to load
- Returns:
The loaded plugin instance
- Return type:
- megu.plugin.discover.load_plugin_module(module_name)[source]¶
Load/import a plugin module given the module name.
- Parameters:
module_name (str) – The name of the plugin module
- Raises:
PluginFailure – When the plugin module fails to import
- Returns:
The imported plugin module
- Return type:
generic
¶
Contains a very generic fallback plugin.
- class megu.plugin.generic.GenericPlugin[source]¶
A very generic fallback plugin.
This plugin assumes that the given URL can just be downloaded with a single HTTP Get request and that it produces a single artifact that only needs to be renamed.
- extract_content(url)[source]¶
Extract the content from the given Url instance.
This extraction makes a single HTTP Head request to fetch Content-Length and Content-Type. Otherwise, it returns a single content instance based on the hash of the given Url.
- merge_manifest(manifest, to_path)[source]¶
Merge the given manifest artifacts into a single filepath.
- Parameters:
- Raises:
ValueError – When the provided manifest contains more than 1 artifact.
- Returns:
The filepath the artifacts were merged into.
- Return type:
manage
¶
Contains logic to install plugins into a directory.
- megu.plugin.manage.add_plugin(package, plugin_dirpath=None, silence_subprocess=False)[source]¶
Install a plugin utilizing pip.
Important
If your package is not installable via pip through any of the distribution methods that pip checks (pypi, git, local, etc.), installation of your plugin simply will not work.
- Parameters:
package (str) – The package identifier that pip should use to discover and install your plugin.
plugin_dirpath (Path, optional) – The directory the plugin should be installed to. Defaults to
PLUGIN_DIR
.silence_subprocess (bool) – If set to
True
, will redirect output of subprocess calls to /dev/null. Defaults toFalse
.
- Returns:
The directory the plugin was installed to.
- Return type:
- megu.plugin.manage.remove_plugin(package, plugin_dirpath=None)[source]¶
Remove the given package if it exists in the plugin directory.
- Parameters:
package (str) – The name of the package to remove.
plugin_dirpath (Path, optional) – The plugin directory to remove the package from. Defaults to
PLUGIN_DIR
.
- Raises:
NotADirectoryError – If the given package does not exist as a subdirectory within the given plugin directory.
megu.services
¶
Contains helpful service functions that should really only be used during runtime.
- megu.services.get_downloader(content)[source]¶
Get the best available downloader for the given content.
- Parameters:
content (Content) – The content that the downloader should be able to handle.
- Returns:
The best available downloader instance for the given content.
- Return type:
BaseDownloader
- megu.services.get_plugin(url, plugin_dirpath=None)[source]¶
Get the best available plugin for a given url.
- megu.services.iter_content(url, plugin)[source]¶
Shortcut to discover and iterate over content for a given URL.
- megu.services.merge_manifest(plugin, manifest, to_path)[source]¶
Merge a manifest with the given plugin and finalize content to the given path.
- Parameters:
- Raises:
FileExistsError – If the given output path already exists.
- Returns:
The path the merged content was finalized to.
- Return type:
megu.utils
¶
Contains utilities for the framework to use.
These helper/utility functions should not be exposed to plugins.
- megu.utils.allocate_storage(to_path, size)[source]¶
Allocate a specific number of bytes to a non-existing filepath.
- Parameters:
- Raises:
FileExistsError – If the given filepath already exists
- Returns:
The given filepath
- Return type: