Content¶
Constructing content also needs some additional properties to describe the content. Before we get into the individual properties, checkout the example below:
from mimetypes import guess_type
for post in api_response.json()["posts"]:
# ... skip posts with no file details ...
# ... construct HttpResource for image ...
# ... construct Meta for content ...
# ... get the MD5 checksum for image ...
yield Content(
id=f"4chan-{groups['board']}-{post['no']}",
url=url.url,
quality=1.0,
size=post["fsize"],
type=guess_type(image_url),
resources=[image_resource],
meta=meta,
checksums=[image_checksum],
extra=post
)
id
Is a unique identifier for the content regardless of the quality. There can be multiple content entries that use the same id but of different qualities. For example, an image may have a thumbnail. Both the image and the thumbnail represent the same remote content, so their ids are the same. However, their qualities are different.
url
Is the string source URL where the content was extracted from. In most all cases this should just be the URL provided to the
extract_content()
method.
quality
Is a floating point number that represents the quality of the content in relation to other content using the same
id
. The higher the number, the better quality the content is. Note that this number is relative to other content qualities. For example, the source image may use a quality of1.0
and the thumbnail for that same image may be0.0
.
size
Is the size (in bytes) that will be taken up on the local file system if all resources are downloaded. There is some flexibility with this value, but try to get as close the the actual value as possible.
type
Is the mimetype of the content being fetched. This can usually be determined by the
mimetypes.guess_type()
function when given the resource URL. However, you may need to construct this yourself depending on the type of resource the content uses.
resources
Is a list of
Resource
instances to use to download the remote content to the local file system. For each resource defined in this list, a request will be made and the response will be downloaded and bundled as an artifact in a manifest. This means that the number of resources provided in the content should be the same number of artifacts downloaded as a manifest.
meta
Is a
Meta
instance containing metadata taken from the media content host. This includes various descriptive information about the content that is not vital to downloading the content.
checksums
Is a list of
Checksum
instances used to verify the fetched content.
extra
Is a dictionary of miscellaneous data that can be used to store whatever data you might want. Keep it reasonable though.
So we end up with an implementation of extract_content()
that looks like the following:
class ThreadPlugin(BasePlugin):
name = "4chan Thread"
domains = {"boards.4chan.org", "boards.4channel.org"}
pattern = re.compile(r"^https?:\/\/(?:(?:www|boards)\.)?4chan(?:nel)?\.org\/(?P<board>\w+)\/thread\/(?P<thread>\d+)")
def can_handle(self, url: Url) -> bool:
return self.pattern.match(url.url) is not None
def extract_content(self, url):
match = self.pattern.match(url.url)
if not match:
raise ValueError(f"Failed to match url {url.url}")
with http_session() as session:
groups = match.groupdict()
api_response = session.get(f"https://a.4cdn.org/{groups['board']}/thread/{groups['thread']}.json")
if api_response.status_code != 200:
raise ValueError(f"Failed to fetch API details for 4chan board {groups['board']} thread {groups['thread']}")
for post in api_response.json()["posts"]:
# skip posts with no file details
if post.get("filename") is None or post.get("ext") is None:
continue
# construct HttpResource for image
image_url = f"https://i.4cdn.org/{groups['board']}/{post['tim']}{post['ext']}"
image_resource = HttpResource(method=HttpMethod.GET, url=image_url)
# construct Meta for content
meta = Meta(
id=str(post["no"]),
description=post.get("com"),
publisher=post.get("name"),
published_at=(datetime.fromtimestamp(post["time"]) if "time" in post else None),
filename=post.get("filename"),
thumbnail=f"https://i.4cdn.org/{groups['board']}/{post['tim']}s.jpg"
)
# get the MD5 checksum for image
image_checksum = Checksum(type=HashType.MD5, hash=b64decode(post["md5"]).hex())
yield Content(
id=f"4chan-{groups['board']}-{post['no']}",
url=url.url,
quality=1.0,
size=post["fsize"],
type=guess_type(image_url),
resources=[image_resource],
meta=meta,
checksums=[image_checksum],
extra=post
)
Now that we have yielded the fully constructed content, the rest of the framework can take it from there.