ingress.w3c¶
-
class
bigbang.ingress.w3c.
W3CMailList
(name: str, source: Union[List[str], str], msgs: List[mailbox.mboxMessage])¶ Bases:
bigbang.ingress.abstract.AbstractMailList
This class handles the scraping of a all public Emails contained in a single mailing list in the hypermail format. To be more precise, each contributor to a mailing list sends their message to an Email address that has the following structure: <mailing_list_name>@w3.org. Thus, this class contains all Emails send to a specific <mailing_list_name> (the Email localpart, such as “public-abcg” or “public-accesslearn-contrib”).
- Parameters
name (The name of the list (e.g. public-2018-permissions-ws, ..)) –
source (Contains the information of the location of the mailing list.) – It can be either an URL where the list or a path to the file(s).
msgs (List of mboxMessage objects) –
Example
To scrape a W3C mailing list from an URL and store it in run-time memory, we do the following
>>> mlist = W3CMailList.from_url( >>> name="public-bigdata", >>> url="https://lists.w3.org/Archives/Public/public-bigdata/", >>> select={ >>> "years": 2015, >>> "months": "August", >>> "fields": "header", >>> }, >>> )
To save it as
*.mbox
file we do the following >>> mlist.to_mbox(path_to_file)-
classmethod
from_mbox
(name: str, filepath: str) → bigbang.ingress.w3c.W3CMailList¶ Docstring in AbstractMailList.
-
classmethod
from_messages
(name: str, url: str, messages: Union[List[str], List[mailbox.mboxMessage]], fields: str = 'total') → bigbang.ingress.w3c.W3CMailList¶ Docstring in AbstractMailList.
-
classmethod
from_url
(name: str, url: str, select: Optional[dict] = {'fields': 'total'}) → bigbang.ingress.w3c.W3CMailList¶ Docstring in AbstractMailList.
-
static
get_all_periods_and_their_urls
(url: str) → Tuple[List[str], List[str]]¶ W3C groups messages into monthly time bundles. This method obtains all the URLs that lead to the messages of each time bundle.
- Returns
Returns a tuple of two lists that look like
([‘April 2017’, ‘January 2001’, …], [‘ulr1’, ‘url2’, …])
-
classmethod
get_message_urls
(name: str, url: str, select: Optional[dict] = None) → List[str]¶ Docstring in AbstractMailList.
-
classmethod
get_messages_urls
(name: str, url: str) → List[str]¶ - Parameters
name (Name of the W3C mailing list.) –
url (URL to group of messages that are within the same period.) –
- Returns
- Return type
List of URLs from which mboxMessage can be initialized.
-
static
get_name_from_url
(url: str) → str¶ Get name of mailing list.
-
classmethod
get_period_urls
(url: str, select: Optional[dict] = None) → List[str]¶ All messages within a certain period (e.g. January 2021).
- Parameters
url (URL to the W3C list.) –
select (Selection criteria that can filter messages by:) –
content, i.e. header and/or body
period, i.e. written in a certain year and month
-
class
bigbang.ingress.w3c.
W3CMailListDomain
(name: str, url: str, lists: List[Union[bigbang.ingress.abstract.AbstractMailList, str]])¶ Bases:
bigbang.ingress.abstract.AbstractMailListDomain
This class handles the scraping of a all public Emails contained in a mail list domain that has the hypermail format, such as W3C. To be more precise, each contributor to a mail list domain sends their message to an Email address that has the following structure: <mailing_list_name>@w3.org. Thus, this class contains all Emails send to <mail_list_domain_name> (the Email domain name). These Emails are contained in a list of W3CMailList types, such that it is known to which <mailing_list_name> (the Email localpart) was send.
- Parameters
name (The name of the mailing list domain.) –
url (The URL where the mailing list domain lives) –
lists (A list containing the mailing lists as W3CMailList types) –
-
All methods in the `AbstractMailListDomain` class.
Example
To scrape a W3C mailing list mailing list domain from an URL and store it in run-time memory, we do the following >>> mlistdom = W3CMailListDomain.from_url( >>> name=”W3C”, >>> url_root=”https://lists.w3.org/Archives/Public/”, >>> select={ >>> “years”: 2015, >>> “months”: “November”, >>> “weeks”: 4, >>> “fields”: “header”, >>> }, >>> instant_save=False, >>> only_mlist_urls=False, >>> )
To save it as *.mbox file we do the following >>> mlistdom.to_mbox(path_to_directory)
-
classmethod
from_mailing_lists
(name: str, url_root: str, url_mailing_lists: Union[List[str], List[bigbang.ingress.w3c.W3CMailList]], select: Optional[dict] = {'fields': 'total'}, only_mlist_urls: bool = True, instant_save: Optional[bool] = True) → bigbang.ingress.w3c.W3CMailListDomain¶ Docstring in AbstractMailListDomain.
-
classmethod
from_mbox
(name: str, directorypath: str, filedsc: str = '*.mbox') → bigbang.ingress.w3c.W3CMailListDomain¶ Docstring in AbstractMailListDomain.
-
classmethod
from_url
(name: str, url_root: str, url_home: Optional[str] = None, select: Optional[dict] = {'fields': 'total'}, instant_save: bool = True, only_mlist_urls: bool = True) → bigbang.ingress.w3c.W3CMailListDomain¶ Docstring in AbstractMailListDomain.
-
static
get_lists_from_url
(name: str, select: dict, url_root: str, url_home: Optional[str] = None, instant_save: bool = True, only_mlist_urls: bool = True) → List[Union[bigbang.ingress.w3c.W3CMailList, str]]¶ Docstring in AbstractMailListDomain.
-
exception
bigbang.ingress.w3c.
W3CMailListDomainWarning
¶ Bases:
BaseException
Base class for W3CMailListDomain class specific exceptions
-
exception
bigbang.ingress.w3c.
W3CMailListWarning
¶ Bases:
BaseException
Base class for W3CMailList class specific exceptions
-
class
bigbang.ingress.w3c.
W3CMessageParser
(website=False, url_login: Optional[str] = None, url_pref: Optional[str] = None, login: Optional[Dict[str, str]] = {'password': None, 'username': None}, session: Optional[requests.sessions.Session] = None)¶ Bases:
bigbang.ingress.abstract.AbstractMessageParser
,email.parser.Parser
This class handles the creation of an mailbox.mboxMessage object (using the from_*() methods) and its storage in various other file formats (using the to_*() methods) that can be saved on the local memory.
- Parameters
website (Set 'True' if messages are going to be scraped from websites,) – otherwise ‘False’ if read from local memory. This distinction needs to be made if missing messages should be added.
url_pref (URL to the 'Preferences'/settings page.) –
Example
To create a Email message parser object, use the following syntax: >>> msg_parser = W3CMessageParser(website=True)
To obtain the Email message content and return it as mboxMessage object, you need to do the following: >>> msg = msg_parser.from_url( >>> list_name=”public-2018-permissions-ws”, >>> url=”https://lists.w3.org/Archives/Public/public-2018-permissions-ws/2019May/0000.html”, >>> fields=”total”, >>> )
-
empty_header
= {}¶
-
exception
bigbang.ingress.w3c.
W3CMessageParserWarning
¶ Bases:
BaseException
Base class for W3CMessageParser class specific exceptions
-
bigbang.ingress.w3c.
parse_dfn_header
(header_text)¶
-
bigbang.ingress.w3c.
text_for_selector
(soup: bs4.BeautifulSoup, selector: str)¶ Filter out header or body field from website and return them as utf-8 string.