ingress.w3c

class bigbang.ingress.w3c.W3CMailList(name: str, source: Union[List[str], str], msgs: List[mailbox.mboxMessage])

Bases: bigbang.ingress.abstract.AbstractMailList

This class handles the scraping of a all public Emails contained in a single mailing list in the hypermail format. To be more precise, each contributor to a mailing list sends their message to an Email address that has the following structure: <mailing_list_name>@w3.org. Thus, this class contains all Emails send to a specific <mailing_list_name> (the Email localpart, such as “public-abcg” or “public-accesslearn-contrib”).

Parameters
  • name (The name of the list (e.g. public-2018-permissions-ws, ..)) –

  • source (Contains the information of the location of the mailing list.) – It can be either an URL where the list or a path to the file(s).

  • msgs (List of mboxMessage objects) –

Example

To scrape a W3C mailing list from an URL and store it in run-time memory, we do the following

>>> mlist = W3CMailList.from_url(
>>>     name="public-bigdata",
>>>     url="https://lists.w3.org/Archives/Public/public-bigdata/",
>>>     select={
>>>         "years": 2015,
>>>         "months": "August",
>>>         "fields": "header",
>>>     },
>>> )

To save it as *.mbox file we do the following >>> mlist.to_mbox(path_to_file)

classmethod from_mbox(name: str, filepath: str)bigbang.ingress.w3c.W3CMailList

Docstring in AbstractMailList.

classmethod from_messages(name: str, url: str, messages: Union[List[str], List[mailbox.mboxMessage]], fields: str = 'total')bigbang.ingress.w3c.W3CMailList

Docstring in AbstractMailList.

classmethod from_url(name: str, url: str, select: Optional[dict] = {'fields': 'total'})bigbang.ingress.w3c.W3CMailList

Docstring in AbstractMailList.

static get_all_periods_and_their_urls(url: str) → Tuple[List[str], List[str]]

W3C groups messages into monthly time bundles. This method obtains all the URLs that lead to the messages of each time bundle.

Returns

  • Returns a tuple of two lists that look like

  • ([‘April 2017’, ‘January 2001’, …], [‘ulr1’, ‘url2’, …])

classmethod get_message_urls(name: str, url: str, select: Optional[dict] = None) → List[str]

Docstring in AbstractMailList.

classmethod get_messages_urls(name: str, url: str) → List[str]
Parameters
  • name (Name of the W3C mailing list.) –

  • url (URL to group of messages that are within the same period.) –

Returns

Return type

List of URLs from which mboxMessage can be initialized.

static get_name_from_url(url: str) → str

Get name of mailing list.

classmethod get_period_urls(url: str, select: Optional[dict] = None) → List[str]

All messages within a certain period (e.g. January 2021).

Parameters
  • url (URL to the W3C list.) –

  • select (Selection criteria that can filter messages by:) –

    • content, i.e. header and/or body

    • period, i.e. written in a certain year and month

class bigbang.ingress.w3c.W3CMailListDomain(name: str, url: str, lists: List[Union[bigbang.ingress.abstract.AbstractMailList, str]])

Bases: bigbang.ingress.abstract.AbstractMailListDomain

This class handles the scraping of a all public Emails contained in a mail list domain that has the hypermail format, such as W3C. To be more precise, each contributor to a mail list domain sends their message to an Email address that has the following structure: <mailing_list_name>@w3.org. Thus, this class contains all Emails send to <mail_list_domain_name> (the Email domain name). These Emails are contained in a list of W3CMailList types, such that it is known to which <mailing_list_name> (the Email localpart) was send.

Parameters
  • name (The name of the mailing list domain.) –

  • url (The URL where the mailing list domain lives) –

  • lists (A list containing the mailing lists as W3CMailList types) –

All methods in the `AbstractMailListDomain` class.

Example

To scrape a W3C mailing list mailing list domain from an URL and store it in run-time memory, we do the following >>> mlistdom = W3CMailListDomain.from_url( >>> name=”W3C”, >>> url_root=”https://lists.w3.org/Archives/Public/”, >>> select={ >>> “years”: 2015, >>> “months”: “November”, >>> “weeks”: 4, >>> “fields”: “header”, >>> }, >>> instant_save=False, >>> only_mlist_urls=False, >>> )

To save it as *.mbox file we do the following >>> mlistdom.to_mbox(path_to_directory)

classmethod from_mailing_lists(name: str, url_root: str, url_mailing_lists: Union[List[str], List[bigbang.ingress.w3c.W3CMailList]], select: Optional[dict] = {'fields': 'total'}, only_mlist_urls: bool = True, instant_save: Optional[bool] = True)bigbang.ingress.w3c.W3CMailListDomain

Docstring in AbstractMailListDomain.

classmethod from_mbox(name: str, directorypath: str, filedsc: str = '*.mbox')bigbang.ingress.w3c.W3CMailListDomain

Docstring in AbstractMailListDomain.

classmethod from_url(name: str, url_root: str, url_home: Optional[str] = None, select: Optional[dict] = {'fields': 'total'}, instant_save: bool = True, only_mlist_urls: bool = True)bigbang.ingress.w3c.W3CMailListDomain

Docstring in AbstractMailListDomain.

static get_lists_from_url(name: str, select: dict, url_root: str, url_home: Optional[str] = None, instant_save: bool = True, only_mlist_urls: bool = True) → List[Union[bigbang.ingress.w3c.W3CMailList, str]]

Docstring in AbstractMailListDomain.

exception bigbang.ingress.w3c.W3CMailListDomainWarning

Bases: BaseException

Base class for W3CMailListDomain class specific exceptions

exception bigbang.ingress.w3c.W3CMailListWarning

Bases: BaseException

Base class for W3CMailList class specific exceptions

class bigbang.ingress.w3c.W3CMessageParser(website=False, url_login: Optional[str] = None, url_pref: Optional[str] = None, login: Optional[Dict[str, str]] = {'password': None, 'username': None}, session: Optional[requests.sessions.Session] = None)

Bases: bigbang.ingress.abstract.AbstractMessageParser, email.parser.Parser

This class handles the creation of an mailbox.mboxMessage object (using the from_*() methods) and its storage in various other file formats (using the to_*() methods) that can be saved on the local memory.

Parameters
  • website (Set 'True' if messages are going to be scraped from websites,) – otherwise ‘False’ if read from local memory. This distinction needs to be made if missing messages should be added.

  • url_pref (URL to the 'Preferences'/settings page.) –

Example

To create a Email message parser object, use the following syntax: >>> msg_parser = W3CMessageParser(website=True)

To obtain the Email message content and return it as mboxMessage object, you need to do the following: >>> msg = msg_parser.from_url( >>> list_name=”public-2018-permissions-ws”, >>> url=”https://lists.w3.org/Archives/Public/public-2018-permissions-ws/2019May/0000.html”, >>> fields=”total”, >>> )

empty_header = {}
exception bigbang.ingress.w3c.W3CMessageParserWarning

Bases: BaseException

Base class for W3CMessageParser class specific exceptions

bigbang.ingress.w3c.parse_dfn_header(header_text)
bigbang.ingress.w3c.text_for_selector(soup: bs4.BeautifulSoup, selector: str)

Filter out header or body field from website and return them as utf-8 string.