ingress.abstract

class bigbang.ingress.abstract.AbstractMailList(name: str, source: Union[List[str], str], msgs: List[mailbox.mboxMessage])

Bases: abc.ABC

This class handles the scraping of a all public Emails contained in a single mailing list. To be more precise, each contributor to a mailing list sends their message to an Email address that has the following structure: <mailing_list_name>@<mail_list_domain_name>. Thus, this class contains all Emails send to a specific <mailing_list_name> (the Email localpart).

Parameters
  • name (The of whom the list (e.g. 3GPP_COMMON_IMS_XFER, IEEESCO-DIFUSION, ..)) –

  • source (Contains the information of the location of the mailing list.) – It can be either an URL where the list or a path to the file(s).

  • msgs (List of mboxMessage objects) –

from_url()
from_messages()
from_mbox()
get_message_urls()
get_messages_from_url()
get_index_of_elements_in_selection()
get_name_from_url()
to_dict()
to_pandas_dataframe()
to_mbox()
__getitem__(index) → mailbox.mboxMessage

Get specific message at position index within the mailing list.

__iter__()

Iterate over each message within the mailing list.

__len__() → int

Get number of messsages within the mailing list.

abstract classmethod from_mbox(name: str, filepath: str)bigbang.ingress.abstract.AbstractMailList
Parameters
  • name (Name of the list of messages, e.g. '3GPP_TSG_SA_WG2_UPCON'.) –

  • filepath (Path to file in which mailing list is stored.) –

abstract classmethod from_messages(name: str, url: str, messages: Union[List[str], List[mailbox.mboxMessage]], fields: str = 'total', url_login: str = None, url_pref: str = None, login: Optional[Dict[str, str]] = {'password': None, 'username': None}, session: Optional[str] = None)bigbang.ingress.abstract.AbstractMailList
Parameters
  • name (Name of the list of messages, e.g. 'public-bigdata') –

  • url (URL to the Email list.) –

  • messages (Can either be a list of URLs to specific messages) – or a list of mboxMessage objects.

  • url_login (URL to the 'Log In' page.) –

  • url_pref (URL to the 'Preferences'/settings page.) –

  • login (Login credentials (username and password) that were used to set) – up AuthSession.

  • session (requests.Session() object for the Email list domain website.) –

abstract classmethod from_url(name: str, url: str, select: Optional[dict] = {'fields': 'total'}, url_login: Optional[str] = None, url_pref: Optional[str] = None, login: Optional[Dict[str, str]] = {'password': None, 'username': None}, session: Optional[requests.sessions.Session] = None)bigbang.ingress.abstract.AbstractMailList
Parameters
  • name (Name of the mailing list.) –

  • url (URL to the mailing list.) –

  • select (Selection criteria that can filter messages by:) –

    • content, i.e. header and/or body

    • period, i.e. written in a certain year, month, week-of-month

  • url_login (URL to the 'Log In' page) –

  • url_pref (URL to the 'Preferences'/settings page) –

  • login (Login credentials (username and password) that were used to set) – up AuthSession.

  • session (requests.Session() object for the Email list domain website.) –

static get_index_of_elements_in_selection(times: List[Union[int, str]], urls: List[str], filtr: Union[tuple, list, int, str]) → List[int]

Filter out messages that where in a specific period. Period here is a set containing units of year, month, and week-of-month which can have the following example elements:

  • years: (1992, 2010), [2000, 2008], 2021

  • months: [“January”, “July”], “November”

  • weeks: (1, 4), [1, 5], 2

Parameters
  • times (A list containing information of the period for each) – group of mboxMessage.

  • urls (Corresponding URLs of each group of mboxMessage of which the) – period info is contained in times.

  • filtr (Containing info on what should be filtered.) –

Returns

Return type

Indices of to the elements in times/ursl.

abstract classmethod get_message_urls(name: str, url: str, select: Optional[dict] = None) → List[str]
Parameters
  • name (Name of the list of messages, e.g. 'public-bigdata') –

  • url (URL to the mailing list.) –

  • select (Selection criteria that can filter messages by:) –

    • content, i.e. header and/or body

    • period, i.e. written in a certain year and month

Returns

Return type

List of all selected URLs of the messages in the mailing list.

static get_messages_from_urls(name: str, msg_urls: list, msg_parser, fields: Optional[str] = 'total') → List[mailbox.mboxMessage]

Generator that returns all messages within a certain period (e.g. January 2021, Week 5).

Parameters
  • name (Name of the list of messages, e.g. '3GPP_TSG_SA_WG2_UPCON') –

  • url (URL to the mailing list.) –

  • fields (Content, i.e. header and/or body) –

abstract get_name_from_url() → str
to_dict(include_body: bool = True) → Dict[str, List[str]]
Parameters

include_body (A boolean that indicates whether the message body should) – be included or not.

Returns

  • A Dictionary with the first key layer being the header field names and

  • the “body” key. Each value field is a list containing the respective

  • header field contents arranged by the order as they were scraped from

  • the web. This format makes the conversion to a pandas.DataFrame easier.

to_mbox(dir_out: str, filename: Optional[str] = None)

Safe mailing list to .mbox files.

to_pandas_dataframe(include_body: bool = True) → pandas.core.frame.DataFrame
Parameters

include_body (A boolean that indicates whether the message body should) – be included or not.

Returns

  • Converts the mailing list into a pandas.DataFrame object in which each

  • row represents an Email.

class bigbang.ingress.abstract.AbstractMailListDomain(name: str, url: str, lists: List[Union[bigbang.ingress.abstract.AbstractMailList, str]])

Bases: abc.ABC

This class handles the scraping of a all public Emails contained in a mail list domain. To be more precise, each contributor to a mailing archive sends their message to an Email address that has the following structure: <mailing_list_name>@<mail_list_domain_name>. Thus, this class contains all Emails send to <mail_list_domain_name> (the Email domain name). These Emails are contained in a list of AbstractMailList types, such that it is known to which <mailing_list_name> (the Email localpart) was send.

Parameters
  • name (The mail list domain name (e.g. 3GPP, IEEE, W3C)) –

  • url (The URL where the archive lives) –

  • lists (A list containing the mailing lists as AbstractMailList types) –

from_url()
from_mailing_lists()
from_mbox()
get_lists_from_url()
to_dict()
to_pandas_dataframe()
to_mbox()
__getitem__(index)

Get specific mailing list at position index from the mail list domain.

__iter__()

Iterate over each mailing list within the mail list domain.

__len__()

Get number of mailing lists within the mail list domain.

abstract classmethod from_mailing_lists(name: str, url_root: str, url_mailing_lists: Union[List[str], List[bigbang.ingress.abstract.AbstractMailList]], select: Optional[dict] = None, url_login: Optional[str] = None, url_pref: Optional[str] = None, login: Optional[Dict[str, str]] = {'password': None, 'username': None}, session: Optional[str] = None, only_mlist_urls: bool = True, instant_save: Optional[bool] = True)bigbang.ingress.abstract.AbstractMailListDomain

Create mailing mail list domain from a given list of ‘AbstractMailList` instances or URLs pointing to mailing lists.

Parameters
  • name (mail list domain name, such that multiple instances of) – AbstractMailListDomain can easily be distinguished.

  • url_root (The invariant root URL that does not change no matter what) – part of the mail list domain we access.

  • url_mailing_lists (This argument can either be a list of AbstractMailList) – objects or a list of string containing the URLs to the mailing list of interest.

  • url_login (URL to the 'Log In' page.) –

  • url_pref (URL to the 'Preferences'/settings page.) –

  • login (Login credentials (username and password) that were used to set) – up AuthSession.

  • session (requests.Session() object for the mail list domain website.) –

  • only_list_urls (Boolean giving the choice to collect only mailing list) – URLs or also their contents.

  • instant_save (Boolean giving the choice to save a AbstractMailList as) – soon as it is completely scraped or collect entire mail list domain. The prior is recommended if a large number of mailing lists are scraped which can require a lot of memory and time.

abstract classmethod from_mbox(name: str, directorypath: str, filedsc: str = '*.mbox')bigbang.ingress.abstract.AbstractMailListDomain
Parameters
  • name (mail list domain name, such that multiple instances of) – AbstractMailListDomain can easily be distinguished.

  • directorypath (Path to the folder in which AbstractMailListDomain is stored.) –

  • filedsc (Optional filter that only reads files matching the description.) – By default all files with an mbox extension are read.

abstract classmethod from_url(name: str, url_root: str, url_home: Optional[str] = None, select: Optional[dict] = None, url_login: Optional[str] = None, url_pref: Optional[str] = None, login: Optional[Dict[str, str]] = {'password': None, 'username': None}, session: Optional[str] = None, instant_save: bool = True, only_mlist_urls: bool = True)bigbang.ingress.abstract.AbstractMailListDomain

Create a mail list domain from a given URL. :param name: AbstractMailListDomain can easily be distinguished. :type name: Email list domain name, such that multiple instances of :param url_root: part of the mail list domain we access. :type url_root: The invariant root URL that does not change no matter what :param url_home: it contains the different sections which we obtain using get_sections(). :type url_home: The ‘home’ space of the mail list domain. This is required as :param select:

  • content, i.e. header and/or body

  • period, i.e. written in a certain year, month, week-of-month

Parameters
  • url_login (URL to the 'Log In' page.) –

  • url_pref (URL to the 'Preferences'/settings page.) –

  • login (Login credentials (username and password) that were used to set) – up AuthSession.

  • session (requests.Session() object for the mail list domain website.) –

  • instant_save (Boolean giving the choice to save a AbstractMailList as) – soon as it is completely scraped or collect entire mail list domain. The prior is recommended if a large number of mailing lists are scraped which can require a lot of memory and time.

  • only_list_urls (Boolean giving the choice to collect only AbstractMailList) – URLs or also their contents.

abstract classmethod get_lists_from_url(url_home: str, select: dict, session: Optional[str] = None, instant_save: bool = True, only_mlist_urls: bool = True) → List[Union[bigbang.ingress.abstract.AbstractMailList, str]]

Created dictionary of all lists in the mail list domain.

Parameters
  • url_root (The invariant root URL that does not change no matter what) – part of the mail list domain we access.

  • url_home (The 'home' space of the mail list domain. This is required as) – it contains the different sections which we obtain using get_sections().

  • select (Selection criteria that can filter messages by:) –

    • content, i.e. header and/or body

    • period, i.e. written in a certain year, month, week-of-month

  • session (requests.Session() object for the mail list domain website.) –

  • instant_save (Boolean giving the choice to save a AbstractMailList as) – soon as it is completely scraped or collect entire mail list domain. The prior is recommended if a large number of mailing lists are scraped which can require a lot of memory and time.

  • only_list_urls (Boolean giving the choice to collect only AbstractMailList) – URLs or also their contents.

Returns

archive_dict

Return type

the keys are the names of the lists and the value their url

to_dict(include_body: bool = True) → Dict[str, List[str]]

Concatenates mailing list dictionaries created using AbstractMailList.to_dict().

to_mbox(dir_out: str)

Save mail list domain content to .mbox files

to_pandas_dataframe(include_body: bool = True) → pandas.core.frame.DataFrame

Concatenates mailing list pandas.DataFrames created using AbstractMailList.to_pandas_dataframe().

exception bigbang.ingress.abstract.AbstractMailListDomainWarning

Bases: BaseException

Base class for AbstractMailListDomain class specific exceptions

exception bigbang.ingress.abstract.AbstractMailListWarning

Bases: BaseException

Base class for AbstractMailList class specific exceptions

class bigbang.ingress.abstract.AbstractMessageParser(website=False, url_login: Optional[str] = None, url_pref: Optional[str] = None, login: Optional[Dict[str, str]] = {'password': None, 'username': None}, session: Optional[requests.sessions.Session] = None)

Bases: abc.ABC

This class handles the creation of an mailbox.mboxMessage object (using the from_*() methods) and its storage in various other file formats (using the to_*() methods) that can be saved on the local memory.

create_email_message(archived_at: str, body: str, **header) → mailbox.mboxMessage
Parameters
  • archived_at (URL to the Email message.) –

  • body (String that contains the body of the message.) –

  • header (Dictionary that contains all available header fields of the) – message.

from_url(list_name: str, url: str, fields: str = 'total') → mailbox.mboxMessage
Parameters
  • list_name (The name of the mailing list.) –

  • url (URL of this Email) –

  • fields (Indicates whether to return 'header', 'body' or 'total'/both or) – the Email. The latter is the default.

static to_dict(msg: mailbox.mboxMessage) → Dict[str, List[str]]

Convert mboxMessage to a Dictionary

static to_mbox(msg: mailbox.mboxMessage, filepath: str)
Parameters
  • msg (The Email.) –

  • filepath (Path to file in which the Email will be stored.) –

static to_pandas_dataframe(msg: mailbox.mboxMessage) → pandas.core.frame.DataFrame

Convert mboxMessage to a pandas.DataFrame

exception bigbang.ingress.abstract.AbstractMessageParserWarning

Bases: BaseException

Base class for AbstractMessageParser class specific exceptions