ingress.abstract¶
-
class
bigbang.ingress.abstract.
AbstractMailList
(name: str, source: Union[List[str], str], msgs: List[mailbox.mboxMessage])¶ Bases:
abc.ABC
This class handles the scraping of a all public Emails contained in a single mailing list. To be more precise, each contributor to a mailing list sends their message to an Email address that has the following structure: <mailing_list_name>@<mail_list_domain_name>. Thus, this class contains all Emails send to a specific <mailing_list_name> (the Email localpart).
- Parameters
name (The of whom the list (e.g. 3GPP_COMMON_IMS_XFER, IEEESCO-DIFUSION, ..)) –
source (Contains the information of the location of the mailing list.) – It can be either an URL where the list or a path to the file(s).
msgs (List of mboxMessage objects) –
-
from_url
()¶
-
from_messages
()¶
-
from_mbox
()¶
-
get_message_urls
()¶
-
get_messages_from_url
()¶
-
get_index_of_elements_in_selection
()¶
-
get_name_from_url
()¶
-
to_dict
()¶
-
to_pandas_dataframe
()¶
-
to_mbox
()¶
-
__getitem__
(index) → mailbox.mboxMessage¶ Get specific message at position index within the mailing list.
-
__iter__
()¶ Iterate over each message within the mailing list.
-
__len__
() → int¶ Get number of messsages within the mailing list.
-
abstract classmethod
from_mbox
(name: str, filepath: str) → bigbang.ingress.abstract.AbstractMailList¶ - Parameters
name (Name of the list of messages, e.g. '3GPP_TSG_SA_WG2_UPCON'.) –
filepath (Path to file in which mailing list is stored.) –
-
abstract classmethod
from_messages
(name: str, url: str, messages: Union[List[str], List[mailbox.mboxMessage]], fields: str = 'total', url_login: str = None, url_pref: str = None, login: Optional[Dict[str, str]] = {'password': None, 'username': None}, session: Optional[str] = None) → bigbang.ingress.abstract.AbstractMailList¶ - Parameters
name (Name of the list of messages, e.g. 'public-bigdata') –
url (URL to the Email list.) –
messages (Can either be a list of URLs to specific messages) – or a list of mboxMessage objects.
url_login (URL to the 'Log In' page.) –
url_pref (URL to the 'Preferences'/settings page.) –
login (Login credentials (username and password) that were used to set) – up AuthSession.
session (requests.Session() object for the Email list domain website.) –
-
abstract classmethod
from_url
(name: str, url: str, select: Optional[dict] = {'fields': 'total'}, url_login: Optional[str] = None, url_pref: Optional[str] = None, login: Optional[Dict[str, str]] = {'password': None, 'username': None}, session: Optional[requests.sessions.Session] = None) → bigbang.ingress.abstract.AbstractMailList¶ - Parameters
name (Name of the mailing list.) –
url (URL to the mailing list.) –
select (Selection criteria that can filter messages by:) –
content, i.e. header and/or body
period, i.e. written in a certain year, month, week-of-month
url_login (URL to the 'Log In' page) –
url_pref (URL to the 'Preferences'/settings page) –
login (Login credentials (username and password) that were used to set) – up AuthSession.
session (requests.Session() object for the Email list domain website.) –
-
static
get_index_of_elements_in_selection
(times: List[Union[int, str]], urls: List[str], filtr: Union[tuple, list, int, str]) → List[int]¶ Filter out messages that where in a specific period. Period here is a set containing units of year, month, and week-of-month which can have the following example elements:
years: (1992, 2010), [2000, 2008], 2021
months: [“January”, “July”], “November”
weeks: (1, 4), [1, 5], 2
- Parameters
times (A list containing information of the period for each) – group of mboxMessage.
urls (Corresponding URLs of each group of mboxMessage of which the) – period info is contained in times.
filtr (Containing info on what should be filtered.) –
- Returns
- Return type
Indices of to the elements in times/ursl.
-
abstract classmethod
get_message_urls
(name: str, url: str, select: Optional[dict] = None) → List[str]¶ - Parameters
name (Name of the list of messages, e.g. 'public-bigdata') –
url (URL to the mailing list.) –
select (Selection criteria that can filter messages by:) –
content, i.e. header and/or body
period, i.e. written in a certain year and month
- Returns
- Return type
List of all selected URLs of the messages in the mailing list.
-
static
get_messages_from_urls
(name: str, msg_urls: list, msg_parser, fields: Optional[str] = 'total') → List[mailbox.mboxMessage]¶ Generator that returns all messages within a certain period (e.g. January 2021, Week 5).
- Parameters
name (Name of the list of messages, e.g. '3GPP_TSG_SA_WG2_UPCON') –
url (URL to the mailing list.) –
fields (Content, i.e. header and/or body) –
-
abstract
get_name_from_url
() → str¶
-
to_dict
(include_body: bool = True) → Dict[str, List[str]]¶ - Parameters
include_body (A boolean that indicates whether the message body should) – be included or not.
- Returns
A Dictionary with the first key layer being the header field names and
the “body” key. Each value field is a list containing the respective
header field contents arranged by the order as they were scraped from
the web. This format makes the conversion to a pandas.DataFrame easier.
-
to_mbox
(dir_out: str, filename: Optional[str] = None)¶ Safe mailing list to .mbox files.
-
to_pandas_dataframe
(include_body: bool = True) → pandas.core.frame.DataFrame¶ - Parameters
include_body (A boolean that indicates whether the message body should) – be included or not.
- Returns
Converts the mailing list into a pandas.DataFrame object in which each
row represents an Email.
-
class
bigbang.ingress.abstract.
AbstractMailListDomain
(name: str, url: str, lists: List[Union[bigbang.ingress.abstract.AbstractMailList, str]])¶ Bases:
abc.ABC
This class handles the scraping of a all public Emails contained in a mail list domain. To be more precise, each contributor to a mailing archive sends their message to an Email address that has the following structure: <mailing_list_name>@<mail_list_domain_name>. Thus, this class contains all Emails send to <mail_list_domain_name> (the Email domain name). These Emails are contained in a list of AbstractMailList types, such that it is known to which <mailing_list_name> (the Email localpart) was send.
- Parameters
name (The mail list domain name (e.g. 3GPP, IEEE, W3C)) –
url (The URL where the archive lives) –
lists (A list containing the mailing lists as AbstractMailList types) –
-
from_url
()¶
-
from_mailing_lists
()¶
-
from_mbox
()¶
-
get_lists_from_url
()¶
-
to_dict
()¶
-
to_pandas_dataframe
()¶
-
to_mbox
()¶
-
__getitem__
(index)¶ Get specific mailing list at position index from the mail list domain.
-
__iter__
()¶ Iterate over each mailing list within the mail list domain.
-
__len__
()¶ Get number of mailing lists within the mail list domain.
-
abstract classmethod
from_mailing_lists
(name: str, url_root: str, url_mailing_lists: Union[List[str], List[bigbang.ingress.abstract.AbstractMailList]], select: Optional[dict] = None, url_login: Optional[str] = None, url_pref: Optional[str] = None, login: Optional[Dict[str, str]] = {'password': None, 'username': None}, session: Optional[str] = None, only_mlist_urls: bool = True, instant_save: Optional[bool] = True) → bigbang.ingress.abstract.AbstractMailListDomain¶ Create mailing mail list domain from a given list of ‘AbstractMailList` instances or URLs pointing to mailing lists.
- Parameters
name (mail list domain name, such that multiple instances of) – AbstractMailListDomain can easily be distinguished.
url_root (The invariant root URL that does not change no matter what) – part of the mail list domain we access.
url_mailing_lists (This argument can either be a list of AbstractMailList) – objects or a list of string containing the URLs to the mailing list of interest.
url_login (URL to the 'Log In' page.) –
url_pref (URL to the 'Preferences'/settings page.) –
login (Login credentials (username and password) that were used to set) – up AuthSession.
session (requests.Session() object for the mail list domain website.) –
only_list_urls (Boolean giving the choice to collect only mailing list) – URLs or also their contents.
instant_save (Boolean giving the choice to save a AbstractMailList as) – soon as it is completely scraped or collect entire mail list domain. The prior is recommended if a large number of mailing lists are scraped which can require a lot of memory and time.
-
abstract classmethod
from_mbox
(name: str, directorypath: str, filedsc: str = '*.mbox') → bigbang.ingress.abstract.AbstractMailListDomain¶ - Parameters
name (mail list domain name, such that multiple instances of) – AbstractMailListDomain can easily be distinguished.
directorypath (Path to the folder in which AbstractMailListDomain is stored.) –
filedsc (Optional filter that only reads files matching the description.) – By default all files with an mbox extension are read.
-
abstract classmethod
from_url
(name: str, url_root: str, url_home: Optional[str] = None, select: Optional[dict] = None, url_login: Optional[str] = None, url_pref: Optional[str] = None, login: Optional[Dict[str, str]] = {'password': None, 'username': None}, session: Optional[str] = None, instant_save: bool = True, only_mlist_urls: bool = True) → bigbang.ingress.abstract.AbstractMailListDomain¶ Create a mail list domain from a given URL. :param name: AbstractMailListDomain can easily be distinguished. :type name: Email list domain name, such that multiple instances of :param url_root: part of the mail list domain we access. :type url_root: The invariant root URL that does not change no matter what :param url_home: it contains the different sections which we obtain using get_sections(). :type url_home: The ‘home’ space of the mail list domain. This is required as :param select:
content, i.e. header and/or body
period, i.e. written in a certain year, month, week-of-month
- Parameters
url_login (URL to the 'Log In' page.) –
url_pref (URL to the 'Preferences'/settings page.) –
login (Login credentials (username and password) that were used to set) – up AuthSession.
session (requests.Session() object for the mail list domain website.) –
instant_save (Boolean giving the choice to save a AbstractMailList as) – soon as it is completely scraped or collect entire mail list domain. The prior is recommended if a large number of mailing lists are scraped which can require a lot of memory and time.
only_list_urls (Boolean giving the choice to collect only AbstractMailList) – URLs or also their contents.
-
abstract classmethod
get_lists_from_url
(url_home: str, select: dict, session: Optional[str] = None, instant_save: bool = True, only_mlist_urls: bool = True) → List[Union[bigbang.ingress.abstract.AbstractMailList, str]]¶ Created dictionary of all lists in the mail list domain.
- Parameters
url_root (The invariant root URL that does not change no matter what) – part of the mail list domain we access.
url_home (The 'home' space of the mail list domain. This is required as) – it contains the different sections which we obtain using get_sections().
select (Selection criteria that can filter messages by:) –
content, i.e. header and/or body
period, i.e. written in a certain year, month, week-of-month
session (requests.Session() object for the mail list domain website.) –
instant_save (Boolean giving the choice to save a AbstractMailList as) – soon as it is completely scraped or collect entire mail list domain. The prior is recommended if a large number of mailing lists are scraped which can require a lot of memory and time.
only_list_urls (Boolean giving the choice to collect only AbstractMailList) – URLs or also their contents.
- Returns
archive_dict
- Return type
the keys are the names of the lists and the value their url
-
to_dict
(include_body: bool = True) → Dict[str, List[str]]¶ Concatenates mailing list dictionaries created using AbstractMailList.to_dict().
-
to_mbox
(dir_out: str)¶ Save mail list domain content to .mbox files
-
to_pandas_dataframe
(include_body: bool = True) → pandas.core.frame.DataFrame¶ Concatenates mailing list pandas.DataFrames created using AbstractMailList.to_pandas_dataframe().
-
exception
bigbang.ingress.abstract.
AbstractMailListDomainWarning
¶ Bases:
BaseException
Base class for AbstractMailListDomain class specific exceptions
-
exception
bigbang.ingress.abstract.
AbstractMailListWarning
¶ Bases:
BaseException
Base class for AbstractMailList class specific exceptions
-
class
bigbang.ingress.abstract.
AbstractMessageParser
(website=False, url_login: Optional[str] = None, url_pref: Optional[str] = None, login: Optional[Dict[str, str]] = {'password': None, 'username': None}, session: Optional[requests.sessions.Session] = None)¶ Bases:
abc.ABC
This class handles the creation of an mailbox.mboxMessage object (using the from_*() methods) and its storage in various other file formats (using the to_*() methods) that can be saved on the local memory.
-
create_email_message
(archived_at: str, body: str, **header) → mailbox.mboxMessage¶ - Parameters
archived_at (URL to the Email message.) –
body (String that contains the body of the message.) –
header (Dictionary that contains all available header fields of the) – message.
-
from_url
(list_name: str, url: str, fields: str = 'total') → mailbox.mboxMessage¶ - Parameters
list_name (The name of the mailing list.) –
url (URL of this Email) –
fields (Indicates whether to return 'header', 'body' or 'total'/both or) – the Email. The latter is the default.
-
static
to_dict
(msg: mailbox.mboxMessage) → Dict[str, List[str]]¶ Convert mboxMessage to a Dictionary
-
static
to_mbox
(msg: mailbox.mboxMessage, filepath: str)¶ - Parameters
msg (The Email.) –
filepath (Path to file in which the Email will be stored.) –
-
static
to_pandas_dataframe
(msg: mailbox.mboxMessage) → pandas.core.frame.DataFrame¶ Convert mboxMessage to a pandas.DataFrame
-
-
exception
bigbang.ingress.abstract.
AbstractMessageParserWarning
¶ Bases:
BaseException
Base class for AbstractMessageParser class specific exceptions