ingress.listserv

class bigbang.ingress.listserv.ListservMailList(name: str, source: Union[List[str], str], msgs: List[mailbox.mboxMessage])

Bases: bigbang.ingress.abstract.AbstractMailList

This class handles the scraping of a all public Emails contained in a single mailing list in the LISTSERV 16.5 and 17 format. To be more precise, each contributor to a mailing list sends their message to an Email address that has the following structure: <mailing_list_name>@LIST.ETSI.ORG. Thus, this class contains all Emails send to a specific <mailing_list_name> (the Email localpart, such as “3GPP_TSG_CT_WG1” or “3GPP_TSG_CT_WG3_108E_MAIN”).

Parameters
  • name (The of whom the list (e.g. 3GPP_COMMON_IMS_XFER, IEEESCO-DIFUSION, ..)) –

  • source (Contains the information of the location of the mailing list.) – It can be either an URL where the list or a path to the file(s).

  • msgs (List of mboxMessage objects) –

Example

To scrape a Listserv mailing list from an URL and store it in run-time memory, we do the following >>> mlist = ListservMailList.from_url( >>> name=”IEEE-TEST”, >>> url=”https://listserv.ieee.org/cgi-bin/wa?A0=IEEE-TEST”, >>> select={ >>> “years”: 2015, >>> “months”: “November”, >>> “weeks”: 4, >>> “fields”: “header”, >>> }, >>> login={“username”: <your_username>, “password”: <your_password>}, >>> )

To save it as *.mbox file we do the following >>> mlist.to_mbox(path_to_file)

classmethod from_listserv_directories(name: str, directorypaths: List[str], filedsc: str = '*.LOG?????', select: Optional[dict] = None)bigbang.ingress.listserv.ListservMailList

This method is required if the files that contain the list messages were directly exported from LISTSERV 16.5 (e.g. by a member of 3GPP). Each mailing list has its own directory and is split over multiple files with an extension starting with LOG and ending with five digits.

Parameters
  • name (Name of the list of messages, e.g. '3GPP_TSG_SA_WG2_UPCON'.) –

  • directorypaths (List of directory paths where LISTSERV formatted) – messages are.

  • filedsc (A description of the relevant files, e.g. *.LOG?????) –

  • select (Selection criteria that can filter messages by:) –

    • content, i.e. header and/or body

    • period, i.e. written in a certain year, month, week-of-month

classmethod from_listserv_files(name: str, filepaths: List[str], select: Optional[dict] = None)bigbang.ingress.listserv.ListservMailList

This method is required if the files that contain the list messages were directly exported from LISTSERV 16.5 (e.g. by a member of 3GPP). Each mailing list has its own directory and is split over multiple files with an extension starting with LOG and ending with five digits. Compared to ListservMailList.from_listserv_directories(), this method reads messages from single files, instead of all the files contained in a directory.

Parameters
  • name (Name of the list of messages, e.g. '3GPP_TSG_SA_WG2_UPCON') –

  • filepaths (List of file paths where LISTSERV formatted messages are.) – Such files can have a file extension of the form: *.LOG1405D

  • select (Selection criteria that can filter messages by:) –

    • content, i.e. header and/or body

    • period, i.e. written in a certain year, month, week-of-month

classmethod from_mbox(name: str, filepath: str)bigbang.ingress.listserv.ListservMailList

Docstring in AbstractMailList.

classmethod from_messages(name: str, url: str, messages: Union[List[str], List[mailbox.mboxMessage]], fields: str = 'total', url_login: str = 'https://list.etsi.org/scripts/wa.exe?LOGON', url_pref: str = 'https://list.etsi.org/scripts/wa.exe?PREF', login: Optional[Dict[str, str]] = {'password': None, 'username': None}, session: Optional[str] = None)bigbang.ingress.listserv.ListservMailList

Docstring in AbstractMailList.

classmethod from_url(name: str, url: str, select: Optional[dict] = {'fields': 'total'}, url_login: str = 'https://list.etsi.org/scripts/wa.exe?LOGON', url_pref: str = 'https://list.etsi.org/scripts/wa.exe?PREF', login: Optional[Dict[str, str]] = {'password': None, 'username': None}, session: Optional[requests.sessions.Session] = None)bigbang.ingress.listserv.ListservMailList

Docstring in AbstractMailList.

static get_all_periods_and_their_urls(url: str) → Tuple[List[str], List[str]]

LISTSERV groups messages into weekly time bundles. This method obtains all the URLs that lead to the messages of each time bundle.

Returns

  • Returns a tuple of two lists that look like

  • ([‘April 2017, 2’, ‘January 2001’, …], [‘ulr1’, ‘url2’, …])

classmethod get_line_numbers_of_header_starts(content: List[str]) → List[int]

By definition LISTSERV logs seperate new messages by a row of 73 equal signs.

Parameters

content (The content of one LISTSERV file.) –

Returns

Return type

List of line numbers where header starts

classmethod get_message_urls(name: str, url: str, select: Optional[dict] = None) → List[str]

Docstring in AbstractMailList.

This routine is needed for Listserv 16.5

static get_name_from_url(url: str) → str

Get name of mailing list.

classmethod get_period_urls(url: str, select: Optional[dict] = None) → List[str]

All messages within a certain period (e.g. January 2021, Week 5).

Parameters
  • url (URL to the LISTSERV list.) –

  • select (Selection criteria that can filter messages by:) –

    • content, i.e. header and/or body

    • period, i.e. written in a certain year, month, week-of-month

class bigbang.ingress.listserv.ListservMailListDomain(name: str, url: str, lists: List[Union[bigbang.ingress.abstract.AbstractMailList, str]])

Bases: bigbang.ingress.abstract.AbstractMailListDomain

This class handles the scraping of a all public Emails contained in a mail list domain that has the LISTSERV 16.5 and 17 format, such as 3GPP. To be more precise, each contributor to a mail list domain sends their message to an Email address that has the following structure: <mailing_list_name>@w3.org. Thus, this class contains all Emails send to <mail_list_domain_name> (the Email domain name). These Emails are contained in a list of ListservMailList types, such that it is known to which <mailing_list_name> (the Email localpart) was send.

Parameters
  • name (The mailing list domain name (e.g. 3GPP, IEEE, ..)) –

  • url (The URL where the mailing list domain lives) –

  • lists (A list containing the mailing lists as ListservMailList types) –

All methods in the `AbstractMailListDomain` class in addition to:
from_listserv_directory()
get_sections()

Example

To scrape a Listserv mailing list domain from an URL and store it in run-time memory, we do the following >>> mlistdom = ListservMailListDomain.from_url( >>> name=”IEEE”, >>> url_root=”https://listserv.ieee.org/cgi-bin/wa?”, >>> url_home=”https://listserv.ieee.org/cgi-bin/wa?HOME”, >>> select={ >>> “years”: 2015, >>> “months”: “November”, >>> “weeks”: 4, >>> “fields”: “header”, >>> }, >>> login={“username”: <your_username>, “password”: <your_password>}, >>> instant_save=False, >>> only_mlist_urls=False, >>> )

To save it as *.mbox file we do the following >>> mlistdom.to_mbox(path_to_directory)

classmethod from_listserv_directory(name: str, directorypath: str, folderdsc: str = '*', filedsc: str = '*.LOG?????', select: Optional[dict] = None)bigbang.ingress.listserv.ListservMailListDomain

This method is required if the files that contain the mail list domain messages were directly exported from LISTSERV 16.5 (e.g. by a member of 3GPP). Each mailing list has its own subdirectory and is split over multiple files with an extension starting with LOG and ending with five digits.

Parameters
  • name (mail list domain name, such that multiple instances of) – ListservMailListDomain can easily be distinguished.

  • directorypath (Where the ListservMailListDomain can be initialised.) –

  • folderdsc (A description of the relevant folders) –

  • filedsc (A description of the relevant files, e.g. *.LOG?????) –

  • select (Selection criteria that can filter messages by:) –

    • content, i.e. header and/or body

    • period, i.e. written in a certain year, month, week-of-month

classmethod from_mailing_lists(name: str, url_root: str, url_mailing_lists: Union[List[str], List[bigbang.ingress.listserv.ListservMailList]], select: Optional[dict] = {'fields': 'total'}, url_login: str = 'https://list.etsi.org/scripts/wa.exe?LOGON', url_pref: str = 'https://list.etsi.org/scripts/wa.exe?PREF', login: Optional[Dict[str, str]] = {'password': None, 'username': None}, session: Optional[str] = None, only_mlist_urls: bool = True, instant_save: Optional[bool] = True)bigbang.ingress.listserv.ListservMailListDomain

Docstring in AbstractMailList.

classmethod from_mbox(name: str, directorypath: str, filedsc: str = '*.mbox')bigbang.ingress.listserv.ListservMailList

Docstring in AbstractMailList.

classmethod from_url(name: str, url_root: str, url_home: str, select: Optional[dict] = {'fields': 'total'}, url_login: str = 'https://list.etsi.org/scripts/wa.exe?LOGON', url_pref: str = 'https://list.etsi.org/scripts/wa.exe?PREF', login: Optional[Dict[str, str]] = {'password': None, 'username': None}, session: Optional[str] = None, instant_save: bool = True, only_mlist_urls: bool = True)bigbang.ingress.listserv.ListservMailListDomain

Docstring in AbstractMailList.

static get_lists_from_url(url_root: str, url_home: str, select: dict, session: Optional[str] = None, instant_save: bool = True, only_mlist_urls: bool = True) → List[Union[bigbang.ingress.listserv.ListservMailList, str]]

Docstring in AbstractMailList.

get_sections(url_home: str) → int

Get different sections of mail list domain. On the Listserv 16.5 website they look like: [3GPP] [3GPP–AT1] [AT2–CONS] [CONS–EHEA] [EHEA–ERM_] … On the Listserv 17 website they look like: [<<][<]1-50(798)[>][>>]

Returns

  • If sections exist, it returns their urls and names. Otherwise it returns

  • the url_home.

exception bigbang.ingress.listserv.ListservMailListDomainWarning

Bases: BaseException

Base class for ListservMailListDomain class specific exceptions

exception bigbang.ingress.listserv.ListservMailListWarning

Bases: BaseException

Base class for ListservMailList class specific exceptions

class bigbang.ingress.listserv.ListservMessageParser(website=False, url_login: Optional[str] = None, url_pref: Optional[str] = None, login: Optional[Dict[str, str]] = {'password': None, 'username': None}, session: Optional[requests.sessions.Session] = None)

Bases: bigbang.ingress.abstract.AbstractMessageParser, email.parser.Parser

This class handles the creation of an mailbox.mboxMessage object (using the from_*() methods) and its storage in various other file formats (using the to_*() methods) that can be saved on the local memory.

Parameters
  • website (Set 'True' if messages are going to be scraped from websites,) – otherwise ‘False’ if read from local memory.

  • url_login (URL to the 'Log In' page.) –

  • url_pref (URL to the 'Preferences'/settings page.) –

  • login (Login credentials (username and password) that were used to set) – up AuthSession. You can create your own for the 3GPP mail list domain.

  • session (requests.Session() object for the mail list domain website.) –

from_url()
from_listserv_file()
_get_header_from_html()
_get_body_from_html()
_get_header_from_listserv_file()
_get_body_from_listserv_file()

Example

To create a Email message parser object, use the following syntax: >>> msg_parser = ListservMessageParser( >>> website=True, >>> login={“username”: <your_username>, “password”: <your_password>}, >>> )

To obtain the Email message content and return it as mboxMessage object, you need to do the following: >>> msg = msg_parser.from_url( >>> list_name=”3GPP_TSG_RAN_DRAFTS”, >>> url=”https://list.etsi.org/scripts/wa.exe?A2=ind2010B&L=3GPP_TSG_RAN_DRAFTS&O=D&P=29883”, >>> fields=”total”, >>> )

empty_header = {}
from_listserv_file(list_name: str, file_path: str, header_start_line_nr: int, fields: str = 'total') → mailbox.mboxMessage

This method is required if the message is inside a file that was directly exported from LISTSERV 16.5 (e.g. by a member of 3GPP). Such files have an extension starting with LOG and ending with five digits.

Parameters
  • list_name (The name of the LISTSERV Email list.) –

  • file_path (Path to file that contains the Email list.) –

  • header_start_line_nr (Line number in the file on which a new message starts.) –

  • fields (Indicates whether to return 'header', 'body' or 'total'/both or) – the Email.

exception bigbang.ingress.listserv.ListservMessageParserWarning

Bases: BaseException

Base class for ListservMessageParser class specific exceptions