ingress.listserv¶
-
class
bigbang.ingress.listserv.
ListservMailList
(name: str, source: Union[List[str], str], msgs: List[mailbox.mboxMessage])¶ Bases:
bigbang.ingress.abstract.AbstractMailList
This class handles the scraping of a all public Emails contained in a single mailing list in the LISTSERV 16.5 and 17 format. To be more precise, each contributor to a mailing list sends their message to an Email address that has the following structure: <mailing_list_name>@LIST.ETSI.ORG. Thus, this class contains all Emails send to a specific <mailing_list_name> (the Email localpart, such as “3GPP_TSG_CT_WG1” or “3GPP_TSG_CT_WG3_108E_MAIN”).
- Parameters
name (The of whom the list (e.g. 3GPP_COMMON_IMS_XFER, IEEESCO-DIFUSION, ..)) –
source (Contains the information of the location of the mailing list.) – It can be either an URL where the list or a path to the file(s).
msgs (List of mboxMessage objects) –
Example
To scrape a Listserv mailing list from an URL and store it in run-time memory, we do the following >>> mlist = ListservMailList.from_url( >>> name=”IEEE-TEST”, >>> url=”https://listserv.ieee.org/cgi-bin/wa?A0=IEEE-TEST”, >>> select={ >>> “years”: 2015, >>> “months”: “November”, >>> “weeks”: 4, >>> “fields”: “header”, >>> }, >>> login={“username”: <your_username>, “password”: <your_password>}, >>> )
To save it as *.mbox file we do the following >>> mlist.to_mbox(path_to_file)
-
classmethod
from_listserv_directories
(name: str, directorypaths: List[str], filedsc: str = '*.LOG?????', select: Optional[dict] = None) → bigbang.ingress.listserv.ListservMailList¶ This method is required if the files that contain the list messages were directly exported from LISTSERV 16.5 (e.g. by a member of 3GPP). Each mailing list has its own directory and is split over multiple files with an extension starting with LOG and ending with five digits.
- Parameters
name (Name of the list of messages, e.g. '3GPP_TSG_SA_WG2_UPCON'.) –
directorypaths (List of directory paths where LISTSERV formatted) – messages are.
filedsc (A description of the relevant files, e.g. *.LOG?????) –
select (Selection criteria that can filter messages by:) –
content, i.e. header and/or body
period, i.e. written in a certain year, month, week-of-month
-
classmethod
from_listserv_files
(name: str, filepaths: List[str], select: Optional[dict] = None) → bigbang.ingress.listserv.ListservMailList¶ This method is required if the files that contain the list messages were directly exported from LISTSERV 16.5 (e.g. by a member of 3GPP). Each mailing list has its own directory and is split over multiple files with an extension starting with LOG and ending with five digits. Compared to ListservMailList.from_listserv_directories(), this method reads messages from single files, instead of all the files contained in a directory.
- Parameters
name (Name of the list of messages, e.g. '3GPP_TSG_SA_WG2_UPCON') –
filepaths (List of file paths where LISTSERV formatted messages are.) – Such files can have a file extension of the form: *.LOG1405D
select (Selection criteria that can filter messages by:) –
content, i.e. header and/or body
period, i.e. written in a certain year, month, week-of-month
-
classmethod
from_mbox
(name: str, filepath: str) → bigbang.ingress.listserv.ListservMailList¶ Docstring in AbstractMailList.
-
classmethod
from_messages
(name: str, url: str, messages: Union[List[str], List[mailbox.mboxMessage]], fields: str = 'total', url_login: str = 'https://list.etsi.org/scripts/wa.exe?LOGON', url_pref: str = 'https://list.etsi.org/scripts/wa.exe?PREF', login: Optional[Dict[str, str]] = {'password': None, 'username': None}, session: Optional[str] = None) → bigbang.ingress.listserv.ListservMailList¶ Docstring in AbstractMailList.
-
classmethod
from_url
(name: str, url: str, select: Optional[dict] = {'fields': 'total'}, url_login: str = 'https://list.etsi.org/scripts/wa.exe?LOGON', url_pref: str = 'https://list.etsi.org/scripts/wa.exe?PREF', login: Optional[Dict[str, str]] = {'password': None, 'username': None}, session: Optional[requests.sessions.Session] = None) → bigbang.ingress.listserv.ListservMailList¶ Docstring in AbstractMailList.
-
static
get_all_periods_and_their_urls
(url: str) → Tuple[List[str], List[str]]¶ LISTSERV groups messages into weekly time bundles. This method obtains all the URLs that lead to the messages of each time bundle.
- Returns
Returns a tuple of two lists that look like
([‘April 2017, 2’, ‘January 2001’, …], [‘ulr1’, ‘url2’, …])
-
classmethod
get_line_numbers_of_header_starts
(content: List[str]) → List[int]¶ By definition LISTSERV logs seperate new messages by a row of 73 equal signs.
- Parameters
content (The content of one LISTSERV file.) –
- Returns
- Return type
List of line numbers where header starts
-
classmethod
get_message_urls
(name: str, url: str, select: Optional[dict] = None) → List[str]¶ Docstring in AbstractMailList.
This routine is needed for Listserv 16.5
-
static
get_name_from_url
(url: str) → str¶ Get name of mailing list.
-
classmethod
get_period_urls
(url: str, select: Optional[dict] = None) → List[str]¶ All messages within a certain period (e.g. January 2021, Week 5).
- Parameters
url (URL to the LISTSERV list.) –
select (Selection criteria that can filter messages by:) –
content, i.e. header and/or body
period, i.e. written in a certain year, month, week-of-month
-
class
bigbang.ingress.listserv.
ListservMailListDomain
(name: str, url: str, lists: List[Union[bigbang.ingress.abstract.AbstractMailList, str]])¶ Bases:
bigbang.ingress.abstract.AbstractMailListDomain
This class handles the scraping of a all public Emails contained in a mail list domain that has the LISTSERV 16.5 and 17 format, such as 3GPP. To be more precise, each contributor to a mail list domain sends their message to an Email address that has the following structure: <mailing_list_name>@w3.org. Thus, this class contains all Emails send to <mail_list_domain_name> (the Email domain name). These Emails are contained in a list of ListservMailList types, such that it is known to which <mailing_list_name> (the Email localpart) was send.
- Parameters
name (The mailing list domain name (e.g. 3GPP, IEEE, ..)) –
url (The URL where the mailing list domain lives) –
lists (A list containing the mailing lists as ListservMailList types) –
-
All methods in the `AbstractMailListDomain` class in addition to:
-
from_listserv_directory
()¶
-
get_sections
()¶
Example
To scrape a Listserv mailing list domain from an URL and store it in run-time memory, we do the following >>> mlistdom = ListservMailListDomain.from_url( >>> name=”IEEE”, >>> url_root=”https://listserv.ieee.org/cgi-bin/wa?”, >>> url_home=”https://listserv.ieee.org/cgi-bin/wa?HOME”, >>> select={ >>> “years”: 2015, >>> “months”: “November”, >>> “weeks”: 4, >>> “fields”: “header”, >>> }, >>> login={“username”: <your_username>, “password”: <your_password>}, >>> instant_save=False, >>> only_mlist_urls=False, >>> )
To save it as *.mbox file we do the following >>> mlistdom.to_mbox(path_to_directory)
-
classmethod
from_listserv_directory
(name: str, directorypath: str, folderdsc: str = '*', filedsc: str = '*.LOG?????', select: Optional[dict] = None) → bigbang.ingress.listserv.ListservMailListDomain¶ This method is required if the files that contain the mail list domain messages were directly exported from LISTSERV 16.5 (e.g. by a member of 3GPP). Each mailing list has its own subdirectory and is split over multiple files with an extension starting with LOG and ending with five digits.
- Parameters
name (mail list domain name, such that multiple instances of) – ListservMailListDomain can easily be distinguished.
directorypath (Where the ListservMailListDomain can be initialised.) –
folderdsc (A description of the relevant folders) –
filedsc (A description of the relevant files, e.g. *.LOG?????) –
select (Selection criteria that can filter messages by:) –
content, i.e. header and/or body
period, i.e. written in a certain year, month, week-of-month
-
classmethod
from_mailing_lists
(name: str, url_root: str, url_mailing_lists: Union[List[str], List[bigbang.ingress.listserv.ListservMailList]], select: Optional[dict] = {'fields': 'total'}, url_login: str = 'https://list.etsi.org/scripts/wa.exe?LOGON', url_pref: str = 'https://list.etsi.org/scripts/wa.exe?PREF', login: Optional[Dict[str, str]] = {'password': None, 'username': None}, session: Optional[str] = None, only_mlist_urls: bool = True, instant_save: Optional[bool] = True) → bigbang.ingress.listserv.ListservMailListDomain¶ Docstring in AbstractMailList.
-
classmethod
from_mbox
(name: str, directorypath: str, filedsc: str = '*.mbox') → bigbang.ingress.listserv.ListservMailList¶ Docstring in AbstractMailList.
-
classmethod
from_url
(name: str, url_root: str, url_home: str, select: Optional[dict] = {'fields': 'total'}, url_login: str = 'https://list.etsi.org/scripts/wa.exe?LOGON', url_pref: str = 'https://list.etsi.org/scripts/wa.exe?PREF', login: Optional[Dict[str, str]] = {'password': None, 'username': None}, session: Optional[str] = None, instant_save: bool = True, only_mlist_urls: bool = True) → bigbang.ingress.listserv.ListservMailListDomain¶ Docstring in AbstractMailList.
-
static
get_lists_from_url
(url_root: str, url_home: str, select: dict, session: Optional[str] = None, instant_save: bool = True, only_mlist_urls: bool = True) → List[Union[bigbang.ingress.listserv.ListservMailList, str]]¶ Docstring in AbstractMailList.
-
get_sections
(url_home: str) → int¶ Get different sections of mail list domain. On the Listserv 16.5 website they look like: [3GPP] [3GPP–AT1] [AT2–CONS] [CONS–EHEA] [EHEA–ERM_] … On the Listserv 17 website they look like: [<<][<]1-50(798)[>][>>]
- Returns
If sections exist, it returns their urls and names. Otherwise it returns
the url_home.
-
exception
bigbang.ingress.listserv.
ListservMailListDomainWarning
¶ Bases:
BaseException
Base class for ListservMailListDomain class specific exceptions
-
exception
bigbang.ingress.listserv.
ListservMailListWarning
¶ Bases:
BaseException
Base class for ListservMailList class specific exceptions
-
class
bigbang.ingress.listserv.
ListservMessageParser
(website=False, url_login: Optional[str] = None, url_pref: Optional[str] = None, login: Optional[Dict[str, str]] = {'password': None, 'username': None}, session: Optional[requests.sessions.Session] = None)¶ Bases:
bigbang.ingress.abstract.AbstractMessageParser
,email.parser.Parser
This class handles the creation of an mailbox.mboxMessage object (using the from_*() methods) and its storage in various other file formats (using the to_*() methods) that can be saved on the local memory.
- Parameters
website (Set 'True' if messages are going to be scraped from websites,) – otherwise ‘False’ if read from local memory.
url_login (URL to the 'Log In' page.) –
url_pref (URL to the 'Preferences'/settings page.) –
login (Login credentials (username and password) that were used to set) – up AuthSession. You can create your own for the 3GPP mail list domain.
session (requests.Session() object for the mail list domain website.) –
-
from_url
()¶
-
from_listserv_file
()¶
-
_get_header_from_html
()¶
-
_get_body_from_html
()¶
-
_get_header_from_listserv_file
()¶
-
_get_body_from_listserv_file
()¶
Example
To create a Email message parser object, use the following syntax: >>> msg_parser = ListservMessageParser( >>> website=True, >>> login={“username”: <your_username>, “password”: <your_password>}, >>> )
To obtain the Email message content and return it as mboxMessage object, you need to do the following: >>> msg = msg_parser.from_url( >>> list_name=”3GPP_TSG_RAN_DRAFTS”, >>> url=”https://list.etsi.org/scripts/wa.exe?A2=ind2010B&L=3GPP_TSG_RAN_DRAFTS&O=D&P=29883”, >>> fields=”total”, >>> )
-
empty_header
= {}¶
-
from_listserv_file
(list_name: str, file_path: str, header_start_line_nr: int, fields: str = 'total') → mailbox.mboxMessage¶ This method is required if the message is inside a file that was directly exported from LISTSERV 16.5 (e.g. by a member of 3GPP). Such files have an extension starting with LOG and ending with five digits.
- Parameters
list_name (The name of the LISTSERV Email list.) –
file_path (Path to file that contains the Email list.) –
header_start_line_nr (Line number in the file on which a new message starts.) –
fields (Indicates whether to return 'header', 'body' or 'total'/both or) – the Email.
-
exception
bigbang.ingress.listserv.
ListservMessageParserWarning
¶ Bases:
BaseException
Base class for ListservMessageParser class specific exceptions