analysis.listserv¶
-
class
bigbang.analysis.listserv.
ListservMailList
(name: str, filepath: str, msgs: pandas.core.frame.DataFrame)¶ Bases:
object
Note
Issues loading 3GPP_TSG_RAN_WG1 which is 3.3Gb large
-
__iter__
()¶ Iterate over each message within the mailing list.
-
__len__
() → int¶ Get number of messsages within the mailing list.
-
add_thread_info
()¶ Edit pd.DataFrame to include extra column to identify which thread a message belongs to.
-
add_weight_to_edge
(dic: dict, key1: str, key2: str) → dict¶ - Parameters
dic –
key1 –
key2 –
-
static
contract
(count: numpy.array, label: list, contract: float) → Dict[str, int]¶ This function contracts all domain names that contributed to a mailinglists below the contract threshold into one entity called Others. Meaning, if contract=3 and nokia.com, nokia.com, t-mobile.at all wrote less then three Emails to the mailinglist in question, their contributions are going to be summed into one entity denoted as Others.
- Parameters
count (Number of Emails send to mailinglist.) –
label (Names of contributers to mailinglist.) –
contract (Threshold below which all contributions will be summed.) –
-
create_sender_receiver_digraph
(nw: Optional[dict] = None, entity_in_focus: Optional[list] = None, node_attributes: Optional[Dict[str, list]] = None)¶ Create directed graph from messaging network created with ListservMailList.get_sender_receiver_dict().
- Parameters
nw (dictionary created with self.get_sender_receiver_dict()) –
entity_in_focus (This can be a list of domain names or localparts. If) – such a list is provided, the creaed di-graph will only focus on their relations.
-
crop_by_address
(header_field: str, per_address_field: Dict[str, List[str]]) → bigbang.analysis.listserv.ListservMailList¶ - Parameters
header_field (For a Listserv mailing list the most representative) – header fields for senders and receivers are ‘from’ and ‘comments-to’ respectively.
per_address_field (Filter by 'local-part' or 'domain' part of an address.) –
- The data structure of the argument should be, e.g.:
{‘localpart’: [string-1, string-2, …]}
- Returns
- Return type
ListservMailList object cropped to specification.
-
crop_by_subject
(match=<class 'str'>, place: int = 2) → bigbang.analysis.listserv.ListservMailList¶ - Parameters
match (Only keep messages with subject lines containing match string.) –
place (Define how to filter for match. Use on of the following methods:) – 0 = Using Regex expression 1 = String ends with match 2 =
- Returns
- Return type
ListservMailList object cropped to message subject.
-
crop_by_year
(yrs: Union[int, list]) → bigbang.analysis.listserv.ListservMailList¶ Filter self.df DataFrame by year in message date.
- Parameters
yrs (Specify a specific year, such as 2021, or a range of years, such) – as [2011, 2021].
- Returns
- Return type
ListservMailList object cropped to specification.
-
crop_dic_to_entity_in_focus
(dic: dict, entity_in_focus: list) → dict¶ - Parameters
entity_in_focus (This can a list of domain names or localparts.) –
-
classmethod
from_mbox
(name: str, filepath: str, include_body: bool = True) → bigbang.analysis.listserv.ListservMailList¶
-
classmethod
from_pandas_dataframe
(df: pandas.core.frame.DataFrame, name: Optional[str] = None, filepath: Optional[str] = None) → bigbang.analysis.listserv.ListservMailList¶
-
get_domains
(header_fields: List[str], return_msg_counts: bool = False, df: Optional[pandas.core.frame.DataFrame] = None) → dict¶ Get contribution of members per affiliation.
Note
For a Listserv mailing list the most representative header fields of senders and receivers are ‘from’ and ‘comments-to’ respectively.
- Parameters
header_fields (Indicate which Email header field to process) – (e.g. ‘from’, ‘reply-to’, ‘comments-to’). For a listserv mailing list the most representative header fields of senders and receivers are ‘from’ and ‘comments-to’ respectively.
return_msg_counts (If 'True', return # of messages per domain.) –
-
get_domainscount
(header_fields: List[str], per_year: bool = False) → dict¶ - Parameters
header_fields (Indicate which Email header field to process) – (e.g. ‘from’, ‘reply-to’, ‘comments-to’). For a listserv mailing list the most representative header fields of senders and receivers are ‘from’ and ‘comments-to’ respectively.
per_year (Aggregate results for each year.) –
-
get_graph_prop_per_domain_per_year
(years: Optional[tuple] = None, func=<function betweenness_centrality>, **args) → dict¶ - Parameters
years –
func –
-
get_localparts
(header_fields: List[str], per_domain: bool = False, return_msg_counts: bool = False, df: Optional[pandas.core.frame.DataFrame] = None) → dict¶ Get contribution of members per affiliation.
- Parameters
header_fields (Indicate which Email header field to process) – (e.g. ‘from’, ‘reply-to’, ‘comments-to’). For a listserv mailing list the most representative header fields of senders and receivers are ‘from’ and ‘comments-to’ respectively.
per_domain –
return_msg_counts (If 'True', return # of messages per localpart.) –
-
get_localpartscount
(header_fields: List[str], per_domain: bool = False, per_year: bool = False) → dict¶ - Parameters
header_fields (Indicate which Email header field to process) – (e.g. ‘from’, ‘reply-to’, ‘comments-to’). For a listserv mailing list the most representative header fields of senders and receivers are ‘from’ and ‘comments-to’ respectively.
per_domain (Aggregate results for each domain.) –
per_year (Aggregate results for each year.) –
-
get_messagescount
(header_fields: Optional[List[str]] = None, per_address_field: Optional[str] = None, per_year: bool = False) → dict¶ - Parameters
header_fields (Indicate which Email header field to process) – (e.g. ‘from’, ‘reply-to’, ‘comments-to’). For a listserv mailing list the most representative header fields of senders and receivers are ‘from’ and ‘comments-to’ respectively.
per_year (Aggregate results for each year.) –
-
get_messagescount_per_timezone
(percentage: bool = False) → Dict[str, int]¶ Get contribution of messages per time zone.
- Parameters
percentage (Whether to return count of messages percentage w.r.t. total.) –
-
static
get_name_localpart_domain
(string: str) → tuple¶ Split an address field which has (ideally) a format as ‘Heinrich von Kleist <Heinrich.vonKleist@SELBST.org>’ into name, local-part, and domain. All strings are returned in lower case only to avoid duplicates.
Note
Test whether the incorporation of email.utils.parseaddr() can improve this function.
-
get_sender_receiver_dict
(address_field: str = 'domain', entity_in_focus: Optional[list] = None, df: Optional[pandas.core.frame.DataFrame] = None) → Dict¶ - Parameters
address_field –
entity_in_focus (This can a list of domain names or localparts. If such) – a list is provided, the creaed dictionary will only contain their information.
- Returns
Nested dictionary with first layer the ‘from’ domain keys and
the second layer the ‘comments-to’ domain keys with the
integer indicating the number of messages between them.
-
get_threads
(return_length: bool = False) → dict¶ Collect all messages that belong to the same thread.
Note
Computationally very intensive.
- Parameters
return_length – If ‘True’, the returned dictionary will be of the form {‘subject1’: # of messages, ‘subject2’: # of messages, …}. If ‘False’, the returned dictionary will be of the form {‘subject1’: list of indices, ‘subject2’: list of indices, …}.
-
get_threadsroot
(per_address_field: Optional[str] = None, df: Optional[pandas.core.frame.DataFrame] = None) → dict¶ Find all unique message subjects. Replies not treated as a new subject.
Note
The most reliable ways to find the beginning of threads is to check whether the subject line of a message contains an element of reply_labels at the beginning. Checking whether the header field ‘comments-to’ is empty is not reliable, as ‘reply-all’ is often chosen by mistake as seen here: 2020-04-01 10:08:58+00:00 joern.krause@etsi.org, juergen.hofmann@nokia.com 2020-03-26 21:41:27+00:00 joern.krause@etsi.org NaN 2020-03-26 21:00:08+00:00 joern.krause@etsi.org juergen.hofmann@nokia.com
- Some Emails start with ‘AW:’, which comes from German and has
the same meaning as ‘Re:’.
Some Emails start with ‘=?utf-8?b?J+WbnuWkjTo=?=’ or ‘=?utf-8?b?J+etlOWkjTo=?=’, which are UTF-8 encodings of the Chinese characters ‘回复’ and ‘答复’ both of which have the same meaning as ‘Re:’.
Leading strings such as ‘FW:’ are treated as new subjects.
- Parameters
per_address_field –
- Returns
A dictionary of the form {‘subject1’ (index of message, ‘subject2’: …})
is returned. If per_address_field is specified, the subjects are sorted
into the domain or localpart from which they originate.
-
get_threadsrootcount
(per_address_field: Optional[str] = None, per_year: bool = False) → Union[int, dict]¶ Identify number conversation threads in mailing list.
- Parameters
per_address_field (Aggregate results for each address field, which can) – be, e.g., from, send-to, received-by.
per_year (Aggregate results for each year.) –
-
static
iterator_name_localpart_domain
(li: list) → tuple¶ Generator for the self.get_name_localpart_domain() function.
-
period_of_activity
(format: str = '%a, %d %b %Y %H:%M:%S %z') → list¶ Return a list containing the datetime of the first and last message written in the mailing list.
-
static
to_percentage
(arr: numpy.array) → numpy.array¶
-
-
class
bigbang.analysis.listserv.
ListservMailListDomain
(name: str, filedsc: str, lists: pandas.core.frame.DataFrame)¶ Bases:
object
- Parameters
name – The of whom the mail list domain is (e.g. 3GPP, IEEE, …)
filedsc – The file description of the mail list domain
lists – A list containing the mailing lists as ListservMailList types
-
get_mlistscount_per_institution
()¶
-
classmethod
from_mbox
(name: str, directorypath: str, filedsc: str = '*.mbox') → bigbang.analysis.listserv.ListservMailListDomain¶
-
get_mlistscount_per_institution
() → Dict[str, int]¶ Get a dictionary that lists the mailing lists/working groups in which a institute/company is active.
-
exception
bigbang.analysis.listserv.
ListservMailListDomainWarning
¶ Bases:
BaseException
Base class for ListservMailListDomain class specific exceptions
-
exception
bigbang.analysis.listserv.
ListservMailListWarning
¶ Bases:
BaseException
Base class for ListservMailList class specific exceptions