Welcome to BigBang’s documentation!¶
BigBang
is a toolkit for studying processes of open collaboration and deliberation, especially with respect to the production of digital infrastructures, to make them more transparent and accountable. This is achieved by utilising public communication channels and documents to reveal which actors are leading, following, or left out. It enables the analysis and visualisation of relationships, discourses, time series and knowledge networks.
BigBang
is a community of researchers sharing code and best practices. BigBang
comes with a large directory of examples showcasing what kind of questions can be answered through the communications data. These examples are designed as tutorial and support data science and research pedagogy.
Motivation¶
BigBang was originally designed to study communities developing open sources scientific software. It was quickly thereafter adopted by researchers studying Internet and network standards-setting. Organizations such as the IETF conduct much of their technical work via open mailing list and document records. There is tremendous opportunity to learn from these data sources. BigBang codifies the processes of collecting and cleaning that data, making it available to researchers.
More broadly, BigBang is designed to give researchers and students more transparent insight into the sociotechnical governance and infrastructure processes that shape their world. BigBang is a telescope designed to study the originating singularities that gave rise to the Internet, scientific computation, and other pivotal technological developments.
License¶
BigBang is open source software. It is released under the MIT license.
Installation¶
conda¶
You can use Anaconda. This will also install the conda package management system, which you can use to complete installation.
Install Anaconda, with Python version 3.*.
If you choose not to use Anaconda, you may run into issues with versioning in Python. Add the Conda installation directory to your path during installation.
You also need need to have Git and Pip (for Python3) installed.
Run the following commands:
git clone https://github.com/datactive/bigbang.git
cd bigbang
bash conda-setup.sh
python3 setup.py develop --user
pip¶
git clone https://github.com/datactive/bigbang.git
# optionally create a new virtualenv here
pip3 install -r requirements.txt
python3 setup.py develop --user
Video Tutorial¶
If you have problems installing, you might want to have a look at the video tutorial below (clicking on the image will take you to YouTube).
Datasets¶
This section outlines, how various public mailing lists can be scraped from the web and stored to disk for further processing. Currently, the BigBang
repository does not contain personally identifiable information of any kind. The datasets included in BigBang
pertain to organizational entities and provide ancillary data useful in preprocessing and analysis of those entities. As the mailing-list archives are large and time consuming to scrape from the web, we are working on GDPR compliant method to share the datasets with other researchers.
Mailinglists¶
Below we describe, how the public mailing lists of each of the Internet standard developing organisations can be scrabed from the web. Some mailng lists reach back to 1998 and is multiple GBs in size. Therefore, it can take a considerable amount of time to scrape an entire mailing list. This process can’t be speed up, since one would commit a DDoS attack otherwise. So be prepared to leave your machine running over (multiple) night(s).
IETF¶
To scrabed public mailing lists of the Internet Engineering Task Force (IETF), there are two options outlined below.
Public Mailman Web Archive¶
BigBang comes with a script for collecting files from public Mailman web archives. An example of this is the scipy-dev mailing list page. To collect the archives of the scipy-dev mailing list, run the following command from the root directory of this repository:
python3 bin/collect_mail.py -u http://mail.python.org/pipermail/scipy-dev/
You can also give this command a file with several urls, one per line. One of these is provided in the examples/ directory.
python3 bin/collect_mail.py -f examples/urls.txt
Once the data has been collected, BigBang has functions to support analysis.
Datatracker¶
BigBang can also be used to analyze data of IETF RFC drafts.
It does this using the Glasgow IPL group’s ietfdata
tool.
The script takes an argument, the working group acronym
python3 bin/collect_draft_metadata.py -w httpbis
W3C¶
The World Wide Web Consortium (W3C) mailing archive is managed using the Hypermail software and is hosted at:
https://lists.w3.org/Archives/Public/
There are two ways you can scrape the public mailing-list from that domain. First, one can write their own python script containing a variation of:
from bigbang.ingress import ListservMailList
mlist = W3CMailList.from_url(
name="public-testtwf",
url="https://lists.w3.org/Archives/Public/public-testtwf/",
select={"years": 2014, "fields": "header"},
)
mlist.to_mbox(path_to_file)
Or one can use the command line script and a file containg all mailing-list URLs one wants to scrape:
python bin/collect_mail.py -f examples/url_collections/W3C.txt
3GPP¶
The 3rd Generation Partnership Project (3GPP) mailing archive is managed using the LISTSERV software and is hosted at:
https://list.etsi.org/scripts/wa.exe?HOME
In order to successfully scrape all public mailing lists, one needs to create an account here: https://list.etsi.org/scripts/wa.exe?GETPW1=&X=&Y=
There are two ways you can scrape the public mailing-list from that domain. First, one can write their own python script containing a variation of:
from bigbang.ingress import ListservMailList
mlist = ListservMailList.from_url(
name="3GPP_TSG_SA_WG2_EMEET",
url="https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_SA_WG2_EMEET",
select={"fields": "header",},
url_login="https://list.etsi.org/scripts/wa.exe?LOGON=INDEX",
url_pref="https://list.etsi.org/scripts/wa.exe?PREF",
login=auth_key,
)
mlist.to_mbox(path_to_file)
Or one can use the command line script and a file containg all mailing-list URLs one wants to scrape:
python bin/collect_mail.py -f examples/url_collections/listserv.3GPP.txt
IEEE¶
The Institute of Electrical and Electronics Engineers (W3C) mailing archive is managed using the LISTSERV software and is hosted at:
https://listserv.ieee.org/cgi-bin/wa?INDEX
There are two ways you can scrape the public mailing-list from that domain. First, one can write their own python script containing a variation of:
from bigbang.ingress import ListservMailList
mlist = ListservMailList.from_url(
name="IEEE-TEST",
url="https://listserv.ieee.org/cgi-bin/wa?A0=IEEE-TEST",
select={"fields": "header",},
url_login="https://listserv.ieee.org/cgi-bin/wa?LOGON",
url_pref="https://listserv.ieee.org/cgi-bin/wa?PREF",
login=auth_key,
)
mlist.to_mbox(path_to_file)
Or one can use the command line script and a file containg all mailing-list URLs one wants to scrape:
python bin/collect_mail.py -f examples/url_collections/listserv.IEEE.txt
Ancillary Datasets¶
In addition to providing tools for gathering data from public sources, BigBang
also includes some datasets that have been curated by contributors and researchers.
General¶
Email domain categories¶
BigBang comes with a partial list of email domains, categorized as:
Generic. A domain associated with a generic email provider. E.g.
gmail.com
Personal. A domain associated with a single individual. E.g
csperkins.org
Company. A domain associated with a particular company. E.g.
apple.com
Academic. A domain associated with a university or academic professional organization. E.g.
mit.edu
SDO. A domain associated with a Standards Development Organization. E.g.
ietf.org
This data can be loaded as a Pandas DataFrame with indices as email domains and
categories in the category
column with the following code:
import bigbang.datasets.domains as domains
domain_data = domains.load_data()
The sources of this data are a hand-curated list of domains provided by BigBang contributors and a list of generic email domain providers provided by this public gist.
Organization Metadata¶
BigBang comes with a curated list of metadata about organizations. This data is provided as a DataFrame with the following columns:
name. Organization name. E.g.
gmail.com
Category. Kind of organization. E.g
Infrastructure Company
subsidiary. This column describes when a company is the subsidiary of another company in the list. If the cell in this column is empty, this company can be understood as the parent company.. E.g.
apple.com
stakeholdergroup. Stakeholdergroups are used as they have been defined in the WSIS process and the Tunis-agenda.
nationality. The country name in which the stakeholder or subsidiary is registered.
email domain names. Email domains associated with the organization. May include multiple, comma separated, domain names.
Membership Organization. Membership of regional SDOs, derived from 3GPP data.
This data can be loaded as a Pandas DataFrame with indices as email domains and
categories in the category
column with the following code:
import bigbang.datasets.organizations as organizations
organization_data = organizations.load_data()
The sources of this data are a hand-curated list of domains provided by BigBang contributors and a list of generic email domain providers provided by this public gist.
IETF¶
Publication date of protocols.
3GPP¶
Release dates of standards.
Data Source - Git¶
After the git repositories have been cloned locally, you will be able to start analyzing them. To do this, you will need a GitRepo object, which is a convenient wrapper which does the work of extracting and generating git information and storing it internally in a pandas dataframe. You can then use this GitRepo object’s methods to gain access to the large pandas dataframe.
There are many ways to generate a GitRepo object for a repository, using RepoLoader:
Bash scripts (in the bigbang directory):
single url
python bin/collect_git.py -u https://github.com/scipy/scipy.git
file of urls
python bin/collect_git.py -f examples/git_urls.txt
Github organization name
python bin/collect_git.py -g glass-bead-labs
Single Repo:
remote
get_repo("https://github.com/sbenthall/bigbang.git", in_type = "remote" )
local
get_repo("~/urap/bigbang/archives/sample_git_repos/bigbang", in_type = "local" )
name
get_repo("bigbang", in_type = "name")
Multiple Repos:
With repo names:
get_multi_repo(repo_names=["bigbang","django"])
With repo objects:
get_multi_repo(repos=[{list of existing GitRepo objects}]
With Github Organization names
get_org_multirepo("glass-bead-labs")
Repo Locations¶
As of now, repos are clones into archives/sample_git_repos/{repo_name}
. Their caches are stored at archives/sample_git_repos/{repo_name}_backup.csv
.
Caches¶
Caches are stored at archives/sample_git_repos/{repo_name}_backup.csv
. They are the dumped .csv
files of a GitRepo object’s commit_data
attribute, which is a pandas dataframe of all commit information. We can initialize a GitRepo object by feeding the cache’s Pandas dataframe into the GitRepo init function. However, the init function will need to do some processing before it can use the cache as its commit data. It needs to convert the "Touched File"
attribute of the cache dataframe from unicode "[file1, file2, file3]"
to an actual list ["file1", "file2", "file3"]
. It will also need to convert the time index of the cache from string to datetime.
Bash Scripts¶
Run the following commands while in the bigbang directory. The repo information will go into the default repo location.
python bin/collect_git.py -u https://github.com/scipy/scipy.git
You can also give this command a file with several urls, one per line. One of these is provided in the examples/
directory.
python bin/collect_git.py -f examples/git_urls.txt
This command will load all of the repos of a github organization. Make sure that the name is exactly as it appears on Github.
python bin/collect_git.py -g glass-bead-labs
Single Repos¶
Here, we can load in three ways. We can use a github url, a local path to a repo, or the name of a repo. All of these return a GitRepo
object. Here is an example, with explanations below.
from bigbang import repo_loader # The file that handles most loading
repo = repo_loader.get_repo("https://github.com/sbenthall/bigbang.git", in_type = "remote" )
# repo = repo_loader.get_repo("../", in_type = "local" ) # I commented this out because it may take too long
repo = repo_loader.get_repo("bigbang", in_type = "name")
repo.commit_data # The pandas df of commit data
Remote¶
A remote call to get_repo
will extract the repo’s name from its git url. Thus, https://github.com/sbenthall/bigbang.git
will yield bigbang
as its name. It will check if the repo already exists. If it doesn’t it will send a shell command to clone the remote repository to archives/sample_git_repos/{repo_name}
. It will then return get_repo({name}, in_type="name")
. Before returning, however, it will cache the GitRepo object at archives/sample_git_repos/{repo_name}_backup.csv
to make loading faster the next time.
Local¶
A local call is the simplest. It will first extract the repo name from the filepath. Thus, ~/urap/bigbang/archives/sample_git_repos/bigbang
will yield bigbang
. It will check to see if a git repo exists at the given address. If it does, it will initialize a GitPython object, which only needs a name and a filepath to a Git repo. Note that this option does not check or create a cache.
Name¶
This is the preferred and easiest way to load a git repository. It works under the assumptions above about where a git repo and its cache should be stored. It will check to see if a cache exists. If it does, then it will load a GitPython object using that cache.
If a cache is not found, then the function constructs a filepath from the name, using the above rule about where repo locations. It will pass off the function to get_repo(filepath, in_type="local")
. Before returning the answer, it will cache the result.
MultiRepos¶
These are the ways we can get MultiGitRepo objects. MultiGitRepo objects are GitRepos that were created with a list of GitRepos. Basically, a MultiGitRepo’s commit_data
contains the commit_data from all of its GitRepos. The only difference is that each entry has an extra attribute, Repo Name
that tells us which Repo that commit is initially from. Here are some examples, with explanations below. Note that the examples below will not work if you don’t have an internet connection, and may take some time to process. The first call may also fail if you do not have all of the repositories
from bigbang import repo_loader # The file that handles most loading
## Using GitHub API
multirepo = repo_loader.get_org_multirepo("glass-bead-labs")
## List of repo names
multirepo = repo_loader.get_multi_repo(repo_names = ["bigbang","bead.glass"])
## List of actual repos
repo1 = repo_loader.get_repo("bigbang", in_type="name")
repo2 = repo_loader.get_repo("bead.glass", in_type="name")
multirepo = repo_loader.get_multi_repo(repos = [repo1, repo2])
multirepo.commit_data # The pandas df of commit data
List of Repos / List of Repo Names (get_multi_repo
)¶
This is rather simple. We can call the get_multi_repo
method with either a list of repo names ["bigbang", "django", "scipy"]
or a list of actual GitRepo objects. This returns us the merged MultiGitRepo. Please note that this will not work if a local clone / cache of the repos does not exist for every repo name (e.g. if you ask for ["bigbang", "django", "scipy"]
, you must already have a local copy of those in your sample_git_repos directory.
Github Organization’s Repos (get_org_multirepo
)¶
This is more useful to us. We can use this method to get a MultiGitRepo that contains the information from every repo in a Github Organization. This requires that we input the organization’s name exactly as it appears on Github (edX, glass-bead-labs, codeforamerica, etc.)
It will look for examples/{org_name}_urls.txt
, which should be a file that contains all of the git urls of the projects that belong to that organization. If this file doesn’t yet exist, it will make a call to the Github API. This requires a stable internet connection, and it may randomly stall on requests that do not time out.
The function will then use the list of git urls and the get_repo
method to get each repo. It will use this list of repos to create a MultiGitRepo object, using get_multi_repo
.
Analysis¶
3GPP¶
This page introduces a collection of simple functions with which a comprehensive overview of 3GPP mailinglists, ingressed using bigbang/ingress/listserv.py, is gained. Without extensive editing, these functions should also be applicable to IETF, ICANN, W3C, and IEEE mailinglists, however it hasn’t been tested yet.
To start, a ListservList
class instance needs to be created using either
.from_mbox()
or .from_pandas_dataframe()
. Using the former as an example:
from bigbang.analysis.listserv import ListservList
mlist_name = "3GPP_TSG_CT_WG1_122E_5G"
mlist = ListservList.from_mbox(
name=mlist_name,
filepath=f"/path/to/{mlist_name}.mbox",
include_body=True,
)
The function argument include_body
is by default True
, but if one has to work
with a large quantity of Emails, it might be necessary to set it to False to
avoid out-of-memory errors.
Cropping of mailinglist¶
If one is interested in specific subgroups contained in a mailinglist, then the ListservList class instance can be cropped using the following functions:
# select Emails send in a specific year
mlist.crop_by_year(yrs=[2011])
# select Emails send within a period
mlist.crop_by_year(yrs=[2011, 2021])
# select Emails send or received from specified addresses
mlist.crop_by_address(
header_field='from',
per_address_field={'domain': ['t-mobile.at', 'nokia.com']}
)
# select Emails containing string in subject
mlist.crop_by_subject(match='OpenPGP')
In the second example, the function has an per_address_field
argument. This
argument is a dictionary in which the top-level keys can be localpart
and domain
, where the former is the part of an Email address that stands
in front of the @ and the latter after. Thus for Heinrich.vonKleist@selbst.org,
localpart is Heinrich.vonKleist and the domain is selbst.org.
Who is sending/receiving?¶
To get an insight in which actors are involved in a mailinglist, a ListservList
class instance can be return the unique email domains and the unique email localparts
per domain for multiple header fields:
mlist.get_domains(header_fields=['from', 'reply-to'])
mlist.get_localparts(header_fields=['from', 'reply-to'])
This will return a dictionary, in which each key (both ‘from’ and ‘reply-to’)
contains a list of all domains. If one wants see not just who contributes, but
also how much, change the default argument of return_msg_counts=False
to True
:
mlist.get_domains(header_fields=['from', 'reply-to'], return_msg_counts=True)
Alternatively, one can also get the number of Emails send or received by a certain address via,
mlist.get_messagescount(
header_fields=['from', 'reply-to'],
per_address_field={
'domain': ['t-mobile.at', 'nokia.com'],
'localpart': ['ian.hacking', 'victor.klemperer'],
}
)
Communication Network¶
For a more in-depth view into who is sending (receiving) to (from) whom in a
mailing list, one can use the return_msg_counts=False
as follows:
mlist.create_sender_receiver_digraph()
This will create a new networkx.DiGraph()
instance attribute for mlist
,
which can be used to perform a number of standard calculations using the
networkx
python package:
import networkx as nx
nx.betweenness_centrality(mlist.dg, weight="weight")
nx.closeness_centrality(mlist.dg)
nx.degree_centrality(mlist.dg)
Time-series¶
To study, e.g., the continuity of an actors contribution to a mailinglist, many
function have an optional per_year
boolean argument.
To simply find out during which period Emails were in a mailinglist, one can call
mlist.period_of_activity()
.
Networks¶
Documentation for the analysis and preprocessing scripts of BigBang.
Timeseries¶
Documentation for the analysis and preprocessing scripts of BigBang.
Visualisation¶
Lines¶
To help visualise, e.g., time-series data obtained through 3GPP,
we provide a number of support functions. If, for example, one has executed the
mlist.get_localpartscount()
command with per_year=True
, one can use
lines.evolution_of_participation_1D()
to visualise how the number of get_localparts
changed over time for each domain, which is related to the number of participants
belonging to each organisation:
from bigbang.analysis.listserv import ListservMailList
from bigbang.visualisation import graphs
mlist_name = "3GPP_TSG_SA_WG3_LI"
filepath = f"/home/christovis/InternetGov/bigbang-archives/3GPP/{mlist_name}.mbox"
mlist = ListservMailList.from_mbox(
name=mlist_name,
filepath=filepath,
)
dic = mlist.get_localpartscount(
header_fields=['from'],
per_domain=True,
per_year=True,
)
entities_in_focus = [
'catt.cn',
'chinaunicom.cn',
'huawei.com',
'chinatelecom.cn',
'chinamobile.com',
]
fig, axis = plt.subplots()
lines.evolution_of_participation_1D(
dic['from'],
ax=axis,
entity_in_focus=entities_in_focus,
percentage=False,
)
axis.set_xlabel('Year')
axis.set_ylabel('Nr of senders')
The above code produces the following figure:

Alternatively it can also be visualised as a heat map using
lines.evolution_of_participation_2D()
. Similarly, one can plot the evolution
of, e.g., different types of centrality of domain names in the communication network:
from bigbang.analysis.listserv import ListservMailList
from bigbang.visualisation import graphs
dic = mlist.get_graph_prop_per_domain_per_year(func=nx.degree_centrality)
fig, axis = plt.subplots()
lines.evolution_of_graph_property_by_domain(
dic,
"year",
"degree_centrality",
entity_in_focus=entities_in_focus,
ax=axis,
)
axis.set_xlabel('Year')
axis.set_ylabel(r'$C_{\rm D}$')
Histograms¶
Documentation for the visualization scripts of BigBang.
Graphs¶
To help visualise the results obtained in Communication Network,
we provide support functions, such that the thickness of graph edges and nodes
can be adjusted. Assuming that one as already executed the
mlist.create_sender_receiver_digraph()
command, we can use
graphs.edge_thickness()
to highlight the relation between specific actors or
graphs.node_size()
to let the node size increase with their betweenness centrality.
import networkx as nx
from bigbang.visualisation import graphs
edges, edge_width = graphs.edge_thickness(
mlist.dg,
entity_in_focus=['t-mobile.at', 'nokia.com'],
)
node_size = graphs.node_size(mlist.dg)
nx.draw_networkx_nodes(
mlist.dg, pos,
node_size=node_size,
)
nx.draw_networkx_edges(
mlist.dg, pos,
width=edge_width,
edgelist=edges,
edge_color=edge_width,
edge_cmap=plt.cm.rainbow,
)
Reference¶
archive¶
This module supports the Archive class, a generic structure representing a collection of archived emails, typically from a single mailing list.
-
class
bigbang.archive.
Archive
(data, archive_dir='/home/docs/checkouts/readthedocs.org/user_builds/bigbang-py/checkouts/v0.4.0/archives/', mbox=False)¶ Bases:
object
A representation of a mailing list archive.
Initialize an Archive object.
The behavior of the constructor depends on the type of its first argument, data.
If data is a Pandas DataFrame, it is treated as a representation of email messages with columns for Message-ID, From, Date, In-Reply-To, References, and Body. The created Archive becomes a wrapper around a copy of the input DataFrame.
If data is a string, then it is interpreted as a path to either a single .mbox file (if the optional argument single_file is True) or else to a directory of .mbox files (also in .mbox format). Note that the file extensions need not be .mbox; frequently they will be .txt.
Upon initialization, the Archive object drops duplicate entries and sorts its member variable data by Date.
- Parameters
data (pandas.DataFrame, or str) –
archive_dir (str, optional) – Defaults to CONFIG.mail_path
mbox (bool) –
-
activity
= None¶
-
add_affiliation
(rel_email_affil)¶ Uses a DataFrame of email affiliation information and adds it to the archive’s data table.
The email affilation data is expected to have a regular format, with columns:
email
- strings, complete email addressesaffilation
- strings, names of organizatiosn of affilationmin_date
- datetime, the starting date of the affiliationmax_date
- datetime,the end date of the affilation.
Note that this mutates the dataframe in
self.data
to add the affiliation data.rel_email_affil : pandas.DataFrame
-
compute_activity
(clean=True)¶ Return the computed activity.
-
data
= None¶
-
entities
= None¶
-
get_activity
(resolved=False)¶ Get the activity matrix of an Archive.
Columns of the returned DataFrame are the Senders of emails. Rows are indexed by ordinal date. Cells are the number of emails sent by each sender on each data.
If resolved is true, then default entity resolution is run on the activity matrix before it is returned.
-
get_personal_headers
(header='From')¶ Returns a dataframe with a row for every message of the archive, containing column entries for:
The personal header specified. Defaults to “From”. Could be “Repy-To”.
The email address extracted from the From field
The domain of the From field
This dataframe is computed the first time this method is called and then cached.
- Parameters
header (string, default "From") –
- Returns
data
- Return type
pandas.DataFrame
-
get_threads
(verbose=False)¶ Get threads.
-
preprocessed
= None¶
-
resolve_entities
(inplace=True)¶ Return data with resolved entities.
- Parameters
inplace (bool, default True) –
- Returns
Returns None if inplace == True
- Return type
pandas.DataFrame or None
-
save
(path, encoding='utf-8')¶ Save data to csv file.
-
threads
= None¶
-
exception
bigbang.archive.
ArchiveWarning
¶ Bases:
BaseException
Base class for Archive class specific exceptions
-
exception
bigbang.archive.
MissingDataException
(value)¶ Bases:
Exception
-
bigbang.archive.
archive_directory
(base_dir, list_name)¶ Creates a new archive directory for the given list_name unless one already exists. Returns the path of the archive directory.
Returns the footer of a DataFrame of emails.
A footer is a string occurring at the tail of most messages. Messages can be a DataFrame or a Series
-
bigbang.archive.
load
(path)¶
-
bigbang.archive.
load_data
(name: str, archive_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/bigbang-py/checkouts/v0.4.0/archives/', mbox: bool = False)¶ Load the data associated with an archive name, given as a string.
Attempt to open {archives-directory}/NAME.csv as data.
Failing that, if the the name is a URL, it will try to derive the list name from that URL and load the .csv again.
- Parameters
name (str) –
archive_dir (str, default CONFIG.mail_path) –
mbox (bool, default False) – If true, expects and opens an mbox file at this path
- Returns
data
- Return type
pandas.DataFrame
-
bigbang.archive.
messages_to_dataframe
(messages)¶ Turn a list of parsed messages into a dataframe of message data, indexed by message-id, with column-names from headers.
-
bigbang.archive.
open_list_archives
(archive_name: str, archive_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/bigbang-py/checkouts/v0.4.0/archives/', mbox: bool = False) → pandas.core.frame.DataFrame¶ Return a list of all email messages contained in the specified directory.
- Parameters
archive_name (str) – the name of a subdirectory of the directory specified in argument archive_dir. This directory is expected to contain files with extensions .txt, .mail, or .mbox. These files are all expected to be in mbox format– i.e. a series of blocks of text starting with headers (colon-separated key-value pairs) followed by an email body.
archive_dir (str:) – directory containing all messages.
mbox (bool, default False) – True if there’s an mbox file already available for this archive.
- Returns
data
- Return type
pandas.DataFrame
bigbang_io¶
-
bigbang.bigbang_io.
email_to_dict
(msg: mailbox.mboxMessage) → Dict[str, str]¶ Handles data type transformation from mailbox.mboxMessage to Dictionary.
-
bigbang.bigbang_io.
email_to_mbox
(msg: mailbox.mboxMessage, filepath: str, mode: str = 'w') → None¶ Saves mailbox.mboxMessage as .mbox file.
-
bigbang.bigbang_io.
email_to_pandas_dataframe
(msg: mailbox.mboxMessage) → pandas.core.frame.DataFrame¶ Handles data type transformation from mailbox.mboxMessage to pandas.DataFrame.
-
bigbang.bigbang_io.
get_paths_to_dirs_in_directory
(directory: str, folder_dsc: str = '*') → List[str]¶ Get paths of all directories matching file_dsc in directory
-
bigbang.bigbang_io.
get_paths_to_files_in_directory
(directory: str, file_dsc: str = '*') → List[str]¶ Get paths of all files matching file_dsc in directory
-
bigbang.bigbang_io.
mlist_from_mbox
(filepath: str) → list¶ Reads mailbox.mboxMessage objects from .mbox file. For a clearer definition on what a mailing list is, see: bigbang.ingress.abstract.AbstractList
-
bigbang.bigbang_io.
mlist_from_mbox_to_pandas_dataframe
(filepath: str) → pandas.core.frame.DataFrame¶ Reads mailbox.mboxMessage objects from .mbox file and transforms it to a pandas.DataFrame. For a clearer definition on what a mailing list is, see: bigbang.ingress.abstract.AbstractList
-
bigbang.bigbang_io.
mlist_to_dict
(msgs: List[mailbox.mboxMessage], include_body: bool = True) → Dict[str, List[str]]¶ Handles data type transformation from a List[mailbox.mboxMessage] to a Dictionary. For a clearer definition on what a mailing list is, see: bigbang.ingress.abstract.AbstractList
-
bigbang.bigbang_io.
mlist_to_mbox
(msgs: List[mailbox.mboxMessage], dir_out: str, filename: str) → None¶ Saves a List[mailbox.mboxMessage] as .mbox file. For a clearer definition on what a mailing list is, see: bigbang.ingress.abstract.AbstractList
-
bigbang.bigbang_io.
mlist_to_pandas_dataframe
(msgs: List[mailbox.mboxMessage], include_body: bool = True) → pandas.core.frame.DataFrame¶ Handles data type transformation from a List[mailbox.mboxMessage] to a pandas.DataFrame. For a clearer definition on what a mailing list is, see: bigbang.ingress.abstract.AbstractList
-
bigbang.bigbang_io.
mlistdom_to_dict
(mlists: List[List[mailbox.mboxMessage]], include_body: bool = True) → Dict[str, List[str]]¶ Handles data type transformation from a List[AbstractList] to a Dictionary. For a clearer definition on what a mailing archive is, see: bigbang.ingress.abstract.AbstractArchive
-
bigbang.bigbang_io.
mlistdom_to_mbox
(mlists: List[List[mailbox.mboxMessage]], dir_out: str)¶ Saves a List[AbstractList] as .mbox file. For a clearer definition on what a mailing archive is, see: bigbang.ingress.abstract.AbstractArchive
-
bigbang.bigbang_io.
mlistdom_to_pandas_dataframe
(mlists: List[List[mailbox.mboxMessage]], include_body: bool = True) → pandas.core.frame.DataFrame¶ Handles data type transformation from a List[AbstractList] to a pandas.DataFrame. For a clearer definition on what a mailing archive is, see: bigbang.ingress.abstract.AbstractArchive
parse¶
-
bigbang.parse.
clean_from
(m_from)¶ Return a person’s name extracted from ‘From’ field of email, based on heuristics.
-
bigbang.parse.
clean_mid
(mid)¶
-
bigbang.parse.
clean_name
(name)¶ Clean just the name portion from email.utils.parseaddr.
Returns None if the name portion is missing anything name-like. Otherwise, returns the cleaned name.
-
bigbang.parse.
get_date
(message)¶
-
bigbang.parse.
get_refs
(refs)¶
-
bigbang.parse.
get_text
(msg)¶ Get text from a message.
-
bigbang.parse.
guess_first_name
(cleaned_from)¶ Attempt to extract a person’s first name from the cleaned version of their name (from a ‘From’ field). This may or may not be the given name. Returns None if heuristic doesn’t recognize a separable first name.
-
bigbang.parse.
normalize_email_address
(address)¶ Takes a valid email address and returns a normalized one, for matching purposes.
-
bigbang.parse.
split_references
(refs)¶
-
bigbang.parse.
tokenize_name
(clean_name)¶ Create a tokenized version of a name, good for comparison and sorting for entity resolution.
Takes a Unicode name already cleaned of most punctuation and spurious characters, hopefully.
utils¶
Miscellaneous utility functions used in other modules.
-
bigbang.utils.
add_freq
(idx, freq=None)¶ Add a frequency attribute to idx, through inference or directly.
Returns a copy. If freq is None, it is inferred.
-
bigbang.utils.
clean_message
(mess)¶
-
bigbang.utils.
get_common_foot
(str1, str2, delimiter=None)¶
-
bigbang.utils.
get_common_head
(str1, str2, delimiter=None)¶
-
bigbang.utils.
get_paths_to_dirs_in_directory
(directory: str, folder_dsc: str = '*') → List[str]¶ Get paths of all directories matching file_dsc in directory
-
bigbang.utils.
get_paths_to_files_in_directory
(directory: str, file_dsc: str = '*') → List[str]¶ Get paths of all files matching file_dsc in directory
-
bigbang.utils.
labeled_blockmodel
(g, partition)¶ Perform blockmodel transformation on graph g and partition represented by dictionary partition. Values of partition are used to partition the graph. Keys of partition are used to label the nodes of the new graph.
-
bigbang.utils.
remove_quoted
(mess)¶
-
bigbang.utils.
repartition_dataframe
(df, partition)¶ Create a new dataframe with the same index as argument dataframe df, where columns are the keys of dictionary partition. The data of the returned dataframe are the combinations of columns listed in the keys of partition
analysis.repo_loader¶
-
exception
bigbang.analysis.repo_loader.
RepoLoaderWarning
¶ Bases:
BaseException
Base class for Archive class specific exceptions
-
bigbang.analysis.repo_loader.
cache_path
(name)¶ Takes in a name (bigbang) Returns where its cached file should be (../sample_git_repos/bigbang_backup.csv)
-
bigbang.analysis.repo_loader.
create_graph
(dic)¶ Converts a dictionary of dependencies into a NetworkX DiGraph.
-
bigbang.analysis.repo_loader.
fetch_repo
(url)¶ Takes in a git url and uses shell commands to clone the git repo into sample_git_repos/
TODO: We shouldn’t use this with shell=True because of security concerns.
-
bigbang.analysis.repo_loader.
filepath_to_name
(filepath)¶ Converts a filepath (../archives/sample_git_repos/{name}) to a name. Note that this will fail if the filepath ends in a “/”. It must end in the name of the folder. Thus, it should be ../archives/sample_git_repos/{name} not ../archives/sample_git_repos/{name}/
-
bigbang.analysis.repo_loader.
get_cache
(name)¶ Takes in a name (bigbang) Returns a GitRepo object containing the cache data if the cache exists Returns None otherwise.
-
bigbang.analysis.repo_loader.
get_dependency_network
(filepath)¶ Given a directory, collects all Python and IPython files and uses the Python AST to create a dictionary of dependencies from them. Returns the dependencies converted into a NetworkX graph.
-
bigbang.analysis.repo_loader.
get_files
(filepath)¶ Returns a list of the Python files in a directory, and converts IPython notebooks into Python source code and includes them with the Python files.
-
bigbang.analysis.repo_loader.
get_multi_repo
(repo_names=None, repos=None)¶ As of now, this only accepts names/repos, not local urls TODO: This could be optimized
-
bigbang.analysis.repo_loader.
get_org_multirepo
(org_name)¶
-
bigbang.analysis.repo_loader.
get_org_repos
(org_name)¶ Checks to see if we have the urls for a given org If we don’t, it fetches them. Once we do, it returns a list of GitRepo objects from the urls.
-
bigbang.analysis.repo_loader.
get_repo
(repo_in, in_type='name', update=False)¶ - Takes three different options for type:
remote: basically a git url
name (default): a name like ‘scipy’ which the method can expand to a url
- local: a filepath to a file on the local system
(basically an existing git directory on this computer)
This returns an initialized GitRepo object with its data and name already loaded.
-
bigbang.analysis.repo_loader.
load_org_repos
(org_name)¶ fetches a list of all repos in an organization from github and gathers their URL’s (of the form *.git) It dumps these into ../examples/{org_name}_urls.txt
-
bigbang.analysis.repo_loader.
name_to_filepath
(name)¶ Converts a name of a repo to its filepath. Currently, these go to ../archives/sample_git_repos/{name}/
-
bigbang.analysis.repo_loader.
repo_already_exists
(filepath)¶
-
bigbang.analysis.repo_loader.
url_to_name
(url)¶ Converts a github url (e.g. https://github.com/sbenthall/bigbang.git) to a human-readable name (bigbang) by looking at the word between the last “/” and “.git”.
get_dependencies¶
datasets.domains¶
This submodule is responsible for making data about the classification of email domains available in Python memory.
The data is stored in a CSV file that is provided with the BigBang repository.
This file was generated using a script that is provided for the library for reproducibility.
The script can be found in Create Domain-Category Data.ipynb
-
bigbang.datasets.domains.domains.
load_data
()¶ Returns a datafarme with email domains labeled by category.
Categories include: generic, personal, company, academic, sdo
- Returns
data
- Return type
pandas.DataFrame
ingress.abstract¶
-
class
bigbang.ingress.abstract.
AbstractMailList
(name: str, source: Union[List[str], str], msgs: List[mailbox.mboxMessage])¶ Bases:
abc.ABC
This class handles the scraping of a all public Emails contained in a single mailing list. To be more precise, each contributor to a mailing list sends their message to an Email address that has the following structure: <mailing_list_name>@<mail_list_domain_name>. Thus, this class contains all Emails send to a specific <mailing_list_name> (the Email localpart).
- Parameters
name (The of whom the list (e.g. 3GPP_COMMON_IMS_XFER, IEEESCO-DIFUSION, ..)) –
source (Contains the information of the location of the mailing list.) – It can be either an URL where the list or a path to the file(s).
msgs (List of mboxMessage objects) –
-
from_url
()¶
-
from_messages
()¶
-
from_mbox
()¶
-
get_message_urls
()¶
-
get_messages_from_url
()¶
-
get_index_of_elements_in_selection
()¶
-
get_name_from_url
()¶
-
to_dict
()¶
-
to_pandas_dataframe
()¶
-
to_mbox
()¶
-
__getitem__
(index) → mailbox.mboxMessage¶ Get specific message at position index within the mailing list.
-
__iter__
()¶ Iterate over each message within the mailing list.
-
__len__
() → int¶ Get number of messsages within the mailing list.
-
abstract classmethod
from_mbox
(name: str, filepath: str) → bigbang.ingress.abstract.AbstractMailList¶ - Parameters
name (Name of the list of messages, e.g. '3GPP_TSG_SA_WG2_UPCON'.) –
filepath (Path to file in which mailing list is stored.) –
-
abstract classmethod
from_messages
(name: str, url: str, messages: Union[List[str], List[mailbox.mboxMessage]], fields: str = 'total', url_login: str = None, url_pref: str = None, login: Optional[Dict[str, str]] = {'password': None, 'username': None}, session: Optional[str] = None) → bigbang.ingress.abstract.AbstractMailList¶ - Parameters
name (Name of the list of messages, e.g. 'public-bigdata') –
url (URL to the Email list.) –
messages (Can either be a list of URLs to specific messages) – or a list of mboxMessage objects.
url_login (URL to the 'Log In' page.) –
url_pref (URL to the 'Preferences'/settings page.) –
login (Login credentials (username and password) that were used to set) – up AuthSession.
session (requests.Session() object for the Email list domain website.) –
-
abstract classmethod
from_url
(name: str, url: str, select: Optional[dict] = {'fields': 'total'}, url_login: Optional[str] = None, url_pref: Optional[str] = None, login: Optional[Dict[str, str]] = {'password': None, 'username': None}, session: Optional[requests.sessions.Session] = None) → bigbang.ingress.abstract.AbstractMailList¶ - Parameters
name (Name of the mailing list.) –
url (URL to the mailing list.) –
select (Selection criteria that can filter messages by:) –
content, i.e. header and/or body
period, i.e. written in a certain year, month, week-of-month
url_login (URL to the 'Log In' page) –
url_pref (URL to the 'Preferences'/settings page) –
login (Login credentials (username and password) that were used to set) – up AuthSession.
session (requests.Session() object for the Email list domain website.) –
-
static
get_index_of_elements_in_selection
(times: List[Union[int, str]], urls: List[str], filtr: Union[tuple, list, int, str]) → List[int]¶ Filter out messages that where in a specific period. Period here is a set containing units of year, month, and week-of-month which can have the following example elements:
years: (1992, 2010), [2000, 2008], 2021
months: [“January”, “July”], “November”
weeks: (1, 4), [1, 5], 2
- Parameters
times (A list containing information of the period for each) – group of mboxMessage.
urls (Corresponding URLs of each group of mboxMessage of which the) – period info is contained in times.
filtr (Containing info on what should be filtered.) –
- Returns
- Return type
Indices of to the elements in times/ursl.
-
abstract classmethod
get_message_urls
(name: str, url: str, select: Optional[dict] = None) → List[str]¶ - Parameters
name (Name of the list of messages, e.g. 'public-bigdata') –
url (URL to the mailing list.) –
select (Selection criteria that can filter messages by:) –
content, i.e. header and/or body
period, i.e. written in a certain year and month
- Returns
- Return type
List of all selected URLs of the messages in the mailing list.
-
static
get_messages_from_urls
(name: str, msg_urls: list, msg_parser, fields: Optional[str] = 'total') → List[mailbox.mboxMessage]¶ Generator that returns all messages within a certain period (e.g. January 2021, Week 5).
- Parameters
name (Name of the list of messages, e.g. '3GPP_TSG_SA_WG2_UPCON') –
url (URL to the mailing list.) –
fields (Content, i.e. header and/or body) –
-
abstract
get_name_from_url
() → str¶
-
to_dict
(include_body: bool = True) → Dict[str, List[str]]¶ - Parameters
include_body (A boolean that indicates whether the message body should) – be included or not.
- Returns
A Dictionary with the first key layer being the header field names and
the “body” key. Each value field is a list containing the respective
header field contents arranged by the order as they were scraped from
the web. This format makes the conversion to a pandas.DataFrame easier.
-
to_mbox
(dir_out: str, filename: Optional[str] = None)¶ Safe mailing list to .mbox files.
-
to_pandas_dataframe
(include_body: bool = True) → pandas.core.frame.DataFrame¶ - Parameters
include_body (A boolean that indicates whether the message body should) – be included or not.
- Returns
Converts the mailing list into a pandas.DataFrame object in which each
row represents an Email.
-
class
bigbang.ingress.abstract.
AbstractMailListDomain
(name: str, url: str, lists: List[Union[bigbang.ingress.abstract.AbstractMailList, str]])¶ Bases:
abc.ABC
This class handles the scraping of a all public Emails contained in a mail list domain. To be more precise, each contributor to a mailing archive sends their message to an Email address that has the following structure: <mailing_list_name>@<mail_list_domain_name>. Thus, this class contains all Emails send to <mail_list_domain_name> (the Email domain name). These Emails are contained in a list of AbstractMailList types, such that it is known to which <mailing_list_name> (the Email localpart) was send.
- Parameters
name (The mail list domain name (e.g. 3GPP, IEEE, W3C)) –
url (The URL where the archive lives) –
lists (A list containing the mailing lists as AbstractMailList types) –
-
from_url
()¶
-
from_mailing_lists
()¶
-
from_mbox
()¶
-
get_lists_from_url
()¶
-
to_dict
()¶
-
to_pandas_dataframe
()¶
-
to_mbox
()¶
-
__getitem__
(index)¶ Get specific mailing list at position index from the mail list domain.
-
__iter__
()¶ Iterate over each mailing list within the mail list domain.
-
__len__
()¶ Get number of mailing lists within the mail list domain.
-
abstract classmethod
from_mailing_lists
(name: str, url_root: str, url_mailing_lists: Union[List[str], List[bigbang.ingress.abstract.AbstractMailList]], select: Optional[dict] = None, url_login: Optional[str] = None, url_pref: Optional[str] = None, login: Optional[Dict[str, str]] = {'password': None, 'username': None}, session: Optional[str] = None, only_mlist_urls: bool = True, instant_save: Optional[bool] = True) → bigbang.ingress.abstract.AbstractMailListDomain¶ Create mailing mail list domain from a given list of ‘AbstractMailList` instances or URLs pointing to mailing lists.
- Parameters
name (mail list domain name, such that multiple instances of) – AbstractMailListDomain can easily be distinguished.
url_root (The invariant root URL that does not change no matter what) – part of the mail list domain we access.
url_mailing_lists (This argument can either be a list of AbstractMailList) – objects or a list of string containing the URLs to the mailing list of interest.
url_login (URL to the 'Log In' page.) –
url_pref (URL to the 'Preferences'/settings page.) –
login (Login credentials (username and password) that were used to set) – up AuthSession.
session (requests.Session() object for the mail list domain website.) –
only_list_urls (Boolean giving the choice to collect only mailing list) – URLs or also their contents.
instant_save (Boolean giving the choice to save a AbstractMailList as) – soon as it is completely scraped or collect entire mail list domain. The prior is recommended if a large number of mailing lists are scraped which can require a lot of memory and time.
-
abstract classmethod
from_mbox
(name: str, directorypath: str, filedsc: str = '*.mbox') → bigbang.ingress.abstract.AbstractMailListDomain¶ - Parameters
name (mail list domain name, such that multiple instances of) – AbstractMailListDomain can easily be distinguished.
directorypath (Path to the folder in which AbstractMailListDomain is stored.) –
filedsc (Optional filter that only reads files matching the description.) – By default all files with an mbox extension are read.
-
abstract classmethod
from_url
(name: str, url_root: str, url_home: Optional[str] = None, select: Optional[dict] = None, url_login: Optional[str] = None, url_pref: Optional[str] = None, login: Optional[Dict[str, str]] = {'password': None, 'username': None}, session: Optional[str] = None, instant_save: bool = True, only_mlist_urls: bool = True) → bigbang.ingress.abstract.AbstractMailListDomain¶ Create a mail list domain from a given URL. :param name: AbstractMailListDomain can easily be distinguished. :type name: Email list domain name, such that multiple instances of :param url_root: part of the mail list domain we access. :type url_root: The invariant root URL that does not change no matter what :param url_home: it contains the different sections which we obtain using get_sections(). :type url_home: The ‘home’ space of the mail list domain. This is required as :param select:
content, i.e. header and/or body
period, i.e. written in a certain year, month, week-of-month
- Parameters
url_login (URL to the 'Log In' page.) –
url_pref (URL to the 'Preferences'/settings page.) –
login (Login credentials (username and password) that were used to set) – up AuthSession.
session (requests.Session() object for the mail list domain website.) –
instant_save (Boolean giving the choice to save a AbstractMailList as) – soon as it is completely scraped or collect entire mail list domain. The prior is recommended if a large number of mailing lists are scraped which can require a lot of memory and time.
only_list_urls (Boolean giving the choice to collect only AbstractMailList) – URLs or also their contents.
-
abstract classmethod
get_lists_from_url
(url_home: str, select: dict, session: Optional[str] = None, instant_save: bool = True, only_mlist_urls: bool = True) → List[Union[bigbang.ingress.abstract.AbstractMailList, str]]¶ Created dictionary of all lists in the mail list domain.
- Parameters
url_root (The invariant root URL that does not change no matter what) – part of the mail list domain we access.
url_home (The 'home' space of the mail list domain. This is required as) – it contains the different sections which we obtain using get_sections().
select (Selection criteria that can filter messages by:) –
content, i.e. header and/or body
period, i.e. written in a certain year, month, week-of-month
session (requests.Session() object for the mail list domain website.) –
instant_save (Boolean giving the choice to save a AbstractMailList as) – soon as it is completely scraped or collect entire mail list domain. The prior is recommended if a large number of mailing lists are scraped which can require a lot of memory and time.
only_list_urls (Boolean giving the choice to collect only AbstractMailList) – URLs or also their contents.
- Returns
archive_dict
- Return type
the keys are the names of the lists and the value their url
-
to_dict
(include_body: bool = True) → Dict[str, List[str]]¶ Concatenates mailing list dictionaries created using AbstractMailList.to_dict().
-
to_mbox
(dir_out: str)¶ Save mail list domain content to .mbox files
-
to_pandas_dataframe
(include_body: bool = True) → pandas.core.frame.DataFrame¶ Concatenates mailing list pandas.DataFrames created using AbstractMailList.to_pandas_dataframe().
-
exception
bigbang.ingress.abstract.
AbstractMailListDomainWarning
¶ Bases:
BaseException
Base class for AbstractMailListDomain class specific exceptions
-
exception
bigbang.ingress.abstract.
AbstractMailListWarning
¶ Bases:
BaseException
Base class for AbstractMailList class specific exceptions
-
class
bigbang.ingress.abstract.
AbstractMessageParser
(website=False, url_login: Optional[str] = None, url_pref: Optional[str] = None, login: Optional[Dict[str, str]] = {'password': None, 'username': None}, session: Optional[requests.sessions.Session] = None)¶ Bases:
abc.ABC
This class handles the creation of an mailbox.mboxMessage object (using the from_*() methods) and its storage in various other file formats (using the to_*() methods) that can be saved on the local memory.
-
create_email_message
(archived_at: str, body: str, **header) → mailbox.mboxMessage¶ - Parameters
archived_at (URL to the Email message.) –
body (String that contains the body of the message.) –
header (Dictionary that contains all available header fields of the) – message.
-
from_url
(list_name: str, url: str, fields: str = 'total') → mailbox.mboxMessage¶ - Parameters
list_name (The name of the mailing list.) –
url (URL of this Email) –
fields (Indicates whether to return 'header', 'body' or 'total'/both or) – the Email. The latter is the default.
-
static
to_dict
(msg: mailbox.mboxMessage) → Dict[str, List[str]]¶ Convert mboxMessage to a Dictionary
-
static
to_mbox
(msg: mailbox.mboxMessage, filepath: str)¶ - Parameters
msg (The Email.) –
filepath (Path to file in which the Email will be stored.) –
-
static
to_pandas_dataframe
(msg: mailbox.mboxMessage) → pandas.core.frame.DataFrame¶ Convert mboxMessage to a pandas.DataFrame
-
-
exception
bigbang.ingress.abstract.
AbstractMessageParserWarning
¶ Bases:
BaseException
Base class for AbstractMessageParser class specific exceptions
ingress.listserv¶
-
class
bigbang.ingress.listserv.
ListservMailList
(name: str, source: Union[List[str], str], msgs: List[mailbox.mboxMessage])¶ Bases:
bigbang.ingress.abstract.AbstractMailList
This class handles the scraping of a all public Emails contained in a single mailing list in the LISTSERV 16.5 and 17 format. To be more precise, each contributor to a mailing list sends their message to an Email address that has the following structure: <mailing_list_name>@LIST.ETSI.ORG. Thus, this class contains all Emails send to a specific <mailing_list_name> (the Email localpart, such as “3GPP_TSG_CT_WG1” or “3GPP_TSG_CT_WG3_108E_MAIN”).
- Parameters
name (The of whom the list (e.g. 3GPP_COMMON_IMS_XFER, IEEESCO-DIFUSION, ..)) –
source (Contains the information of the location of the mailing list.) – It can be either an URL where the list or a path to the file(s).
msgs (List of mboxMessage objects) –
Example
To scrape a Listserv mailing list from an URL and store it in run-time memory, we do the following >>> mlist = ListservMailList.from_url( >>> name=”IEEE-TEST”, >>> url=”https://listserv.ieee.org/cgi-bin/wa?A0=IEEE-TEST”, >>> select={ >>> “years”: 2015, >>> “months”: “November”, >>> “weeks”: 4, >>> “fields”: “header”, >>> }, >>> login={“username”: <your_username>, “password”: <your_password>}, >>> )
To save it as *.mbox file we do the following >>> mlist.to_mbox(path_to_file)
-
classmethod
from_listserv_directories
(name: str, directorypaths: List[str], filedsc: str = '*.LOG?????', select: Optional[dict] = None) → bigbang.ingress.listserv.ListservMailList¶ This method is required if the files that contain the list messages were directly exported from LISTSERV 16.5 (e.g. by a member of 3GPP). Each mailing list has its own directory and is split over multiple files with an extension starting with LOG and ending with five digits.
- Parameters
name (Name of the list of messages, e.g. '3GPP_TSG_SA_WG2_UPCON'.) –
directorypaths (List of directory paths where LISTSERV formatted) – messages are.
filedsc (A description of the relevant files, e.g. *.LOG?????) –
select (Selection criteria that can filter messages by:) –
content, i.e. header and/or body
period, i.e. written in a certain year, month, week-of-month
-
classmethod
from_listserv_files
(name: str, filepaths: List[str], select: Optional[dict] = None) → bigbang.ingress.listserv.ListservMailList¶ This method is required if the files that contain the list messages were directly exported from LISTSERV 16.5 (e.g. by a member of 3GPP). Each mailing list has its own directory and is split over multiple files with an extension starting with LOG and ending with five digits. Compared to ListservMailList.from_listserv_directories(), this method reads messages from single files, instead of all the files contained in a directory.
- Parameters
name (Name of the list of messages, e.g. '3GPP_TSG_SA_WG2_UPCON') –
filepaths (List of file paths where LISTSERV formatted messages are.) – Such files can have a file extension of the form: *.LOG1405D
select (Selection criteria that can filter messages by:) –
content, i.e. header and/or body
period, i.e. written in a certain year, month, week-of-month
-
classmethod
from_mbox
(name: str, filepath: str) → bigbang.ingress.listserv.ListservMailList¶ Docstring in AbstractMailList.
-
classmethod
from_messages
(name: str, url: str, messages: Union[List[str], List[mailbox.mboxMessage]], fields: str = 'total', url_login: str = 'https://list.etsi.org/scripts/wa.exe?LOGON', url_pref: str = 'https://list.etsi.org/scripts/wa.exe?PREF', login: Optional[Dict[str, str]] = {'password': None, 'username': None}, session: Optional[str] = None) → bigbang.ingress.listserv.ListservMailList¶ Docstring in AbstractMailList.
-
classmethod
from_url
(name: str, url: str, select: Optional[dict] = {'fields': 'total'}, url_login: str = 'https://list.etsi.org/scripts/wa.exe?LOGON', url_pref: str = 'https://list.etsi.org/scripts/wa.exe?PREF', login: Optional[Dict[str, str]] = {'password': None, 'username': None}, session: Optional[requests.sessions.Session] = None) → bigbang.ingress.listserv.ListservMailList¶ Docstring in AbstractMailList.
-
static
get_all_periods_and_their_urls
(url: str) → Tuple[List[str], List[str]]¶ LISTSERV groups messages into weekly time bundles. This method obtains all the URLs that lead to the messages of each time bundle.
- Returns
Returns a tuple of two lists that look like
([‘April 2017, 2’, ‘January 2001’, …], [‘ulr1’, ‘url2’, …])
-
classmethod
get_line_numbers_of_header_starts
(content: List[str]) → List[int]¶ By definition LISTSERV logs seperate new messages by a row of 73 equal signs.
- Parameters
content (The content of one LISTSERV file.) –
- Returns
- Return type
List of line numbers where header starts
-
classmethod
get_message_urls
(name: str, url: str, select: Optional[dict] = None) → List[str]¶ Docstring in AbstractMailList.
This routine is needed for Listserv 16.5
-
static
get_name_from_url
(url: str) → str¶ Get name of mailing list.
-
classmethod
get_period_urls
(url: str, select: Optional[dict] = None) → List[str]¶ All messages within a certain period (e.g. January 2021, Week 5).
- Parameters
url (URL to the LISTSERV list.) –
select (Selection criteria that can filter messages by:) –
content, i.e. header and/or body
period, i.e. written in a certain year, month, week-of-month
-
class
bigbang.ingress.listserv.
ListservMailListDomain
(name: str, url: str, lists: List[Union[bigbang.ingress.abstract.AbstractMailList, str]])¶ Bases:
bigbang.ingress.abstract.AbstractMailListDomain
This class handles the scraping of a all public Emails contained in a mail list domain that has the LISTSERV 16.5 and 17 format, such as 3GPP. To be more precise, each contributor to a mail list domain sends their message to an Email address that has the following structure: <mailing_list_name>@w3.org. Thus, this class contains all Emails send to <mail_list_domain_name> (the Email domain name). These Emails are contained in a list of ListservMailList types, such that it is known to which <mailing_list_name> (the Email localpart) was send.
- Parameters
name (The mailing list domain name (e.g. 3GPP, IEEE, ..)) –
url (The URL where the mailing list domain lives) –
lists (A list containing the mailing lists as ListservMailList types) –
-
All methods in the `AbstractMailListDomain` class in addition to:
-
from_listserv_directory
()¶
-
get_sections
()¶
Example
To scrape a Listserv mailing list domain from an URL and store it in run-time memory, we do the following >>> mlistdom = ListservMailListDomain.from_url( >>> name=”IEEE”, >>> url_root=”https://listserv.ieee.org/cgi-bin/wa?”, >>> url_home=”https://listserv.ieee.org/cgi-bin/wa?HOME”, >>> select={ >>> “years”: 2015, >>> “months”: “November”, >>> “weeks”: 4, >>> “fields”: “header”, >>> }, >>> login={“username”: <your_username>, “password”: <your_password>}, >>> instant_save=False, >>> only_mlist_urls=False, >>> )
To save it as *.mbox file we do the following >>> mlistdom.to_mbox(path_to_directory)
-
classmethod
from_listserv_directory
(name: str, directorypath: str, folderdsc: str = '*', filedsc: str = '*.LOG?????', select: Optional[dict] = None) → bigbang.ingress.listserv.ListservMailListDomain¶ This method is required if the files that contain the mail list domain messages were directly exported from LISTSERV 16.5 (e.g. by a member of 3GPP). Each mailing list has its own subdirectory and is split over multiple files with an extension starting with LOG and ending with five digits.
- Parameters
name (mail list domain name, such that multiple instances of) – ListservMailListDomain can easily be distinguished.
directorypath (Where the ListservMailListDomain can be initialised.) –
folderdsc (A description of the relevant folders) –
filedsc (A description of the relevant files, e.g. *.LOG?????) –
select (Selection criteria that can filter messages by:) –
content, i.e. header and/or body
period, i.e. written in a certain year, month, week-of-month
-
classmethod
from_mailing_lists
(name: str, url_root: str, url_mailing_lists: Union[List[str], List[bigbang.ingress.listserv.ListservMailList]], select: Optional[dict] = {'fields': 'total'}, url_login: str = 'https://list.etsi.org/scripts/wa.exe?LOGON', url_pref: str = 'https://list.etsi.org/scripts/wa.exe?PREF', login: Optional[Dict[str, str]] = {'password': None, 'username': None}, session: Optional[str] = None, only_mlist_urls: bool = True, instant_save: Optional[bool] = True) → bigbang.ingress.listserv.ListservMailListDomain¶ Docstring in AbstractMailList.
-
classmethod
from_mbox
(name: str, directorypath: str, filedsc: str = '*.mbox') → bigbang.ingress.listserv.ListservMailList¶ Docstring in AbstractMailList.
-
classmethod
from_url
(name: str, url_root: str, url_home: str, select: Optional[dict] = {'fields': 'total'}, url_login: str = 'https://list.etsi.org/scripts/wa.exe?LOGON', url_pref: str = 'https://list.etsi.org/scripts/wa.exe?PREF', login: Optional[Dict[str, str]] = {'password': None, 'username': None}, session: Optional[str] = None, instant_save: bool = True, only_mlist_urls: bool = True) → bigbang.ingress.listserv.ListservMailListDomain¶ Docstring in AbstractMailList.
-
static
get_lists_from_url
(url_root: str, url_home: str, select: dict, session: Optional[str] = None, instant_save: bool = True, only_mlist_urls: bool = True) → List[Union[bigbang.ingress.listserv.ListservMailList, str]]¶ Docstring in AbstractMailList.
-
get_sections
(url_home: str) → int¶ Get different sections of mail list domain. On the Listserv 16.5 website they look like: [3GPP] [3GPP–AT1] [AT2–CONS] [CONS–EHEA] [EHEA–ERM_] … On the Listserv 17 website they look like: [<<][<]1-50(798)[>][>>]
- Returns
If sections exist, it returns their urls and names. Otherwise it returns
the url_home.
-
exception
bigbang.ingress.listserv.
ListservMailListDomainWarning
¶ Bases:
BaseException
Base class for ListservMailListDomain class specific exceptions
-
exception
bigbang.ingress.listserv.
ListservMailListWarning
¶ Bases:
BaseException
Base class for ListservMailList class specific exceptions
-
class
bigbang.ingress.listserv.
ListservMessageParser
(website=False, url_login: Optional[str] = None, url_pref: Optional[str] = None, login: Optional[Dict[str, str]] = {'password': None, 'username': None}, session: Optional[requests.sessions.Session] = None)¶ Bases:
bigbang.ingress.abstract.AbstractMessageParser
,email.parser.Parser
This class handles the creation of an mailbox.mboxMessage object (using the from_*() methods) and its storage in various other file formats (using the to_*() methods) that can be saved on the local memory.
- Parameters
website (Set 'True' if messages are going to be scraped from websites,) – otherwise ‘False’ if read from local memory.
url_login (URL to the 'Log In' page.) –
url_pref (URL to the 'Preferences'/settings page.) –
login (Login credentials (username and password) that were used to set) – up AuthSession. You can create your own for the 3GPP mail list domain.
session (requests.Session() object for the mail list domain website.) –
-
from_url
()¶
-
from_listserv_file
()¶
-
_get_header_from_html
()¶
-
_get_body_from_html
()¶
-
_get_header_from_listserv_file
()¶
-
_get_body_from_listserv_file
()¶
Example
To create a Email message parser object, use the following syntax: >>> msg_parser = ListservMessageParser( >>> website=True, >>> login={“username”: <your_username>, “password”: <your_password>}, >>> )
To obtain the Email message content and return it as mboxMessage object, you need to do the following: >>> msg = msg_parser.from_url( >>> list_name=”3GPP_TSG_RAN_DRAFTS”, >>> url=”https://list.etsi.org/scripts/wa.exe?A2=ind2010B&L=3GPP_TSG_RAN_DRAFTS&O=D&P=29883”, >>> fields=”total”, >>> )
-
empty_header
= {}¶
-
from_listserv_file
(list_name: str, file_path: str, header_start_line_nr: int, fields: str = 'total') → mailbox.mboxMessage¶ This method is required if the message is inside a file that was directly exported from LISTSERV 16.5 (e.g. by a member of 3GPP). Such files have an extension starting with LOG and ending with five digits.
- Parameters
list_name (The name of the LISTSERV Email list.) –
file_path (Path to file that contains the Email list.) –
header_start_line_nr (Line number in the file on which a new message starts.) –
fields (Indicates whether to return 'header', 'body' or 'total'/both or) – the Email.
-
exception
bigbang.ingress.listserv.
ListservMessageParserWarning
¶ Bases:
BaseException
Base class for ListservMessageParser class specific exceptions
ingress.w3c¶
-
class
bigbang.ingress.w3c.
W3CMailList
(name: str, source: Union[List[str], str], msgs: List[mailbox.mboxMessage])¶ Bases:
bigbang.ingress.abstract.AbstractMailList
This class handles the scraping of a all public Emails contained in a single mailing list in the hypermail format. To be more precise, each contributor to a mailing list sends their message to an Email address that has the following structure: <mailing_list_name>@w3.org. Thus, this class contains all Emails send to a specific <mailing_list_name> (the Email localpart, such as “public-abcg” or “public-accesslearn-contrib”).
- Parameters
name (The name of the list (e.g. public-2018-permissions-ws, ..)) –
source (Contains the information of the location of the mailing list.) – It can be either an URL where the list or a path to the file(s).
msgs (List of mboxMessage objects) –
Example
To scrape a W3C mailing list from an URL and store it in run-time memory, we do the following
>>> mlist = W3CMailList.from_url( >>> name="public-bigdata", >>> url="https://lists.w3.org/Archives/Public/public-bigdata/", >>> select={ >>> "years": 2015, >>> "months": "August", >>> "fields": "header", >>> }, >>> )
To save it as
*.mbox
file we do the following >>> mlist.to_mbox(path_to_file)-
classmethod
from_mbox
(name: str, filepath: str) → bigbang.ingress.w3c.W3CMailList¶ Docstring in AbstractMailList.
-
classmethod
from_messages
(name: str, url: str, messages: Union[List[str], List[mailbox.mboxMessage]], fields: str = 'total') → bigbang.ingress.w3c.W3CMailList¶ Docstring in AbstractMailList.
-
classmethod
from_url
(name: str, url: str, select: Optional[dict] = {'fields': 'total'}) → bigbang.ingress.w3c.W3CMailList¶ Docstring in AbstractMailList.
-
static
get_all_periods_and_their_urls
(url: str) → Tuple[List[str], List[str]]¶ W3C groups messages into monthly time bundles. This method obtains all the URLs that lead to the messages of each time bundle.
- Returns
Returns a tuple of two lists that look like
([‘April 2017’, ‘January 2001’, …], [‘ulr1’, ‘url2’, …])
-
classmethod
get_message_urls
(name: str, url: str, select: Optional[dict] = None) → List[str]¶ Docstring in AbstractMailList.
-
classmethod
get_messages_urls
(name: str, url: str) → List[str]¶ - Parameters
name (Name of the W3C mailing list.) –
url (URL to group of messages that are within the same period.) –
- Returns
- Return type
List of URLs from which mboxMessage can be initialized.
-
static
get_name_from_url
(url: str) → str¶ Get name of mailing list.
-
classmethod
get_period_urls
(url: str, select: Optional[dict] = None) → List[str]¶ All messages within a certain period (e.g. January 2021).
- Parameters
url (URL to the W3C list.) –
select (Selection criteria that can filter messages by:) –
content, i.e. header and/or body
period, i.e. written in a certain year and month
-
class
bigbang.ingress.w3c.
W3CMailListDomain
(name: str, url: str, lists: List[Union[bigbang.ingress.abstract.AbstractMailList, str]])¶ Bases:
bigbang.ingress.abstract.AbstractMailListDomain
This class handles the scraping of a all public Emails contained in a mail list domain that has the hypermail format, such as W3C. To be more precise, each contributor to a mail list domain sends their message to an Email address that has the following structure: <mailing_list_name>@w3.org. Thus, this class contains all Emails send to <mail_list_domain_name> (the Email domain name). These Emails are contained in a list of W3CMailList types, such that it is known to which <mailing_list_name> (the Email localpart) was send.
- Parameters
name (The name of the mailing list domain.) –
url (The URL where the mailing list domain lives) –
lists (A list containing the mailing lists as W3CMailList types) –
-
All methods in the `AbstractMailListDomain` class.
Example
To scrape a W3C mailing list mailing list domain from an URL and store it in run-time memory, we do the following >>> mlistdom = W3CMailListDomain.from_url( >>> name=”W3C”, >>> url_root=”https://lists.w3.org/Archives/Public/”, >>> select={ >>> “years”: 2015, >>> “months”: “November”, >>> “weeks”: 4, >>> “fields”: “header”, >>> }, >>> instant_save=False, >>> only_mlist_urls=False, >>> )
To save it as *.mbox file we do the following >>> mlistdom.to_mbox(path_to_directory)
-
classmethod
from_mailing_lists
(name: str, url_root: str, url_mailing_lists: Union[List[str], List[bigbang.ingress.w3c.W3CMailList]], select: Optional[dict] = {'fields': 'total'}, only_mlist_urls: bool = True, instant_save: Optional[bool] = True) → bigbang.ingress.w3c.W3CMailListDomain¶ Docstring in AbstractMailListDomain.
-
classmethod
from_mbox
(name: str, directorypath: str, filedsc: str = '*.mbox') → bigbang.ingress.w3c.W3CMailListDomain¶ Docstring in AbstractMailListDomain.
-
classmethod
from_url
(name: str, url_root: str, url_home: Optional[str] = None, select: Optional[dict] = {'fields': 'total'}, instant_save: bool = True, only_mlist_urls: bool = True) → bigbang.ingress.w3c.W3CMailListDomain¶ Docstring in AbstractMailListDomain.
-
static
get_lists_from_url
(name: str, select: dict, url_root: str, url_home: Optional[str] = None, instant_save: bool = True, only_mlist_urls: bool = True) → List[Union[bigbang.ingress.w3c.W3CMailList, str]]¶ Docstring in AbstractMailListDomain.
-
exception
bigbang.ingress.w3c.
W3CMailListDomainWarning
¶ Bases:
BaseException
Base class for W3CMailListDomain class specific exceptions
-
exception
bigbang.ingress.w3c.
W3CMailListWarning
¶ Bases:
BaseException
Base class for W3CMailList class specific exceptions
-
class
bigbang.ingress.w3c.
W3CMessageParser
(website=False, url_login: Optional[str] = None, url_pref: Optional[str] = None, login: Optional[Dict[str, str]] = {'password': None, 'username': None}, session: Optional[requests.sessions.Session] = None)¶ Bases:
bigbang.ingress.abstract.AbstractMessageParser
,email.parser.Parser
This class handles the creation of an mailbox.mboxMessage object (using the from_*() methods) and its storage in various other file formats (using the to_*() methods) that can be saved on the local memory.
- Parameters
website (Set 'True' if messages are going to be scraped from websites,) – otherwise ‘False’ if read from local memory. This distinction needs to be made if missing messages should be added.
url_pref (URL to the 'Preferences'/settings page.) –
Example
To create a Email message parser object, use the following syntax: >>> msg_parser = W3CMessageParser(website=True)
To obtain the Email message content and return it as mboxMessage object, you need to do the following: >>> msg = msg_parser.from_url( >>> list_name=”public-2018-permissions-ws”, >>> url=”https://lists.w3.org/Archives/Public/public-2018-permissions-ws/2019May/0000.html”, >>> fields=”total”, >>> )
-
empty_header
= {}¶
-
exception
bigbang.ingress.w3c.
W3CMessageParserWarning
¶ Bases:
BaseException
Base class for W3CMessageParser class specific exceptions
-
bigbang.ingress.w3c.
parse_dfn_header
(header_text)¶
-
bigbang.ingress.w3c.
text_for_selector
(soup: bs4.BeautifulSoup, selector: str)¶ Filter out header or body field from website and return them as utf-8 string.
ingress.git_repo¶
-
class
bigbang.ingress.git_repo.
GitRepo
(name, url=None, attribs=['HEXSHA', 'Committer Name', 'Committer Email', 'Commit Message', 'Time', 'Parent Commit', 'Touched File'], cache=None)¶ Bases:
object
Store a git repository given the address to that repo relative to this file.
It returns the data in many forms.
Index a Pandas DataFrame object by time.
That stores the raw form of the repo’s commit data as a table.
Each row in this table is a commit.
And each column represents an attribute of that commit: (eg.: time, message, commiter name, committer email, commit hexsha).
-
by_committer
()¶ Return commit data grouped by commiter.
-
property
commit_data
¶ Return commit data.
-
commits_for_committer
(committer_name)¶ Return commits for committer given the commiter name.
-
commits_per_day
()¶ Return commits grouped by day.
-
commits_per_day_full
()¶ Return commits grouped by day and by commiter.
-
commits_per_week
()¶ Return commits grouped by week.
-
gen_data
(repo, raw)¶ Generate data to repo.
-
merge_with_repo
(other)¶ Append commit to a repo.
-
populate_data
(attribs=['HEXSHA', 'Committer Name', 'Committer Email', 'Commit Message', 'Time', 'Parent Commit', 'Touched File'])¶ Populate data.
-
-
class
bigbang.ingress.git_repo.
MultiGitRepo
(repos, attribs=['HEXSHA', 'Committer Name', 'Committer Email', 'Commit Message', 'Time', 'Parent Commit', 'Touched File'])¶ Bases:
bigbang.ingress.git_repo.GitRepo
Repos must have a “Repo Name” column.
Index a Pandas DataFrame object by time.
That stores the raw form of the repo’s commit data as a table.
Each row in this table is a commit.
And each column represents an attribute of that commit: (eg.: time, message, commiter name, committer email, commit hexsha).
-
bigbang.ingress.git_repo.
cache_fixer
(r)¶ Adds info from row to graph.
ingress.mailman¶
-
exception
bigbang.ingress.mailman.
InvalidURLException
(value)¶ Bases:
Exception
-
bigbang.ingress.mailman.
access_provenance
(directory)¶ Return an object with provenance information located in the given directory, or None if no provenance was found.
-
bigbang.ingress.mailman.
collect_archive_from_url
(url: Union[list, str], archive_dir='/home/docs/checkouts/readthedocs.org/user_builds/bigbang-py/checkouts/v0.4.0/archives/', notes=None)¶ Collect archives (generally tar.gz) files from mailmain archive page.
Return True if archives were downloaded, False otherwise (for example if the page lists no accessible archive files).
-
bigbang.ingress.mailman.
collect_from_file
(urls_file: str, archive_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/bigbang-py/checkouts/v0.4.0/archives/', notes=None)¶ Collect urls from a file.
-
bigbang.ingress.mailman.
collect_from_url
(url: Union[list, str], archive_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/bigbang-py/checkouts/v0.4.0/archives/', notes=None)¶ Collect data from a given url.
-
bigbang.ingress.mailman.
get_list_name
(url)¶ Return the ‘list name’ from a canonical mailman archive url.
Otherwise return the same URL.
-
bigbang.ingress.mailman.
normalize_archives_url
(url)¶ Normalize url.
will try to infer, find or guess the most useful archives URL, given a URL.
Return normalized URL, or the original URL if no improvement is found.
-
bigbang.ingress.mailman.
open_activity_summary
(url, archive_dir='/home/docs/checkouts/readthedocs.org/user_builds/bigbang-py/checkouts/v0.4.0/archives/')¶ Open the message activity summary for a particular mailing list (as specified by url).
Return the dataframe, or return None if no activity summary export file is found.
-
bigbang.ingress.mailman.
populate_provenance
(directory, list_name, list_url, notes=None)¶ Create a provenance metadata file for current mailing list collection.
-
bigbang.ingress.mailman.
recursive_get_payload
(x)¶ Get payloads recursively.
-
bigbang.ingress.mailman.
unzip_archive
(url, archive_dir='/home/docs/checkouts/readthedocs.org/user_builds/bigbang-py/checkouts/v0.4.0/archives/')¶ Unzip archive files.
-
bigbang.ingress.mailman.
update_provenance
(directory, provenance)¶ Update provenance file with given object.
-
bigbang.ingress.mailman.
urls_to_collect
(urls_file: str)¶ Collect urls given urls in a file.
ingress.utils¶
Miscellaneous utility functions used in other modules.
-
bigbang.ingress.utils.
ask_for_input
(request: str) → Optional[str]¶
-
bigbang.ingress.utils.
get_auth_session
(url_login: str, username: str, password: str) → requests.sessions.Session¶ Create AuthSession.
- There are three ways to create an AuthSession:
parse username & password directly into method
create a /bigbang/config/authentication.yaml file that contains keys
- type then into terminal when the method ‘get_login_from_terminal’
is raised
-
bigbang.ingress.utils.
get_login_from_terminal
(username: Optional[str], password: Optional[str], file_auth: str = '/home/docs/checkouts/readthedocs.org/user_builds/bigbang-py/checkouts/v0.4.0/bigbang/config/authentication.yaml') → Tuple[Optional[str]]¶ Get login key from user during run time if ‘username’ and/or ‘password’ is ‘None’. Return ‘None’ if no reply within 15 sec.
-
bigbang.ingress.utils.
get_website_content
(url: str, session: Optional[requests.sessions.Session] = None) → Union[str, bs4.BeautifulSoup]¶ Get HTML code from website
Note
Servers don’t like it when one is sending too many requests from same ip address in short period of time. Therefore we need to:
- catch ‘requests.exceptions.RequestException’ errors
(includes all possible errors to be on the safe side),
safe intermediate results,
continue where we left off at a later stage.
-
bigbang.ingress.utils.
loginkey_to_file
(username: str, password: str, file_auth: str) → None¶ Safe login key to yaml
-
bigbang.ingress.utils.
set_website_preference_for_header
(url_pref: str, session: requests.sessions.Session) → requests.sessions.Session¶ Set the ‘Email Headers’ of the ‘Archive Preferences’ for the auth session to ‘Show All Headers’. Otherwise only a restricted list of header fields is shown.
analysis.attendance¶
-
bigbang.analysis.attendance.
name_email_affil_relations_from_IETF_attendance
(meeting_range=[106, 107, 108], threshold=None)¶ Extract and infer from IETF attendance records relations between full names, email address, and affiliations.
In the returned dataframes, each row represents a relation between two of these forms of entity, along with the maximum and minimum date associated with it in the data.
Two forms of inference are used when generating these relational tables:
Missing values in time are filled forward, then filled backward
TODO: Affiliations are ran through the entity resolution script to reduce them to a ‘cannonical form’
- Parameters
meeting_range (list of ints) – The numbers of the IETF meetings to use for source data
threshold (float) – Defaults to None. If not None, activate entity resolution on the affiliations. Threshold value is used for the entity resolution.
- Returns
rel_name_affil (pandas.DataFrame)
rel_email_affil (pandas.DataFrame)
rel_name_email (pandas.DataFrame)
analysis.listserv¶
-
class
bigbang.analysis.listserv.
ListservMailList
(name: str, filepath: str, msgs: pandas.core.frame.DataFrame)¶ Bases:
object
Note
Issues loading 3GPP_TSG_RAN_WG1 which is 3.3Gb large
-
__iter__
()¶ Iterate over each message within the mailing list.
-
__len__
() → int¶ Get number of messsages within the mailing list.
-
add_thread_info
()¶ Edit pd.DataFrame to include extra column to identify which thread a message belongs to.
-
add_weight_to_edge
(dic: dict, key1: str, key2: str) → dict¶ - Parameters
dic –
key1 –
key2 –
-
static
contract
(count: numpy.array, label: list, contract: float) → Dict[str, int]¶ This function contracts all domain names that contributed to a mailinglists below the contract threshold into one entity called Others. Meaning, if contract=3 and nokia.com, nokia.com, t-mobile.at all wrote less then three Emails to the mailinglist in question, their contributions are going to be summed into one entity denoted as Others.
- Parameters
count (Number of Emails send to mailinglist.) –
label (Names of contributers to mailinglist.) –
contract (Threshold below which all contributions will be summed.) –
-
create_sender_receiver_digraph
(nw: Optional[dict] = None, entity_in_focus: Optional[list] = None, node_attributes: Optional[Dict[str, list]] = None)¶ Create directed graph from messaging network created with ListservMailList.get_sender_receiver_dict().
- Parameters
nw (dictionary created with self.get_sender_receiver_dict()) –
entity_in_focus (This can be a list of domain names or localparts. If) – such a list is provided, the creaed di-graph will only focus on their relations.
-
crop_by_address
(header_field: str, per_address_field: Dict[str, List[str]]) → bigbang.analysis.listserv.ListservMailList¶ - Parameters
header_field (For a Listserv mailing list the most representative) – header fields for senders and receivers are ‘from’ and ‘comments-to’ respectively.
per_address_field (Filter by 'local-part' or 'domain' part of an address.) –
- The data structure of the argument should be, e.g.:
{‘localpart’: [string-1, string-2, …]}
- Returns
- Return type
ListservMailList object cropped to specification.
-
crop_by_subject
(match=<class 'str'>, place: int = 2) → bigbang.analysis.listserv.ListservMailList¶ - Parameters
match (Only keep messages with subject lines containing match string.) –
place (Define how to filter for match. Use on of the following methods:) – 0 = Using Regex expression 1 = String ends with match 2 =
- Returns
- Return type
ListservMailList object cropped to message subject.
-
crop_by_year
(yrs: Union[int, list]) → bigbang.analysis.listserv.ListservMailList¶ Filter self.df DataFrame by year in message date.
- Parameters
yrs (Specify a specific year, such as 2021, or a range of years, such) – as [2011, 2021].
- Returns
- Return type
ListservMailList object cropped to specification.
-
crop_dic_to_entity_in_focus
(dic: dict, entity_in_focus: list) → dict¶ - Parameters
entity_in_focus (This can a list of domain names or localparts.) –
-
classmethod
from_mbox
(name: str, filepath: str, include_body: bool = True) → bigbang.analysis.listserv.ListservMailList¶
-
classmethod
from_pandas_dataframe
(df: pandas.core.frame.DataFrame, name: Optional[str] = None, filepath: Optional[str] = None) → bigbang.analysis.listserv.ListservMailList¶
-
get_domains
(header_fields: List[str], return_msg_counts: bool = False, df: Optional[pandas.core.frame.DataFrame] = None) → dict¶ Get contribution of members per affiliation.
Note
For a Listserv mailing list the most representative header fields of senders and receivers are ‘from’ and ‘comments-to’ respectively.
- Parameters
header_fields (Indicate which Email header field to process) – (e.g. ‘from’, ‘reply-to’, ‘comments-to’). For a listserv mailing list the most representative header fields of senders and receivers are ‘from’ and ‘comments-to’ respectively.
return_msg_counts (If 'True', return # of messages per domain.) –
-
get_domainscount
(header_fields: List[str], per_year: bool = False) → dict¶ - Parameters
header_fields (Indicate which Email header field to process) – (e.g. ‘from’, ‘reply-to’, ‘comments-to’). For a listserv mailing list the most representative header fields of senders and receivers are ‘from’ and ‘comments-to’ respectively.
per_year (Aggregate results for each year.) –
-
get_graph_prop_per_domain_per_year
(years: Optional[tuple] = None, func=<function betweenness_centrality>, **args) → dict¶ - Parameters
years –
func –
-
get_localparts
(header_fields: List[str], per_domain: bool = False, return_msg_counts: bool = False, df: Optional[pandas.core.frame.DataFrame] = None) → dict¶ Get contribution of members per affiliation.
- Parameters
header_fields (Indicate which Email header field to process) – (e.g. ‘from’, ‘reply-to’, ‘comments-to’). For a listserv mailing list the most representative header fields of senders and receivers are ‘from’ and ‘comments-to’ respectively.
per_domain –
return_msg_counts (If 'True', return # of messages per localpart.) –
-
get_localpartscount
(header_fields: List[str], per_domain: bool = False, per_year: bool = False) → dict¶ - Parameters
header_fields (Indicate which Email header field to process) – (e.g. ‘from’, ‘reply-to’, ‘comments-to’). For a listserv mailing list the most representative header fields of senders and receivers are ‘from’ and ‘comments-to’ respectively.
per_domain (Aggregate results for each domain.) –
per_year (Aggregate results for each year.) –
-
get_messagescount
(header_fields: Optional[List[str]] = None, per_address_field: Optional[str] = None, per_year: bool = False) → dict¶ - Parameters
header_fields (Indicate which Email header field to process) – (e.g. ‘from’, ‘reply-to’, ‘comments-to’). For a listserv mailing list the most representative header fields of senders and receivers are ‘from’ and ‘comments-to’ respectively.
per_year (Aggregate results for each year.) –
-
get_messagescount_per_timezone
(percentage: bool = False) → Dict[str, int]¶ Get contribution of messages per time zone.
- Parameters
percentage (Whether to return count of messages percentage w.r.t. total.) –
-
static
get_name_localpart_domain
(string: str) → tuple¶ Split an address field which has (ideally) a format as ‘Heinrich von Kleist <Heinrich.vonKleist@SELBST.org>’ into name, local-part, and domain. All strings are returned in lower case only to avoid duplicates.
Note
Test whether the incorporation of email.utils.parseaddr() can improve this function.
-
get_sender_receiver_dict
(address_field: str = 'domain', entity_in_focus: Optional[list] = None, df: Optional[pandas.core.frame.DataFrame] = None) → Dict¶ - Parameters
address_field –
entity_in_focus (This can a list of domain names or localparts. If such) – a list is provided, the creaed dictionary will only contain their information.
- Returns
Nested dictionary with first layer the ‘from’ domain keys and
the second layer the ‘comments-to’ domain keys with the
integer indicating the number of messages between them.
-
get_threads
(return_length: bool = False) → dict¶ Collect all messages that belong to the same thread.
Note
Computationally very intensive.
- Parameters
return_length – If ‘True’, the returned dictionary will be of the form {‘subject1’: # of messages, ‘subject2’: # of messages, …}. If ‘False’, the returned dictionary will be of the form {‘subject1’: list of indices, ‘subject2’: list of indices, …}.
-
get_threadsroot
(per_address_field: Optional[str] = None, df: Optional[pandas.core.frame.DataFrame] = None) → dict¶ Find all unique message subjects. Replies not treated as a new subject.
Note
The most reliable ways to find the beginning of threads is to check whether the subject line of a message contains an element of reply_labels at the beginning. Checking whether the header field ‘comments-to’ is empty is not reliable, as ‘reply-all’ is often chosen by mistake as seen here: 2020-04-01 10:08:58+00:00 joern.krause@etsi.org, juergen.hofmann@nokia.com 2020-03-26 21:41:27+00:00 joern.krause@etsi.org NaN 2020-03-26 21:00:08+00:00 joern.krause@etsi.org juergen.hofmann@nokia.com
- Some Emails start with ‘AW:’, which comes from German and has
the same meaning as ‘Re:’.
Some Emails start with ‘=?utf-8?b?J+WbnuWkjTo=?=’ or ‘=?utf-8?b?J+etlOWkjTo=?=’, which are UTF-8 encodings of the Chinese characters ‘回复’ and ‘答复’ both of which have the same meaning as ‘Re:’.
Leading strings such as ‘FW:’ are treated as new subjects.
- Parameters
per_address_field –
- Returns
A dictionary of the form {‘subject1’ (index of message, ‘subject2’: …})
is returned. If per_address_field is specified, the subjects are sorted
into the domain or localpart from which they originate.
-
get_threadsrootcount
(per_address_field: Optional[str] = None, per_year: bool = False) → Union[int, dict]¶ Identify number conversation threads in mailing list.
- Parameters
per_address_field (Aggregate results for each address field, which can) – be, e.g., from, send-to, received-by.
per_year (Aggregate results for each year.) –
-
static
iterator_name_localpart_domain
(li: list) → tuple¶ Generator for the self.get_name_localpart_domain() function.
-
period_of_activity
(format: str = '%a, %d %b %Y %H:%M:%S %z') → list¶ Return a list containing the datetime of the first and last message written in the mailing list.
-
static
to_percentage
(arr: numpy.array) → numpy.array¶
-
-
class
bigbang.analysis.listserv.
ListservMailListDomain
(name: str, filedsc: str, lists: pandas.core.frame.DataFrame)¶ Bases:
object
- Parameters
name – The of whom the mail list domain is (e.g. 3GPP, IEEE, …)
filedsc – The file description of the mail list domain
lists – A list containing the mailing lists as ListservMailList types
-
get_mlistscount_per_institution
()¶
-
classmethod
from_mbox
(name: str, directorypath: str, filedsc: str = '*.mbox') → bigbang.analysis.listserv.ListservMailListDomain¶
-
get_mlistscount_per_institution
() → Dict[str, int]¶ Get a dictionary that lists the mailing lists/working groups in which a institute/company is active.
-
exception
bigbang.analysis.listserv.
ListservMailListDomainWarning
¶ Bases:
BaseException
Base class for ListservMailListDomain class specific exceptions
-
exception
bigbang.analysis.listserv.
ListservMailListWarning
¶ Bases:
BaseException
Base class for ListservMailList class specific exceptions
analysis.thread¶
-
class
bigbang.analysis.thread.
Node
(ID, data=None, parent=None)¶ Bases:
object
Form a Node object. ID: Message ID, data: Information about that message, parent: the message’s reply-to
-
add_successor
(successor: list)¶ Add a node which has a message that is a reply to this node
-
get_data
()¶ Return the Information about this message
-
get_id
()¶ Return message ID
-
get_parent
()¶ Return Information in the data set about this message
-
get_successors
()¶ Return a list of nodes of messages which are replies to this node
-
properties
()¶ Return various properties about the tree with this node as root.
-
-
class
bigbang.analysis.thread.
Thread
(root, known_root=True)¶ Bases:
object
Form a thread object. root: the node of the message that start the thread known_root: indicator whether the root node is in our data set
-
get_content
()¶
-
get_duration
()¶ Return the time duration of the thread
-
get_leaves
()¶
-
get_not_leaves
()¶
-
get_num_messages
()¶ Return the number of messages in the thread
-
get_num_people
()¶ Return the number of people in the thread
-
get_root
()¶ Return the root node.
-
analysis.entity_resolution¶
Tools for resolving entities in a data set, in particular individual persons based on their name and email address.
-
bigbang.analysis.entity_resolution.
entity_resolve
(row, emailCol, nameCol)¶ Return a row with name and email by ID.
-
bigbang.analysis.entity_resolution.
getID
(name, email)¶ Get ID from a name and email.
-
bigbang.analysis.entity_resolution.
name_for_id
(id)¶ Return name by ID.
-
bigbang.analysis.entity_resolution.
store
(id, name, email)¶ Store name and email by ID.
analysis.graph¶
Tools for studying the graph of interactions between message senders. An interaction, for the purposes of this module, is a direct reply
-
bigbang.analysis.graph.
ascendancy
(am)¶ Ulanowicz ecosystem health measures Input is weighted adjacency matrix.
-
bigbang.analysis.graph.
capacity
(am)¶ Return the capacity given a adjacency matrix.
-
bigbang.analysis.graph.
compute_ascendancy
(messages, duration=50)¶ Compute ascendancy given messages.
-
bigbang.analysis.graph.
interaction_graph_to_matrix
(dg)¶ Turn an interaction graph into a weighted edge matrix.
-
bigbang.analysis.graph.
messages_to_interaction_graph
(messages, verbose=False, clean=True)¶ Return a interactable graph given messages.
-
bigbang.analysis.graph.
messages_to_reply_graph
(messages)¶ Return a graph given messages.
-
bigbang.analysis.graph.
overhead
(am)¶ Return overhead given a adjacency matrix.
analysis.process¶
-
bigbang.analysis.process.
ai
(m, parts, i)¶
-
bigbang.analysis.process.
bi
(m, parts, i)¶
-
bigbang.analysis.process.
consolidate_senders_activity
(activity_df, to_consolidate)¶ takes a DataFrame in the format returned by activity takes a list of tuples of format (‘from 1’, ‘from 2’) to consolidate returns the consolidated DataFrame (a copy, not in place)
-
bigbang.analysis.process.
containment_distance
(a, b)¶ A case-insensitive distance measure on strings.
- Returns
0 if strings are identical
positive infinity if neither string contains the other
1 / (minimum string length) if one string contains the other.
Good for Organizations. I.e. “cisco” “Cisco” “Cisco Systems” are all ‘close’ (< .2)
-
bigbang.analysis.process.
domain_name_from_email
(name)¶
-
bigbang.analysis.process.
eij
(m, parts, i, j)¶
-
bigbang.analysis.process.
from_header_distance
(a, b, verbose=False)¶ A distance measure specifically for the ‘From’ header of emails. Normalizes based on common differences in client handling of email, then computes Levenshtein distance between components of the field.
-
bigbang.analysis.process.
matricize
(series, func)¶ create a matrix by applying func to pairwise combos of elements in a Series returns a square matrix as a DataFrame should return a symmetric matrix if func(a,b) == func(b,a) should return the identity matrix if func == ‘==’
-
bigbang.analysis.process.
minimum_but_not_self
(column, dataframe)¶
-
bigbang.analysis.process.
modularity
(m, parts)¶ Compute modularity of an adjacency matrix. Use metric from:
Zanetti, M. and Schweitzer, F. 2012. “A Network Perspective on Software Modularity” ARCS Workshops 2012, pp. 175-186.
-
bigbang.analysis.process.
resolve_entities
(significance, distance_function, threshold=0)¶ Takes a Series mapping entities (index) to significance (values, numerical).
Resolves the entities based on a lexical distance function.
Returns a dictionary of labeled (keys) entity lists (values). Key is the most significant member of the entity list.
-
bigbang.analysis.process.
resolve_sender_entities
(act, lexical_distance=0)¶ Given an Archive’s activity matrix, return a dict of lists, each containing message senders (‘From’ fields) that have been groups to be probably the same entity.
-
bigbang.analysis.process.
sorted_matrix
(from_dataframe, limit=None, sort_key=None)¶ Takes a dataframe with ‘from’ fields for column headers
- .
Returns a sorted distance matrix for the column headers, using from_header_distance (see method).
analysis.twopeople¶
twopeople.py
Written by Raj Agrawal and Ki Deuk Kim
Contains functions used to analyze communication between two people in mailing list Examples can be found in ipython notebook “Collaboration Robustness” in examples folder Each function needs a pandas DataFrame called “exchanges” that contains every two-pair communication between participants in a mailing list.
-
bigbang.analysis.twopeople.
duration
(exchanges, A, B)¶ Gets the target two people A, B to analyze and returns the amount of time they communicated in the mailing list in TimeDelta type
-
bigbang.analysis.twopeople.
num_replies
(exchanges, A, B)¶ Returns the number of replies that two people A and B sent to each other in a tuple (# of replies from A to B, # of replies from B to A)
-
bigbang.analysis.twopeople.
panda_allpairs
(exchanges, pairs)¶ With given pairs of communication, returns a Pandas DataFrame that contains communication information between two people A and B in every pair
-
bigbang.analysis.twopeople.
panda_pair
(exchanges, A, B)¶ Forms a new Pandas DataFrame that contains information about communication between a pair A and B using functions provided above and returns the result
-
bigbang.analysis.twopeople.
reciprocity
(exchanges, A, B)¶ Returns the reciprocity of communication between two people A and B in float type. This expresses how interactively they communicated to each other
-
bigbang.analysis.twopeople.
unique_pairs
(exchanges)¶ Finds every unique pair (A, B) from the pandas DataFrame “exchanges” and returns them in set data type
analysis.utils¶
-
bigbang.analysis.utils.
clean_addresses
(df: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame¶
-
bigbang.analysis.utils.
clean_datetime
(df: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame¶
-
bigbang.analysis.utils.
clean_subject
(df: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame¶
-
bigbang.analysis.utils.
domain_entropy
(domain, froms)¶ Compute the entropy of the distribution of counts of email prefixes within the given archive.
- Parameters
domain (string) – An email domain
froms (pandas.DataFrame) – A pandas.DataFrame with From fields, email address, and domains. See the Archive method
get_froms()
- Returns
entropy
- Return type
float
-
bigbang.analysis.utils.
extract_domain
(from_field)¶ Returns the domain of an email address from a string.
-
bigbang.analysis.utils.
extract_email
(from_field)¶ Returns an email address from a string.
-
bigbang.analysis.utils.
get_index_of_msgs_with_datetime
(df: pandas.core.frame.DataFrame, return_boolmask: bool = False) → numpy.array¶
-
bigbang.analysis.utils.
get_index_of_msgs_with_subject
(df: pandas.core.frame.DataFrame, return_boolmask: bool = False) → numpy.array¶
visualisation.lines¶
visualisation.plot¶
Complex plotting functions.
-
bigbang.visualisation.plot.
draw_adjacency_matrix
(G, node_order=None, partitions=[], colors=[], cmap='Greys', figsize=(6, 6))¶ G is a networkx graph
- node_order (optional) is a list of nodes, where each node in G
appears exactly once
- partitions is a list of node lists, where each node in G appears
in exactly one node list
- colors is a list of strings indicating what color each
partition should be
If partitions is specified, the same number of colors needs to be specified.
-
bigbang.visualisation.plot.
stack
(df, partition=None, smooth=1, figsize=(12.5, 7.5), time=True, cm=<matplotlib.colors.ListedColormap object>)¶ Plots a stackplot based on a dataframe. Includes support for partitioning and convolution.
df - a dataframe partition - a (dictionary or list) of lists of columns of df
if dictionary, keys are used as labels
smooth - an integer amount of convolution
visualisation.graphs¶
visualisation.utils¶
visualisation.stackedareachart¶
Contributing¶
The BigBang community welcomes contributions.
Release Procedure¶
When the community decides that it is time to cut a new release, the Core Developers select somebody to act as release manager. That release manager then performs the following steps.
Determine the next release number via the standards of semantic versioning.
Solicit a worthy name for the release.
Address any remaining tickets in the GitHub milestone corresponding to the release, perhaps moving them to other milestones.
Consult the GitHub records of merged PRs and issues to write release notes documenting the changes made in this release.
If the dependencies in main are not already frozen, use
pip freeze
to create a new frozen dependency list. Consider testing the code against unfrozen dependencies first to update version numbers.Use the GitHub Releases interface. to cut a new release from the main branch, with the selected name and number. This should create a new tag corresponding to the release commit.
Write a message to the BigBang development list announcing the new release, including the release notes.
README for BigBang Docs¶
To build the docs, go to the docs/ directory and run
`
make html
`
The built docs will be deposited in docs/_build
Release notes¶
v0.3.0 Syzygy¶
Work on this release was supported by the Prototype Fund.
Robust ListServ data ingress #460 #459 #457
New ReadTheDocs/Sphinx based documentation: domain name and organization metadata #414 #499 #548
New code submodule organization
3GPP analysis #465
datasets submodule for ancillary data #509
Integration of IETF datatracker source and analysis of IETF attendance data #368 #560 #434
IETF draft analysis #370
Tools to identify the institution of email senders #25
Improved test coverage #343
Change from nose to unittest for testing framework #366
Updates and corrections to example notebooks #364
Bug fixes #538 #553 #555 #390
Preliminary work towards entity resolution #405
v0.3.0 Joie de vivre¶
This release converted the codebase to Python 3 and introduced the DataTracker and LISTSERV data sources, among with several new scientific notebooks and maintenance improvements.
Converted to Python 3 (#347, #373, #382, #388)
Installation improvements (#345, #410, #423)
Tenure calculation (#355)
Improved documentation (#351, #389, #450)
Improved testing (#372, #443)
W3C data source improvements (#344, #381)
Organization entity resolution (#385)
Integration of IETF DataTracker data (#386, #394, #444)
Organization and affiliation analysis notebooks (#396)
Code style pre-commit hooks (#403)
LISTSERV data source (#409, #442, #454, #456)
0.2.0 Tulip Revolution¶
We have released BigBang v0.2.0 Tulip Revolution.
This release marks a new milestone in BigBang development.
Gender participation estimation
Improved support for IETF mailing list ingest
Extensive gardening of the example notebooks
Upgraded all notebooks to Jupyter 4
Improved installation process based on user testing
En route to this milestone, the BigBang community made a number of changes to its procedures. These include:
The adoption of a Governance document for guiding decision-making.
The adoption of a Code of Conduct establishing norms of respectful behavior within the community.
The creation of an ombudsteam for handling personal disputes.
We have also for this milestone adopted by community decision the GNU Affero General Public License v3.0.
0.1.0 Anteplanck I¶
An initial public release of BigBand. Proof of concept.