What this website is about
This website lets you explore so called
„word embeddings“ (you can find a good introduction by Jay Alammar
here).
These are the first processing step used in most natural language processing (NLP) A.I. systems, in which the words of a sentence are converted into vectors (you can think of vectors as a bunch of numbers, more information
here).
This vectors are then used by the A.I. (like chatbots, translators, text generators etc.) to compute their output from language input.
„Word embeddings“ are calculated from large corpora of existing text data (e.g. Wikipedia, Twitter, blogs etc.).
I hope people can experience on this website that these „word embeddings“ are already biased in the sense that they convey stereotypes.
It is also likely that A.I. based on such embeddings behaves ethically objectionable (i.e. becomes racist, gender biased ...).
The word vectors calculated by
word2vec (the popular word embeddings calculation method used for this website) allow the calculation of semantic distances between words.
These distances can be visualized or used to calculate lists of semantically similar words.
How to interpret the 3D visualization
When a 300 dimensional geometry (i.e. points in the word2vec space) is projected into 3 dimensional space, in most cases distances cannot be preserved accurately.
Therefore the visible distances between the word spheres are only approximately correct. The lines connect the closest neighbours (in the 300 dimensional word2vec space, not in the visible space!).
Text data used
The word embeddings for this website were built using the
Blog Authorship Corpus.
It consists of the posts of 19320 bloggers collected from blogger.com in August 2004. There are equal numbers of male and female bloggers.
Technical details
The word2vec word embeddings were created using the python library
Gensim.
The database contains about 47k words. Only the more common words from the text corpus are included in the database.
The algorithm used for the 3D visualization can produce different results for several runs.