Loading GIF

What this website is about

This website lets you explore so called „word embeddings“ (you can find a good introduction by Jay Alammar here). These are the first processing step used in most natural language processing (NLP) A.I. systems, in which the words of a sentence are converted into vectors (you can think of vectors as a bunch of numbers, more information here). This vectors are then used by the A.I. (like chatbots, translators, text generators etc.) to compute their output from language input. „Word embeddings“ are calculated from large corpora of existing text data (e.g. Wikipedia, Twitter, blogs etc.).

I hope people can experience on this website that these „word embeddings“ are already biased in the sense that they convey stereotypes. It is also likely that A.I. based on such embeddings behaves ethically objectionable (i.e. becomes racist, gender biased ...).

The word vectors calculated by word2vec (the popular word embeddings calculation method used for this website) allow the calculation of semantic distances between words. These distances can be visualized or used to calculate lists of semantically similar words.

How to interpret the 3D visualization

When a 300 dimensional geometry (i.e. points in the word2vec space) is projected into 3 dimensional space, in most cases distances cannot be preserved accurately. Therefore the visible distances between the word spheres are only approximately correct. The lines connect the closest neighbours (in the 300 dimensional word2vec space, not in the visible space!).

Text data used

The word embeddings for this website were built using the Blog Authorship Corpus. It consists of the posts of 19320 bloggers collected from blogger.com in August 2004. There are equal numbers of male and female bloggers.

Technical details

The word2vec word embeddings were created using the python library Gensim. The database contains about 47k words. Only the more common words from the text corpus are included in the database. The algorithm used for the 3D visualization can produce different results for several runs.