dissected.IT : Visually Explore Word Embeddings

What this website is about

This website lets you explore so called „word embeddings“ (you can find a good introduction by Jay Alammar here). These are the first processing step used in most natural language processing (NLP) A.I. systems, in which the words of a sentence are converted into vectors (you can think of vectors as a bunch of numbers, more information here). This vectors are then used by the A.I. (like chatbots, translators, text generators etc.) to compute their output from language input. „Word embeddings“ are calculated from large corpora of existing text data (e.g. Wikipedia, Twitter, blogs etc.).

I hope people can experience on this website that these „word embeddings“ are already biased in the sense that they convey stereotypes. It is also likely that A.I. based on such embeddings behaves ethically objectionable (i.e. becomes racist, gender biased ...).

The word vectors calculated by word2vec (the popular word embeddings calculation method used for this website) allow the calculation of semantic distances between words. These distances can be visualized or used to calculate lists of semantically similar words.

How to interpret the 3D visualization

When a 300 dimensional geometry (i.e. points in the word2vec space) is projected into 3 dimensional space, in most cases distances cannot be preserved accurately. Therefore the visible distances between the word spheres are only approximately correct. The lines connect the closest neighbours (in the 300 dimensional word2vec space, not in the visible space!).

Text data used

The word embeddings for this website were built using the Blog Authorship Corpus. It consists of the posts of 19320 bloggers collected from blogger.com in August 2004. There are equal numbers of male and female bloggers.

Technical details

The word2vec word embeddings were created using the python library Gensim. The database contains about 47k words. Only the more common words from the text corpus are included in the database. The algorithm used for the 3D visualization can produce different results for several runs.

About the author

This website was created by Marco Lardelli (bio / blog).
You can hire me (data science / A.I. / web development for startups, data journalism, art projects etc.).

Site notice

Responsible person for this website:
Marco Lardelli
Ausserdorfstr. 4c
8052 Zürich
email: info@kanohi.ch

Data privacy information

This website does not set cookies. Google Analytics and similar tools are not used.
We record the following information in our webserver log files:

Visited pages on our domain
Date and time of the visit
Browser and browser version used by you
Operating system of your computer
Referrer URL (URL of the website linking to this domain)
IP address of your computer or router

This information is used to create simple usage statistics for this website.
We load some external resources from this website (fonts, JavaScript code etc.). Please not that your IP-Address and other information (see above) are transferred to the servers hosting these files.