How do we create a well-labeled database of clothing images?
Our original plan was to create a curated shopping experience using machine learning; however, we ran into one huge issue fairly early in our development: where do we get our data? This “data problem” is not unique to us; it is a common issue for machine learning scientists. Computers are not intelligent beings; they cannot “deduce” what humans can. Some things cannot be automated without the help of humans, and labeling clothes (at least in the way we needed them to be labeled) was one of those things. In order to train a model to “recognize” certain aspects of a clothing image, we needed data of already well-labeled images. We were unable to find such a dataset, so we set out to create one ourselves.
The goal is to create a well-documented database of images of clothing, specifically images of clothing from clothing websites. Each image will be stored with the image, the name of the clothing item in the image, gender (as labeled on the website), and the fiber make-up of the clothing in the image (ie., 25% cotton, 75% polyester, etc. The other information we will label is: category (pants, dress, etc.), sub-category(jeans, maxi, mini, etc), pattern (plaid, polka dot, etc.), fabric (denim, fleece, tulle, etc.), color, tags, category tags, and subcategory tags. Category and subcategory tags will be specific to the category or subcategory, but general tags will be applicable to any item. For example, ruffles, buttons, cutouts, or just general embellishment. We want the tags to allow for flexibility while still having a clear, straightforward organization system. There will be a set number of tags to check for each item, i.e. it will be a discriminative, not generative tagging system.
In order to develop our database, we need to use mechanical turk. We are planning to use grant money to buy mechanical turk from Amazon. Essentially, this will allow us to hire people to label the images for us, as having our team of three people label millions of images would take forever. It will speed up the labeling of our images to the point where we can have enough images labeled to not only serve our own data needs, but also the data needs of other people looking to do similar projects. We plan to make this database accessible and downloadable to anyone by storing it on Amazon Web Services. We will use grant money to do this as well.
While we are very excited about this project, it was not our original plan. We learned that ambitions are often thwarted by problems we believe to already be solved. Instead of giving up, we decided to tackle one of those problems head-on, solving a problem that will not only help us fulfill our goals, but also help others achieve theirs in a fashionable way.
Once we have developed this database, we plan to pursue further projects. Not to reveal too many trade secrets, but...we will use the dataset to create an algorithm that can curate a wardrobe of your desired aesthetic. We want to gamify the shopping experience and make a “Tinder” for clothes. We will use your swipe data to improve the clothing suggestions, like most curated advertising does. We are also planning to use the database to develop web extensions that will recommend cheaper or more sustainable alternatives to the clothing for which your are currently shopping.
A special thanks to our mentor Bil Leon for being our rock throughout this process and the Tanner Dean’s Scholar program for funding our project.