Developing a new product | Product management

For years, Babbel's customers learning English had been requesting a way to test out what they were learning with us, and ideally, get some sort of official recognition of their level of English. Around 2017, Duolingo, our biggest competitor, had released its own English assessment and certificate, however, it wasn't any sort of official certification. So, when Babbel was approached by Cambridge English Assessment about the possibility of collaborating on an official digital English test, it seemed like the perfect time and opportunity.

I co-led this project together with a Product Manager from the New Business Initiatives team, and built an MVP for feedback from scratch.
‍
Our goals included:
(a) Collaborate successfully with Cambridge English to meet their rigorous assessment requirements and set a foundation for continued collaboration.
(b) Build and release an online, moderated English assessment with a certificate confirming the result.
(c) Bundle he English test with Babbel's course learning product.

Company

Babbel

Date

June 2018

Based in

Berlin, Germany

Scroll Down

It's what our users want!

At Babbel, I was responsible for all things to do with learning English. As Babbel's language with the most learners, it was the natural language to choose when considering building an assessment. In addition, we had gotten a lot of requests for English tests, so when we received contact from Cambridge English about a potential partnership, it was the perfect time to explore opportunities. I worked as the product lead from my department (Didactics, responsible for the learning product), collaborating with a PM from the New Business Initiatives team in addition to a team of engineers and designers.

Now, friends, it's probably no surprise that Cambridge English and Babbel approach language learning very differently. For our first workshop together in Berlin, we on the Babbel side prepared workshops with sleek slides and post-its for ideating. Our partners at Cambridge on the contrary brought binders full of photocopied exam papers, and directories containing rigorous guidelines for conducting thorough and fair examinations. We had envisaged the test running like a standard Babbel lesson, whereas they had pictured more a downloadable PDF to be printed and scanned through for marking.

Rather than dive right into designing out test, we used these first days to learn from each other. From Cambridge, we learned about how much goes into exam design, including just how tricky it is to write what looks like a deceptively simple "fill in the gap"-style question. We shared insights into user interactions with our digital platform, including motivations and requests for the test.

In a follow up workshop – in Cambridge, this time – we were able to combine our approaches to design a simple, accurate, and technically feasible exam that would cover learners of all beginner and intermediate levels.

Check out our model for the assessment below. Test takers start at the same point, with a number of questions at A2 (upper beginner) level. Depending on their score, they either continue with questions at the same level, or are served with intermediate level questions. The test was seamless, with no apparent gaps presented to the users.

The UX of language assessments…

If you're interested in some of the peculiarities of how exams intersect with user needs and product design, here are a couple of highlights.

No two exams will be the same. Why? We had a bank of many many many (many) questions that could be shown, so it would be virtually impossible for two users to receive the exact same questions in the exact same order. Additionally, we tagged each question with two categories; (a) the skill being assessed, for example, word choice or grammatical structure, and; (b) the topic, for example, Business English, or holiday-related English. The test was designed to surface a mixture of topics in addition to the appropriate number of questions needed to fairly test each skill.

We needed to moderate. When I did exams at school in the UK, we sat in one big hall at desks spaced apart with moderators walking around to ensure no cheating. Obviously this kind of moderation isn't possible for a digital assessment, but we couldn't risk people cheating and unfairly earning a certificate, or people taking screenshots of the questions to share online. The test was built so that screenshots were not possible, and users were instructed not to leave the browser tab or window, or risk their test being invalidated. Additionally, we integrated technology to monitor test takers through their webcams, which was a requirement to take the test. A still was taken at regular intervals, and we were able to spot when test takers left their computers for extended periods of time, or had other people come and help them. This in combination with the large bank of questions help ensure the test was as fair as possible.

Certificates matter! We carried out research to identify exactly what users would want in a test, and while it was important that the test was rigorous and accurate, there was one thing more important: the proof. We worked with a designer to test out multiple iterations on our certificate, opting for one mimicking school qualifications that users said they'd feel "proud" to share on social media or as proof when needed for education or work.

Once is not enough. Although most of our users taking the test wanted the certification as proof of English for work, i.e. an extrinsic factor, we got many requests from test takers for recommendations on how to improve, and we saw a trend in about 20% of users returning within 2 months to retake the test in the hopes of achieving a higher level after working to improve, which suggested that the test was an effective tool in helping learners develop intrinsic motivation to challenge themselves and improve.

Building an MVF with Google Sheets

While our test got largely positive feedback, customer service received daily requests for feedback. The difficulty was that due to the nature of the assessment, we were not able to share which answers test takers got right and wrong. Additionally, our development team needed the capacity for improving the test's moderation functionality.

For some users, I had time to manually check through the CSVs and assess areas of strengths and weaknesses. I would then summarise these, and make a recommendation for Babbel lessons that would help them improve. However, this took 2-3 hours per test taker, so it wasn't sustainable.

However, I knew that each answer was tagged with skill and topic, and that I had a Google Sheet listing all questions with IDs. So I set to work abusing the VLOOKUP formula in a spreadsheet, creating a sheet where I could upload the CSV with an answer key for a test taker, and generate a list of which skills and topics they'd answered correctly and incorrectly. I then associated each skill and topic with three levels of success: "strong", indicating 90%+ correct answers, "fair", indication between 70 and 89% correct, and "improve", indicating less than 70% correct. I wrote a paragraph of feedback for each skill and topic across these three levels. This meant that for each answer key I uploaded, I could generate a report with feedback across all areas assessed for that individual key.

I then associated each "improvement area" with a concrete recommendation for a Babbel course. For example, if a user had problems with a certain grammatical tense, I would recommend the corresponding lessons in our grammar courses.

So, ultimately, my minimum viable Google Sheet (MVGS is a thing now) was able to generate a fully personalised feedback profile with tailored course recommendations for each user in about 2 minutes (give or take a few minutes if the Sheet crashed, which it did not irregularly). I checked the first couple of dozen reports myself to make sure they were correct, and with a couple of tweaks, we were able to adapt the test's value prop to include feedback. With the feedback reports, 98% of feedback on our test experience was positive.

Additionally, this let us more closely bundle the test with Babbel's core offering, and upsell a second (discounted) test attempt to users who followed our recommendations in the feedback reports.

As a side note – I still get slightly triggered whenever anyone mentions VLOOKUP. ✌️

Random unrelated fact…

I went to secondary school in Cambridge, and during our workshop, stayed at a hotel right next door to where I studied, which was bizarre.

Even more random: one of the product leads from the Cambridge was my childhood best friend's mum. It's a small world.