Taking Precise Measurements

salvēte, amīcī et sodālēs! Before we go on to another set of exercises and quizzes, or even another story from the Tres Columnae project, I wanted to take some time to think through a very important, but often unexamined, issue in teaching languages, and especially in teaching Latin and Greek. It’s an issue of measurement and assessment – a critical one, in fact: how do we know that our measurements (the quizzes and tests we give our students) are actually measuring what we want them to measure? Statisticians and experts in assessment refer to this idea as validity; it’s closely linked with a related concept called reliability, which has to do with how close a learner’s scores would be if he/she took the test or quiz more than once. The closer the scores, the more reliable; the more the instrument measures what it’s supposed to measure, the more valid.

When I closed yesterday’s post, I made this point about translation, both as an instructional tool and as an assessment:

I also think it’s a tool that can easily be overused … or even used when it’s not the best tool for the job. Not even a Swiss Army knife is the perfect tool for every job; for example, it would be hard to use one to light up a darkened room! 🙂

My primary concern with over-using translation as an instructional tool is mainly that it keeps our language learners focused on their first language rather than on the language they’re learning. After all, if the only thing you do with a Latin passage is to translate it into, say, English, that would seem to imply a couple of things. First, it implies that English (or whichever language you’re translating into) is the “real” or “primary” language, while Latin is simply a complicated code from which you have to extract the “English meaning.” Second, and consequently, it implies that English (or whichever language you’re translating into) is superior and Latin is inferior. I’ve run into too many advanced Latin students (not mine, usually) who think the Romans actually thought in English but translated their thoughts into Latin! 😦 Of course, that’s a common belief among learners of any language, but it needs to be dispelled, not encouraged. My fear is that an over-use of translation in instruction actually confirms this belief, and my hope is that regular communicative interactions in the language (even the simple multiple-choice responses we’ve looked at in this series of posts) will help learners overcome this and other false preconceptions about the relationships between languages. In keeping with our tool metaphor, translation would be a useful but specialized tool for instruction – more like a set of metric sockets than a Swiss Army knife. (You don’t need them every day, but as I was reminded recently, when I had to replace the battery in a Volvo, when you need them, you really need them!)

So much for the overuse of translation in instruction. My larger concern is the overuse of translation in assessment, which is why I’ve taken such pains in this series of posts to demonstrate other ways (including a bunch of Latin-only ways) to assess both reading comprehension and grammatical analysis without using translation. My biggest concern with translation as an assessment tool – whether for comprehension or for analytical work with the grammar of the language – is that translation is too complicated a task to satisfy anyone’s criteria for validity or reliability. Specifically, I think there are too many variables, both in the learner’s task and in the assessor’s, and the criteria for an acceptable performance are often too vague. (I think of the plaintive questions about “how to grade translations” – usually asked after the translations have been assigned – on the Latinteach listserv over the years, and the perennial questions about “is this translation acceptable” on the AP-Latin listserv.)

For example, consider this sentence from the Tres Columnae story we’ve focused on since Friday:

haec tamen pauca tibi et sorōrī explicāre possum.

  • What criteria for accuracy of translation would you establish for this sentence?
  • How would you communicate them to a learner, in advance, without “giving away” the translation of the sentence to them?
  • What kinds of feedback would you give for “translation errors” produced by a student?

And how would you convert the student’s response into a numeric score?

In the context of Lectiō Octāva, the “new things” to be tested are the datives (tibi and sorōrī). The relatively new things that might still cause trouble for learners are the complementary infinitive explicāre and the meanings of the words haec and possibly possum. In the Tres Columnae system, we’d ask direct questions about these specific items, if they were what we wanted to measure. For example, to test grammatical analysis, we might ask:

  1. cuius modī est explicāre?
    1. indicātīvī
    2. coniunctīvī
    3. imperātīvī
    4. infinītīvī
  2. cuius cāsūs est sorōrī?
    1. nōminātīvī
    2. genitīvī
    3. datīvī
    4. accūsātīvī

To test comprehension, we might ask

  1. quid Impigra facere vult?
    1. rem nārrāre
    2. rem audīre
    3. līberōs laudāre
    4. līberōs pūnīre
  2. quis hanc periodum audit?
    1. Rapidus
    2. Rapida
    3. et Rapidus et Rapida
    4. nec Rapidus nec Rapida

Depending on the learner’s patterns of correct and incorrect responses (which would be tracked, of course, in the Tres Columnae Moodle course), it would be easy for the teacher – and the learner herself – to see patterns of errors and to determine the logical next area of focus for the learner.  It would also be fairly easy to assess the reliability and validity of any given question by comparing it with others that, ostensibly, measure the same skill.

But how, exactly, do you “test” these things with a translation? And how do you give useful feedback?

For example, suppose the student, assigned to translate this sentence, says or writes,

“These few things are possible to be explained to you and your sister.”

It’s a “wrong translation” because of how it handles explicāre and possum and how it doesn’t handle tamen. And yet the student apparently has grasped the function of the two datives; has some idea that explicāre is an infinitive; has correctly determined that haec and omnia go together; and has a general idea of what Impigra is saying to Rapidus and Rapida.

Even if the teacher used a rubric for grading translations – and if that rubric had been shared with the learners – scoring might be a bit problematic. But what if the teacher uses “points” or marks rather than a rubric? How would you convert those problems into a grade – or into meaningful feedback.

Some teachers might choose the “point per word” method. But does that give credit for haec … pauca (accusative in the original, but the subject of this sentence)? And what about explicāre, which is almost, but not quite, “to be explained”? Depending on the teacher, this sentence might end up with a score of 3 / 8 (for tibi et sorōrī), 4.5 / 8 (half credit for explicāre, haec, and omnia), or even 5 /8 (half credit for possum) .. or anywhere from 37.5% to 50% credit. That’s a big range of scores … and a very low set of scores, too, given that the learner did, in fact, understand what was going on with the sentence.

Other teachers might choose a segment-scored or chunk-scored method like the one used by the Advanced Placement Program. In that case, the segments would probably be

  1. haec pauca
  2. tamen
  3. tibi et sorōrī
  4. explicāre possum.

Again, the student gets credit for one segment (tibi et sorōrī), for a score of ¼, or 25%. Or, if the teacher is “kind” and gives partial credit for partly-correct segments, the score might be 1.5/4 (half credit for haec omnia) or even 2/4 (half credit for explicāre possum). A wide, but very different range of scores – and still quite low, given that the student did, in fact, understand the point of the sentence!

Unfortunately, when translation is used as the only assessment tool for comprehension and grammatical analysis, it’s very difficult for teachers (or other assessors) to be consistent in their scoring … and this tends to make test designers, who are worried about validity and reliability, very nervous. That’s one reason why so many test designers and publishers, especially in the current U.S. climate, use multiple-choice responses so heavily: they may not be perfect, but at least the machine scoring the responses will do so with consistency. Assessors can also be trained to apply a rubric pretty consistently – the fewer levels in the rubric, the more reliable it will be – but non-rubric-scored, non-forced-choice responses will always raise some validity or reliability concerns.

quid respondētis, amīcī?

  • What do you think of my concerns about the validity and reliability of translation?
  • Or do they just make you angry because “we’ve always done it that way” and I seem to be upsetting the apple cart?
  • Do you see ways to make translation-type assessments more valid and more reliable?
  • What do you think of our alternatives to translation?
  • What concerns about validity or reliability do you have in their regard?

Tune in next time, when we’ll respond to your concerns, share some more questions, and preview the next series of posts. intereā, grātiās maximās omnibus iam legentibus et respondentibus.