What is EDI ?
EDI has been wrapped up in obscurity for ages. A cumbersome acronym for something that has been defined in as many ways as there are EDI standards. Remember the hero with a thousand faces ?
Let's assume that for the purpose of this article EDI is cumulative for a data format or a markup language. Just like XML. It's a fair analogy.
Everybody knows XML. It's easy to manipulate it programmatically and it's free. But what about EDI ? I'll try to lay it out from a programmer's perspective.
EDI Document Structure
There are multiple EDI standards, each targeting either a business vertical (manufacturing, automotive, retail, healthcare), or a geographic location (North America, Europe, UK, Germany, etc.). The prevalent ones are X12 (mostly North America) and EDIFACT (mostly Europe), including their derivatives (HIPAA, EANCOM, etc.). The difference between the standards is substantial in regards to their tags and formal description, however the underlying principles are very similar.
A picture is worth a thousand words they say. Here is the one that describes EDI X12 envelopes:
and one for EDIFACT envelopes:
For more information regarding the structure of EDI please take a look at the following resources:
That's pretty straightforward. EDI documents are a combination of transactions (or messages or business data) wrapped up in functional groups, which are themselves wrapped up in interchange envelopes. Three levels of nesting. Let's move on to the purpose of each level.
EDI envelopes (or interchanges) carry information for the sender and the receiver, e.g. the sender and receiver codes and qualifiers. Every EDI exchange is set as an agreement in both the sender's and the receiver's software systems. The EDI codes uniquely identify the two - the same way sender's and receiver's addresses are written on post cards or paper envelopes.
Envelopes also tell you the date and time when the document was created and the control number of the envelope. The combination between the sender code, the receiver code and the control number is unique and is used to rule out any duplicates. It's like a temporary universal primary key for each envelope. I said temporary because envelope control numbers are reset every 30 days or so but this varies between partners.
Envelopes also drive the automatic generation of technical acknowledgments, e.g. if one is required or not and indicate whether the data contained is for testing or production purposes. The trailer contain something like a checksum to tally up with the control number in the header and the count of the included items.
In terms of parsing or translating EDI documents programmatically, the envelopes are important because they also define the set of delimiters (or separators). Without the delimiters no EDI document can be interpreted.
EDI delimiters (or separators) are a notable difference across the EDI standards and each standard has its own way of defining them. I'll only cover X12 and EDIFACT here.
Let's begin with X12 (also identical for all HIPAA transactions such as 837 or 835).
All separators are defined in the ISA interchange header. ISA is positional and contains 106 characters. The first 3 are the segment tag 'ISA' and immediately after is the data element separator. The last two characters in the ISA are the component data element separator and the segment separator. Sometimes segments can be postfixed and things such as CR\LF can appear at the end of each segment.
If we have the sample document:
The separators are (positions within the ISA segment):
data element separator at position 4 *
component data element separator at position 105 >
segment terminator at position 106 ~
The repetition separator is at position 83 if ISA version starting at position 85 is greater than '00401'. In the sample above it is not (the version is ''00204') and therefore 'U' is standard identifier instead of repetition separator.
EDIFACT and EANCOM share the same approach to delimiters. By default all envelopes use the following delimiters unless explicitly requested otherwise:
data element separator +
component data element separator :
repetition separator *
segment terminator '
release indicator ?
When an envelope is prefixed with UNA segment (sitting just before the interchange header UNB), all delimiters are derived from the UNA. The structure of UNA is positional and contains 9 characters. The first 3 are the segment tag 'UNA' and then:
component data element separator at position 4 :
data element separator at position 5 +
decimal notation at position 6 .
release indicator at position 7 ?
reserved (not used) at position 8
segment terminator at position 9 '
This is a sample of an EDIFACT envelope with UNA segment:
UNA:+.? ' UNB+UNOB:1+102096559TEST:16:ZZUK+PARTNERID:01:ZZUK+071101:1701+131++INVOIC++1++1'
EDIFACT allows for escaping delimiters by using a release indicator ('?' by default) before the escaped delimiter. X12 does not.
EDI Functional Groups
EDI envelopes are the outer-most wrapper of EDI documents. They define the separators, identify the sender and the receiver, and stamp the document with control number, date and time.
The next level are the functional groups. They are used to group transactions of the same type and version. The groups are logical containers and are mandatory for X12 but optional for EDIFACT. The reason for this is that in X12 the functional groups define the version of the transactions whereas in EDIFACT the version is defined within each transaction. X12 also acknowledges functional groups, e.g. generates 997 or 999 at the end of each group. EDIFACT acknowledges envelopes, e.g. generates CONTRL at the end of each envelope.
Functional groups combine only messages of the same type (only invoices or only purchase orders but not both) and of the same version. They are somewhat obsolete for EDIFACT but mandatory for X12.
The inner-most level of EDI documents contains the business transactions. This is where all business data such as claims or invoices or purchase orders is stacked up. This is the core of every EDI document.
EDI transactions are defined by the organization governing the respective EDI standard, e.g ASC is for X12, UNECE for EDIFACT, GS1 for EANCOM, etc. They release new versions of the transactions every year.
Unfortunately the versions are not backward compatible for both X12 and EDIFACT. This means that an invoice 810 in version 003010 is not compatible with an invoice in version 004010. It is primarily because of the EDI codes that can be attached to data elements (the list of values each element can take). This is a major issue with EDI and presents one of the biggest challenges for every implementation.
Although there are many versions in circulation the most popular are 004010 for X12 and D96A for EDIFACT. If you are ever going to exchange messages with a trading partner it's highly likely that they will be using one of these two.
The first step in parsing EDI was to identify the separators. The second step is to identify the version for each transaction.
X12 version is the last element in the GS segment or the ST segment. When present, the version in the ST segment takes precedence.
The version in the ISA segment is for the ISA itself and is irrelevant.
EDIFACT version is the combination of the edition and release component elements following the message type in the UNH segment.
EDI transactions are identified by a transaction identifier. For example in X12 invoices are denoted with '810' and in EDIFACT with 'INVOIC'.
X12 transaction identifier is the first data element in ST segment, e.g. ST*Transaction identifier* or '837' in the sample above.
EDIFACT transaction identifier is the second data element in UNH segment, first component data element INVOIC, e.g. UNH+some data+Transaction identifier: or 'INVOIC' in the sample above.
EDI is (Un)Dead
The conundrum is not how to parse EDI but what to parse it to ? What would be a suitable data structure that can represent hierarchical data and is programming language agnostic ? XML springs to mind.
There is no standardized approach to defining EDI transactions. Well, there kind of is and isn't. XML has XSD. Interestingly EDI transactions are also provided as XSDs by their governing organizations. That's the IS. Almost nobody uses those XSDs. That's the ISN'T.
This is one of the pitfalls of EDI. This is why EDI is considered cumbersome and programmers given the task of dealing with it feel they've drawn the short stick. Every major company that offers EDI software of some kind, has taken on the proprietary way to define EDI. Some still use XSDs but they are not the same as the ones provided by ASC or UNECE. Some have implemented meta-scripts such as SEF, which are so complex that virtually no one dares using them.
That's the first issue with EDI - there is no well established and generic approach to defining programmatically business transactions. It's pretty much every company for itself.
The second issue is with the business transactions themselves. Although they are governed, in reality companies had taken the liberty to amend them whenever they see fit. The bigger the company the easier it is to apply changes. Then force all trading partners to comply. Deviating from the standards resulted in the proliferation of representations of business transactions, e.g. the structure of an EDI invoice for company A is different than that for company B nullifying the fact that they both use the same EDI version and EDI transaction.
All this basically defies the purpose of the EDI standard. It was meant to be the common ground, or the canonical version of truth. Straying from this exacerbated the situation. There was no truth anymore. This led to the premature conclusion that EDI was dead. The attempt to standardize the exchange of business documents had failed.
If companies can exchange data modeled according to their internal representations, then why would they need EDI on top if it ? Why would they need this extra layer of complexity ? This is the dirty secret of EDI. No one needs it in the state it is in today, however it has spread its tentacles so wide and deep and has accumulated such a mass that it is a force to be reckoned with. EDI is here to stay and there is no way around it.
EDI can be viewed as an extra compression or encoding of data. That's pretty much what it is from a programmer's perspective. When a purchase order is pulled out from a database and you need to send it to another application what is the general approach?
I bet that the receiving application exposes a RESTful or SOAP endpoint and you only have to consume it. There will be mapping code at your end to map the purchase order from the database to the purchase order structure defined at the service endpoint. Why did we need EDI again ?
Companies exchanging EDI would rarely expose service endpoints. Instead they would have a mail box or some sort of FTP (SFTP, FTPS, etc.) or good old AS2 over HTTP or else. Or simply a file share. The upshot is that they will provide a transport interface that requires EDI documents to be exchanged as files or streams. It's pretty old school.
From architecture's perspective this isn't too bad. You will still pull the purchase order out from the database, but then what to map it to ? How to produce an EDI file out of it ?
The same is valid for the receiver. They'll get an EDI file via FTP or AS2 but how would they read it ? What would be the medium ?
EDI to What ?
Earlier on I alluded to the similarity between EDI and XML. Both can be used to represent hierarchical data structures. Both use meta-formats to define these data structures (XSD, etc.).
Understandably all X12 and EDIFACT transactions are provided as XSDs by their governing organizations. XSD is language agnostic, and there are good parsers for XML in every programming language. Coupling EDI to XML sounds like a good idea and the effort to get the ball rolling seems justifiable. If EDI can be defined with XSD and represented as XML, a good bet is that EDI would become obsolete at some point in the future. They will slowly conflate into one by XML absorbing EDI. Or so people thought. Nothing like this has happened to the surprise of many and XML never prevailed. The both continue to co-exist in the same troubled relationship.
The thing is that EDI still needs to be converted to\from XML. It is uncommon for applications to internally represent data using XML. They model their domain data structures using the programming languages they use, be it C++, C#, Java, etc. Therefore XML is predominantly used at the outer borders in data exchange. Internal objects are natively serialized or deserialized to or from XML.
Most of the existing EDI translators rely on this forced marriage between EDI and XML.They convert EDI to XML and vice verse. Then XML can be manipulated using XML DOM and exported to whatever. I must admit that this does appear to be intuitive. I, as others before me, based my first EDI parsers on XML. I believed that XML is the structure of choice when it comes to EDI. I was wrong.
This is the first part of this article. It was meant to be the intro into EDI. In part two I'll go into the nitty-gritty details of how to represent EDI, and the most optimal algorithm to parse EDI documents.
Go to Part 2 - EDI Translator.